# Supabase Digital Channels Pipeline for Banking Churn POC

This notebook handles the Supabase digital channels data pipeline for the Apex National Bank churn prediction POC.

## Overview

Supabase serves as our **Mobile Banking App Backend**, containing:
- **Sessions** - App login/logout events with device info
- **Events** - User actions (view balance, transfer, bill pay, etc.)

## Churn Signal Logic

Digital engagement strongly predicts churn:

| Segment | App Usage | Rationale |
|---------|-----------|----------|
| Active | Regular logins, multiple features | Engaged customers stay |
| At-Risk | Declining app usage | Disengagement signals risk |
| Churned | Stopped using app before closing | Early warning sign |

## Entity Resolution

Customers are linked via **email**:

```
ERPNext (email_id) <---> Supabase (user_email)
```

## Notebook Sections

1. **Configuration** - Environment setup and credentials
2. **Supabase Client** - REST API wrapper
3. **Exploration** - Check existing tables
4. **Data Ingestion** - Create app sessions and events
5. **Data Extraction** - Export to CSV/JSON for Databricks

---
## 1. Configuration

### Prerequisites

1. Supabase project created at https://supabase.com
2. Tables created (app_sessions, app_events)
3. Credentials in `docs/.env` file (auto-loaded)

### Credentials File Format

```powershell
$env:SUPABASE_URL = "https://xxxxx.supabase.co"
$env:SUPABASE_KEY = "your-anon-key"
$env:SUPABASE_SECRET = "your-service-role-key"
```

### Install Dependencies

```bash
pip install supabase
```

In [1]:
# Install supabase if not present
try:
    from supabase import create_client, Client
    print("supabase already installed")
except ImportError:
    print("Installing supabase...")
    !pip install supabase
    from supabase import create_client, Client
    print("Installation complete!")

Installing supabase...
Collecting supabase
  Downloading supabase-2.27.1-py3-none-any.whl.metadata (4.6 kB)
Collecting realtime==2.27.1 (from supabase)
  Downloading realtime-2.27.1-py3-none-any.whl.metadata (7.0 kB)
Collecting supabase-functions==2.27.1 (from supabase)
  Downloading supabase_functions-2.27.1-py3-none-any.whl.metadata (2.4 kB)
Collecting storage3==2.27.1 (from supabase)
  Downloading storage3-2.27.1-py3-none-any.whl.metadata (2.1 kB)
Collecting supabase-auth==2.27.1 (from supabase)
  Downloading supabase_auth-2.27.1-py3-none-any.whl.metadata (6.4 kB)
Collecting postgrest==2.27.1 (from supabase)
  Downloading postgrest-2.27.1-py3-none-any.whl.metadata (3.4 kB)
Collecting yarl>=1.22.0 (from supabase)
  Downloading yarl-1.22.0-cp313-cp313-win_amd64.whl.metadata (77 kB)
Collecting deprecation>=2.1.0 (from postgrest==2.27.1->supabase)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=4.14.0 (from realtime==2.27.1->supabase)

In [2]:
# Required packages
import os
import re
import json
import csv
import random
import uuid
from datetime import datetime, timedelta
from typing import List, Dict, Any
from pathlib import Path
import pandas as pd

print("Packages imported successfully!")

Packages imported successfully!


In [3]:
# =============================================================================
# CONFIGURATION - Auto-load from .env file
# =============================================================================

def load_env_file(env_path: str) -> dict:
    """
    Load credentials from .env file (PowerShell format).
    """
    credentials = {}
    try:
        with open(env_path, 'r') as f:
            for line in f:
                match = re.match(r'\$env:(\w+)\s*=\s*["\']([^"\']*)["\']', line.strip())
                if match:
                    var_name, value = match.groups()
                    credentials[var_name] = value
        print(f"Loaded {len(credentials)} credentials from {env_path}")
    except FileNotFoundError:
        print(f"Warning: {env_path} not found, using environment variables")
    return credentials

# Load from .env file
ENV_FILE = Path("../../docs/.env")
env_creds = load_env_file(str(ENV_FILE))

# Supabase Credentials
SUPABASE_URL = env_creds.get("SUPABASE_URL", os.getenv("SUPABASE_URL", ""))
SUPABASE_KEY = env_creds.get("SUPABASE_KEY", os.getenv("SUPABASE_KEY", ""))
SUPABASE_SECRET = env_creds.get("SUPABASE_SECRET", os.getenv("SUPABASE_SECRET", ""))

# Data paths
ERP_CUSTOMERS_PATH = Path("../../data/raw/erp_customers.csv")
OUTPUT_DIR = Path("../../data/raw")

# Date range (same as other systems)
START_DATE = datetime(2023, 1, 1)
END_DATE = datetime(2026, 1, 11)

# Check credentials
creds_set = all([SUPABASE_URL, SUPABASE_KEY])

if creds_set:
    print(f"\nConfiguration loaded!")
    print(f"  URL: {SUPABASE_URL}")
    print(f"  Key: {SUPABASE_KEY[:20]}...")
else:
    print("WARNING: Supabase credentials not set!")
    print("\nAdd to docs/.env:")
    print('  $env:SUPABASE_URL = "your-url"')
    print('  $env:SUPABASE_KEY = "your-key"')

Loaded 11 credentials from ..\..\docs\.env

Configuration loaded!
  URL: https://wtdspfddzqkpdaokgzys.supabase.co
  Key: sb_publishable_nDRxk...


In [4]:
# =============================================================================
# REFERENCE DATA
# =============================================================================

# App event types for mobile banking
EVENT_TYPES = [
    {"type": "view_balance", "weight": 30},      # Most common
    {"type": "view_transactions", "weight": 20},
    {"type": "transfer_internal", "weight": 15},
    {"type": "transfer_external", "weight": 5},
    {"type": "bill_payment", "weight": 10},
    {"type": "card_controls", "weight": 5},
    {"type": "view_statements", "weight": 5},
    {"type": "update_profile", "weight": 3},
    {"type": "contact_support", "weight": 2},     # Less common
    {"type": "apply_product", "weight": 5},
]

# Device types
DEVICES = [
    {"type": "iPhone", "platform": "iOS", "weight": 45},
    {"type": "Android Phone", "platform": "Android", "weight": 40},
    {"type": "iPad", "platform": "iOS", "weight": 10},
    {"type": "Android Tablet", "platform": "Android", "weight": 5},
]

# App versions
APP_VERSIONS = ["3.0.1", "3.1.0", "3.2.0", "3.2.1", "4.0.0", "4.1.0"]

print(f"Reference data loaded:")
print(f"  Event types: {len(EVENT_TYPES)}")
print(f"  Device types: {len(DEVICES)}")

Reference data loaded:
  Event types: 10
  Device types: 4


---
## 2. Supabase Client

Connect to Supabase using the official Python client.

In [5]:
def connect_supabase() -> Client:
    """
    Connect to Supabase using the Python client.
    """
    try:
        # Use service role key for admin operations
        key_to_use = SUPABASE_SECRET if SUPABASE_SECRET else SUPABASE_KEY
        supabase: Client = create_client(SUPABASE_URL, key_to_use)
        print(f"Connected to Supabase!")
        print(f"  URL: {SUPABASE_URL}")
        return supabase
    except Exception as e:
        print(f"Connection error: {e}")
        return None


# Connect
if creds_set:
    supabase = connect_supabase()
else:
    print("Skipping connection - credentials not set")
    supabase = None

Connected to Supabase!
  URL: https://wtdspfddzqkpdaokgzys.supabase.co


---
## 3. Exploration

Check existing data in Supabase tables.

In [6]:
def explore_supabase(supabase: Client):
    """
    Survey existing data in Supabase.
    """
    print("=" * 60)
    print("Supabase Data Survey")
    print(f"URL: {SUPABASE_URL}")
    print("=" * 60)
    
    tables = ["app_sessions", "app_events"]
    
    for table in tables:
        try:
            result = supabase.table(table).select("*", count="exact").limit(1).execute()
            count = result.count if result.count else 0
            print(f"  {table:20} : {count:6} records")
        except Exception as e:
            print(f"  {table:20} : Error - {str(e)[:50]}")

# Run exploration
if supabase:
    explore_supabase(supabase)
else:
    print("Not connected to Supabase")

Supabase Data Survey
URL: https://wtdspfddzqkpdaokgzys.supabase.co
  app_sessions         :      0 records
  app_events           :      0 records


In [7]:
# Load ERPNext customers to link digital data
print("\n" + "=" * 60)
print("ERPNext Customers (to link)")
print("=" * 60)

if ERP_CUSTOMERS_PATH.exists():
    erp_customers = pd.read_csv(ERP_CUSTOMERS_PATH)
    
    print(f"\nLoaded {len(erp_customers)} customers from ERPNext")
    print(f"\nSegment distribution:")
    print(erp_customers['website'].value_counts())  # 'website' stores segment
    
    print(f"\nSample customers:")
    print(erp_customers[['customer_name', 'email_id', 'website']].head())
else:
    print(f"ERPNext customer file not found: {ERP_CUSTOMERS_PATH}")
    erp_customers = None


ERPNext Customers (to link)

Loaded 502 customers from ERPNext

Segment distribution:
website
active     311
at_risk    126
churned     63
Name: count, dtype: int64

Sample customers:
     customer_name                 email_id  website
0  Adnan Ahmed 831  adnan.ahmed31@gmail.com   active
1  Adnan Begum 248  adnan.begum45@gmail.com   active
2  Adnan Begum 595  adnan.begum26@gmail.com   active
3  Adnan Malik 605  adnan.malik70@gmail.com  at_risk
4  Adnan Malik 886  adnan.malik79@gmail.com   active


---
## 4. Data Ingestion

Create mobile app data in Supabase linked to ERPNext customers.

### Churn Signal Implementation

| Segment | Sessions/Month | Events/Session | Pattern |
|---------|----------------|----------------|--------|
| Active | 15-30 | 5-15 | Regular usage |
| At-Risk | 5-15 | 3-8 | Declining usage |
| Churned | 0-5 | 1-3 | Stopped before churn |

In [8]:
def determine_session_count(segment: str) -> int:
    """
    Determine number of app sessions based on customer segment.
    Active customers use the app more frequently.
    """
    if segment == "churned":
        return random.randint(5, 15)   # Low usage before churning
    elif segment == "at_risk":
        return random.randint(15, 30)  # Declining usage
    else:  # active
        return random.randint(30, 60)  # Regular usage


def determine_events_per_session(segment: str) -> int:
    """
    Determine number of events per session.
    Active customers do more per session.
    """
    if segment == "churned":
        return random.randint(1, 3)    # Quick sessions
    elif segment == "at_risk":
        return random.randint(2, 5)    # Limited engagement
    else:  # active
        return random.randint(4, 10)   # Full engagement


def get_session_dates(segment: str, start: datetime, end: datetime) -> List[datetime]:
    """
    Generate session dates based on segment behavior.
    """
    dates = []
    num_sessions = determine_session_count(segment)
    
    if segment == "churned":
        # Sessions stopped 3-12 months before end
        churn_date = end - timedelta(days=random.randint(90, 365))
        session_end = churn_date
    else:
        session_end = end
    
    # Generate random dates
    delta = (session_end - start).days
    if delta > 0:
        for _ in range(num_sessions):
            random_days = random.randint(0, delta)
            session_date = start + timedelta(days=random_days)
            dates.append(session_date)
    
    return sorted(dates)


def weighted_choice(items: List[Dict]) -> Dict:
    """Select item based on weight."""
    total = sum(item['weight'] for item in items)
    r = random.uniform(0, total)
    cumulative = 0
    for item in items:
        cumulative += item['weight']
        if r <= cumulative:
            return item
    return items[-1]


print("Helper functions defined.")
print(f"\nSessions by segment:")
print(f"  Active: ~{determine_session_count('active')} sessions")
print(f"  At-Risk: ~{determine_session_count('at_risk')} sessions")
print(f"  Churned: ~{determine_session_count('churned')} sessions")

Helper functions defined.

Sessions by segment:
  Active: ~37 sessions
  At-Risk: ~28 sessions
  Churned: ~8 sessions


In [9]:
def create_sessions_and_events(supabase: Client, erp_customers: pd.DataFrame) -> tuple:
    """
    Create app sessions and events for all customers.
    
    Returns tuple of (sessions_list, events_list) for CSV export.
    """
    print("\n" + "=" * 60)
    print("Creating App Sessions and Events")
    print("=" * 60)
    
    all_sessions = []
    all_events = []
    session_errors = 0
    event_errors = 0
    
    segment_stats = {"active": 0, "at_risk": 0, "churned": 0}
    
    for idx, row in erp_customers.iterrows():
        email = row['email_id']
        segment = row['website']  # Segment stored in website field
        
        # Get session dates for this customer
        session_dates = get_session_dates(segment, START_DATE, END_DATE)
        
        for session_date in session_dates:
            # Generate session
            device = weighted_choice(DEVICES)
            session_duration = random.randint(60, 900)  # 1-15 minutes
            session_end = session_date + timedelta(seconds=session_duration)
            
            session_data = {
                "user_email": email,
                "session_start": session_date.isoformat(),
                "session_end": session_end.isoformat(),
                "device_type": device['type'],
                "platform": device['platform'],
                "app_version": random.choice(APP_VERSIONS),
            }
            
            try:
                # Insert session
                result = supabase.table("app_sessions").insert(session_data).execute()
                session_id = result.data[0]['id']
                
                session_data['id'] = session_id
                session_data['segment'] = segment
                all_sessions.append(session_data)
                segment_stats[segment] += 1
                
                # Generate events for this session
                num_events = determine_events_per_session(segment)
                
                for i in range(num_events):
                    event_type = weighted_choice(EVENT_TYPES)
                    event_offset = random.randint(0, session_duration)
                    event_time = session_date + timedelta(seconds=event_offset)
                    
                    event_data = {
                        "session_id": session_id,
                        "user_email": email,
                        "event_type": event_type['type'],
                        "event_timestamp": event_time.isoformat(),
                        "event_data": json.dumps({"source": "mobile_app"}),
                    }
                    
                    try:
                        supabase.table("app_events").insert(event_data).execute()
                        event_data['segment'] = segment
                        all_events.append(event_data)
                    except Exception as e:
                        event_errors += 1
                        if event_errors <= 3:
                            print(f"  Event error: {str(e)[:60]}")
                
            except Exception as e:
                session_errors += 1
                if session_errors <= 3:
                    print(f"  Session error: {str(e)[:60]}")
        
        if (idx + 1) % 50 == 0:
            print(f"  Processed {idx + 1}/{len(erp_customers)} customers...")
            print(f"    Sessions: {len(all_sessions)}, Events: {len(all_events)}")
    
    print(f"\n" + "=" * 60)
    print("INGESTION COMPLETE")
    print("=" * 60)
    print(f"  Total Sessions: {len(all_sessions)} ({session_errors} errors)")
    print(f"  Total Events: {len(all_events)} ({event_errors} errors)")
    print(f"\nSessions by segment:")
    for seg, count in segment_stats.items():
        print(f"  {seg}: {count} sessions")
    
    return all_sessions, all_events

In [10]:
# ============================================================================
# RUN DATA INGESTION
# ============================================================================
#
# This cell creates mobile app data in Supabase:
# - App sessions (login events with device info)
# - App events (user actions during sessions)
#
# Runtime: Approximately 5-10 minutes for 500 customers

if supabase and erp_customers is not None:
    print("=" * 60)
    print("Supabase Data Ingestion")
    print(f"Target: {SUPABASE_URL}")
    print(f"Customers to process: {len(erp_customers)}")
    print("=" * 60)
    
    sessions, events = create_sessions_and_events(supabase, erp_customers)
else:
    print("Cannot run ingestion:")
    if not supabase:
        print("  - Not connected to Supabase")
    if erp_customers is None:
        print("  - ERPNext customers not loaded")
    sessions, events = [], []

Supabase Data Ingestion
Target: https://wtdspfddzqkpdaokgzys.supabase.co
Customers to process: 502

Creating App Sessions and Events
  Session error: <ConnectionTerminated error_code:0, last_stream_id:19999, ad
  Processed 50/502 customers...
    Sessions: 1870, Events: 12014
  Event error: <ConnectionTerminated error_code:0, last_stream_id:19999, ad
  Processed 100/502 customers...
    Sessions: 3604, Events: 22745
  Event error: <ConnectionTerminated error_code:0, last_stream_id:19999, ad
  Processed 150/502 customers...
    Sessions: 5259, Events: 32619
  Event error: <ConnectionTerminated error_code:0, last_stream_id:19999, ad
  Processed 200/502 customers...
    Sessions: 7007, Events: 43317
  Processed 250/502 customers...
    Sessions: 8794, Events: 54739
  Processed 300/502 customers...
    Sessions: 10578, Events: 65585
  Session error: Out of range float values are not JSON compliant: nan
  Session error: Out of range float values are not JSON compliant: nan
  Processed 350/5

---
## 5. Data Extraction

Export Supabase data to CSV/JSON for Databricks ingestion.

In [11]:
def extract_sessions_from_supabase(supabase: Client) -> List[Dict]:
    """Extract all sessions from Supabase."""
    print("\nExtracting Sessions...")
    
    try:
        result = supabase.table("app_sessions").select("*").execute()
        sessions = result.data
        print(f"  Found {len(sessions)} sessions")
        return sessions
    except Exception as e:
        print(f"  Error: {e}")
        return []


def extract_events_from_supabase(supabase: Client) -> List[Dict]:
    """Extract all events from Supabase."""
    print("\nExtracting Events...")
    
    try:
        result = supabase.table("app_events").select("*").execute()
        events = result.data
        print(f"  Found {len(events)} events")
        return events
    except Exception as e:
        print(f"  Error: {e}")
        return []

In [12]:
def save_to_csv(data: List[Dict], filename: str, output_dir: Path):
    """Save data to CSV file."""
    if not data:
        print(f"  No data for {filename}")
        return
    
    output_dir.mkdir(parents=True, exist_ok=True)
    filepath = output_dir / filename
    
    # Get all unique keys
    all_keys = set()
    for record in data:
        all_keys.update(record.keys())
    fieldnames = sorted(list(all_keys))
    
    with open(filepath, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)
    
    print(f"  Saved: {filename} ({len(data)} records)")


def save_to_json(data: List[Dict], filename: str, output_dir: Path):
    """Save data to JSON file."""
    if not data:
        print(f"  No data for {filename}")
        return
    
    output_dir.mkdir(parents=True, exist_ok=True)
    filepath = output_dir / filename
    
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, default=str)
    
    print(f"  Saved: {filename} ({len(data)} records)")

In [13]:
# ============================================================================
# RUN DATA EXTRACTION
# ============================================================================

if supabase:
    print("=" * 60)
    print("Supabase Data Extraction")
    print(f"Source: {SUPABASE_URL}")
    print(f"Output: {OUTPUT_DIR.absolute()}")
    print("=" * 60)
    
    # Extract data from Supabase
    sb_sessions = extract_sessions_from_supabase(supabase)
    sb_events = extract_events_from_supabase(supabase)
    
    # Save to CSV
    print("\n" + "=" * 60)
    print("Saving to CSV")
    print("=" * 60)
    
    save_to_csv(sb_sessions, "sb_sessions.csv", OUTPUT_DIR)
    save_to_csv(sb_events, "sb_events.csv", OUTPUT_DIR)
    
    # Save to JSON
    print("\n" + "=" * 60)
    print("Saving to JSON")
    print("=" * 60)
    
    save_to_json(sb_sessions, "sb_sessions.json", OUTPUT_DIR)
    save_to_json(sb_events, "sb_events.json", OUTPUT_DIR)
    
    # Summary
    print("\n" + "=" * 60)
    print("EXTRACTION COMPLETE")
    print("=" * 60)
    print(f"\nFiles saved to: {OUTPUT_DIR.absolute()}")
    print(f"\n  sb_sessions.csv : {len(sb_sessions)} records")
    print(f"  sb_events.csv   : {len(sb_events)} records")
else:
    print("Not connected to Supabase - cannot extract data")

Supabase Data Extraction
Source: https://wtdspfddzqkpdaokgzys.supabase.co
Output: c:\Users\SulaimanAhmed\Desktop\portfolio\Banking project\banking-churn-databricks\notebooks\exploration\..\..\data\raw

Extracting Sessions...
  Found 1000 sessions

Extracting Events...
  Found 1000 events

Saving to CSV
  Saved: sb_sessions.csv (1000 records)
  Saved: sb_events.csv (1000 records)

Saving to JSON
  Saved: sb_sessions.json (1000 records)
  Saved: sb_events.json (1000 records)

EXTRACTION COMPLETE

Files saved to: c:\Users\SulaimanAhmed\Desktop\portfolio\Banking project\banking-churn-databricks\notebooks\exploration\..\..\data\raw

  sb_sessions.csv : 1000 records
  sb_events.csv   : 1000 records


---
## Summary

This notebook provides the complete Supabase digital channels data pipeline:

1. **Configuration** - Auto-load credentials from .env
2. **API Client** - Connect using supabase-py
3. **Exploration** - Survey existing data
4. **Ingestion** - Create sessions and events
5. **Extraction** - Export to CSV/JSON

### Churn Signal Implementation

| Segment | Sessions | Events/Session | Pattern |
|---------|----------|----------------|--------|
| Active | 30-60 | 4-10 | Regular, engaged users |
| At-Risk | 15-30 | 2-5 | Declining engagement |
| Churned | 5-15 | 1-3 | Stopped before closing account |

### Entity Resolution

Customers are linked via **email**:
```
ERPNext.email_id = Supabase.user_email
```

### Files Generated

```
data/raw/
├── sb_sessions.csv    # App login sessions
├── sb_sessions.json
├── sb_events.csv      # User actions (churn signal!)
└── sb_events.json
```

### Next Steps

1. Run Google Sheets legacy data generator
2. Build Bronze layer in dbt

---
*Created for Banking Customer Churn Prediction POC*