# Lesson 3: Exercise 2 - Conform Rider IDs Across PostgreSQL and Neo4j

## Goal

Create a **conformed rider dimension** by reconciling rider identifiers across PostgreSQL (trips) and Neo4j (graph edges). This ensures that analytics can join trips and graph edges using a single, consistent rider key.

## Prerequisites

You should have completed:
- **Lesson 1, Exercise 1**: Connected to PostgreSQL (`raw_trips` table)
- **Lesson 1, Exercise 3**: Connected to Neo4j (graph edges)
- **Lesson 2, Exercise 1**: Designed the `dw_dim_rider` table

## What You Will Build

A Pandas-based conformance process that:

1. Extracts rider identifiers from PostgreSQL trips
2. Extracts rider identifiers from Neo4j graph edges
3. Identifies common and unique riders across systems
4. Creates a unified rider dimension table
5. Validates referential integrity

### Why Conformance Matters

Different source systems often use different naming conventions or ID formats. For example:
- Trips use `rider_id` directly
- Graph edges reference riders as `from_node_id` with `from_node_type = 'Rider'`
- Graph edges also have `rider_segment` attributes not present in trips

Conformance creates a single source of truth for each entity.

---

## Imports and Dependencies

Run this cell first to import all required libraries.

In [None]:
# ========= Imports
import os
from datetime import datetime
from typing import Tuple, List, Set
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, text
from neo4j import GraphDatabase

print("All imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

---
## Configuration

**Important:** These credentials match the populate scripts from Lesson 1. Update only if your environment differs.

In [None]:
# ========= PostgreSQL Configuration ==========
# These match the populate-postgres.py script from Lesson 1.

PG_HOST = "localhost"      # Database host
PG_PORT = "5432"           # Database port
PG_DB = "postgres"         # Database name (populate script uses 'postgres')
PG_USER = "temp"           # User from populate-postgres.py
PG_PASSWORD = "temp"       # Password from populate-postgres.py

PG_URI = f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"

# ========= Neo4j Configuration ==========
# These match the populate-neo4j.py script from Lesson 1.

NEO4J_URI = "bolt://localhost:7687"  # Neo4j connection URI
NEO4J_USER = "neo4j"                 # User from populate-neo4j.py
NEO4J_PASSWORD = "neo4jpass"         # Password from populate-neo4j.py

# Output path
OUTPUT_DIM_RIDER = "/tmp/dim_rider_conformed.csv"

print("Configuration set!")
print(f"   - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB} (user: {PG_USER})")
print(f"   - Neo4j: {NEO4J_URI} (user: {NEO4J_USER})")
print(f"   - Output: {OUTPUT_DIM_RIDER}")

---
## Populate the Databases

Run these cells to populate PostgreSQL and Neo4j with sample data (if not already done).

In [None]:
!python populate-postgres.py

In [None]:
!python populate-neo4j.py

---
## Helper Functions

Utility functions for data cleaning and Neo4j queries.

In [None]:
# ========= Helper Functions

def trim_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize text fields and handle NaN values.
    """
    df = df.copy()
    for c in df.select_dtypes(include=['object']).columns:
        df[c] = df[c].astype(str).str.strip()
        # Use mask instead of replace to avoid FutureWarning
        df[c] = df[c].mask(df[c].isin(['nan', 'None', 'NaN', '']), np.nan)
    return df


def run_cypher(session, cypher: str, **params) -> list:
    """
    Execute a Cypher query and return results as list of dicts.
    """
    result = session.run(cypher, **params)
    return [record.data() for record in result]


print("Helper functions defined: trim_df(), run_cypher()")

---
## Step 1: Extract Riders from PostgreSQL

Get all unique rider IDs from the `raw_trips` table.

**TODO**: Write a SQL query to extract unique rider IDs from `raw_trips` along with:
- Count of trips per rider
- First trip datetime
- Last trip datetime

Use GROUP BY on rider_id and filter out NULL rider_ids.

In [None]:
# ========= STEP 1: Extract riders from PostgreSQL
print("Step 1: Extracting riders from PostgreSQL...")
print("-" * 50)

# Connect to PostgreSQL
pg_engine = create_engine(PG_URI)

# TODO: Write SQL to extract unique rider IDs with trip count and date range
SQL_RIDERS = """
-- TODO: Write your SELECT statement here
-- Include: rider_id, trip_count, first_trip, last_trip

"""

with pg_engine.connect() as conn:
    pg_riders = pd.read_sql(text(SQL_RIDERS), conn)

pg_riders = trim_df(pg_riders)

print(f"Extracted {len(pg_riders):,} unique riders from PostgreSQL")
print(f"\nSample:")
display(pg_riders.head())

# Store as set for later comparison
trips_rider_ids = set(pg_riders['rider_id'].dropna().unique())
print(f"\nUnique rider IDs: {len(trips_rider_ids):,}")

---
## Step 2: Extract Riders from Neo4j

Get rider information from graph edges. In Neo4j, riders appear in multiple contexts:
- As `from_node_id` when `from_node_type = 'Rider'`
- As `to_node_id` when `to_node_type = 'Rider'`
- In the `rider_id` property of certain edges

**TODO**: Write a Cypher query to extract rider information from Neo4j. You can either:
1. Query Rider nodes directly: `MATCH (r:Rider) RETURN r.id, r.rider_segment`
2. Or extract from edge properties if riders aren't separate nodes

In [None]:
# ========= STEP 2: Extract riders from Neo4j
print("Step 2: Extracting riders from Neo4j...")
print("-" * 50)

import warnings
# Suppress Neo4j informational notifications about null handling
warnings.filterwarnings('ignore', message='.*AggregationSkippedNull.*')

# Connect to Neo4j
neo4j_driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# TODO: Write Cypher query to extract riders
CYPHER_RIDERS = """
-- TODO: Write your Cypher query here
-- Return: rider_id, rider_segment
"""

with neo4j_driver.session() as session:
    results = run_cypher(session, CYPHER_RIDERS)

neo4j_riders = pd.DataFrame(results)

if not neo4j_riders.empty:
    neo4j_riders = trim_df(neo4j_riders)
    print(f"Extracted {len(neo4j_riders):,} rider records from Neo4j")
    print(f"\nSample:")
    display(neo4j_riders.head())
    
    # Store unique rider IDs
    edges_rider_ids = set(neo4j_riders['rider_id'].dropna().unique())
    print(f"\nUnique rider IDs: {len(edges_rider_ids):,}")
else:
    print("No riders found in Neo4j")
    edges_rider_ids = set()

---
## Step 3: Analyze Overlap Between Sources

Compare rider IDs from both systems to understand the overlap.

**TODO**: Use set operations to calculate:
- Common riders (intersection): `trips_rider_ids & edges_rider_ids`
- Trips-only riders (difference): `trips_rider_ids - edges_rider_ids`
- Edges-only riders (difference): `edges_rider_ids - trips_rider_ids`
- Total unique riders (union): `trips_rider_ids | edges_rider_ids`

In [None]:
# ========= STEP 3: Analyze overlap
print("Step 3: Analyzing overlap between sources...")
print("-" * 50)

# TODO: Calculate overlap using set operations
common_riders = set()    
trips_only = set()      
edges_only = set()       
all_riders = set()       

print(f"OVERLAP ANALYSIS:")
print(f"   PostgreSQL (trips):  {len(trips_rider_ids):,}")
print(f"   Neo4j (edges):       {len(edges_rider_ids):,}")
print(f"   ---")
print(f"   Common to both:      {len(common_riders):,}")
print(f"   Only in trips:       {len(trips_only):,}")
print(f"   Only in edges:       {len(edges_only):,}")
print(f"   Total unique:        {len(all_riders):,}")

# Overlap percentage
if len(all_riders) > 0:
    overlap_pct = len(common_riders) / len(all_riders) * 100
    print(f"\n   Overlap rate: {overlap_pct:.1f}%")

---
## Step 4: Build Conformed Rider Dimension

Create a unified rider dimension that includes:
- All unique rider IDs from both sources
- Rider segment (from Neo4j where available)
- Trip activity metrics (from PostgreSQL)
- Source flags indicating where each rider appears

**TODO**: Build the conformed dimension by:

1. Start with all unique rider_ids from `all_riders`
2. Left join to `pg_riders` to get trip metrics
3. Left join to `neo4j_riders` to get rider_segment
4. Add boolean flags: `in_trips`, `in_edges`
5. Fill missing segments with 'Unknown'

In [None]:
# ========= STEP 4: Build conformed dimension
print("Step 4: Building conformed rider dimension...")
print("-" * 50)

# TODO: Create base DataFrame with all unique rider IDs


# TODO: Left join to PostgreSQL data (trip metrics)


# TODO: Add flag for riders in trips


# TODO: Left join to Neo4j data (rider segment)
if not neo4j_riders.empty:


# TODO: Add flag for riders in edges


# TODO: Fill missing trip counts with 0


# TODO: Fill missing segments with 'Unknown'


# Placeholder for now
dim_rider_final = pd.DataFrame()

print(f"Conformed dimension created: {len(dim_rider_final):,} riders")
print(f"\nSample rows:")
display(dim_rider_final.head(10))

---
## Step 5: Validate the Conformed Dimension

Ensure all source riders are captured and check referential integrity.

In [None]:
# ========= STEP 5: Validate
print("Step 5: Validating conformed dimension...")
print("-" * 50)

# Check all source riders captured
dim_rider_set = set(dim_rider_final['rider_id'].unique()) if not dim_rider_final.empty else set()
trips_covered = trips_rider_ids.issubset(dim_rider_set)
edges_covered = edges_rider_ids.issubset(dim_rider_set)

print(f"Coverage check:")
print(f"   All trips riders captured: {trips_covered}")
print(f"   All edges riders captured: {edges_covered}")

# Segment distribution
if not dim_rider_final.empty and 'rider_segment' in dim_rider_final.columns:
    print(f"\nRider segment distribution:")
    segment_dist = dim_rider_final['rider_segment'].value_counts()
    for segment, count in segment_dist.items():
        pct = count / len(dim_rider_final) * 100
        print(f"   {segment}: {count:,} ({pct:.1f}%)")

---
## Step 6: Output Conformed Dimension

Save the conformed rider dimension for loading into the warehouse.

In [None]:
# ========= STEP 6: Output
print("Step 6: Outputting conformed dimension...")
print("-" * 50)

# Save to CSV
if not dim_rider_final.empty:
    dim_rider_final.to_csv(OUTPUT_DIM_RIDER, index=False)
    file_size = os.path.getsize(OUTPUT_DIM_RIDER) / 1024
    
    print(f"Saved to: {OUTPUT_DIM_RIDER}")
    print(f"File size: {file_size:.1f} KB")
    print(f"Total riders: {len(dim_rider_final):,}")
else:
    print("No data to save - complete the TODO sections above first.")

---
## Step 7: Clean Up

Close all database connections.

In [None]:
# ========= STEP 7: Clean up
pg_engine.dispose()
neo4j_driver.close()
print("Database connections closed.")

---
## Key Takeaways

### Conformance Strategies

1. **Union approach**: Combine all unique identifiers from all sources
2. **Attribute enrichment**: Pull attributes from whichever source has them
3. **Conflict resolution**: When sources disagree, use rules (e.g., most recent, most common)

### Common Challenges

- **ID format differences**: Some systems use prefixes (e.g., `R12345` vs `12345`)
- **Case sensitivity**: `rider_001` vs `RIDER_001`
- **Orphan records**: Riders that appear in edges but never took a trip