# Lesson 3: Exercise 2 Solution - Conform Rider IDs Across PostgreSQL and Neo4j

## Goal

Create a **conformed rider dimension** by reconciling rider identifiers across PostgreSQL (trips) and Neo4j (graph edges). This ensures that analytics can join trips and graph edges using a single, consistent rider key.

## Prerequisites

You should have completed:
- **Lesson 1, Exercise 1**: Connected to PostgreSQL (`raw_trips` table)
- **Lesson 1, Exercise 3**: Connected to Neo4j (graph edges)
- **Lesson 2, Exercise 1**: Designed the `dw_dim_rider` table

## What You Will Build

A Pandas-based conformance process that:

1. Extracts rider identifiers from PostgreSQL trips
2. Extracts rider identifiers from Neo4j graph edges
3. Identifies common and unique riders across systems
4. Creates a unified rider dimension table
5. Validates referential integrity

### Why Conformance Matters

Different source systems often use different naming conventions or ID formats. For example:
- Trips use `rider_id` directly
- Graph edges reference riders as `from_node_id` with `from_node_type = 'Rider'`
- Graph edges also have `rider_segment` attributes not present in trips

Conformance creates a single source of truth for each entity.

### Acceptance Criteria

- All unique riders from both sources are captured
- Rider dimension includes attributes from both systems where available
- Output is ready for loading into `dw_dim_rider`

---

## Lesson 3 Exercise 2: Conform Rider IDs Across PostgreSQL and Neo4j Solution

## Imports and Dependencies

Run this cell first to import all required libraries.

In [1]:
# ========= Imports
import os
from datetime import datetime
from typing import Tuple, List, Set
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, text
from neo4j import GraphDatabase

print("All imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

All imports successful!
   - pandas version: 2.3.1
   - numpy version: 2.2.6


---
## Configuration

**Important:** These credentials match the populate scripts from Lesson 1. Update only if your environment differs.

In [2]:
# ========= PostgreSQL Configuration ==========
# These match the populate-postgres.py script from Lesson 1.

PG_HOST = "localhost"      # Database host
PG_PORT = "5432"           # Database port
PG_DB = "postgres"         # Database name (populate script uses 'postgres')
PG_USER = "temp"           # User from populate-postgres.py
PG_PASSWORD = "temp"       # Password from populate-postgres.py

PG_URI = f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"

# ========= Neo4j Configuration ==========
# These match the populate-neo4j.py script from Lesson 1.

NEO4J_URI = "bolt://localhost:7687"  # Neo4j connection URI
NEO4J_USER = "neo4j"                 # User from populate-neo4j.py
NEO4J_PASSWORD = "neo4jpass"         # Password from populate-neo4j.py

# Output path
OUTPUT_DIM_RIDER = "/tmp/dim_rider_conformed.csv"

print("Configuration set!")
print(f"   - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB} (user: {PG_USER})")
print(f"   - Neo4j: {NEO4J_URI} (user: {NEO4J_USER})")
print(f"   - Output: {OUTPUT_DIM_RIDER}")

Configuration set!
   - PostgreSQL: localhost:5432/postgres (user: temp)
   - Neo4j: bolt://localhost:7687 (user: neo4j)
   - Output: /tmp/dim_rider_conformed.csv


---
## Verify Database Setup

These cells verify that PostgreSQL and Neo4j have data from Lesson 1. If verification fails, run the populate scripts:

In [8]:
!python populate-postgres.py
!python populate-neo4j.py

Postgres is up!
Connecting to PostgreSQL...
Created table public.raw_trips.
Loaded trips from ./data/van_transit_trips_postgres.csv into public.raw_trips.
Connected to Neo4j at bolt://localhost:7687 as neo4j
Loaded 122 edges … (Rider)-[:SHARES_TRIP_WITH]->(Rider)
Loaded 446 edges … (Rider)-[:ALIGHTED]->(Station)
Loaded 835 edges … (Rider)-[:TRANSFERRED_TO]->(Station)
Loaded 1335 edges … (Rider)-[:BOARDED]->(Vehicle)
Loaded 1370 edges … (Rider)-[:BOARDED]->(Vehicle)
Loaded 1802 edges … (Route)-[:ROUTE_SERVES]->(Station)
Loaded 2302 edges … (Station)-[:CONNECTS_TO]->(Station)
Loaded 2500 edges … (Station)-[:CONNECTS_TO]->(Station)

✅ Loaded 2500 edges from ./data/van_transit_graph_edges_neo4j.csv


In [9]:
# ========= Verify PostgreSQL ==========
print("Verifying PostgreSQL...")
try:
    pg_engine = create_engine(PG_URI)
    with pg_engine.connect() as conn:
        result = conn.execute(text("SELECT COUNT(*) FROM raw_trips"))
        count = result.scalar()
    print(f"   OK: raw_trips has {count:,} rows")
    pg_engine.dispose()
except Exception as e:
    print(f"   ERROR: {e}")
    print("   Run: python populate-postgres.py")

Verifying PostgreSQL...
   OK: raw_trips has 2,500 rows


In [10]:
# ========= Verify Neo4j ==========
print("Verifying Neo4j...")
try:
    neo4j_driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
    with neo4j_driver.session() as session:
        result = session.run("MATCH (n) RETURN COUNT(n) as count")
        count = result.single()["count"]
    print(f"   OK: Neo4j has {count:,} nodes")
    neo4j_driver.close()
except Exception as e:
    print(f"   ERROR: {e}")
    print("   Run: python populate-neo4j.py")

Verifying Neo4j...
   OK: Neo4j has 2,043 nodes


---
## Helper Functions

Utility functions for data cleaning and Neo4j queries.

In [11]:
# ========= Helper Functions

def trim_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize text fields and handle NaN values.
    """
    df = df.copy()
    for c in df.select_dtypes(include=['object']).columns:
        df[c] = df[c].astype(str).str.strip()
        # Use mask instead of replace to avoid FutureWarning
        df[c] = df[c].mask(df[c].isin(['nan', 'None', 'NaN', '']), np.nan)
    return df


def run_cypher(session, cypher: str, **params) -> list:
    """
    Execute a Cypher query and return results as list of dicts.
    """
    result = session.run(cypher, **params)
    return [record.data() for record in result]


print("Helper functions defined: trim_df(), run_cypher()")

Helper functions defined: trim_df(), run_cypher()


---
## Step 1: Extract Riders from PostgreSQL

Get all unique rider IDs from the `raw_trips` table.

In [12]:
# ========= STEP 1: Extract riders from PostgreSQL
print("Step 1: Extracting riders from PostgreSQL...")
print("-" * 50)

# Connect to PostgreSQL
pg_engine = create_engine(PG_URI)

# Extract unique rider IDs with trip count
SQL_RIDERS = """
SELECT 
    rider_id,
    COUNT(*) as trip_count,
    MIN(board_datetime) as first_trip,
    MAX(board_datetime) as last_trip
FROM raw_trips
WHERE rider_id IS NOT NULL
GROUP BY rider_id
"""

with pg_engine.connect() as conn:
    pg_riders = pd.read_sql(text(SQL_RIDERS), conn)

pg_riders = trim_df(pg_riders)

print(f"Extracted {len(pg_riders):,} unique riders from PostgreSQL")
print(f"\nSample:")
display(pg_riders.head())

# Store as set for later comparison
trips_rider_ids = set(pg_riders['rider_id'].dropna().unique())
print(f"\nUnique rider IDs: {len(trips_rider_ids):,}")

Step 1: Extracting riders from PostgreSQL...
--------------------------------------------------
Extracted 2,455 unique riders from PostgreSQL

Sample:


Unnamed: 0,rider_id,trip_count,first_trip,last_trip
0,R66173,1,2025-03-25 11:54:35,2025-03-25 11:54:35
1,R22664,1,2025-01-23 20:11:28,2025-01-23 20:11:28
2,R64150,1,2024-08-05 01:29:18,2024-08-05 01:29:18
3,R33434,1,2024-05-02 12:05:40,2024-05-02 12:05:40
4,R14438,1,2025-01-25 09:50:00,2025-01-25 09:50:00



Unique rider IDs: 2,455


---
## Step 2: Extract Riders from Neo4j

Get rider information from graph edges. In Neo4j, riders appear in multiple contexts:
- As `from_node_id` when `from_node_type = 'Rider'`
- As `to_node_id` when `to_node_type = 'Rider'`
- In the `rider_id` property of certain edges

In [13]:
# ========= STEP 2: Extract riders from Neo4j
print("Step 2: Extracting riders from Neo4j...")
print("-" * 50)

import warnings
# Suppress Neo4j informational notifications about null handling
warnings.filterwarnings('ignore', message='.*AggregationSkippedNull.*')

# Connect to Neo4j
neo4j_driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Cypher query to extract riders from edges
# This finds all nodes labeled as Rider and extracts their properties
CYPHER_RIDERS = """
MATCH (r:Rider)
RETURN 
    coalesce(r.id, r.rider_id) as rider_id,
    r.rider_segment as rider_segment
"""

# Alternative: Extract from edge properties if riders aren't separate nodes
CYPHER_RIDERS_FROM_EDGES = """
MATCH (a)-[r]->(b)
WHERE labels(a)[0] = 'Rider' OR r.rider_id IS NOT NULL
WITH 
    CASE 
        WHEN labels(a)[0] = 'Rider' THEN coalesce(a.id, a.rider_id)
        ELSE r.rider_id 
    END as rider_id,
    CASE
        WHEN labels(a)[0] = 'Rider' THEN a.rider_segment
        ELSE r.rider_segment
    END as rider_segment
WHERE rider_id IS NOT NULL
RETURN DISTINCT rider_id, rider_segment
"""

with neo4j_driver.session() as session:
    # Try the node-based query first
    try:
        results = run_cypher(session, CYPHER_RIDERS)
        if not results:
            # Fall back to edge-based extraction
            results = run_cypher(session, CYPHER_RIDERS_FROM_EDGES)
    except Exception as e:
        print(f"   Using edge-based extraction: {e}")
        results = run_cypher(session, CYPHER_RIDERS_FROM_EDGES)

neo4j_riders = pd.DataFrame(results)

if not neo4j_riders.empty:
    neo4j_riders = trim_df(neo4j_riders)
    print(f"Extracted {len(neo4j_riders):,} rider records from Neo4j")
    print(f"\nSample:")
    display(neo4j_riders.head())
    
    # Store unique rider IDs
    edges_rider_ids = set(neo4j_riders['rider_id'].dropna().unique())
    print(f"\nUnique rider IDs: {len(edges_rider_ids):,}")
else:
    print("No riders found in Neo4j")
    edges_rider_ids = set()

Step 2: Extracting riders from Neo4j...
--------------------------------------------------
Extracted 1,483 rider records from Neo4j

Sample:


Unnamed: 0,rider_id,rider_segment
0,R42192,
1,R42701,
2,R78643,
3,R71761,
4,R90961,



Unique rider IDs: 1,483


In [14]:
# Check rider segment distribution from Neo4j
if not neo4j_riders.empty and 'rider_segment' in neo4j_riders.columns:
    print("Rider segments from Neo4j:")
    print("-" * 50)
    segment_counts = neo4j_riders['rider_segment'].value_counts(dropna=False)
    display(segment_counts)

Rider segments from Neo4j:
--------------------------------------------------


rider_segment
NaN    1483
Name: count, dtype: int64

---
## Step 3: Analyze Overlap Between Sources

Compare rider IDs from both systems to understand the overlap.

In [15]:
# ========= STEP 3: Analyze overlap
print("Step 3: Analyzing overlap between sources...")
print("-" * 50)

# Calculate overlap
common_riders = trips_rider_ids & edges_rider_ids
trips_only = trips_rider_ids - edges_rider_ids
edges_only = edges_rider_ids - trips_rider_ids
all_riders = trips_rider_ids | edges_rider_ids

print(f"OVERLAP ANALYSIS:")
print(f"   PostgreSQL (trips):  {len(trips_rider_ids):,}")
print(f"   Neo4j (edges):       {len(edges_rider_ids):,}")
print(f"   ---")
print(f"   Common to both:      {len(common_riders):,}")
print(f"   Only in trips:       {len(trips_only):,}")
print(f"   Only in edges:       {len(edges_only):,}")
print(f"   Total unique:        {len(all_riders):,}")

# Overlap percentage
if len(all_riders) > 0:
    overlap_pct = len(common_riders) / len(all_riders) * 100
    print(f"\n   Overlap rate: {overlap_pct:.1f}%")

Step 3: Analyzing overlap between sources...
--------------------------------------------------
OVERLAP ANALYSIS:
   PostgreSQL (trips):  2,455
   Neo4j (edges):       1,483
   ---
   Common to both:      56
   Only in trips:       2,399
   Only in edges:       1,427
   Total unique:        3,882

   Overlap rate: 1.4%


---
## Step 4: Build Conformed Rider Dimension

Create a unified rider dimension that includes:
- All unique rider IDs from both sources
- Rider segment (from Neo4j where available)
- Trip activity metrics (from PostgreSQL)
- Source flags indicating where each rider appears

In [16]:
# ========= STEP 4: Build conformed dimension
print("Step 4: Building conformed rider dimension...")
print("-" * 50)

# Start with PostgreSQL riders (has trip metrics)
pg_riders['in_trips'] = True

# Get rider segments from Neo4j (take most common segment per rider)
if not neo4j_riders.empty and 'rider_segment' in neo4j_riders.columns:
    rider_segments = neo4j_riders.groupby('rider_id')['rider_segment'].agg(
        lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None
    ).reset_index()
    rider_segments['in_edges'] = True
else:
    rider_segments = pd.DataFrame(columns=['rider_id', 'rider_segment', 'in_edges'])

# Merge sources
dim_rider = pd.merge(
    pg_riders[['rider_id', 'trip_count', 'first_trip', 'last_trip', 'in_trips']],
    rider_segments,
    on='rider_id',
    how='outer'
)

# Fill missing flags using np.where (avoids fillna FutureWarning)
dim_rider['in_trips'] = np.where(dim_rider['in_trips'].isna(), False, dim_rider['in_trips']).astype(bool)
dim_rider['in_edges'] = np.where(dim_rider['in_edges'].isna(), False, dim_rider['in_edges']).astype(bool)

# Fill missing segments with 'Unknown'
dim_rider['rider_segment'] = np.where(dim_rider['rider_segment'].isna(), 'Unknown', dim_rider['rider_segment'])

# Fill missing trip counts with 0
dim_rider['trip_count'] = np.where(dim_rider['trip_count'].isna(), 0, dim_rider['trip_count']).astype(int)

# Sort by rider_id
dim_rider = dim_rider.sort_values('rider_id').reset_index(drop=True)

print(f"Conformed dimension created: {len(dim_rider):,} riders")
print(f"\nSample rows:")
display(dim_rider.head(10))

Step 4: Building conformed rider dimension...
--------------------------------------------------
Conformed dimension created: 3,882 riders

Sample rows:


Unnamed: 0,rider_id,trip_count,first_trip,last_trip,in_trips,rider_segment,in_edges
0,R10014,1,2025-02-10 00:56:01,2025-02-10 00:56:01,True,Unknown,False
1,R10036,0,NaT,NaT,False,Unknown,True
2,R10059,1,2024-07-06 10:02:03,2024-07-06 10:02:03,True,Unknown,False
3,R10076,1,2024-04-23 13:43:46,2024-04-23 13:43:46,True,Unknown,False
4,R10085,0,NaT,NaT,False,Unknown,True
5,R10090,0,NaT,NaT,False,Unknown,True
6,R10116,1,2025-02-21 00:08:39,2025-02-21 00:08:39,True,Unknown,False
7,R10159,0,NaT,NaT,False,Unknown,True
8,R10178,1,2024-07-25 13:53:28,2024-07-25 13:53:28,True,Unknown,False
9,R10191,0,NaT,NaT,False,Unknown,True


In [17]:
# Add warehouse-ready fields (SCD Type 2 support)
print("Adding warehouse-ready fields...")
print("-" * 50)

# Create final dimension structure matching dw_dim_rider
dim_rider_final = pd.DataFrame({
    'rider_id': dim_rider['rider_id'],
    'rider_segment': dim_rider['rider_segment'],
    'trip_count': dim_rider['trip_count'],
    'first_trip': dim_rider['first_trip'],
    'last_trip': dim_rider['last_trip'],
    'in_trips': dim_rider['in_trips'],
    'in_edges': dim_rider['in_edges'],
    'effective_from': datetime.now(),
    'effective_to': pd.NaT,
    'is_current': True
})

print(f"Final dimension structure:")
print(dim_rider_final.dtypes)
print(f"\nSample:")
display(dim_rider_final.head())

Adding warehouse-ready fields...
--------------------------------------------------
Final dimension structure:
rider_id                  object
rider_segment             object
trip_count                 int64
first_trip        datetime64[ns]
last_trip         datetime64[ns]
in_trips                    bool
in_edges                    bool
effective_from    datetime64[us]
effective_to      datetime64[ns]
is_current                  bool
dtype: object

Sample:


Unnamed: 0,rider_id,rider_segment,trip_count,first_trip,last_trip,in_trips,in_edges,effective_from,effective_to,is_current
0,R10014,Unknown,1,2025-02-10 00:56:01,2025-02-10 00:56:01,True,False,2025-12-15 20:42:42.490580,NaT,True
1,R10036,Unknown,0,NaT,NaT,False,True,2025-12-15 20:42:42.490580,NaT,True
2,R10059,Unknown,1,2024-07-06 10:02:03,2024-07-06 10:02:03,True,False,2025-12-15 20:42:42.490580,NaT,True
3,R10076,Unknown,1,2024-04-23 13:43:46,2024-04-23 13:43:46,True,False,2025-12-15 20:42:42.490580,NaT,True
4,R10085,Unknown,0,NaT,NaT,False,True,2025-12-15 20:42:42.490580,NaT,True


---
## Step 5: Validate Conformance

Verify that the conformed dimension properly captures all riders from both sources.

In [18]:
# ========= STEP 5: Validate conformance
print("Step 5: Validating conformance...")
print("-" * 50)

# Check all source riders are captured
dim_rider_ids = set(dim_rider_final['rider_id'].dropna())

print("Coverage check:")
trips_covered = trips_rider_ids.issubset(dim_rider_ids)
edges_covered = edges_rider_ids.issubset(dim_rider_ids)
print(f"   All trips riders captured: {trips_covered}")
print(f"   All edges riders captured: {edges_covered}")

# Segment distribution
print(f"\nRider segment distribution:")
segment_dist = dim_rider_final['rider_segment'].value_counts()
for segment, count in segment_dist.items():
    pct = count / len(dim_rider_final) * 100
    print(f"   {segment}: {count:,} ({pct:.1f}%)")

# Source distribution
print(f"\nSource coverage:")
both = dim_rider_final[dim_rider_final['in_trips'] & dim_rider_final['in_edges']]
trips_only_dim = dim_rider_final[dim_rider_final['in_trips'] & ~dim_rider_final['in_edges']]
edges_only_dim = dim_rider_final[~dim_rider_final['in_trips'] & dim_rider_final['in_edges']]
print(f"   In both sources: {len(both):,}")
print(f"   Trips only:      {len(trips_only_dim):,}")
print(f"   Edges only:      {len(edges_only_dim):,}")

Step 5: Validating conformance...
--------------------------------------------------
Coverage check:
   All trips riders captured: True
   All edges riders captured: True

Rider segment distribution:
   Unknown: 3,882 (100.0%)

Source coverage:
   In both sources: 56
   Trips only:      2,399
   Edges only:      1,427


In [19]:
# Referential integrity check - verify all trip rider_ids can join
print("Referential integrity check:")
print("-" * 50)

# Reload trips to check join
with pg_engine.connect() as conn:
    trips_sample = pd.read_sql(
        text("SELECT DISTINCT rider_id FROM raw_trips WHERE rider_id IS NOT NULL LIMIT 1000"),
        conn
    )

# Check trips can join
trips_join_check = trips_sample.merge(
    dim_rider_final[['rider_id']], 
    on='rider_id', 
    how='left', 
    indicator=True
)
unmatched_trips = trips_join_check[trips_join_check['_merge'] == 'left_only']
print(f"Trips with unmatched rider_id: {len(unmatched_trips)}")

if len(unmatched_trips) == 0:
    print("   All sampled trips can join to dim_rider - PASSED")
else:
    print(f"   WARNING: {len(unmatched_trips)} trips have unmatched rider_ids")

Referential integrity check:
--------------------------------------------------
Trips with unmatched rider_id: 0
   All sampled trips can join to dim_rider - PASSED


---
## Step 6: Output Conformed Dimension

Save the conformed rider dimension for loading into the warehouse.

In [20]:
# ========= STEP 6: Output
print("Step 6: Outputting conformed dimension...")
print("-" * 50)

# Save to CSV
dim_rider_final.to_csv(OUTPUT_DIM_RIDER, index=False)
file_size = os.path.getsize(OUTPUT_DIM_RIDER) / 1024

print(f"Saved to: {OUTPUT_DIM_RIDER}")
print(f"File size: {file_size:.1f} KB")
print(f"Total riders: {len(dim_rider_final):,}")

Step 6: Outputting conformed dimension...
--------------------------------------------------
Saved to: /tmp/dim_rider_conformed.csv
File size: 330.0 KB
Total riders: 3,882


---
## Step 7: Clean Up

Close all database connections.

In [21]:
# ========= STEP 7: Clean up
pg_engine.dispose()
neo4j_driver.close()
print("Database connections closed.")

Database connections closed.


---
## Summary

In [22]:
# ========= Final Summary
print("=" * 60)
print("CONFORMANCE JOB SUMMARY: Rider ID Reconciliation")
print("=" * 60)
print(f"""
Sources:
  - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB} (raw_trips)
  - Neo4j:      {NEO4J_URI} (graph edges)

Output:
  - Conformed dimension: {OUTPUT_DIM_RIDER}

Rider Counts:
  - From trips:    {len(trips_rider_ids):,}
  - From edges:    {len(edges_rider_ids):,}
  - Common:        {len(common_riders):,}
  - Total unique:  {len(dim_rider_final):,}

Segments:
{segment_dist.to_string()}

Referential Integrity: {'PASSED' if len(unmatched_trips) == 0 else 'FAILED'}

Status:    SUCCESS
Completed: {datetime.now().isoformat()}
""")
print("=" * 60)

CONFORMANCE JOB SUMMARY: Rider ID Reconciliation

Sources:
  - PostgreSQL: localhost:5432/postgres (raw_trips)
  - Neo4j:      bolt://localhost:7687 (graph edges)

Output:
  - Conformed dimension: /tmp/dim_rider_conformed.csv

Rider Counts:
  - From trips:    2,455
  - From edges:    1,483
  - Common:        56
  - Total unique:  3,882

Segments:
rider_segment
Unknown    3882

Referential Integrity: PASSED

Status:    SUCCESS
Completed: 2025-12-15T20:43:02.758940



---
## Key Takeaways

### Conformance Strategies

1. **Union approach**: Combine all unique identifiers from all sources
2. **Attribute enrichment**: Pull attributes from whichever source has them
3. **Conflict resolution**: When sources disagree, use rules (e.g., most recent, most common)

### Common Challenges

- **ID format differences**: Some systems use prefixes (e.g., `R12345` vs `12345`)
- **Case sensitivity**: `rider_001` vs `RIDER_001`
- **Orphan records**: Riders that appear in edges but never took a trip
