# Lesson 1 — Exercise 3 (Python): Connect to Neo4j and Preview Graph Edges

## Goal

Connect to a Neo4j instance containing graph edges (relationships like
`Rider-BOARDED->Station`, `Station-CONNECTS_TO->Station`, etc.) and
quickly profile which **node types** and **relationship types** you'll
need to map into `stg.edges_raw`.

## What to build

A Jupyter notebook that:

1.  Connects with the official `neo4j` Python driver.

2.  Runs Cypher to:

    -   Count edges by `relationship`.
    -   Count by `(from_node_type, to_node_type)`.
    -   Sample 10 edges with fields needed later (`edge_id`, node
        IDs/types, `timestamp`, `route_id`, `mode`, etc.).

3.  Writes `/tmp/edges_preview.csv` and prints summary tables.

### Acceptance criteria

-   Uses env `NEO4J_URI`, `NEO4J_USER`, `NEO4J_PASSWORD`.
-   Prints top 5 relationships and top 5 node-type pairs.
-   Writes a 10-row sample with columns that will map into
    `stg.edges_raw` (edge_id, from_node_id/type, to_node_id/type,
    relationship, timestamp, route_id, mode, rider_id, station_id… when
    present).

------------------------------------------------------------------------

## Lesson 1 Exercise 3: Connect to Neo4j and Preview Graph Edges Solution

Populate the Neo4j database by running available script

In [1]:
!python populate-neo4j.py

Connected to Neo4j at bolt://localhost:7687 as neo4j
Loaded 122 edges … (Rider)-[:SHARES_TRIP_WITH]->(Rider)
Loaded 446 edges … (Rider)-[:ALIGHTED]->(Station)
Loaded 835 edges … (Rider)-[:TRANSFERRED_TO]->(Station)
Loaded 1335 edges … (Rider)-[:BOARDED]->(Vehicle)
Loaded 1370 edges … (Rider)-[:BOARDED]->(Vehicle)
Loaded 1802 edges … (Route)-[:ROUTE_SERVES]->(Station)
Loaded 2302 edges … (Station)-[:CONNECTS_TO]->(Station)
Loaded 2500 edges … (Station)-[:CONNECTS_TO]->(Station)

✅ Loaded 2500 edges from ./data/van_transit_graph_edges_neo4j.csv


### Step 1: Imports and Configuration

Load required libraries and set up connection parameters from environment variables.

In [2]:
import os
import pandas as pd
from neo4j import GraphDatabase

# --- Configuration (from environment variables or defaults) ---
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.environ.get("NEO4J_USER", "neo4j")
NEO4J_PASSWORD = os.environ.get("NEO4J_PASSWORD", "neo4jpass")
OUT_FILE = "/tmp/edges_preview.csv"

print(f"Neo4j URI: {NEO4J_URI}")
print(f"Neo4j User: {NEO4J_USER}")
print(f"Output file: {OUT_FILE}")

Neo4j URI: bolt://localhost:7687
Neo4j User: neo4j
Output file: /tmp/edges_preview.csv


### Step 2: Connect to Neo4j

Establish a connection to the Neo4j database using the official Python driver.

In [3]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Verify connectivity
driver.verify_connectivity()
print("Successfully connected to Neo4j!")

Successfully connected to Neo4j!


### Step 3: Count Edges by Relationship Type

Query the graph to see which relationship types exist and how many of each.

In [4]:
CYPHER_REL_COUNTS = """
MATCH ()-[r]->()
RETURN type(r) AS relationship, count(*) AS n
ORDER BY n DESC
LIMIT 5
"""

with driver.session() as session:
    result = session.run(CYPHER_REL_COUNTS)
    rel_counts = pd.DataFrame([record.data() for record in result])

print("Top 5 Relationship Types:")
print("-" * 40)
display(rel_counts)

Top 5 Relationship Types:
----------------------------------------


Unnamed: 0,relationship,n
0,CONNECTS_TO,698
1,BOARDED,535
2,ROUTE_SERVES,432
3,TRANSFERRED_TO,389
4,ALIGHTED,324


### Step 4: Count by Node-Type Pairs

Understand which types of nodes are connected to each other (e.g., Rider to Station, Station to Route).

In [5]:
CYPHER_NODETYPE_PAIRS = """
MATCH (a)-[r]->(b)
RETURN labels(a)[0] AS from_node_type, labels(b)[0] AS to_node_type, count(*) AS n
ORDER BY n DESC
LIMIT 5
"""

with driver.session() as session:
    result = session.run(CYPHER_NODETYPE_PAIRS)
    nodetype_pairs = pd.DataFrame([record.data() for record in result])

print("Top 5 Node-Type Pairs:")
print("-" * 40)
display(nodetype_pairs)

Top 5 Node-Type Pairs:
----------------------------------------


Unnamed: 0,from_node_type,to_node_type,n
0,Rider,Station,713
1,Station,Station,698
2,Rider,Vehicle,535
3,Route,Station,432
4,Rider,Rider,122


### Step 5: Sample Edges with Key Fields

Pull a sample of edges with the fields we'll need to map into the staging table `stg.edges_raw`.

In [6]:
# Sample edges with properties needed for staging
# Uses properties stored by the data loader directly
CYPHER_SAMPLE = """
MATCH (a)-[r]->(b)
RETURN
  r.edge_id                             AS edge_id,
  a.id                                  AS from_node_id,
  labels(a)[0]                          AS from_node_type,
  b.id                                  AS to_node_id,
  labels(b)[0]                          AS to_node_type,
  type(r)                               AS relationship,
  r.timestamp                           AS timestamp,
  r.route_id                            AS route_id,
  r.mode                                AS mode,
  r.rider_id                            AS rider_id,
  r.station_id                          AS station_id,
  r.rider_segment                       AS rider_segment,
  r.edge_strength                       AS edge_strength,
  r.total_fare_cad                      AS total_fare_cad
LIMIT 10
"""

with driver.session() as session:
    result = session.run(CYPHER_SAMPLE)
    sample_edges = pd.DataFrame([record.data() for record in result])

print(f"Sampled {len(sample_edges)} edges")
print("-" * 40)
display(sample_edges)

Sampled 10 edges
----------------------------------------


Unnamed: 0,edge_id,from_node_id,from_node_type,to_node_id,to_node_type,relationship,timestamp,route_id,mode,rider_id,station_id,rider_segment,edge_strength,total_fare_cad
0,EDGE300010,R42192,Rider,R37585,Rider,SHARES_TRIP_WITH,2025-03-24T19:54:31.000000000,R101,skytrain,R42192,S028,Student,0.498,3.4
1,EDGE300024,R42701,Rider,R12676,Rider,SHARES_TRIP_WITH,2024-05-06T08:28:24.000000000,R112,seabus,R42701,S007,Commuter,0.371,2.83
2,EDGE300042,R42192,Rider,R15184,Rider,SHARES_TRIP_WITH,2024-06-10T13:24:13.000000000,R008,skytrain,R42192,S002,Commuter,0.849,3.19
3,EDGE300043,R78643,Rider,R59654,Rider,SHARES_TRIP_WITH,2024-11-02T04:15:31.000000000,R049,bus,R78643,S004,Commuter,0.208,2.19
4,EDGE300048,R71761,Rider,R41733,Rider,SHARES_TRIP_WITH,2024-01-27T15:28:47.000000000,R063,wce,R71761,S020,Commuter,0.716,4.57
5,EDGE300052,R90961,Rider,R61816,Rider,SHARES_TRIP_WITH,2024-08-13T15:29:11.000000000,R089,wce,R90961,S023,Occasional,0.15,2.05
6,EDGE300062,R81218,Rider,R40846,Rider,SHARES_TRIP_WITH,2024-05-12T15:03:52.000000000,R056,bus,R81218,S025,Commuter,0.33,3.31
7,EDGE300080,R16815,Rider,R50565,Rider,SHARES_TRIP_WITH,2025-03-23T08:21:35.000000000,R110,skytrain,R16815,S004,Commuter,0.277,4.51
8,EDGE300081,R54981,Rider,R30292,Rider,SHARES_TRIP_WITH,2024-08-22T04:53:13.000000000,R076,bus,R54981,S003,Student,0.731,2.17
9,EDGE300111,R86041,Rider,R47297,Rider,SHARES_TRIP_WITH,2024-04-29T19:05:13.000000000,R042,bus,R86041,S015,Student,0.199,2.18


### Step 6: Validate Required Columns for Staging

Check that the sample contains the columns we need to map into `stg.edges_raw`.

In [7]:
# Columns required for staging table mapping
required_columns = [
    "edge_id",
    "from_node_id",
    "from_node_type",
    "to_node_id",
    "to_node_type",
    "relationship",
    "timestamp"
]

present = [c for c in required_columns if c in sample_edges.columns]
missing = [c for c in required_columns if c not in sample_edges.columns]

print("Column Validation for stg.edges_raw:")
print("-" * 40)
print(f"Present: {present}")
if missing:
    print(f"MISSING: {missing}")
else:
    print("All required columns present!")

# Show data types
print("\nColumn Data Types:")
print(sample_edges.dtypes)

Column Validation for stg.edges_raw:
----------------------------------------
Present: ['edge_id', 'from_node_id', 'from_node_type', 'to_node_id', 'to_node_type', 'relationship', 'timestamp']
All required columns present!

Column Data Types:
edge_id            object
from_node_id       object
from_node_type     object
to_node_id         object
to_node_type       object
relationship       object
timestamp          object
route_id           object
mode               object
rider_id           object
station_id         object
rider_segment      object
edge_strength     float64
total_fare_cad    float64
dtype: object


### Step 7: Write Sample to CSV

Save the sample edges to a CSV file for reference.

In [8]:
sample_edges.to_csv(OUT_FILE, index=False)
print(f"Wrote {len(sample_edges)} rows to {OUT_FILE}")

# Verify the file
verify_df = pd.read_csv(OUT_FILE)
print(f"Verified: {len(verify_df)} rows, {len(verify_df.columns)} columns")

Wrote 10 rows to /tmp/edges_preview.csv
Verified: 10 rows, 14 columns


### Step 8: Clean Up

Close the Neo4j driver connection.

In [9]:
driver.close()
print("Neo4j connection closed.")

Neo4j connection closed.


------------------------------------------------------------------------

## Summary

In this exercise, you:

1. Connected to Neo4j using the official Python driver
2. Profiled relationship types to understand the graph structure
3. Analyzed node-type pairs to see how entities connect
4. Sampled edges with fields needed for the staging table
5. Validated that required columns are present
6. Exported a preview CSV for reference