# Lesson 1 - Exercise 3 (Python): Connect to Neo4j and Preview Graph Edges

## Goal

Connect to a Neo4j instance containing graph edges (relationships like
`Rider-BOARDED->Station`, `Station-CONNECTS_TO->Station`, etc.) and
quickly profile which **node types** and **relationship types** you'll
need to map into `stg.edges_raw`.

## What to build

A Jupyter notebook that:

1.  Connects with the official `neo4j` Python driver.

2.  Runs Cypher to:

    -   Count edges by `relationship`.
    -   Count by `(from_node_type, to_node_type)`.
    -   Sample 10 edges with fields needed later (`edge_id`, node
        IDs/types, `timestamp`, `route_id`, `mode`, etc.).

3.  Writes `/tmp/edges_preview.csv` and prints summary tables.

------------------------------------------------------------------------

Populate the Neo4j database by running available script

In [None]:
!python populate-neo4j.py

### Step 1: Imports and Configuration

Load required libraries and set up connection parameters from environment variables.

In [None]:
import os
import pandas as pd
from neo4j import GraphDatabase

# --- Configuration (from environment variables or defaults) ---
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.environ.get("NEO4J_USER", "neo4j")
NEO4J_PASSWORD = os.environ.get("NEO4J_PASSWORD", "neo4jpass")
OUT_FILE = "/tmp/edges_preview.csv"

print(f"Neo4j URI: {NEO4J_URI}")
print(f"Neo4j User: {NEO4J_USER}")
print(f"Output file: {OUT_FILE}")

### Step 2: Connect to Neo4j

Establish a connection to the Neo4j database using the official Python driver.

In [None]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Verify connectivity
driver.verify_connectivity()
print("Successfully connected to Neo4j!")

### Step 3: Count Edges by Relationship Type

Query the graph to see which relationship types exist and how many of each.

**TODO**: Write a Cypher query that counts edges by their relationship type. The query should:
- Match all relationships using the pattern `(a)-[r]->(b)`
- Use `type(r)` to get the relationship type
- Group by relationship type and count occurrences
- Order by count descending and limit to top 5

Execute the query and convert results to a DataFrame.

In [None]:
CYPHER_REL_COUNTS = """
-- TODO: Write your Cypher query here
"""

with driver.session() as session:
    # TODO: Execute the query and convert results to a DataFrame
    # Hint: Use session.run() and convert records with record.data()
    result = None
    rel_counts = pd.DataFrame()

print("Top 5 Relationship Types:")
print("-" * 40)
display(rel_counts)

### Step 4: Count by Node-Type Pairs

Understand how different entity types connect in the graph.

**TODO**: Write a Cypher query that counts edges grouped by the combination of:
- Source node type (use `labels(a)[0]` to get the first label)
- Target node type (use `labels(b)[0]`)

Order by count descending and limit to top 5.

In [None]:
CYPHER_NODE_PAIRS = """
-- TODO: Write your Cypher query here
"""

with driver.session() as session:
    # TODO: Execute the query and convert results to a DataFrame
    result = None
    node_pair_counts = pd.DataFrame()

print("Top 5 Node-Type Pairs:")
print("-" * 40)
display(node_pair_counts)

### Step 5: Sample Edges with Properties

Get a sample of edges with all the properties we'll need for the staging table.

**TODO**: Write a Cypher query that returns a sample of 10 edges with the following fields:
- `edge_id`: from `r.edge_id`
- `from_node_id`: from `a.id`
- `from_node_type`: from `labels(a)[0]`
- `to_node_id`: from `b.id`
- `to_node_type`: from `labels(b)[0]`
- `relationship`: from `type(r)`
- `timestamp`: from `r.timestamp`
- `route_id`: from `r.route_id`
- `mode`: from `r.mode`
- `rider_id`: from `r.rider_id`
- `station_id`: from `r.station_id`
- `rider_segment`: from `r.rider_segment`
- `edge_strength`: from `r.edge_strength`
- `total_fare_cad`: from `r.total_fare_cad`

In [None]:
CYPHER_SAMPLE = """
-- TODO: Write your Cypher query here
"""

with driver.session() as session:
    # TODO: Execute the query and convert results to a DataFrame
    result = None
    sample_edges = pd.DataFrame()

print(f"Sampled {len(sample_edges)} edges")
print("-" * 40)
display(sample_edges)

### Step 6: Validate Required Columns for Staging

Check that the sample contains the columns we need to map into `stg.edges_raw`.

In [None]:
# Columns required for staging table mapping
required_columns = [
    "edge_id",
    "from_node_id",
    "from_node_type",
    "to_node_id",
    "to_node_type",
    "relationship",
    "timestamp"
]

present = [c for c in required_columns if c in sample_edges.columns]
missing = [c for c in required_columns if c not in sample_edges.columns]

print("Column Validation for stg.edges_raw:")
print("-" * 40)
print(f"Present: {present}")
if missing:
    print(f"MISSING: {missing}")
else:
    print("All required columns present!")

# Show data types
print("\nColumn Data Types:")
print(sample_edges.dtypes)

### Step 7: Write Sample to CSV

Save the sample edges to a CSV file for reference.

**TODO**: Write the `sample_edges` DataFrame to the CSV file specified by `OUT_FILE`. Then read it back to verify the write was successful.

In [None]:
# TODO: Write sample_edges to CSV (without the index)


# TODO: Read the file back to verify
verify_df = None

print(f"Wrote {len(sample_edges)} rows to {OUT_FILE}")
print(f"Verified: {len(verify_df)} rows, {len(verify_df.columns)} columns")

### Step 8: Clean Up

Close the Neo4j driver connection.

In [None]:
driver.close()
print("Neo4j connection closed.")

------------------------------------------------------------------------

## Summary

In this exercise, you:

1. Connected to Neo4j using the official Python driver
2. Profiled relationship types to understand the graph structure
3. Analyzed node-type pairs to see how entities connect
4. Sampled edges with fields needed for the staging table
5. Validated that required columns are present
6. Exported a preview CSV for reference

These patterns (env-driven config, bounded reads, quick stats, CSV outputs) are exactly what you'll reuse when building **ETL pipelines** in later lessons and for the final **e-commerce project**.