# 📊 Graph Database Fundamentals with Neo4j

## Welcome to Graph Databases!

In this notebook, you'll learn how to use **graph databases** to model and query complex, interconnected data. While vector databases excel at similarity search, graph databases are optimized for understanding and traversing **relationships** between entities.

## 🎯 What You'll Learn

- Core graph database concepts: nodes, relationships, and properties
- **Cypher query language** for pattern matching and graph traversal
- Building knowledge graphs from structured data
- Advanced pattern matching and multi-hop relationship queries
- When to use graph databases vs. vector databases in your AI applications

## 💼 Business Value: Why Graph Databases Matter

Graph databases power some of the world's most critical applications:

- **Social Networks**: LinkedIn uses graphs to show "how you're connected" to other professionals
- **Fraud Detection**: Banks detect fraud rings by analyzing transaction networks in real-time
- **Recommendation Systems**: Netflix and Amazon use graphs to find "people like you" and recommend content
- **Knowledge Management**: Enterprises build knowledge graphs to connect documents, people, projects, and expertise
- **Supply Chain Optimization**: Companies model complex supply networks to identify bottlenecks and risks

Unlike traditional databases that struggle with complex joins across multiple tables, graph databases are designed to traverse relationships at lightning speed—even across millions of connections.

## 🔗 How This Connects to Your Previous Learning

You've already learned about:
- **Vector databases**: Great for similarity search ("find documents similar to this")
- **RAG systems**: Retrieve relevant context for LLM queries
- **LlamaIndex**: Orchestrate data ingestion and retrieval

Graph databases complement these tools by adding **relationship intelligence**. In the next notebook, you'll combine graph databases with vector search to create **GraphRAG**—the next evolution of retrieval-augmented generation.

Let's get started! 🚀

---
## 📚 Part 1: Theory - What Are Graph Databases?

### Understanding Graph Databases

A **graph database** stores data as a network of interconnected nodes and relationships, rather than rows and tables. This structure mirrors how we naturally think about connected information: people know people, products belong to categories, documents cite other documents.

Traditional **relational databases** require complex JOIN operations to traverse relationships. As you add more joins ("find friends of friends of friends"), performance degrades exponentially. **Document databases** like MongoDB store data as self-contained documents, but struggle to represent complex relationships efficiently.

Graph databases solve this by making relationships **first-class citizens**. Relationships are stored directly alongside the data, enabling constant-time traversal regardless of database size. A query that finds "friends within 3 degrees of separation" executes in milliseconds even with millions of users.

### 🔑 Key Terminology

Every graph database is built from three fundamental elements:

1. **Nodes** (also called vertices): Represent entities in your domain
   - Examples: Person, Company, Product, Document
   - Have **labels** to categorize them (e.g., `:Employee`, `:Department`)
   - Contain **properties** with key-value data (e.g., `name: "Sarah"`, `age: 35`)

2. **Relationships** (also called edges): Connect nodes and represent how they're related
   - Examples: WORKS_FOR, PURCHASED, KNOWS, MANAGES
   - Always have a **direction** (though you can query in either direction)
   - Also contain **properties** (e.g., `since: 2020`, `strength: 0.8`)

3. **Properties**: Key-value pairs that store data on nodes and relationships
   - Examples: `salary: 75000`, `created_at: "2024-01-15"`, `priority: "high"`

### 🎯 When to Use Graph Databases vs. Vector Databases

Choose **graph databases** when:
- Your data has complex, multi-hop relationships (organizational hierarchies, social networks)
- You need to answer "how are these connected?" questions
- Relationships are as important as the entities themselves
- You're detecting patterns across connections (fraud rings, influence networks)

Choose **vector databases** when:
- You need semantic similarity search ("find documents similar to this query")
- Your data is unstructured (text, images, audio)
- Relationships are less important than content similarity

The best systems use **both**: vector search to find relevant content, graph traversal to understand context and relationships. You'll learn this hybrid approach in Notebook 2!

### 🌟 Real-World Use Cases

1. **LinkedIn Connections**: "How are you connected to this person?" requires traversing the professional network graph
2. **Fraud Detection**: Banks analyze transaction graphs to find suspicious patterns—multiple accounts transferring to the same destination
3. **Recommendation Engines**: "Users who bought X also bought Y" is a graph traversal problem
4. **Drug Discovery**: Pharmaceutical companies use knowledge graphs to find relationships between proteins, diseases, and compounds
5. **IT Operations**: Map dependencies between services to understand cascade failures

### 💡 Key Point: The Power of Relationship Traversal

In a relational database, finding "friends of friends" requires multiple self-joins that get exponentially slower. In a graph database, you simply follow the relationships—performance stays constant regardless of how many "hops" you traverse. This is called **index-free adjacency**, and it's what makes graph databases uniquely powerful for connected data.

### 🎯 Key Takeaways

- Graph databases store data as **nodes** (entities) connected by **relationships**
- Relationships are stored directly, enabling fast traversal across millions of connections
- Unlike relational databases, query performance doesn't degrade with relationship depth
- Graph databases excel at "how are these connected?" questions
- They complement vector databases—use graphs for relationships, vectors for similarity
- Real-world applications: social networks, fraud detection, recommendations, knowledge management

---
## ⚙️ Part 2: Setup - Installing Neo4j in Google Colab

Neo4j is the world's leading graph database. We'll install **Neo4j Community Edition** directly in this Colab environment—no external accounts or services needed!

### What We're Installing

- **Neo4j database**: Graph database engine that runs locally in this Colab VM
- **Python driver**: Official Neo4j Python library for executing Cypher queries
- **Supporting libraries**: pandas, matplotlib, networkx for data manipulation and visualization

This installation takes about 30 seconds. Let's begin! ⚡

In [None]:
# 📦 Install Neo4j Community Edition in Colab
# This sets up a full Neo4j database server in this virtual machine

import time
import subprocess
import os

print("🔄 Step 1: Installing Java (Neo4j requirement)...")
!apt-get update -qq > /dev/null 2>&1
!apt-get install -y openjdk-17-jdk wget gnupg > /dev/null 2>&1

print("✅ Java installed!")

print("\n📥 Step 2: Adding Neo4j repository...")
# Add Neo4j GPG key
!wget -O - https://debian.neo4j.com/neotechnology.gpg.key | apt-key add - > /dev/null 2>&1
# Add Neo4j repository
!echo 'deb https://debian.neo4j.com stable latest' | tee /etc/apt/sources.list.d/neo4j.list > /dev/null 2>&1

print("✅ Neo4j repository added!")

print("\n📥 Step 3: Installing Neo4j Community Edition...")
!apt-get update -qq > /dev/null 2>&1
!apt-get install -y neo4j > /dev/null 2>&1

print("✅ Neo4j installed!")

# Verify installation
neo4j_installed = os.path.exists('/usr/bin/neo4j')
if not neo4j_installed:
    print("❌ Neo4j installation failed!")
    print("Checking for neo4j command...")
    !which neo4j
    raise Exception("Neo4j not found. Installation failed.")

print("\n🔐 Step 4: Setting initial password...")
!neo4j-admin dbms set-initial-password password123 2>&1 | grep -v "Warning"

print("✅ Password set!")

print("\n⚙️ Step 5: Configuring Neo4j...")
neo4j_conf = "/etc/neo4j/neo4j.conf"

if os.path.exists(neo4j_conf):
    # Backup original config
    !cp /etc/neo4j/neo4j.conf /etc/neo4j/neo4j.conf.backup
    
    # Update configuration for Colab
    config_updates = {
        'server.bolt.enabled': 'true',
        'server.bolt.listen_address': '0.0.0.0:7687',
        'server.http.enabled': 'true',
        'server.http.listen_address': '0.0.0.0:7474',
        'dbms.security.auth_enabled': 'true',
        'server.default_listen_address': '0.0.0.0',
        'server.directories.logs': '/var/log/neo4j',
        'server.directories.data': '/var/lib/neo4j/data'
    }
    
    # Read existing config
    with open(neo4j_conf, 'r') as f:
        config_lines = f.readlines()
    
    # Update or add configuration
    with open(neo4j_conf, 'w') as f:
        for line in config_lines:
            written = False
            for key, value in config_updates.items():
                if line.startswith(f'#{key}') or line.startswith(key):
                    f.write(f'{key}={value}\n')
                    config_updates.pop(key)
                    written = True
                    break
            if not written:
                f.write(line)
        
        # Add remaining configs that weren't in the file
        for key, value in config_updates.items():
            f.write(f'{key}={value}\n')
    
    print("✅ Configuration updated!")
else:
    print(f"⚠️ Config file not found at {neo4j_conf}")

print("\n🚀 Step 6: Starting Neo4j...")
start_result = !neo4j start 2>&1
print('\n'.join(start_result))

print("\n📚 Step 7: Installing Python drivers...")
!pip install -q neo4j py2neo

print("✅ Python drivers installed!")

# Wait for Neo4j to be ready
print("\n⏳ Step 8: Waiting for Neo4j to be ready...")
max_attempts = 40
attempt = 0
neo4j_ready = False

while attempt < max_attempts and not neo4j_ready:
    try:
        # Check if Neo4j is responding
        result = subprocess.run(['neo4j', 'status'], 
                               capture_output=True, 
                               text=True, 
                               timeout=5)
        if 'running' in result.stdout.lower():
            # Double-check by trying to connect to the port
            import socket
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(2)
            result = sock.connect_ex(('127.0.0.1', 7687))
            sock.close()
            if result == 0:
                neo4j_ready = True
                break
    except Exception as e:
        pass
    
    attempt += 1
    time.sleep(2)
    if attempt % 5 == 0:
        print(f"   Still waiting... ({attempt}/{max_attempts})")

if neo4j_ready:
    time.sleep(3)  # Give it a few more seconds
    print("\n✅ Neo4j is fully ready!")
else:
    print("\n⚠️ Neo4j may still be initializing...")
    print("   Don't worry - try the connection test and use the troubleshooting cell if needed.")

print("\n" + "="*60)
print("📊 CONNECTION DETAILS")
print("="*60)
print("URI:      bolt://localhost:7687")
print("Username: neo4j")
print("Password: password123")
print("="*60)

print("\n🔍 Final status check:")
!neo4j status

In [None]:
# 📦 Install additional dependencies for data manipulation and visualization

print("📥 Installing additional libraries...")
!pip install -q pandas matplotlib networkx python-louvain

# Suppress unnecessary warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

print("✅ All dependencies installed!")

In [None]:
# 📚 Import all necessary libraries

# Neo4j driver for database connection and queries
from neo4j import GraphDatabase

# Data manipulation and analysis
import pandas as pd
import json
from typing import List, Dict, Any, Optional

# Visualization libraries
import matplotlib.pyplot as plt
import networkx as nx
from matplotlib.patches import Patch

# Utilities
import time
from pprint import pprint

print("✅ All libraries imported successfully!")

In [None]:
# 🔗 Configure Neo4j connection parameters

NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "password123"

# Create the database driver (connection pool)
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

print("✅ Neo4j driver created!")

In [None]:
# 🧪 Test Neo4j connection with retry logic

def test_neo4j_connection(max_retries=5, retry_delay=3):
    """
    Verify that we can connect to Neo4j and execute a simple query.
    This function tests the connection with retry logic for robustness.
    
    Args:
        max_retries: Maximum number of connection attempts
        retry_delay: Seconds to wait between retries
    """
    for attempt in range(1, max_retries + 1):
        try:
            print(f"🔄 Connection attempt {attempt}/{max_retries}...")
            
            # Open a session and run a simple query
            with driver.session() as session:
                # Count all nodes in the database (should be 0 for a fresh install)
                result = session.run("MATCH (n) RETURN count(n) as node_count")
                record = result.single()
                node_count = record["node_count"]
                
                print("\n✅ Successfully connected to Neo4j!")
                print(f"📊 Current node count: {node_count}")
                print("🎉 Database is ready to use!")
                return True
                
        except Exception as e:
            print(f"❌ Attempt {attempt} failed: {str(e)[:100]}")
            
            if attempt < max_retries:
                print(f"⏳ Waiting {retry_delay} seconds before retry...")
                time.sleep(retry_delay)
            else:
                print("\n❌ All connection attempts failed!")
                print("\n🔧 Troubleshooting steps:")
                print("   1. Check if Neo4j is running: !neo4j status")
                print("   2. Try restarting Neo4j: !neo4j restart")
                print("   3. Wait a bit longer and run this cell again")
                print("   4. Check Neo4j logs: !cat /var/log/neo4j/neo4j.log | tail -50")
                return False

# Run the connection test with retries
test_neo4j_connection()

In [None]:
# 🔧 TROUBLESHOOTING: Run this cell if connection failed

print("🔍 Checking Neo4j status...")
!neo4j status

print("\n🔄 Attempting to restart Neo4j...")
!neo4j restart

print("\n⏳ Waiting 30 seconds for Neo4j to fully initialize...")
import time
time.sleep(30)

print("\n✅ Neo4j restart complete!")
print("📊 Now try running the connection test cell above again.")

# Optionally check logs for errors
print("\n📋 Recent Neo4j logs (last 20 lines):")
!tail -20 /var/log/neo4j/neo4j.log 2>/dev/null || echo "Could not access logs"

### 🔧 Troubleshooting Neo4j Connection Issues

If the connection test above failed, try these steps:

**Option 1: Restart Neo4j**
Run the cell below to restart Neo4j and wait longer for it to initialize.

**Option 2: Check Neo4j Status**
Verify if Neo4j is actually running with: `!neo4j status`

**Option 3: View Logs**
Check Neo4j logs for errors: `!tail -50 /var/log/neo4j/neo4j.log`

**Common Issues:**
- Neo4j needs more time to start (wait 30-60 seconds)
- Memory issues in Colab (restart runtime and try again)
- Port conflicts (less common in Colab)

If problems persist, run the troubleshooting cell below:

---
## 🔤 Part 3: Cypher Basics - The Graph Query Language

### Introduction to Cypher

**Cypher** is Neo4j's query language for working with graphs. If you know SQL, you'll find Cypher familiar but more intuitive for connected data. Instead of thinking about tables and joins, you think about **patterns** in the graph.

### Basic Syntax Patterns

Cypher uses ASCII art to represent graph patterns:

```cypher
()                    // A node
(p:Person)            // A node with label 'Person', aliased as 'p'
(p:Person {name: "Sarah"})  // A node with properties

-->                   // A relationship (directed)
-[r]->                // A relationship, aliased as 'r'
-[:KNOWS]->           // A relationship with type 'KNOWS'
-[:KNOWS {since: 2020}]->  // A relationship with properties

(p1:Person)-[:KNOWS]->(p2:Person)  // A complete pattern
```

### Core Cypher Commands

- **CREATE**: Add new nodes and relationships to the graph
- **MATCH**: Find patterns in the graph (like SQL's SELECT)
- **WHERE**: Filter results based on conditions
- **RETURN**: Specify what data to return
- **DELETE**: Remove nodes and relationships (use DETACH DELETE to remove relationships too)
- **MERGE**: Create if doesn't exist, or match if it does (prevents duplicates)

### 💡 Key Point: Cypher's Pattern-Matching Power

The magic of Cypher is in pattern matching. Instead of writing complex joins, you simply **draw the pattern** you're looking for using ASCII art. The database figures out how to traverse the graph efficiently. Want to find "people who work with people who know each other"? Just write the pattern:

```cypher
(p1:Person)-[:WORKS_WITH]->(p2:Person)-[:KNOWS]->(p3:Person)
```

Let's try some examples!

In [None]:
# 🛠️ Helper function to execute Cypher queries

def run_query(query: str, parameters: Optional[Dict[str, Any]] = None) -> List[Dict[str, Any]]:
    """
    Execute a Cypher query and return results as a list of dictionaries.
    
    Args:
        query: Cypher query string
        parameters: Optional dictionary of parameters for the query
    
    Returns:
        List of result records as dictionaries
    """
    try:
        with driver.session() as session:
            result = session.run(query, parameters or {})
            # Convert results to list of dictionaries
            records = [dict(record) for record in result]
            return records
    except Exception as e:
        print(f"❌ Query execution failed: {str(e)}")
        print(f"Query: {query}")
        return []

print("✅ Helper function created!")

### 🧪 Let's Practice Basic Cypher Queries

We'll start with simple examples to understand Cypher syntax:

In [None]:
# 1️⃣ CREATE a single node
print("1️⃣ Creating a Person node...")

query = """
CREATE (p:Person {name: "Alice", age: 30, role: "Data Scientist"})
RETURN p.name as name, p.age as age, p.role as role
"""
result = run_query(query)
print(f"✅ Created: {result[0]}\n")

In [None]:
# 2️⃣ CREATE multiple nodes with different labels
print("2️⃣ Creating multiple nodes...")

query = """
CREATE (p:Person {name: "Bob", age: 35}),
       (c:Company {name: "TechCorp", industry: "Software"}),
       (s:Skill {name: "Python", category: "Programming"})
RETURN 'Nodes created' as status
"""
result = run_query(query)
print(f"✅ {result[0]['status']}\n")

In [None]:
# 3️⃣ CREATE nodes with relationships
print("3️⃣ Creating nodes with relationships...")

query = """
CREATE (alice:Person {name: "Alice"})-[:WORKS_FOR {since: 2020}]->(company:Company {name: "DataCo"})
RETURN alice.name as person, company.name as company
"""
result = run_query(query)
print(f"✅ {result[0]['person']} works for {result[0]['company']}\n")

In [None]:
# 4️⃣ MATCH and return nodes
print("4️⃣ Finding all Person nodes...")

query = """
MATCH (p:Person)
RETURN p.name as name, p.age as age
ORDER BY p.name
"""
results = run_query(query)
print(f"✅ Found {len(results)} people:")
for person in results:
    print(f"   - {person['name']}, age: {person.get('age', 'N/A')}")
print()

In [None]:
# 5️⃣ MATCH patterns with relationships
print("5️⃣ Finding who works for which companies...")

query = """
MATCH (p:Person)-[r:WORKS_FOR]->(c:Company)
RETURN p.name as employee, c.name as company, r.since as since
"""
results = run_query(query)
if results:
    print(f"✅ Found {len(results)} employment relationships:")
    for rel in results:
        print(f"   - {rel['employee']} → {rel['company']} (since {rel.get('since', 'unknown')})")
else:
    print("No employment relationships found.")
print()

In [None]:
# 6️⃣ Use WHERE to filter results
print("6️⃣ Finding people over age 30...")

query = """
MATCH (p:Person)
WHERE p.age > 30
RETURN p.name as name, p.age as age
ORDER BY p.age DESC
"""
results = run_query(query)
if results:
    print(f"✅ Found {len(results)} people over 30:")
    for person in results:
        print(f"   - {person['name']}: {person['age']} years old")
else:
    print("No people over 30 found.")
print()

In [None]:
# 7️⃣ Clean up test data
print("7️⃣ Cleaning up test data...")

query = """
MATCH (n)
DETACH DELETE n
"""
run_query(query)
print("✅ All nodes and relationships deleted!\n")

# Verify database is empty
verify_query = "MATCH (n) RETURN count(n) as count"
result = run_query(verify_query)
print(f"📊 Current node count: {result[0]['count']}")

### 🎯 Cypher Basics - Key Takeaways

- Cypher uses **ASCII art** to represent graph patterns: `(node)-[:RELATIONSHIP]->(node)`
- **CREATE** adds new data, **MATCH** finds existing patterns
- Use **DETACH DELETE** to remove nodes and their relationships
- Properties are specified in curly braces: `{name: "Alice", age: 30}`
- Pattern matching is intuitive—just draw what you're looking for!
- Cypher reads like English: "Match this pattern, where this condition, return that"

Now that you understand basic Cypher, let's build a real knowledge graph! 🚀

---
## 🏢 Part 4: Building a Knowledge Graph

### Use Case: Company Organizational Knowledge Graph

We'll build a knowledge graph representing a company's organizational structure. This is a common real-world use case where graph databases excel.

### 📊 Our Graph Schema

**Entities (Nodes):**
- **Employee**: People who work at the company
- **Department**: Organizational units
- **Project**: Work initiatives
- **Skill**: Technical and professional competencies

**Relationships:**
- **(Employee)-[:WORKS_IN]->(Department)**: Where someone works
- **(Employee)-[:MANAGES]->(Employee)**: Organizational hierarchy
- **(Employee)-[:ASSIGNED_TO]->(Project)**: Project assignments
- **(Employee)-[:HAS_SKILL]->(Skill)**: Professional competencies

### 💡 Why This Structure Matters

This graph enables powerful queries like:
- "Who has Python skills in the Data Science department?"
- "Show me the reporting chain from employee to CEO"
- "Find colleagues who work on similar projects"
- "Which departments collaborate on projects most?"

Let's create realistic mock data for our company!

In [None]:
# 📝 Generate comprehensive mock data for our company knowledge graph

# Employees with realistic details
employees = [
    {"id": "emp001", "name": "Sarah Chen", "title": "VP of Data Science", "years_experience": 12, "email": "sarah.chen@company.com"},
    {"id": "emp002", "name": "Marcus Johnson", "title": "Senior ML Engineer", "years_experience": 7, "email": "marcus.j@company.com"},
    {"id": "emp003", "name": "Priya Patel", "title": "Data Scientist", "years_experience": 4, "email": "priya.p@company.com"},
    {"id": "emp004", "name": "James Liu", "title": "ML Engineer", "years_experience": 5, "email": "james.liu@company.com"},
    {"id": "emp005", "name": "Elena Rodriguez", "title": "Data Analyst", "years_experience": 3, "email": "elena.r@company.com"},
    {"id": "emp006", "name": "David Kim", "title": "VP of Engineering", "years_experience": 15, "email": "david.kim@company.com"},
    {"id": "emp007", "name": "Amy Zhang", "title": "Senior Backend Engineer", "years_experience": 8, "email": "amy.zhang@company.com"},
    {"id": "emp008", "name": "Carlos Santos", "title": "Frontend Engineer", "years_experience": 5, "email": "carlos.s@company.com"},
    {"id": "emp009", "name": "Lisa Anderson", "title": "DevOps Engineer", "years_experience": 6, "email": "lisa.a@company.com"},
    {"id": "emp010", "name": "Mohammed Ali", "title": "Backend Engineer", "years_experience": 4, "email": "mohammed.a@company.com"},
    {"id": "emp011", "name": "Sophie Martin", "title": "Product Manager", "years_experience": 9, "email": "sophie.m@company.com"},
    {"id": "emp012", "name": "Raj Sharma", "title": "Product Designer", "years_experience": 6, "email": "raj.s@company.com"},
    {"id": "emp013", "name": "Nina Williams", "title": "UX Researcher", "years_experience": 5, "email": "nina.w@company.com"},
    {"id": "emp014", "name": "Alex Turner", "title": "CEO", "years_experience": 20, "email": "alex.turner@company.com"},
    {"id": "emp015", "name": "Jennifer Lee", "title": "CTO", "years_experience": 18, "email": "jennifer.lee@company.com"},
    {"id": "emp016", "name": "Robert Brown", "title": "Senior Data Scientist", "years_experience": 8, "email": "robert.b@company.com"},
    {"id": "emp017", "name": "Maria Garcia", "title": "Junior ML Engineer", "years_experience": 2, "email": "maria.g@company.com"},
    {"id": "emp018", "name": "Tom Wilson", "title": "Data Engineer", "years_experience": 5, "email": "tom.w@company.com"},
]

# Departments with budgets
departments = [
    {"id": "dept001", "name": "Data Science", "budget": 2500000, "location": "Building A"},
    {"id": "dept002", "name": "Engineering", "budget": 5000000, "location": "Building B"},
    {"id": "dept003", "name": "Product", "budget": 1500000, "location": "Building A"},
    {"id": "dept004", "name": "Executive", "budget": 3000000, "location": "Building C"},
]

# Projects with status and priority
projects = [
    {"id": "proj001", "name": "Customer Churn Prediction", "status": "active", "priority": "high", "budget": 500000},
    {"id": "proj002", "name": "Recommendation Engine v2", "status": "active", "priority": "high", "budget": 750000},
    {"id": "proj003", "name": "Real-time Analytics Dashboard", "status": "active", "priority": "medium", "budget": 300000},
    {"id": "proj004", "name": "Mobile App Redesign", "status": "planning", "priority": "medium", "budget": 400000},
    {"id": "proj005", "name": "Data Pipeline Optimization", "status": "active", "priority": "high", "budget": 350000},
    {"id": "proj006", "name": "NLP Chatbot", "status": "completed", "priority": "low", "budget": 200000},
    {"id": "proj007", "name": "Fraud Detection System", "status": "active", "priority": "critical", "budget": 900000},
    {"id": "proj008", "name": "Infrastructure Migration", "status": "active", "priority": "high", "budget": 600000},
    {"id": "proj009", "name": "A/B Testing Platform", "status": "planning", "priority": "medium", "budget": 250000},
    {"id": "proj010", "name": "Computer Vision API", "status": "active", "priority": "medium", "budget": 450000},
]

# Skills taxonomy
skills = [
    {"name": "Python", "category": "Programming", "level": "essential"},
    {"name": "TensorFlow", "category": "ML Framework", "level": "advanced"},
    {"name": "PyTorch", "category": "ML Framework", "level": "advanced"},
    {"name": "SQL", "category": "Database", "level": "essential"},
    {"name": "Neo4j", "category": "Database", "level": "specialized"},
    {"name": "AWS", "category": "Cloud", "level": "essential"},
    {"name": "Docker", "category": "DevOps", "level": "essential"},
    {"name": "Kubernetes", "category": "DevOps", "level": "advanced"},
    {"name": "React", "category": "Frontend", "level": "essential"},
    {"name": "Node.js", "category": "Backend", "level": "essential"},
    {"name": "NLP", "category": "AI Specialty", "level": "specialized"},
    {"name": "Computer Vision", "category": "AI Specialty", "level": "specialized"},
    {"name": "Scikit-learn", "category": "ML Framework", "level": "essential"},
    {"name": "Apache Spark", "category": "Big Data", "level": "advanced"},
    {"name": "Product Strategy", "category": "Business", "level": "specialized"},
]

# Relationships: Employee works in Department
works_in = [
    ("emp001", "dept001"), ("emp002", "dept001"), ("emp003", "dept001"),
    ("emp004", "dept001"), ("emp005", "dept001"), ("emp016", "dept001"),
    ("emp017", "dept001"), ("emp018", "dept001"),
    ("emp006", "dept002"), ("emp007", "dept002"), ("emp008", "dept002"),
    ("emp009", "dept002"), ("emp010", "dept002"),
    ("emp011", "dept003"), ("emp012", "dept003"), ("emp013", "dept003"),
    ("emp014", "dept004"), ("emp015", "dept004"),
]

# Relationships: Manager manages Employee
manages = [
    ("emp014", "emp015"),  # CEO -> CTO
    ("emp014", "emp001"),  # CEO -> VP Data Science
    ("emp014", "emp006"),  # CEO -> VP Engineering
    ("emp015", "emp006"),  # CTO -> VP Engineering
    ("emp001", "emp002"),  # VP DS -> Senior ML Engineer
    ("emp001", "emp016"),  # VP DS -> Senior Data Scientist
    ("emp002", "emp003"),  # Senior ML -> Data Scientist
    ("emp002", "emp004"),  # Senior ML -> ML Engineer
    ("emp016", "emp017"),  # Senior DS -> Junior ML Engineer
    ("emp001", "emp018"),  # VP DS -> Data Engineer
    ("emp001", "emp005"),  # VP DS -> Data Analyst
    ("emp006", "emp007"),  # VP Eng -> Senior Backend
    ("emp007", "emp010"),  # Senior Backend -> Backend Engineer
    ("emp006", "emp008"),  # VP Eng -> Frontend Engineer
    ("emp006", "emp009"),  # VP Eng -> DevOps Engineer
    ("emp011", "emp012"),  # PM -> Product Designer
    ("emp011", "emp013"),  # PM -> UX Researcher
]

# Relationships: Employee assigned to Project
assigned_to = [
    ("emp001", "proj001"), ("emp003", "proj001"), ("emp005", "proj001"),
    ("emp002", "proj002"), ("emp004", "proj002"), ("emp016", "proj002"),
    ("emp018", "proj003"), ("emp007", "proj003"), ("emp005", "proj003"),
    ("emp008", "proj004"), ("emp012", "proj004"), ("emp013", "proj004"),
    ("emp018", "proj005"), ("emp009", "proj005"), ("emp007", "proj005"),
    ("emp004", "proj006"), ("emp003", "proj006"),
    ("emp002", "proj007"), ("emp016", "proj007"), ("emp004", "proj007"),
    ("emp009", "proj008"), ("emp007", "proj008"), ("emp010", "proj008"),
    ("emp011", "proj009"), ("emp012", "proj009"),
    ("emp003", "proj010"), ("emp017", "proj010"),
]

# Relationships: Employee has Skill
has_skill = [
    # Sarah Chen - VP Data Science
    ("emp001", "Python"), ("emp001", "TensorFlow"), ("emp001", "AWS"), ("emp001", "SQL"),
    # Marcus Johnson - Senior ML Engineer
    ("emp002", "Python"), ("emp002", "TensorFlow"), ("emp002", "PyTorch"), ("emp002", "Docker"),
    # Priya Patel - Data Scientist
    ("emp003", "Python"), ("emp003", "Scikit-learn"), ("emp003", "SQL"), ("emp003", "NLP"),
    # James Liu - ML Engineer
    ("emp004", "Python"), ("emp004", "PyTorch"), ("emp004", "Docker"), ("emp004", "NLP"),
    # Elena Rodriguez - Data Analyst
    ("emp005", "Python"), ("emp005", "SQL"), ("emp005", "Scikit-learn"),
    # David Kim - VP Engineering
    ("emp006", "Python"), ("emp006", "AWS"), ("emp006", "Kubernetes"), ("emp006", "Docker"),
    # Amy Zhang - Senior Backend Engineer
    ("emp007", "Python"), ("emp007", "Node.js"), ("emp007", "SQL"), ("emp007", "Docker"), ("emp007", "AWS"),
    # Carlos Santos - Frontend Engineer
    ("emp008", "React"), ("emp008", "Node.js"),
    # Lisa Anderson - DevOps Engineer
    ("emp009", "Docker"), ("emp009", "Kubernetes"), ("emp009", "AWS"),
    # Mohammed Ali - Backend Engineer
    ("emp010", "Python"), ("emp010", "Node.js"), ("emp010", "SQL"), ("emp010", "Docker"),
    # Sophie Martin - Product Manager
    ("emp011", "Product Strategy"), ("emp011", "SQL"),
    # Raj Sharma - Product Designer
    ("emp012", "React"),
    # Nina Williams - UX Researcher
    ("emp013", "Product Strategy"),
    # Robert Brown - Senior Data Scientist
    ("emp016", "Python"), ("emp016", "TensorFlow"), ("emp016", "Scikit-learn"), ("emp016", "SQL"),
    # Maria Garcia - Junior ML Engineer
    ("emp017", "Python"), ("emp017", "TensorFlow"), ("emp017", "Computer Vision"),
    # Tom Wilson - Data Engineer
    ("emp018", "Python"), ("emp018", "SQL"), ("emp018", "Apache Spark"), ("emp018", "AWS"),
]

print(f"✅ Mock data generated!")
print(f"   📊 {len(employees)} employees")
print(f"   🏢 {len(departments)} departments")
print(f"   📁 {len(projects)} projects")
print(f"   🎯 {len(skills)} skills")
print(f"   🔗 {len(works_in) + len(manages) + len(assigned_to) + len(has_skill)} relationships")

In [None]:
# 🧹 Clear existing data to start fresh

print("🧹 Clearing database...")
run_query("MATCH (n) DETACH DELETE n")
print("✅ Database cleared!\n")

In [None]:
# 🏗️ Create Employee nodes

def create_employees():
    """
    Create all Employee nodes in the graph.
    Uses UNWIND for efficient batch creation.
    """
    query = """
    UNWIND $employees AS emp
    CREATE (e:Employee {
        id: emp.id,
        name: emp.name,
        title: emp.title,
        years_experience: emp.years_experience,
        email: emp.email
    })
    """
    run_query(query, {"employees": employees})
    print(f"✅ Created {len(employees)} Employee nodes")

create_employees()

In [None]:
# 🏢 Create Department nodes

def create_departments():
    """
    Create all Department nodes in the graph.
    """
    query = """
    UNWIND $departments AS dept
    CREATE (d:Department {
        id: dept.id,
        name: dept.name,
        budget: dept.budget,
        location: dept.location
    })
    """
    run_query(query, {"departments": departments})
    print(f"✅ Created {len(departments)} Department nodes")

create_departments()

In [None]:
# 📁 Create Project nodes

def create_projects():
    """
    Create all Project nodes in the graph.
    """
    query = """
    UNWIND $projects AS proj
    CREATE (p:Project {
        id: proj.id,
        name: proj.name,
        status: proj.status,
        priority: proj.priority,
        budget: proj.budget
    })
    """
    run_query(query, {"projects": projects})
    print(f"✅ Created {len(projects)} Project nodes")

create_projects()

In [None]:
# 🎯 Create Skill nodes

def create_skills():
    """
    Create all Skill nodes in the graph.
    """
    query = """
    UNWIND $skills AS skill
    CREATE (s:Skill {
        name: skill.name,
        category: skill.category,
        level: skill.level
    })
    """
    run_query(query, {"skills": skills})
    print(f"✅ Created {len(skills)} Skill nodes")

create_skills()

In [None]:
# 🔗 Create all relationships

def create_relationships():
    """
    Create all relationships between nodes.
    This establishes the graph structure.
    """
    # WORKS_IN relationships
    works_in_query = """
    UNWIND $relationships AS rel
    MATCH (e:Employee {id: rel.emp_id})
    MATCH (d:Department {id: rel.dept_id})
    CREATE (e)-[:WORKS_IN]->(d)
    """
    works_in_data = [{"emp_id": e, "dept_id": d} for e, d in works_in]
    run_query(works_in_query, {"relationships": works_in_data})
    print(f"✅ Created {len(works_in)} WORKS_IN relationships")
    
    # MANAGES relationships
    manages_query = """
    UNWIND $relationships AS rel
    MATCH (manager:Employee {id: rel.manager_id})
    MATCH (report:Employee {id: rel.report_id})
    CREATE (manager)-[:MANAGES]->(report)
    """
    manages_data = [{"manager_id": m, "report_id": r} for m, r in manages]
    run_query(manages_query, {"relationships": manages_data})
    print(f"✅ Created {len(manages)} MANAGES relationships")
    
    # ASSIGNED_TO relationships
    assigned_query = """
    UNWIND $relationships AS rel
    MATCH (e:Employee {id: rel.emp_id})
    MATCH (p:Project {id: rel.proj_id})
    CREATE (e)-[:ASSIGNED_TO]->(p)
    """
    assigned_data = [{"emp_id": e, "proj_id": p} for e, p in assigned_to]
    run_query(assigned_query, {"relationships": assigned_data})
    print(f"✅ Created {len(assigned_to)} ASSIGNED_TO relationships")
    
    # HAS_SKILL relationships
    skill_query = """
    UNWIND $relationships AS rel
    MATCH (e:Employee {id: rel.emp_id})
    MATCH (s:Skill {name: rel.skill_name})
    CREATE (e)-[:HAS_SKILL]->(s)
    """
    skill_data = [{"emp_id": e, "skill_name": s} for e, s in has_skill]
    run_query(skill_query, {"relationships": skill_data})
    print(f"✅ Created {len(has_skill)} HAS_SKILL relationships")

create_relationships()

In [None]:
# 🔍 Verify data load

print("📊 Verifying knowledge graph structure...\n")

# Count nodes by label
node_query = """
MATCH (n)
RETURN labels(n)[0] as label, count(*) as count
ORDER BY count DESC
"""
node_results = run_query(node_query)
print("📍 Nodes by type:")
for row in node_results:
    print(f"   {row['label']}: {row['count']}")

# Count relationships by type
rel_query = """
MATCH ()-[r]->()
RETURN type(r) as relationship, count(*) as count
ORDER BY count DESC
"""
rel_results = run_query(rel_query)
print("\n🔗 Relationships by type:")
for row in rel_results:
    print(f"   {row['relationship']}: {row['count']}")

print("\n✅ Knowledge graph successfully created!")

---
## 🔍 Part 5: Pattern Matching & Traversal

### The Power of Pattern Matching

Now that our knowledge graph is built, we can unleash the real power of graph databases: **pattern matching** and **relationship traversal**. Unlike SQL joins that get slower with complexity, graph traversals maintain constant performance.

### What We'll Explore

- **Single-hop queries**: Direct relationships (employee → department)
- **Multi-hop queries**: Chained relationships (employee → project → other employees)
- **Variable-length paths**: Find connections within N degrees
- **Aggregations**: Count, collect, and analyze patterns
- **Pattern combinations**: Complex queries matching multiple patterns simultaneously

### 💡 Key Point: Traversal Performance

Neo4j's **index-free adjacency** means each node physically stores references to its connected relationships. Traversing from one node to another is a simple pointer lookup—no index searches required. This is why graph databases can traverse millions of relationships in milliseconds, while relational databases struggle with just a few joins.

Let's explore with practical queries! 🚀

In [None]:
# 1️⃣ Find all employees in a specific department
print("1️⃣ Who works in the Data Science department?\n")

query = """
MATCH (e:Employee)-[:WORKS_IN]->(d:Department {name: "Data Science"})
RETURN e.name as name, e.title as title, e.years_experience as experience
ORDER BY e.years_experience DESC
"""
results = run_query(query)
print(f"📊 Found {len(results)} Data Science employees:\n")
for emp in results:
    print(f"   • {emp['name']} - {emp['title']} ({emp['experience']} years)")
print()

In [None]:
# 2️⃣ Find organizational hierarchy (who manages whom)
print("2️⃣ Organizational reporting structure:\n")

query = """
MATCH (manager:Employee)-[:MANAGES]->(report:Employee)
RETURN manager.name as manager, manager.title as manager_title,
       report.name as report, report.title as report_title
ORDER BY manager.name
"""
results = run_query(query)
print(f"📊 Found {len(results)} management relationships:\n")
current_manager = None
for rel in results[:10]:  # Show first 10 for brevity
    if rel['manager'] != current_manager:
        current_manager = rel['manager']
        print(f"\n👤 {rel['manager']} ({rel['manager_title']}) manages:")
    print(f"   └─ {rel['report']} ({rel['report_title']})")
print()

In [None]:
# 3️⃣ Find employees with specific skills
print("3️⃣ Who has TensorFlow skills?\n")

query = """
MATCH (e:Employee)-[:HAS_SKILL]->(s:Skill {name: "TensorFlow"})
RETURN e.name as name, e.title as title
ORDER BY e.name
"""
results = run_query(query)
print(f"📊 Found {len(results)} employees with TensorFlow skills:\n")
for emp in results:
    print(f"   🔧 {emp['name']} - {emp['title']}")
print()

In [None]:
# 4️⃣ Find employees working on high-priority projects
print("4️⃣ Who's working on high-priority projects?\n")

query = """
MATCH (e:Employee)-[:ASSIGNED_TO]->(p:Project)
WHERE p.priority IN ['high', 'critical']
RETURN p.name as project, p.priority as priority, p.status as status,
       collect(e.name) as team_members
ORDER BY 
    CASE p.priority 
        WHEN 'critical' THEN 1 
        WHEN 'high' THEN 2 
        ELSE 3 
    END
"""
results = run_query(query)
print(f"📊 Found {len(results)} high-priority projects:\n")
for proj in results:
    priority_emoji = "🔴" if proj['priority'] == 'critical' else "🟠"
    print(f"{priority_emoji} {proj['project']} [{proj['status']}]")
    print(f"   Team: {', '.join(proj['team_members'])}\n")
print()

In [None]:
# 5️⃣ Find colleagues (people in same department)
print("5️⃣ Who are Sarah Chen's colleagues?\n")

query = """
MATCH (sarah:Employee {name: "Sarah Chen"})-[:WORKS_IN]->(d:Department)<-[:WORKS_IN]-(colleague:Employee)
WHERE sarah <> colleague
RETURN colleague.name as name, colleague.title as title, d.name as department
ORDER BY colleague.name
"""
results = run_query(query)
print(f"📊 Sarah Chen has {len(results)} colleagues in her department:\n")
for col in results:
    print(f"   👥 {col['name']} - {col['title']}")
print()

In [None]:
# 6️⃣ Skill overlap analysis
print("6️⃣ Who shares the most skills with Sarah Chen?\n")

query = """
MATCH (sarah:Employee {name: "Sarah Chen"})-[:HAS_SKILL]->(s:Skill)<-[:HAS_SKILL]-(colleague:Employee)
WHERE sarah <> colleague
RETURN colleague.name as name, 
       colleague.title as title,
       count(s) as shared_skills,
       collect(s.name) as skills
ORDER BY shared_skills DESC
LIMIT 5
"""
results = run_query(query)
print(f"📊 Top colleagues by shared skills with Sarah Chen:\n")
for i, person in enumerate(results, 1):
    print(f"{i}. {person['name']} ({person['title']})")
    print(f"   Shared skills ({person['shared_skills']}): {', '.join(person['skills'])}\n")
print()

In [None]:
# 7️⃣ Project team composition
print("7️⃣ Who's on the 'Customer Churn Prediction' team?\n")

query = """
MATCH (e:Employee)-[:ASSIGNED_TO]->(p:Project {name: "Customer Churn Prediction"})
RETURN p.name as project, 
       p.status as status,
       p.priority as priority,
       collect({name: e.name, title: e.title}) as team
"""
results = run_query(query)
if results:
    proj = results[0]
    print(f"📁 Project: {proj['project']}")
    print(f"   Status: {proj['status']} | Priority: {proj['priority']}\n")
    print(f"   👥 Team Members ({len(proj['team'])}):\n")
    for member in proj['team']:
        print(f"      • {member['name']} - {member['title']}")
print()

In [None]:
# 8️⃣ Find paths between employees (relationship chains)
print("8️⃣ How is Sarah Chen connected to Carlos Santos?\n")

query = """
MATCH path = (sarah:Employee {name: "Sarah Chen"})-[*1..4]-(carlos:Employee {name: "Carlos Santos"})
RETURN [node in nodes(path) | node.name] as connection_path,
       [rel in relationships(path) | type(rel)] as relationship_types,
       length(path) as path_length
ORDER BY path_length
LIMIT 3
"""
results = run_query(query)
print(f"📊 Found {len(results)} connection paths:\n")
for i, path in enumerate(results, 1):
    print(f"Path {i} ({path['path_length']} degrees of separation):")
    for j in range(len(path['connection_path']) - 1):
        print(f"   {path['connection_path'][j]} --[{path['relationship_types'][j]}]--> ", end="")
    print(path['connection_path'][-1])
    print()
print()

In [None]:
# 9️⃣ Department budget analysis with employee count
print("9️⃣ Department resource analysis:\n")

query = """
MATCH (d:Department)
OPTIONAL MATCH (e:Employee)-[:WORKS_IN]->(d)
RETURN d.name as department,
       d.budget as budget,
       count(e) as employee_count,
       d.budget / count(e) as budget_per_employee
ORDER BY d.budget DESC
"""
results = run_query(query)
print(f"📊 Department Resource Allocation:\n")
for dept in results:
    print(f"🏢 {dept['department']}")
    print(f"   Budget: ${dept['budget']:,}")
    print(f"   Employees: {dept['employee_count']}")
    print(f"   Budget per employee: ${dept['budget_per_employee']:,.0f}\n")
print()

In [None]:
# 🔟 Find potential collaborators (similar skills + different projects)
print("🔟 Recommended collaborators for cross-team projects:\n")

query = """
MATCH (e1:Employee {name: "Priya Patel"})-[:HAS_SKILL]->(s:Skill)<-[:HAS_SKILL]-(e2:Employee)
MATCH (e1)-[:WORKS_IN]->(d1:Department), (e2)-[:WORKS_IN]->(d2:Department)
WHERE e1 <> e2 AND d1 <> d2
RETURN e2.name as name, 
       e2.title as title,
       d2.name as department,
       count(DISTINCT s) as shared_skills,
       collect(DISTINCT s.name)[0..3] as sample_skills
ORDER BY shared_skills DESC
LIMIT 5
"""
results = run_query(query)
print(f"📊 Priya Patel should collaborate with:\n")
for i, person in enumerate(results, 1):
    print(f"{i}. {person['name']} from {person['department']}")
    print(f"   {person['title']}")
    print(f"   {person['shared_skills']} shared skills: {', '.join(person['sample_skills'])}\n")
print()

In [None]:
# 1️⃣1️⃣ Multi-hop: Find employees by skill category, then their projects
print("1️⃣1️⃣ What projects are ML experts working on?\n")

query = """
MATCH (e:Employee)-[:HAS_SKILL]->(s:Skill)
WHERE s.category = "ML Framework"
MATCH (e)-[:ASSIGNED_TO]->(p:Project)
RETURN DISTINCT p.name as project, 
       p.priority as priority,
       collect(DISTINCT e.name) as ml_experts
ORDER BY 
    CASE p.priority 
        WHEN 'critical' THEN 1 
        WHEN 'high' THEN 2 
        ELSE 3 
    END
"""
results = run_query(query)
print(f"📊 Projects with ML Framework experts:\n")
for proj in results:
    print(f"📁 {proj['project']} [{proj['priority']}]")
    print(f"   ML Experts: {', '.join(proj['ml_experts'])}\n")
print()

In [None]:
# 1️⃣2️⃣ Find the most in-demand skills
print("1️⃣2️⃣ Most in-demand skills across the company:\n")

query = """
MATCH (e:Employee)-[:HAS_SKILL]->(s:Skill)
RETURN s.name as skill,
       s.category as category,
       count(e) as employee_count,
       collect(e.name)[0..3] as sample_employees
ORDER BY employee_count DESC
LIMIT 10
"""
results = run_query(query)
print(f"📊 Top 10 Skills:\n")
for i, skill in enumerate(results, 1):
    print(f"{i}. {skill['skill']} ({skill['category']})")
    print(f"   {skill['employee_count']} employees have this skill")
    print(f"   Including: {', '.join(skill['sample_employees'])}...\n")
print()

### 🎯 Pattern Matching - Key Takeaways

- **Pattern matching is intuitive**: Write queries that look like the relationships you're finding
- **Multi-hop traversals are efficient**: Graph databases don't slow down with relationship depth
- **Variable-length paths** `[*1..4]` find connections within N degrees of separation
- **Aggregations** like `count()` and `collect()` enable powerful analytics
- **Combining patterns** allows complex queries (skills + projects + departments simultaneously)
- **Graph queries reveal insights** impossible or impractical in relational databases

These queries would require complex multi-table joins in SQL, often with performance degradation. In Neo4j, they execute in milliseconds! ⚡

---
## 📊 Part 6: Graph Visualization

### Visualizing Graph Structure

Sometimes the best way to understand a graph is to **see it**. We'll use NetworkX and Matplotlib to visualize subgraphs from our knowledge graph.

### What We'll Visualize

- Department organizational structure
- Project team networks
- Skill relationships
- Complete subgraphs for specific departments

Let's create some beautiful visualizations! 🎨

In [None]:
# 🎨 Create a visualization function

def visualize_subgraph(query: str, title: str, figsize=(14, 10)):
    """
    Execute a Cypher query and visualize the resulting subgraph.
    
    Args:
        query: Cypher query that returns paths or nodes and relationships
        title: Title for the visualization
        figsize: Figure size tuple (width, height)
    """
    try:
        # Execute query
        with driver.session() as session:
            result = session.run(query)
            
            # Create NetworkX graph
            G = nx.DiGraph()
            node_labels_map = {}
            node_colors_map = {}
            
            # Color scheme for different node types
            color_scheme = {
                'Employee': '#3498db',    # Blue
                'Department': '#2ecc71',  # Green
                'Project': '#e74c3c',     # Red
                'Skill': '#9b59b6'        # Purple
            }
            
            # Process query results
            for record in result:
                for key in record.keys():
                    item = record[key]
                    
                    # Handle nodes
                    if hasattr(item, 'labels'):
                        node_id = item.element_id
                        label = list(item.labels)[0] if item.labels else 'Unknown'
                        name = item.get('name', item.get('title', 'Unknown'))
                        
                        G.add_node(node_id, label=label, name=name)
                        node_labels_map[node_id] = name
                        node_colors_map[node_id] = color_scheme.get(label, '#95a5a6')
                    
                    # Handle relationships
                    elif hasattr(item, 'type'):
                        start_id = item.start_node.element_id
                        end_id = item.end_node.element_id
                        rel_type = item.type
                        
                        # Add nodes if they don't exist
                        if start_id not in G:
                            start_label = list(item.start_node.labels)[0] if item.start_node.labels else 'Unknown'
                            start_name = item.start_node.get('name', item.start_node.get('title', 'Unknown'))
                            G.add_node(start_id, label=start_label, name=start_name)
                            node_labels_map[start_id] = start_name
                            node_colors_map[start_id] = color_scheme.get(start_label, '#95a5a6')
                        
                        if end_id not in G:
                            end_label = list(item.end_node.labels)[0] if item.end_node.labels else 'Unknown'
                            end_name = item.end_node.get('name', item.end_node.get('title', 'Unknown'))
                            G.add_node(end_id, label=end_label, name=end_name)
                            node_labels_map[end_id] = end_name
                            node_colors_map[end_id] = color_scheme.get(end_label, '#95a5a6')
                        
                        G.add_edge(start_id, end_id, relationship=rel_type)
            
            # Check if graph has nodes
            if len(G.nodes()) == 0:
                print("❌ No data returned from query. Cannot visualize empty graph.")
                return
            
            # Create visualization
            plt.figure(figsize=figsize)
            
            # Use spring layout for better visualization
            pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
            
            # Draw nodes
            node_colors = [node_colors_map[node] for node in G.nodes()]
            nx.draw_networkx_nodes(G, pos, node_color=node_colors, 
                                   node_size=2000, alpha=0.9, 
                                   edgecolors='black', linewidths=2)
            
            # Draw edges
            nx.draw_networkx_edges(G, pos, edge_color='gray', 
                                   arrows=True, arrowsize=20, 
                                   arrowstyle='->', width=2, alpha=0.6)
            
            # Draw labels
            nx.draw_networkx_labels(G, pos, node_labels_map, 
                                    font_size=9, font_weight='bold')
            
            # Create legend
            legend_elements = [Patch(facecolor=color, label=label, edgecolor='black')
                               for label, color in color_scheme.items()]
            plt.legend(handles=legend_elements, loc='upper left', fontsize=10)
            
            plt.title(title, fontsize=16, fontweight='bold', pad=20)
            plt.axis('off')
            plt.tight_layout()
            plt.show()
            
            print(f"✅ Visualized {len(G.nodes())} nodes and {len(G.edges())} relationships\n")
            
    except Exception as e:
        print(f"❌ Visualization failed: {str(e)}")

print("✅ Visualization function created!")

In [None]:
# 📊 Visualization 1: Data Science Department Structure
print("📊 Visualizing Data Science department structure...\n")

query = """
MATCH (d:Department {name: "Data Science"})<-[:WORKS_IN]-(e:Employee)
OPTIONAL MATCH (e)-[r:MANAGES]->(report:Employee)
RETURN d, e, r, report
"""

visualize_subgraph(query, "Data Science Department - Organizational Structure")

In [None]:
# 📊 Visualization 2: Project Team Network
print("📊 Visualizing 'Fraud Detection System' project team...\n")

query = """
MATCH (p:Project {name: "Fraud Detection System"})<-[:ASSIGNED_TO]-(e:Employee)
OPTIONAL MATCH (e)-[:HAS_SKILL]->(s:Skill)
RETURN p, e, s
"""

visualize_subgraph(query, "Fraud Detection System - Project Team & Skills")

In [None]:
# 📊 Visualization 3: Python Skill Network
print("📊 Visualizing Python skill network...\n")

query = """
MATCH (s:Skill {name: "Python"})<-[:HAS_SKILL]-(e:Employee)
MATCH (e)-[:WORKS_IN]->(d:Department)
RETURN s, e, d
LIMIT 10
"""

visualize_subgraph(query, "Python Skill Network - Who Knows Python?")

In [None]:
# 📊 Visualization 4: Cross-Department Collaboration
print("📊 Visualizing cross-department project collaboration...\n")

query = """
MATCH (p:Project {name: "Real-time Analytics Dashboard"})<-[:ASSIGNED_TO]-(e:Employee)
MATCH (e)-[:WORKS_IN]->(d:Department)
RETURN p, e, d
"""

visualize_subgraph(query, "Real-time Analytics Dashboard - Cross-Department Collaboration")

---
## 🚀 Part 7: Advanced Graph Patterns

### Beyond Basic Queries

Now that you're comfortable with pattern matching, let's explore advanced graph algorithms and patterns:

- **Shortest paths**: Find the most direct connection between nodes
- **Recommendation algorithms**: Suggest connections based on graph patterns
- **Centrality analysis**: Identify the most important/connected nodes
- **Community detection**: Find clusters of related nodes

These patterns power real-world applications from LinkedIn's "How you're connected" to fraud detection networks.

In [None]:
# 🛤️ Shortest Path: Find the shortest connection between two employees
print("🛤️ Finding shortest path between Elena Rodriguez and Lisa Anderson...\n")

query = """
MATCH (start:Employee {name: "Elena Rodriguez"}),
      (end:Employee {name: "Lisa Anderson"}),
      path = shortestPath((start)-[*..10]-(end))
RETURN [node in nodes(path) | node.name] as path_nodes,
       [rel in relationships(path) | type(rel)] as path_relationships,
       length(path) as path_length
"""
results = run_query(query)

if results:
    path = results[0]
    print(f"📊 Shortest path found: {path['path_length']} hops\n")
    print("Connection chain:")
    for i in range(len(path['path_nodes']) - 1):
        print(f"   {path['path_nodes'][i]}")
        print(f"      ↓ [{path['path_relationships'][i]}]")
    print(f"   {path['path_nodes'][-1]}")
else:
    print("No path found between these employees.")
print()

In [None]:
# 💡 Recommendation: Suggest new team members for a project
print("💡 Recommending new team members for 'Customer Churn Prediction' project...\n")

query = """
// Find skills needed for the project (based on current team skills)
MATCH (currentTeam:Employee)-[:ASSIGNED_TO]->(p:Project {name: "Customer Churn Prediction"})
MATCH (currentTeam)-[:HAS_SKILL]->(projectSkills:Skill)

// Find employees with those skills who are NOT on the project
MATCH (candidate:Employee)-[:HAS_SKILL]->(projectSkills)
WHERE NOT (candidate)-[:ASSIGNED_TO]->(p)

// Count their active projects (prefer people with lighter workload)
OPTIONAL MATCH (candidate)-[:ASSIGNED_TO]->(otherProjects:Project)
WHERE otherProjects.status = 'active'

WITH candidate, 
     count(DISTINCT projectSkills) as matching_skills,
     count(DISTINCT otherProjects) as current_workload,
     collect(DISTINCT projectSkills.name) as skills

RETURN candidate.name as name,
       candidate.title as title,
       matching_skills,
       current_workload,
       skills[0..3] as sample_skills
ORDER BY matching_skills DESC, current_workload ASC
LIMIT 5
"""
results = run_query(query)

print(f"📊 Top 5 recommended candidates:\n")
for i, candidate in enumerate(results, 1):
    print(f"{i}. {candidate['name']} - {candidate['title']}")
    print(f"   ✅ {candidate['matching_skills']} matching skills: {', '.join(candidate['sample_skills'])}")
    print(f"   📊 Current workload: {candidate['current_workload']} active projects\n")
print()

In [None]:
# 🌟 Centrality: Find most connected employees (collaboration hubs)
print("🌟 Finding collaboration hubs (most connected employees)...\n")

query = """
MATCH (e:Employee)
OPTIONAL MATCH (e)-[:MANAGES]->(reports:Employee)
OPTIONAL MATCH (e)-[:ASSIGNED_TO]->(projects:Project)
OPTIONAL MATCH (e)-[:HAS_SKILL]->(skills:Skill)
OPTIONAL MATCH (e)-[r]-(connected)

WITH e,
     count(DISTINCT reports) as direct_reports,
     count(DISTINCT projects) as project_count,
     count(DISTINCT skills) as skill_count,
     count(DISTINCT r) as total_connections

RETURN e.name as name,
       e.title as title,
       direct_reports,
       project_count,
       skill_count,
       total_connections,
       (direct_reports * 3 + project_count * 2 + skill_count) as influence_score
ORDER BY influence_score DESC
LIMIT 10
"""
results = run_query(query)

print(f"📊 Top 10 Most Connected Employees (Collaboration Hubs):\n")
for i, person in enumerate(results, 1):
    print(f"{i}. {person['name']} - {person['title']}")
    print(f"   👥 {person['direct_reports']} direct reports")
    print(f"   📁 {person['project_count']} projects")
    print(f"   🎯 {person['skill_count']} skills")
    print(f"   🔗 {person['total_connections']} total connections")
    print(f"   ⭐ Influence Score: {person['influence_score']}\n")
print()

In [None]:
# 🔍 Pattern Detection: Find employees who should know each other but don't collaborate
print("🔍 Finding potential collaboration opportunities...\n")

query = """
// Find pairs of employees with overlapping skills but no shared projects
MATCH (e1:Employee)-[:HAS_SKILL]->(s:Skill)<-[:HAS_SKILL]-(e2:Employee)
WHERE e1 <> e2 AND id(e1) < id(e2)  // Avoid duplicates

// Count shared skills
WITH e1, e2, count(DISTINCT s) as shared_skills
WHERE shared_skills >= 2

// Check if they share any projects
OPTIONAL MATCH (e1)-[:ASSIGNED_TO]->(p:Project)<-[:ASSIGNED_TO]-(e2)

// Only return pairs with no shared projects
WITH e1, e2, shared_skills, count(p) as shared_projects
WHERE shared_projects = 0

// Get their departments
MATCH (e1)-[:WORKS_IN]->(d1:Department)
MATCH (e2)-[:WORKS_IN]->(d2:Department)

RETURN e1.name as person1,
       e1.title as title1,
       d1.name as dept1,
       e2.name as person2,
       e2.title as title2,
       d2.name as dept2,
       shared_skills,
       CASE WHEN d1 <> d2 THEN 'Cross-department' ELSE 'Same department' END as opportunity_type
ORDER BY shared_skills DESC
LIMIT 8
"""
results = run_query(query)

print(f"📊 Potential Collaboration Opportunities ({len(results)} found):\n")
for i, opp in enumerate(results, 1):
    emoji = "🌉" if opp['opportunity_type'] == 'Cross-department' else "🤝"
    print(f"{i}. {emoji} {opp['opportunity_type']}")
    print(f"   {opp['person1']} ({opp['title1']}, {opp['dept1']})")
    print(f"   ↔️")
    print(f"   {opp['person2']} ({opp['title2']}, {opp['dept2']})")
    print(f"   💡 {opp['shared_skills']} shared skills\n")
print()

### 🎯 Advanced Patterns - Key Takeaways

- **Shortest path algorithms** find optimal connections through complex networks
- **Recommendation systems** can be built using graph pattern matching
- **Centrality analysis** identifies influential nodes (leaders, connectors, experts)
- **Pattern detection** reveals hidden opportunities (potential collaborations, gaps)
- **Graph algorithms** solve problems that are intractable in relational databases
- Real-world applications: social network analysis, fraud detection, supply chain optimization

---
## ✅ Part 8: Best Practices & Common Pitfalls

### 🎯 Best Practices for Graph Databases

Now that you understand graph fundamentals, let's discuss how to build production-quality graph applications.

#### 1. **When to Use Graph Databases**
✅ **Use graph databases when:**
- Your data has complex, multi-hop relationships (social networks, org charts)
- You need to traverse relationships frequently ("friends of friends", supply chains)
- Relationship queries are central to your application
- Your schema evolves frequently (graphs are schema-flexible)
- You need real-time pattern matching (fraud detection, recommendations)

❌ **Don't use graph databases when:**
- You primarily need simple CRUD operations on independent records
- Your queries rarely involve relationships (use document DB instead)
- You need pure analytical aggregations (use data warehouse instead)
- Relationships are sparse and simple (relational DB may suffice)

#### 2. **Cypher Query Optimization**
- **Start with specific nodes**: Use indexed properties to anchor your queries
  ```cypher
  // ✅ Good: Starts with indexed lookup
  MATCH (e:Employee {id: "emp001"})-[:MANAGES]->(reports)
  
  // ❌ Bad: Scans all employees first
  MATCH (e:Employee)-[:MANAGES]->(reports)
  WHERE e.id = "emp001"
  ```

- **Limit relationship traversal depth**: Unbounded traversals can be expensive
  ```cypher
  // ✅ Good: Limited depth
  MATCH path = (a)-[*1..4]-(b)
  
  // ❌ Bad: Could traverse entire graph
  MATCH path = (a)-[*]-(b)
  ```

- **Use PROFILE and EXPLAIN**: Understand query execution plans
  ```cypher
  PROFILE MATCH (e:Employee)-[:HAS_SKILL]->(s:Skill {name: "Python"})
  RETURN e.name
  ```

#### 3. **Indexing and Constraints**
Always create indexes and constraints for properties you query frequently:

```cypher
// Unique constraints (also create indexes)
CREATE CONSTRAINT IF NOT EXISTS FOR (e:Employee) REQUIRE e.id IS UNIQUE

// Indexes for frequent lookups
CREATE INDEX IF NOT EXISTS FOR (e:Employee) ON (e.name)
CREATE INDEX IF NOT EXISTS FOR (p:Project) ON (p.status)
```

#### 4. **Graph Modeling Best Practices**

**Nodes vs. Properties:**
- Make something a **node** if:
  - It has relationships to other entities
  - You'll query it independently
  - It appears in multiple contexts

- Make something a **property** if:
  - It's a simple attribute (name, age, status)
  - It doesn't need relationships
  - It's specific to one entity

**Example:**
```cypher
// ✅ Good: Skills are nodes (can be shared, have relationships)
(Employee)-[:HAS_SKILL]->(Skill {name: "Python"})

// ❌ Bad: Skills as properties (can't find "who else has this skill?")
(Employee {skills: ["Python", "SQL"]})
```

#### 5. **Relationship Direction Matters**
- Always define relationship direction logically (even if you query both ways)
- Use descriptive relationship types: `MANAGES`, `WORKS_IN`, not `RELATED_TO`
- You can traverse relationships in any direction regardless of how they're stored:
  ```cypher
  // Stored direction: (Manager)-[:MANAGES]->(Employee)
  
  // Query in stored direction
  MATCH (m:Employee)-[:MANAGES]->(e:Employee)
  
  // Query in reverse direction
  MATCH (e:Employee)<-[:MANAGES]-(m:Employee)
  
  // Query ignoring direction
  MATCH (e1:Employee)-[:MANAGES]-(e2:Employee)
  ```

#### 6. **Testing Queries Incrementally**
Build complex queries step by step:
```cypher
// Step 1: Find the anchor node
MATCH (e:Employee {name: "Sarah Chen"})
RETURN e

// Step 2: Add one relationship
MATCH (e:Employee {name: "Sarah Chen"})-[:MANAGES]->(reports)
RETURN e, reports

// Step 3: Add filters and aggregations
MATCH (e:Employee {name: "Sarah Chen"})-[:MANAGES]->(reports)
WHERE reports.years_experience > 5
RETURN e.name, collect(reports.name) as experienced_reports
```

### ⚠️ Common Mistakes to Avoid

1. **Overmodeling (too many node types)**
   - Don't create a node type for every possible entity
   - Example: Don't create `:StreetAddress`, `:City`, `:State` as separate nodes if you only need address data as properties

2. **Not using indexes for frequent lookups**
   - If you query `WHERE e.email = "..."` often, create an index on `email`
   - Indexes dramatically improve query performance

3. **Ignoring relationship direction**
   - Define relationships in the natural direction: `(Employee)-[:WORKS_FOR]->(Company)`
   - Don't use bidirectional relationships unless truly needed

4. **Over-relying on properties instead of relationships**
   - If you're filtering by a property frequently, consider making it a node
   - Example: Instead of `(Employee {department: "Engineering"})`, use `(Employee)-[:WORKS_IN]->(Department {name: "Engineering"})`

5. **Not cleaning up test data**
   - Always use `MATCH (n) DETACH DELETE n` to clear test graphs
   - In production, use careful WHERE clauses to delete only specific data

6. **Writing queries that scan the entire graph**
   - Always start queries with an indexed lookup or specific node match
   - Avoid queries like `MATCH (n) WHERE n.name = "Alice"` (scans everything)

### 🎯 Key Takeaways

- **Graph databases excel at connected data**—use them when relationships matter
- **Index frequently queried properties** for performance
- **Model thoughtfully**: nodes for entities with relationships, properties for attributes
- **Relationship direction is semantic**, but you can query in any direction
- **Test queries incrementally**—build complexity step by step
- **Avoid overmodeling**—not everything needs to be a node
- **Profile your queries** to understand and optimize performance
- **Graph databases complement vector databases**—use both for GraphRAG!

These practices will help you build scalable, performant graph applications! 🚀

---
## 🚀 Part 9: Next Steps and Extensions

### Congratulations! 🎉

You've learned the fundamentals of graph databases with Neo4j:
- ✅ Core graph concepts (nodes, relationships, properties)
- ✅ Cypher query language for pattern matching
- ✅ Building knowledge graphs from structured data
- ✅ Advanced pattern matching and traversals
- ✅ Graph visualization techniques
- ✅ Best practices and common pitfalls

### 🔜 What's Next: Notebook 2 - GraphRAG

In the next notebook, you'll take your graph database skills to the next level by combining them with:
- **Vector embeddings** for semantic search
- **LlamaIndex** for document ingestion and retrieval
- **LLMs** for entity extraction and query generation
- **Hybrid search** that leverages both graph relationships and vector similarity

You'll learn how **GraphRAG** (Graph + Retrieval-Augmented Generation) provides better context and reduces hallucinations compared to traditional RAG systems.

### 📚 Practice Exercises

Before moving on, try these extensions to reinforce your learning:

#### Exercise 1: Add Collaboration Relationships
Create `COLLABORATES_WITH` relationships between employees who work on the same projects:
```cypher
MATCH (e1:Employee)-[:ASSIGNED_TO]->(p:Project)<-[:ASSIGNED_TO]-(e2:Employee)
WHERE e1 <> e2 AND id(e1) < id(e2)
MERGE (e1)-[:COLLABORATES_WITH {projects: count(p)}]->(e2)
```

#### Exercise 2: Time-Based Relationships
Add temporal properties to model career progression:
- Add `joined_date` property to `WORKS_IN` relationships
- Add `started_date` and `ended_date` to `ASSIGNED_TO` relationships
- Query: "Who joined the Data Science department in 2023?"

#### Exercise 3: Build a Recommendation Query
Create a query that recommends skills for employees to learn based on:
- Skills of people in similar roles
- Skills needed for projects in their department
- Skills they don't already have

#### Exercise 4: Model a Different Domain
Apply what you've learned to a new domain:
- **Supply Chain**: Suppliers, Products, Warehouses, Shipments
- **Social Network**: Users, Posts, Comments, Likes, Follows
- **Research Network**: Papers, Authors, Citations, Institutions, Topics
- **Course Platform**: Students, Courses, Modules, Prerequisites, Certificates

### 📖 Additional Resources

- **Neo4j Documentation**: [neo4j.com/docs](https://neo4j.com/docs/)
- **Cypher Manual**: [neo4j.com/docs/cypher-manual](https://neo4j.com/docs/cypher-manual/current/)
- **Neo4j Graph Academy**: Free online courses at [neo4j.com/graphacademy](https://neo4j.com/graphacademy/)
- **Graph Data Science Library**: Advanced algorithms for centrality, community detection, and more

### 🎯 Key Skills You've Gained

- Understanding when to use graph databases vs. other data stores
- Writing Cypher queries for complex relationship patterns
- Modeling real-world domains as knowledge graphs
- Optimizing graph queries for performance
- Visualizing graph structures
- Applying graph analytics to solve business problems

### 🌟 Ready for GraphRAG?

You now have a solid foundation in graph databases. In Notebook 2, you'll see how combining graph databases with vector embeddings and LLMs creates powerful AI applications that understand both **semantic meaning** (via vectors) and **structural relationships** (via graphs).

See you in the next notebook! 🚀

In [None]:
# 🧹 Optional: Clean up and close connections

print("🧹 Closing Neo4j driver connection...")
driver.close()
print("✅ Connection closed successfully!")
print("\n👋 Thanks for learning graph databases with Neo4j!")
print("📚 See you in Notebook 2: GraphRAG with LlamaIndex! 🚀")