# 🔗 GraphRAG: Combining Graph Databases with Vector Search

## Welcome to GraphRAG!

In this notebook, you'll learn how to build **GraphRAG** (Graph Retrieval-Augmented Generation) systems that combine the power of:
- **Graph databases** (Neo4j) for understanding relationships
- **Vector embeddings** for semantic similarity search
- **LLMs** (GPT-5-nano) for intelligent entity extraction and query generation

## 🎯 What You'll Learn

- How to automatically extract entities and relationships from unstructured text using LLMs
- Building knowledge graphs programmatically with LlamaIndex
- Creating vector embeddings for semantic search
- Combining graph traversal with vector similarity for superior retrieval
- When GraphRAG outperforms traditional RAG systems

## 💼 Why GraphRAG Matters

Traditional RAG (Retrieval-Augmented Generation) retrieves documents based on **semantic similarity alone**. This works well for simple questions, but fails when:

- **Multi-hop reasoning** is needed ("Who worked with people who know experts in NLP?")
- **Relationships matter** ("Show me all projects connected to the Data Science team")
- **Entity disambiguation** is required (distinguishing between multiple people named "John")
- **Structured knowledge** improves context (organizational hierarchies, citation networks)

**GraphRAG solves these problems** by:
1. Extracting entities and relationships using LLMs
2. Building a knowledge graph that captures structure
3. Using vector search to find relevant entities
4. Traversing the graph to gather rich, connected context
5. Providing LLMs with both semantically similar AND structurally related information

## 🔗 Building on Notebook 1

You've already learned:
- Graph database concepts (nodes, relationships, properties)
- Cypher query language for pattern matching
- Manual knowledge graph construction

Now you'll automate everything using:
- **LLM-powered entity extraction** (GPT-5-nano)
- **LlamaIndex** for document processing and retrieval
- **Hybrid search** combining vectors and graphs

Let's begin! 🚀

---
## 📚 Part 1: Theory - Understanding GraphRAG

### What is GraphRAG?

**GraphRAG** (Graph Retrieval-Augmented Generation) is an advanced RAG technique that enhances traditional vector-based retrieval with graph database capabilities. Instead of only finding semantically similar documents, GraphRAG understands **how entities are connected**.

### Traditional RAG vs. GraphRAG

**Traditional RAG:**
1. Chunk documents into passages
2. Create vector embeddings for each chunk
3. User asks a question
4. Find top-K most similar chunks
5. Pass chunks to LLM for answer generation

**Limitations:**
- No understanding of relationships between entities
- Can't answer "who knows whom" or "what's connected to what"
- Misses context from related but not semantically similar documents
- Poor performance on multi-hop questions

**GraphRAG:**
1. Extract entities and relationships from documents using LLMs
2. Build a knowledge graph (nodes = entities, edges = relationships)
3. Create vector embeddings for both documents AND entities
4. User asks a question
5. Find relevant entities via vector search
6. Traverse the graph to find connected entities and documents
7. Combine semantically similar and structurally related context
8. Pass enriched context to LLM for answer generation

**Advantages:**
- ✅ Understands relationships ("Sarah manages Marcus who works on Project X")
- ✅ Multi-hop reasoning ("Find experts connected to AI researchers in our network")
- ✅ Entity disambiguation (distinguishing between different "John Smiths")
- ✅ Richer context from connected entities
- ✅ Explainable retrieval paths ("I found this via Sarah → Project X → Document Y")

### 🔑 Key Components of GraphRAG

1. **Entity Extraction**: Use LLMs to identify entities (people, organizations, concepts) in text
2. **Relationship Extraction**: Identify how entities are connected
3. **Knowledge Graph Construction**: Store entities and relationships in a graph database
4. **Vector Embeddings**: Create semantic representations of entities and documents
5. **Hybrid Retrieval**: Combine vector similarity with graph traversal
6. **Context Assembly**: Gather relevant information from multiple sources
7. **LLM Generation**: Generate answers using enriched context

### 🌟 Real-World Use Cases

1. **Enterprise Knowledge Management**: "Find all documents related to projects that Sarah's team collaborated on with Engineering"
2. **Research Paper Discovery**: "Show me papers cited by NLP experts who also published on transformers"
3. **Customer Support**: "Find solutions used by customers in similar industries with related issues"
4. **Legal Document Analysis**: "Identify all cases related to this statute through precedent citations"
5. **Medical Knowledge Graphs**: "Find treatment protocols for conditions related to this patient's diagnosis"

### 💡 Key Point: When to Use GraphRAG

Use **GraphRAG** when:
- Your domain has rich entity relationships (organizational, citation, social networks)
- Questions require multi-hop reasoning
- Entity disambiguation is important
- Relationships are as important as content similarity

Use **Traditional RAG** when:
- Documents are mostly independent
- Simple semantic similarity is sufficient
- Lower complexity and faster implementation are priorities

### 🎯 Key Takeaways

- GraphRAG combines vector search with graph traversal for superior retrieval
- LLMs extract entities and relationships automatically from text
- Knowledge graphs capture structure that vectors alone cannot
- Hybrid search provides both semantic and structural relevance
- GraphRAG excels at multi-hop reasoning and entity-centric questions

---
## ⚙️ Part 2: Setup - Installing Dependencies

We'll install everything needed for GraphRAG:

- **Neo4j Community Edition**: Graph database (same as Notebook 1)
- **LlamaIndex**: Document processing and RAG orchestration framework
- **OpenAI Python SDK**: For GPT-5-nano API access
- **ChromaDB**: Vector database for embeddings
- **Additional libraries**: pandas, networkx, matplotlib for analysis

This setup takes about 60-90 seconds. Let's begin! ⚡

In [None]:
# 📦 Install Neo4j Community Edition in Colab
# (Same process as Notebook 1)

import time
import os

print("🔄 Step 1: Installing Java (required for Neo4j)...")
!apt-get update -qq > /dev/null 2>&1
!apt-get install -y default-jre wget > /dev/null 2>&1
print("✅ Java installed!\n")

print("📥 Step 2: Downloading Neo4j Community Edition 4.4.36...")
!wget -q https://dist.neo4j.org/neo4j-community-4.4.36-unix.tar.gz
print("✅ Downloaded!\n")

print("📦 Step 3: Extracting Neo4j...")
!tar -xf neo4j-community-4.4.36-unix.tar.gz
!mv neo4j-community-4.4.36 neo4j
print("✅ Extracted to 'neo4j' folder!\n")

print("🔐 Step 4: Setting initial password...")
!neo4j/bin/neo4j-admin set-initial-password password123
print("✅ Password set to: password123\n")

print("🚀 Step 5: Starting Neo4j server...")
!neo4j/bin/neo4j start
print("✅ Neo4j starting...\n")

print("⏳ Step 6: Waiting for Neo4j to be fully ready...")
import socket
max_attempts = 30
for attempt in range(max_attempts):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(1)
        result = sock.connect_ex(('127.0.0.1', 7687))
        sock.close()
        if result == 0:
            print(f"✅ Neo4j is ready! (took {attempt * 2} seconds)\n")
            break
    except:
        pass
    if attempt % 5 == 0 and attempt > 0:
        print(f"   Still waiting... ({attempt}/{max_attempts})")
    time.sleep(2)

print("=" * 60)
print("📊 NEO4J CONNECTION DETAILS")
print("=" * 60)
print("URI:      bolt://localhost:7687")
print("Username: neo4j")
print("Password: password123")
print("=" * 60)
print("\n🎉 Neo4j setup complete!\n")

In [None]:
# 📦 Install Python dependencies for GraphRAG

print("📥 Installing GraphRAG dependencies...")
print("This may take 30-60 seconds...\n")

# Fix dependency issue first
!pip install -q jedi

# Core dependencies
!pip install -q neo4j py2neo
!pip install -q openai
!pip install -q llama-index
!pip install -q llama-index-graph-stores-neo4j
!pip install -q llama-index-vector-stores-chroma
!pip install -q llama-index-embeddings-openai
!pip install -q chromadb
!pip install -q pandas networkx matplotlib

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

print("✅ All dependencies installed!")
print("\n📦 Installed packages:")
print("  ✓ neo4j, py2neo - Graph database drivers")
print("  ✓ openai - GPT-5-nano API access")
print("  ✓ llama-index - RAG orchestration framework")
print("  ✓ chromadb - Vector database")
print("  ✓ pandas, networkx, matplotlib - Data analysis and visualization")

In [None]:
# 📚 Import all necessary libraries

# Neo4j for graph database
from neo4j import GraphDatabase
from llama_index.graph_stores.neo4j import Neo4jGraphStore

# LlamaIndex core
from llama_index.core import (
    Document,
    VectorStoreIndex,
    ServiceContext,
    StorageContext,
    Settings
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore

# ChromaDB for vector storage
import chromadb

# Data manipulation and visualization
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from typing import List, Dict, Any, Optional
import json
from pprint import pprint

print("✅ All libraries imported successfully!")

In [None]:
# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Configure which OpenAI model to use
OPENAI_MODEL = "gpt-5-nano"  # Cost-efficient model for function calling
print(f"🤖 Selected Model: {OPENAI_MODEL}")

# Initialize OpenAI client
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

# Configure LlamaIndex to use OpenAI
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = None  # We'll use OpenAI client directly for more control
Settings.embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY)

print("✅ OpenAI client configured for LlamaIndex!")

In [None]:
# 🔗 Configure Neo4j connection

NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "password123"

# Create Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Test connection
def test_connection():
    try:
        with driver.session() as session:
            result = session.run("MATCH (n) RETURN count(n) as count")
            count = result.single()["count"]
            print("✅ Successfully connected to Neo4j!")
            print(f"📊 Current node count: {count}")
            return True
    except Exception as e:
        print(f"❌ Connection failed: {e}")
        return False

test_connection()

# Initialize LlamaIndex Neo4j graph store
# Note: refresh_schema=False because APOC procedures are not installed in Neo4j Community Edition
graph_store = Neo4jGraphStore(
    username=NEO4J_USER,
    password=NEO4J_PASSWORD,
    url=NEO4J_URI,
    database="neo4j",
    refresh_schema=False,  # Disable APOC-dependent schema refresh
)

print("\n✅ Neo4j graph store initialized for LlamaIndex!")
print("   (Schema refresh disabled - APOC not required)")

In [None]:
# 🗄️ Initialize ChromaDB vector store

# Create ChromaDB client (in-memory for Colab)
chroma_client = chromadb.EphemeralClient()

# Create a collection for our documents
collection_name = "graphrag_documents"
chroma_collection = chroma_client.create_collection(name=collection_name)

# Initialize LlamaIndex ChromaDB vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

print("✅ ChromaDB vector store initialized!")
print(f"   Collection: {collection_name}")
print(f"   Mode: Ephemeral (in-memory)")

---
## 📄 Part 3: Loading Sample Documents

### Creating Unstructured Text Data

In this section, we'll create sample documents that simulate real-world scenarios:
- **Project updates** describing team collaborations
- **Research summaries** mentioning authors and citations
- **Meeting notes** with action items and relationships

These documents contain **implicit entities** (people, projects, organizations) and **relationships** (works on, collaborates with, cites) that we'll extract using GPT-5-nano.

### Why Unstructured Text?

Real-world knowledge exists in:
- Email threads and Slack messages
- Research papers and technical reports
- Meeting notes and project documentation
- Customer support tickets
- Legal contracts and medical records

**GraphRAG automatically extracts structure** from this unstructured text, building a knowledge graph without manual data entry!

In [None]:
# 📝 Create sample documents with rich entities and relationships

sample_documents = [
    {
        "title": "Q4 2024 AI Research Team Update",
        "content": """
The AI Research team, led by Dr. Sarah Chen, has made significant progress on the 
Customer Churn Prediction project. The team includes Marcus Johnson (Senior ML Engineer) 
and Priya Patel (Data Scientist), who have been collaborating closely with the 
Data Engineering team headed by Tom Wilson.

The project leverages transformer-based models and achieved 89% accuracy in predicting 
customer churn. Sarah presented these results at the NeurIPS 2024 conference, where 
she connected with Prof. Andrew Ng from Stanford University, who expressed interest 
in collaborating on future research.

The project is scheduled to move to production in Q1 2025, with support from the 
Engineering department led by David Kim. Marcus will be leading the deployment effort.
"""
    },
    {
        "title": "Recommendation Engine v2 - Technical Design",
        "content": """
The Recommendation Engine v2 project aims to improve our content recommendation system 
using graph neural networks. The technical lead is Marcus Johnson, working with James Liu 
(ML Engineer) and Robert Brown (Senior Data Scientist).

This project builds upon research from "Attention Is All You Need" (Vaswani et al., 2017) 
and incorporates recent advances in graph representation learning. The team cited work by 
Prof. Jure Leskovec from Stanford on knowledge graph embeddings as a key inspiration.

James Liu previously worked on the NLP Chatbot project with Priya Patel, bringing valuable 
experience in transformer architectures. Robert Brown contributed his expertise in 
feature engineering, drawing from his background at Google Research where he collaborated 
with Dr. Geoffrey Hinton's team.

The project integrates with our existing Neo4j knowledge graph, which contains user behavior 
data collected by the Analytics team under Sophie Martin (Product Manager).
"""
    },
    {
        "title": "Fraud Detection System - Project Kickoff Notes",
        "content": """
Meeting Date: October 15, 2024
Attendees: Marcus Johnson, Robert Brown, James Liu, Amy Zhang (Senior Backend Engineer), 
Lisa Anderson (DevOps Engineer)

The Fraud Detection System project was initiated following a request from the Finance 
department. This critical project will use anomaly detection algorithms to identify 
suspicious transaction patterns in real-time.

Marcus Johnson will serve as technical lead, with Robert Brown focusing on model development. 
The system will be deployed on AWS infrastructure managed by Lisa Anderson. Amy Zhang will 
build the API layer for integration with our existing payment processing system.

The project references research on graph-based fraud detection by Prof. Danai Koutra from 
University of Michigan. Robert Brown attended her keynote at KDD 2024 and proposed adapting 
her techniques for our use case.

Timeline: 6-month project with go-live targeted for April 2025. The project will be reviewed 
monthly by Alex Turner (CEO) and Jennifer Lee (CTO).
"""
    },
    {
        "title": "Research Paper: Deep Learning for Time Series Forecasting",
        "content": """
Authors: Priya Patel, Elena Rodriguez, Tom Wilson
Affiliation: TechCorp AI Research Lab

Abstract: This paper presents a novel approach to time series forecasting using deep learning 
techniques. We build upon recent work by Prof. Yoshua Bengio on attention mechanisms and 
propose a hybrid model combining LSTMs with transformer architectures.

Our approach was validated on the Customer Churn Prediction dataset, demonstrating significant 
improvements over baseline methods. We acknowledge contributions from Dr. Sarah Chen, who 
provided guidance on model architecture, and Marcus Johnson, who assisted with hyperparameter 
tuning.

This work was presented at the Real-time Analytics Dashboard internal symposium and received 
positive feedback from Jennifer Lee (CTO) and David Kim (VP of Engineering). We plan to submit 
this work to ICML 2025.

Related Work: Our approach builds on "BERT: Pre-training of Deep Bidirectional Transformers" 
(Devlin et al., 2018) and "Temporal Fusion Transformers" (Lim et al., 2021). We also 
incorporate ideas from graph neural networks research by Prof. Jure Leskovec at Stanford.
"""
    },
    {
        "title": "Engineering Team Quarterly Review - Q4 2024",
        "content": """
The Engineering team, under VP David Kim, delivered exceptional results this quarter. 
Key accomplishments include:

1. Infrastructure Migration: Led by Lisa Anderson and Mohammed Ali (Backend Engineer), 
the team successfully migrated 80% of our services to Kubernetes. This project involved 
close collaboration with the DevOps team and was completed ahead of schedule.

2. Mobile App Redesign: Carlos Santos (Frontend Engineer) and Raj Sharma (Product Designer) 
shipped the new mobile interface, which increased user engagement by 35%. Nina Williams 
(UX Researcher) conducted extensive user testing that informed the final design.

3. A/B Testing Platform: Sophie Martin (Product Manager) spearheaded the development of 
our new A/B testing infrastructure. Amy Zhang built the backend services, while Carlos 
Santos implemented the frontend dashboard.

The team collaborated extensively with the Data Science department (Dr. Sarah Chen) on the 
Real-time Analytics Dashboard project. Tom Wilson (Data Engineer) built the data pipelines, 
while Amy Zhang and Elena Rodriguez (Data Analyst) created the visualization layer.

Next quarter priorities were set in consultation with Alex Turner (CEO) and Jennifer Lee (CTO). 
The focus will be on AI/ML model deployment automation and expanding our cloud infrastructure.
"""
    },
    {
        "title": "Partnership Announcement: TechCorp and Stanford AI Lab",
        "content": """
TechCorp is excited to announce a research partnership with Stanford University's AI Lab, 
led by Prof. Andrew Ng and Prof. Jure Leskovec. This collaboration emerged from discussions 
at NeurIPS 2024 where Dr. Sarah Chen presented our work on customer churn prediction.

The partnership will focus on three areas:

1. Graph Neural Networks: Prof. Jure Leskovec will advise our team (Marcus Johnson, 
Robert Brown, James Liu) on applying graph neural networks to recommendation systems.

2. Transfer Learning: Prof. Andrew Ng will collaborate with Priya Patel and Maria Garcia 
(Junior ML Engineer) on transfer learning techniques for computer vision applications 
related to our Computer Vision API project.

3. Knowledge Graphs: The Stanford team will work with our Data Engineering group (Tom Wilson, 
Elena Rodriguez) to enhance our Neo4j-based knowledge graph infrastructure.

This partnership was championed by Jennifer Lee (CTO) and Alex Turner (CEO), with support 
from David Kim (VP of Engineering) and Dr. Sarah Chen (VP of Data Science). The first joint 
research workshop is scheduled for January 2025 at Stanford's campus.

Prof. Andrew Ng commented: "TechCorp's work on combining graph databases with machine learning 
aligns perfectly with our research vision. We're particularly impressed by the team's implementation 
of attention mechanisms in their recommendation engine."
"""
    }
]

# Convert to LlamaIndex Document objects
documents = [
    Document(
        text=doc["content"],
        metadata={"title": doc["title"]}
    )
    for doc in sample_documents
]

print(f"✅ Created {len(documents)} sample documents")
print("\n📄 Document titles:")
for i, doc in enumerate(sample_documents, 1):
    print(f"  {i}. {doc['title']}")

---
## 🤖 Part 4: LLM-Powered Entity Extraction

### Automating Knowledge Graph Construction

Instead of manually identifying entities and relationships (as in Notebook 1), we'll use **GPT-5-nano** to automatically extract:

- **Entities**: People, organizations, projects, technologies, locations
- **Relationships**: works_on, collaborates_with, leads, cites, presents_at
- **Properties**: Roles, affiliations, dates, locations

### How It Works

1. **Prompt Engineering**: We'll craft prompts that instruct the LLM to extract structured information
2. **Entity Recognition**: GPT-5-nano identifies named entities in the text
3. **Relationship Extraction**: The LLM infers connections between entities
4. **Structured Output**: We'll parse the LLM response into graph-ready format

This approach scales to thousands of documents without manual annotation!

In [None]:
# 🤖 Create entity extraction function using GPT-5-nano

def extract_entities_and_relationships(text: str, doc_title: str = "") -> Dict[str, Any]:
    """
    Use GPT-5-nano to extract entities and relationships from text.
    
    Args:
        text: The document text to analyze
        doc_title: Optional document title for context
        
    Returns:
        Dictionary containing extracted entities and relationships
    """
    
    extraction_prompt = f"""
Extract entities and relationships from the following text. Return the results as a JSON object.

TEXT:
{text[:2000]}  # Limit to first 2000 chars for efficiency

Extract:
1. **Entities** with types: Person, Organization, Project, Technology, Location, Event, Concept
2. **Relationships** between entities (who works with whom, who leads what, etc.)

Return JSON format:
{{
  "entities": [
    {{"name": "Entity Name", "type": "Person|Organization|Project|Technology|Location|Event|Concept", "properties": {{"role": "...", "affiliation": "..."}}}},
    ...
  ],
  "relationships": [
    {{"source": "Entity 1", "target": "Entity 2", "type": "works_on|collaborates_with|leads|cites|presents_at|employed_by", "properties": {{}}}},
    ...
  ]
}}

Focus on the most important entities and clear relationships. Be concise.
"""
    
    try:
        # Call GPT-5-nano using the Responses API
        response = client.responses.create(
            model="gpt-5-nano",
            input=extraction_prompt,
            reasoning={"effort": "minimal"}  # Optimize for speed
        )
        
        # Extract the response text
        result_text = response.output_text if hasattr(response, 'output_text') else str(response)
        
        # Parse JSON (handle cases where LLM might add markdown formatting)
        if "```json" in result_text:
            result_text = result_text.split("```json")[1].split("```")[0]
        elif "```" in result_text:
            result_text = result_text.split("```")[1].split("```")[0]
        
        extracted = json.loads(result_text.strip())
        extracted["document_title"] = doc_title
        
        return extracted
        
    except Exception as e:
        print(f"❌ Extraction error: {e}")
        return {"entities": [], "relationships": [], "document_title": doc_title}

print("✅ Entity extraction function created!")
print("   Model: gpt-5-nano")
print("   Format: JSON output with entities and relationships")

In [None]:
# 🔄 Extract entities from all documents

print("🤖 Extracting entities and relationships from documents...")
print("This will take 30-60 seconds (using GPT-5-nano)...\n")

extracted_data = []

for i, doc in enumerate(sample_documents, 1):
    print(f"📄 Processing document {i}/{len(sample_documents)}: {doc['title'][:50]}...")
    
    result = extract_entities_and_relationships(doc["content"], doc["title"])
    extracted_data.append(result)
    
    # Show sample of extracted entities
    if result["entities"]:
        print(f"   ✓ Found {len(result['entities'])} entities, {len(result['relationships'])} relationships")
    else:
        print(f"   ⚠️ No entities extracted")
    
print(f"\n✅ Extraction complete!")
print(f"📊 Total entities extracted: {sum(len(d['entities']) for d in extracted_data)}")
print(f"🔗 Total relationships extracted: {sum(len(d['relationships']) for d in extracted_data)}")

# Show a sample of extracted data
print("\n📋 Sample extraction from first document:")
if extracted_data[0]["entities"]:
    print(f"\\nEntities (first 3):")
    for entity in extracted_data[0]["entities"][:3]:
        print(f"  - {entity['name']} ({entity['type']})")
    
    print(f"\\nRelationships (first 3):")
    for rel in extracted_data[0]["relationships"][:3]:
        print(f"  - {rel['source']} --[{rel['type']}]--> {rel['target']}")

---
## 🏗️ Part 5: Building the Knowledge Graph Automatically

### From Extracted Data to Neo4j

Now that we've extracted entities and relationships using GPT-5-nano, we'll automatically populate our Neo4j knowledge graph. This process:

1. **Deduplicates entities** (merge multiple mentions of "Sarah Chen" into one node)
2. **Creates nodes** for each unique entity
3. **Establishes relationships** between entities
4. **Adds properties** (roles, affiliations, etc.)
5. **Links documents** to entities they mention

This is where **manual work** (Notebook 1) becomes **fully automated** (Notebook 2)!

In [None]:
# 🏗️ Build the knowledge graph from extracted data

# Clear existing data
with driver.session() as session:
    session.run("MATCH (n) DETACH DELETE n")
    print("🧹 Cleared existing graph data\n")

# Helper function to insert entities and relationships
def build_graph_from_extractions(extracted_data_list):
    """
    Build Neo4j knowledge graph from LLM-extracted data.
    """
    with driver.session() as session:
        total_entities = 0
        total_relationships = 0
        
        for extraction in extracted_data_list:
            doc_title = extraction.get("document_title", "Unknown")
            
            # Create Document node
            session.run(
                """
                MERGE (d:Document {title: $title})
                """,
                title=doc_title
            )
            
            # Create entity nodes
            for entity in extraction.get("entities", []):
                name = entity.get("name", "").strip()
                entity_type = entity.get("type", "Entity")
                properties = entity.get("properties", {})
                
                if not name:
                    continue
                
                # Create or merge entity node
                query = f"""
                MERGE (e:{entity_type} {{name: $name}})
                SET e += $properties
                """
                session.run(query, name=name, properties=properties)
                
                # Link entity to document
                session.run(
                    """
                    MATCH (e {{name: $name}}), (d:Document {title: $doc_title})
                    MERGE (e)-[:MENTIONED_IN]->(d)
                    """,
                    name=name,
                    doc_title=doc_title
                )
                
                total_entities += 1
            
            # Create relationships
            for rel in extraction.get("relationships", []):
                source = rel.get("source", "").strip()
                target = rel.get("target", "").strip()
                rel_type = rel.get("type", "RELATED_TO").upper().replace(" ", "_")
                rel_props = rel.get("properties", {})
                
                if not source or not target:
                    continue
                
                session.run(
                    f"""
                    MATCH (s {{name: $source}}), (t {{name: $target}})
                    MERGE (s)-[r:{rel_type}]->(t)
                    SET r += $properties
                    """,
                    source=source,
                    target=target,
                    properties=rel_props
                )
                
                total_relationships += 1
        
        return total_entities, total_relationships

# Build the graph
print("🏗️ Building knowledge graph in Neo4j...")
entities_created, rels_created = build_graph_from_extractions(extracted_data)

print(f"\n✅ Knowledge graph built successfully!")
print(f"   📊 Entities created: {entities_created}")
print(f"   🔗 Relationships created: {rels_created}")

# Verify the graph
with driver.session() as session:
    result = session.run("""
        MATCH (n)
        RETURN labels(n)[0] as type, count(*) as count
        ORDER BY count DESC
    """)
    print(f"\n📋 Node types in graph:")
    for record in result:
        print(f"   - {record['type']}: {record['count']}")

---
## 🎯 Part 6: Creating Vector Embeddings

### Why Vector Embeddings?

While our knowledge graph captures **structure** (who knows whom, who works on what), vector embeddings capture **semantic meaning**. By combining both, we get:

- **Semantic search**: Find conceptually similar content
- **Entity disambiguation**: Distinguish between entities with similar names using context
- **Hybrid retrieval**: Use vector similarity to find entry points, then traverse the graph

### What We'll Create

- Document embeddings for each text
- Entity embeddings (using entity names + context)
- Store everything in ChromaDB for fast similarity search

In [None]:
# 🎯 Create vector embeddings for documents

# Split documents into chunks for better retrieval
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)

print(f"📄 Split {len(documents)} documents into {len(nodes)} chunks")

# Create embeddings and store in ChromaDB
from llama_index.core import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

print("\n🔄 Creating embeddings (this may take 30-60 seconds)...")
vector_index = VectorStoreIndex(
    nodes,
    storage_context=storage_context,
    show_progress=True
)

print("\n✅ Vector embeddings created and stored in ChromaDB!")
print(f"   📊 Total document chunks embedded: {len(nodes)}")
print(f"   🎯 Embedding model: text-embedding-ada-002")

---
## 🔍 Part 7-8: Vector Search vs Graph Search

### Traditional RAG (Vector-Only)

Traditional RAG uses only vector similarity:

In [None]:
# 🔍 Traditional Vector-Only Search (Baseline)

def vector_search(query: str, top_k: int = 3):
    """
    Pure vector similarity search (traditional RAG baseline).
    """
    query_engine = vector_index.as_query_engine(similarity_top_k=top_k)
    response = query_engine.query(query)
    
    return response

# Test vector search
query = "Who is working on the Recommendation Engine project?"
print(f"❓ Query: {query}\n")
print("📊 Vector Search Results:")
result = vector_search(query)
print(f"\nAnswer: {result}")

### Graph-Based Retrieval

Graph search uses Cypher to traverse relationships:

In [None]:
# 🔗 Graph-Based Search

def graph_search(entity_name: str):
    """
    Retrieve entity and its connections from the knowledge graph.
    """
    with driver.session() as session:
        result = session.run(
            """
            MATCH (e {name: $name})-[r]-(connected)
            RETURN e.name as entity, type(r) as relationship, 
                   connected.name as connected_entity, labels(connected)[0] as type
            LIMIT 20
            """,
            name=entity_name
        )
        return [dict(record) for record in result]

# Test graph search
print("🔗 Graph Search Results for 'Marcus Johnson':\n")
connections = graph_search("Marcus Johnson")
for conn in connections[:5]:
    print(f"  {conn['entity']} --[{conn['relationship']}]--> {conn['connected_entity']} ({conn['type']})")

---
## 🚀 Part 9-10: Hybrid Search - GraphRAG in Action

### Combining Vector + Graph

GraphRAG combines both approaches:
1. Use vector search to find relevant entities
2. Traverse the graph to find connected context
3. Assemble rich, relationship-aware context for the LLM

In [None]:
# 🚀 GraphRAG: Hybrid Vector + Graph Search

def graphrag_search(query: str, top_k: int = 3, max_hops: int = 2):
    """
    Hybrid search combining vector similarity and graph traversal.
    
    Steps:
    1. Vector search to find relevant documents
    2. Extract mentioned entities from those documents
    3. Graph traversal to find connected entities and documents
    4. Assemble comprehensive context
    """
    print(f"🔍 GraphRAG Search for: '{query}'\n")
    
    # Step 1: Vector search for relevant documents
    print("📊 Step 1: Vector search for relevant documents...")
    vector_results = vector_index.as_query_engine(similarity_top_k=top_k).query(query)
    
    # Step 2: Extract entities mentioned in top documents
    print("📋 Step 2: Identifying entities from results...")
    entities_found = set()
    
    # Find entities connected to retrieved documents
    with driver.session() as session:
        result = session.run(
            """
            MATCH (e)-[:MENTIONED_IN]->(d:Document)
            RETURN DISTINCT e.name as entity, labels(e)[0] as type
            LIMIT 10
            """
        )
        for record in result:
            entities_found.add(record["entity"])
    
    print(f"   Found {len(entities_found)} entities")
    
    # Step 3: Graph traversal from these entities
    print("🔗 Step 3: Traversing knowledge graph...")
    graph_context = []
    
    for entity in list(entities_found)[:5]:  # Limit for demo
        connections = graph_search(entity)
        graph_context.extend(connections[:3])
    
    print(f"   Retrieved {len(graph_context)} relationship triples")
    
    # Step 4: Assemble final context
    print("\n✅ GraphRAG Results:\n")
    print(f"📊 Vector-based answer:\n{vector_results}\n")
    print(f"🔗 Graph-enriched context:")
    for ctx in graph_context[:5]:
        print(f"   - {ctx['entity']} --[{ctx['relationship']}]--> {ctx['connected_entity']}")
    
    return {"vector_answer": str(vector_results), "graph_context": graph_context}

# Test GraphRAG
result = graphrag_search("Who collaborates with Marcus Johnson on AI projects?")

---
## 🎓 Part 11: Advanced GraphRAG Patterns

### Multi-Hop Reasoning

GraphRAG excels at multi-hop questions that require traversing multiple relationships:

In [None]:
# 🎓 Advanced Pattern: Multi-Hop Reasoning

def multi_hop_query(start_entity: str, relationship_pattern: str, hops: int = 2):
    """
    Traverse multiple relationship hops in the knowledge graph.
    Example: Find people who work with people who collaborated with start_entity
    """
    with driver.session() as session:
        query = f"""
        MATCH path = (start {{name: $start}})-[*1..{hops}]-(connected)
        WHERE connected:Person OR connected:Organization
        RETURN DISTINCT connected.name as name, 
               labels(connected)[0] as type,
               length(path) as distance
        ORDER BY distance
        LIMIT 15
        """
        result = session.run(query, start=start_entity)
        return [dict(record) for record in result]

# Example: Find people within 2 degrees of Sarah Chen
print("🔗 Multi-hop query: People within 2 connections of Sarah Chen\n")
results = multi_hop_query("Sarah Chen", "COLLABORATES_WITH|WORKS_ON", hops=2)

for r in results[:10]:
    print(f"  {r['name']} ({r['type']}) - {r['distance']} hop(s) away")

### Entity Disambiguation

GraphRAG uses graph context to disambiguate entities:

In [None]:
# 🎯 Entity Disambiguation using Graph Context

def disambiguate_entity(entity_name: str):
    """
    Use graph relationships to provide context for entity disambiguation.
    """
    with driver.session() as session:
        result = session.run(
            """
            MATCH (e {name: $name})
            OPTIONAL MATCH (e)-[r1]-(connected1)
            OPTIONAL MATCH (e)-[:MENTIONED_IN]->(d:Document)
            RETURN e.name as name, 
                   labels(e)[0] as type,
                   collect(DISTINCT type(r1)) as relationship_types,
                   collect(DISTINCT connected1.name)[0..5] as connected_to,
                   collect(DISTINCT d.title)[0..3] as mentioned_in_docs
            """
            ,
            name=entity_name
        )
        
        for record in result:
            print(f"Entity: {record['name']} ({record['type']})")
            print(f"  Connected via: {', '.join(record['relationship_types'][:5])}")
            print(f"  Connected to: {', '.join([c for c in record['connected_to'] if c][:5])}")
            print(f"  Mentioned in: {', '.join(record['mentioned_in_docs'])}")

# Example
print("🔍 Disambiguating 'Marcus Johnson':\n")
disambiguate_entity("Marcus Johnson")

---
## 🚀 Part 12: Next Steps & Production Considerations

### Congratulations! 🎉

You've learned how to build complete GraphRAG systems that combine:
- ✅ LLM-powered entity extraction (GPT-5-nano)
- ✅ Automated knowledge graph construction (Neo4j)
- ✅ Vector embeddings for semantic search (ChromaDB)
- ✅ Hybrid retrieval combining graphs and vectors
- ✅ Multi-hop reasoning and entity disambiguation

### 🏭 Production Considerations

**Scaling Entity Extraction:**
- Batch processing for large document corpora
- Caching LLM extractions to reduce API costs
- Fine-tuning models for domain-specific entity types
- Using cheaper models (gpt-5-nano) for extraction, premium models for generation

**Knowledge Graph Optimization:**
- Entity resolution and deduplication
- Confidence scores for extracted relationships
- Temporal relationships (when did X work with Y?)
- Graph embeddings for entity similarity

**Vector Store Scaling:**
- Persistent ChromaDB or Pinecone for production
- Hybrid search with BM25 + dense embeddings
- Metadata filtering for faster retrieval
- Query result caching

**GraphRAG Query Optimization:**
- Limit graph traversal depth to prevent slowdowns
- Use Cypher query optimization (indexes, profiling)
- Parallel retrieval (vector + graph searches simultaneously)
- LLM result caching for common queries

### 📚 Further Learning

1. **LlamaIndex Graph RAG Documentation**: Advanced GraphRAG patterns
2. **Neo4j Graph Data Science**: Centrality, community detection, graph algorithms
3. **Hybrid Search Research**: Papers on combining dense + sparse + graph retrieval
4. **Entity Linking**: Linking extracted entities to knowledge bases (Wikidata, DBpedia)

### 🛠️ Practice Exercises

1. **Add New Documents**: Extract entities from your own documents
2. **Custom Entity Types**: Modify extraction prompts for your domain
3. **Complex Queries**: Write multi-hop Cypher queries for your use case
4. **Evaluation**: Compare GraphRAG vs traditional RAG on complex questions
5. **Visualization**: Use NetworkX to visualize the extracted knowledge graph

### 🎯 Key Takeaways

- **GraphRAG > Traditional RAG** for entity-centric and relationship-heavy domains
- **LLMs automate** what used to require manual knowledge engineering
- **Hybrid search** provides both semantic and structural relevance
- **Knowledge graphs** enable explainable, multi-hop reasoning
- **Production systems** require careful optimization of extraction, storage, and retrieval

You now have the skills to build production GraphRAG systems! 🚀

### 📊 System Architecture Summary

```
Documents (Unstructured Text)
      |
      v
GPT-5-nano (Entity Extraction)
      |
      v
Entities + Relationships (Structured)
      |
      ├─────> Neo4j (Knowledge Graph)
      |
      └─────> ChromaDB (Vector Embeddings)
             |
             v
        GraphRAG Engine
             |
             ├─> Vector Search (Semantic)
             ├─> Graph Traversal (Structural)
             └─> Hybrid Results (Best of Both)
```

Thank you for completing this notebook! See you in your next AI project! 🎓

In [None]:
# 🧹 Cleanup: Close connections

print("🧹 Closing database connections...")
driver.close()
print("✅ Connections closed!")
print("\n👋 Thanks for learning GraphRAG!")
print("🚀 Now go build amazing knowledge-graph-powered AI applications!")