# GraphRAG: Knowledge Graph-Enhanced Retrieval

GraphRAG combines traditional RAG with knowledge graphs to capture entity relationships and enable more sophisticated reasoning.

## Learning Objectives

By the end of this notebook, you will:
1. Understand knowledge graphs and their benefits for RAG
2. Extract entities and relationships from documents
3. Build a knowledge graph index
4. Query using graph traversal
5. Combine vector search with graph-based retrieval

---

## What is GraphRAG?

Traditional RAG retrieves text chunks based on similarity. **GraphRAG** adds:

- **Entity extraction**: Identify people, places, concepts
- **Relationship mapping**: How entities connect
- **Graph traversal**: Follow relationships for context

### When to Use GraphRAG

| Use Case | Vector RAG | GraphRAG |
|----------|------------|----------|
| Simple Q&A | ✓ | |
| Entity relationships | | ✓ |
| Multi-hop reasoning | | ✓ |
| Complex queries | | ✓ |
| Document similarity | ✓ | |

In [None]:
# Setup
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv
load_dotenv()

from llama_index.core import (
    Settings,
    KnowledgeGraphIndex,
    SimpleDirectoryReader,
    Document,
    StorageContext,
)
from llama_index.core.graph_stores import SimpleGraphStore
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 512

print("✓ Setup complete!")

## 1. Knowledge Graph Basics

A knowledge graph consists of:
- **Nodes**: Entities (people, concepts, things)
- **Edges**: Relationships between entities
- **Properties**: Attributes of nodes and edges

Example: `(Python) --[is_a]--> (Programming Language)`

In [None]:
# Create sample documents with rich entity relationships
documents = [
    Document(text="""
Python is a programming language created by Guido van Rossum in 1991.
Python is widely used in machine learning and data science.
Guido van Rossum worked at Google and later at Dropbox.
Python's design philosophy emphasizes code readability.
    """),
    Document(text="""
Machine learning is a subset of artificial intelligence.
Deep learning is a type of machine learning that uses neural networks.
TensorFlow and PyTorch are popular deep learning frameworks.
TensorFlow was developed by Google Brain team.
PyTorch was developed by Meta AI (formerly Facebook AI Research).
    """),
    Document(text="""
Neural networks are inspired by biological neurons.
Geoffrey Hinton is known as the godfather of deep learning.
Geoffrey Hinton worked at Google and the University of Toronto.
Yann LeCun developed convolutional neural networks.
Yann LeCun is the Chief AI Scientist at Meta.
    """),
]

print(f"Created {len(documents)} documents with entity relationships")

## 2. Building a Knowledge Graph Index

LlamaIndex can automatically extract entities and relationships:

In [None]:
# Create a simple graph store (in-memory)
graph_store = SimpleGraphStore()

# Create storage context with graph store
storage_context = StorageContext.from_defaults(graph_store=graph_store)

# Build knowledge graph index
# This extracts entities and relationships using the LLM
print("Building knowledge graph index...")
print("(This extracts entities and relationships using the LLM)\n")

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    show_progress=True,
)

print("\n✓ Knowledge graph index built!")

In [None]:
# Explore the extracted knowledge graph
print("Extracted triplets (subject, predicate, object):")
print("=" * 60)

# Get all triplets from the graph
triplets = []
for node_id, node_data in graph_store._data.items():
    if hasattr(node_data, 'edges'):
        for edge in node_data.edges:
            triplets.append((node_id, edge.predicate, edge.target))

# Display sample triplets
for subject, predicate, obj in triplets[:15]:
    print(f"  ({subject}) --[{predicate}]--> ({obj})")

if len(triplets) > 15:
    print(f"  ... and {len(triplets) - 15} more triplets")

## 3. Querying the Knowledge Graph

Query engines for knowledge graphs can use different retrieval modes:

In [None]:
# Create query engine with keyword mode
# This extracts keywords from the query and matches graph entities
kg_query_engine = kg_index.as_query_engine(
    include_text=True,  # Include original text context
    response_mode="tree_summarize",
    verbose=True,
)

print("✓ Query engine ready!")

In [None]:
# Test entity relationship queries
queries = [
    "Who created Python?",
    "What is the relationship between TensorFlow and Google?",
    "Who are the key figures in deep learning?",
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print("-" * 60)
    response = kg_query_engine.query(query)
    print(f"Answer: {response}")

## 4. Multi-hop Reasoning

Knowledge graphs enable multi-hop reasoning - following chains of relationships:

In [None]:
# Multi-hop query example
# This requires traversing multiple relationships
multi_hop_queries = [
    "Which companies have employed people who contributed to deep learning?",
    "What frameworks are used for the field that Python is used in?",
    "What do Geoffrey Hinton and Yann LeCun have in common?",
]

for query in multi_hop_queries:
    print(f"\n{'='*60}")
    print(f"Multi-hop Query: {query}")
    print("-" * 60)
    response = kg_query_engine.query(query)
    print(f"Answer: {response}")

## 5. Custom Entity and Relationship Extraction

You can customize entity extraction for your domain:

In [None]:
from llama_index.core import PromptTemplate

# Custom triplet extraction prompt
CUSTOM_KG_TRIPLET_EXTRACT_PROMPT = PromptTemplate(
    """Extract knowledge graph triplets from the following text.
Focus on:
- PERSON entities (researchers, developers, founders)
- ORGANIZATION entities (companies, universities, research labs)
- TECHNOLOGY entities (programming languages, frameworks, algorithms)
- CONCEPT entities (fields of study, methodologies)

For relationships, use predicates like:
- created_by, developed_by, founded_by
- works_at, worked_at
- is_part_of, is_type_of
- used_for, enables

Text: {text}

Extract triplets in the format: (subject, predicate, object)
Return up to {max_knowledge_triplets} triplets.

Triplets:
"""
)

print("✓ Custom extraction prompt defined!")

In [None]:
# Build knowledge graph with custom prompt
custom_graph_store = SimpleGraphStore()
custom_storage_context = StorageContext.from_defaults(graph_store=custom_graph_store)

print("Building knowledge graph with custom extraction...")

custom_kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=custom_storage_context,
    max_triplets_per_chunk=10,
    kg_triplet_extract_template=CUSTOM_KG_TRIPLET_EXTRACT_PROMPT,
    show_progress=True,
)

print("\n✓ Custom knowledge graph built!")

## 6. Hybrid Retrieval: Combining Vector and Graph

Combine vector similarity search with graph traversal:

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool

# Create vector index from same documents
vector_index = VectorStoreIndex.from_documents(documents)
vector_query_engine = vector_index.as_query_engine()

# Create tools for both approaches
vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description="Useful for questions about general concepts, definitions, "
                "and when looking for similar content. Good for 'what is' questions.",
)

kg_tool = QueryEngineTool.from_defaults(
    query_engine=kg_query_engine,
    description="Useful for questions about relationships between entities, "
                "'who created/works at/developed' questions, and tracing connections.",
)

# Create router that selects best approach
hybrid_query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[vector_tool, kg_tool],
    verbose=True,
)

print("✓ Hybrid query engine ready!")

In [None]:
# Test hybrid retrieval
hybrid_queries = [
    "What is deep learning?",  # Should use vector (concept question)
    "Who developed PyTorch?",  # Should use KG (relationship question)
    "What companies are involved in AI research?",  # Could use either
]

for query in hybrid_queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print("-" * 60)
    response = hybrid_query_engine.query(query)
    print(f"Answer: {response}")

## 7. Visualizing the Knowledge Graph

Visualize the extracted knowledge graph:

In [None]:
# Create a simple text visualization of the graph
def visualize_graph_text(graph_store):
    """Create a text visualization of the knowledge graph."""
    print("\n=== Knowledge Graph Visualization ===")
    print("\nEntities and their relationships:\n")
    
    entities = set()
    relationships = []
    
    for node_id, node_data in graph_store._data.items():
        entities.add(node_id)
        if hasattr(node_data, 'edges'):
            for edge in node_data.edges:
                entities.add(edge.target)
                relationships.append((node_id, edge.predicate, edge.target))
    
    print(f"Total entities: {len(entities)}")
    print(f"Total relationships: {len(relationships)}")
    
    # Show entity categories
    print("\n--- Sample Entities ---")
    for i, entity in enumerate(sorted(entities)[:20]):
        print(f"  • {entity}")
    
    print("\n--- Sample Relationships ---")
    for subj, pred, obj in relationships[:15]:
        print(f"  {subj} --[{pred}]--> {obj}")

visualize_graph_text(graph_store)

## 8. Summary

You've learned GraphRAG techniques with LlamaIndex:

### Key Takeaways

| Concept | Description |
|---------|-------------|
| **Knowledge Graph** | Nodes (entities) + Edges (relationships) |
| **Triplet Extraction** | (subject, predicate, object) from text |
| **Multi-hop Reasoning** | Following relationship chains |
| **Hybrid Retrieval** | Combining vector + graph approaches |

### When to Use GraphRAG

1. **Entity-heavy domains**: People, organizations, products
2. **Relationship queries**: "Who works at...", "What created..."
3. **Multi-hop questions**: Require traversing multiple relationships
4. **Knowledge integration**: Combining info across documents

### Next Steps

In the next notebook, we'll explore LlamaCloud for managed services.

---

## Exercises

1. **Domain-specific extraction**: Create custom prompts for your domain

2. **Neo4j integration**: Use Neo4j as a persistent graph store

3. **Graph visualization**: Use networkx or pyvis for interactive visualization

4. **Entity resolution**: Handle duplicate entities with different names

In [None]:
# Exercise space
# Build your knowledge graph application here!