[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/advanced_rag/01_GraphRAG_Complete.ipynb)

# GraphRAG Complete - End-to-End Pipeline

## Overview

This notebook demonstrates a **complete end-to-end GraphRAG (Graph-based Retrieval Augmented Generation) system** using the Semantica framework. 

GraphRAG combines the power of:
- **Vector Search**: Semantic similarity matching for finding relevant text chunks
- **Knowledge Graphs**: Structured representation of entities and their relationships
- **Graph Traversal**: Multi-hop reasoning across entity connections

### Key Features

- **Real-World Data**: Uses actual data sources via RSS feeds, web scraping, and local files (NO mock data)
- **Complete Pipeline**: From data ingestion to LLM-powered question answering
- **Hybrid Retrieval**: Combines vector similarity search with knowledge graph traversal
- **Multi-hop Reasoning**: Follows relationships across the graph for deeper context

### What You'll Learn

- How to ingest real-world data from multiple sources (RSS feeds, web, local files)
- How to build knowledge graphs from unstructured text
- How to implement hybrid search combining vectors and graphs
- How to use ContextRetriever for intelligent context expansion
- How to integrate LLMs with GraphRAG for question answering
- How to visualize and export knowledge graphs

### Pipeline Architecture

The complete GraphRAG pipeline consists of 7 phases:

1. **Phase 0**: Setup & Foundation Seeding
2. **Phase 1**: Multi-Source Ingestion
3. **Phase 2**: Document Processing & Chunking
4. **Phase 3**: Comprehensive Semantic Extraction
5. **Phase 4**: Knowledge Graph Construction & Refinement
6. **Phase 5**: Vector Store Population
7. **Phase 6**: GraphRAG Query System
8. **Phase 7**: Visualization & Export

---

## Installation

Install Semantica and required dependencies:


In [None]:
# Install Semantica and all required dependencies
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers


---

## Phase 0: Setup & Configuration

In this phase, we configure Semantica with all necessary components for the GraphRAG pipeline.

### Configuration Components

- **Embedding Provider**: Sentence Transformers for generating text embeddings
- **Extraction Provider**: Groq LLM for entity and relationship extraction
- **Inference Provider**: Groq LLM for answer generation
- **Vector Store**: FAISS for efficient vector similarity search
- **Knowledge Graph**: NetworkX backend for graph operations


In [None]:
# Core imports will be added in cells where they're first used
import os
import json


### Set API Keys

Configure API keys for LLM providers. In production, use environment variables.


In [None]:
# Set up API keys
# Note: In production, use environment variables: export GROQ_API_KEY="your-key"
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "Your Groq API")


### Configure Semantica

Create a configuration dictionary specifying:
- Embedding model and provider
- Extraction model and provider  
- Inference model and provider
- Vector store backend and dimensions
- Knowledge graph backend and settings


In [None]:
# Create configuration dictionary
config_dict = {
    "project_name": "GraphRAG_Complete",
    
    # Embedding configuration
    "embedding": {
        "provider": "sentence_transformers", 
        "model": "all-MiniLM-L6-v2"  # 384-dimensional embeddings
    }, 
    
    # Extraction configuration (for NER and relation extraction)
    "extraction": {
        "provider": "groq", 
        "model": "llama-3.1-8b-instant", 
        "temperature": 0.0  # Deterministic extraction
    },
    
    # Inference configuration (for answer generation)
    "inference": {
        "provider": "groq",
        "model": "llama-3.3-70b-versatile"
    },
    
    # Vector store configuration
    "vector_store": {
        "provider": "faiss", 
        "dimension": 384  # Must match embedding dimension
    },
    
    # Knowledge graph configuration
    "knowledge_graph": {
        "backend": "networkx", 
        "merge_entities": True  # Automatically merge duplicate entities
    }
}


### Initialize Core Components

Initialize the main Semantica core and vector store using the configuration.


In [None]:
from semantica.core import Semantica, ConfigManager
from semantica.vector_store import VectorStore

# Load configuration and initialize Semantica core
config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)

# Initialize vector store with matching dimension
vs = VectorStore(backend="faiss", dimension=384)

print("Configuration complete. Semantica initialized.")
print(f"  - Embedding dimension: 384")
print(f"  - Vector store backend: FAISS")
print(f"  - Knowledge graph backend: NetworkX")


---

## Phase 0.1: Foundation Seeding

Seed the knowledge graph with verified ground truth data. This provides a foundation of known facts that the system can build upon.

### Why Foundation Seeding?

- **Quality Assurance**: Start with verified, accurate knowledge
- **Domain Expertise**: Incorporate expert knowledge into the graph
- **Better Extraction**: Guide entity extraction with known entities
- **Conflict Resolution**: Use seed data as reference for resolving conflicts


In [None]:
# Define foundation data (ground truth)
foundation_data = {
    "entities": [
        {
            "id": "hyaluronic_acid", 
            "name": "Hyaluronic Acid", 
            "type": "Ingredient", 
            "properties": {"role": "Humectant"}
        },
        {
            "id": "retinol", 
            "name": "Retinol", 
            "type": "Ingredient", 
            "properties": {"role": "Anti-aging actives"}
        },
        {
            "id": "niacinamide", 
            "name": "Niacinamide", 
            "type": "Ingredient", 
            "properties": {"role": "Barrier repair"}
        }
    ],
    "relationships": [
        {
            "source": "hyaluronic_acid", 
            "target": "niacinamide", 
            "type": "COMPLEMENTS", 
            "properties": {"benefit": "Hydration + Barrier"}
        }
    ]
}


### Save and Register Seed Data

Save the foundation data to a JSON file and register it with SeedDataManager.


In [None]:
# Save foundation data to JSON file
with open("skincare_base.json", "w") as f:
    json.dump(foundation_data, f, indent=2)

print("Foundation data saved to skincare_base.json")


### Create Foundation Graph

Use SeedDataManager to load the seed data and create the foundation graph.


In [None]:
from semantica.seed import SeedDataManager

# Initialize SeedDataManager
seed_manager = SeedDataManager()

# Register the seed data source
seed_manager.register_source("core_ontology", "json", "skincare_base.json")

# Create foundation graph from seed data
foundation_graph = seed_manager.create_foundation_graph()

print(f"Phase 0.1 Complete. Seeded {len(foundation_data['entities'])} primary nodes and {len(foundation_data['relationships'])} relationships.")
print(f"  - Foundation graph created successfully")
print(f"  - Seed data will be merged with extracted data in Phase 4")


---

## Phase 1: Multi-Source Ingestion

Ingest real-world data from multiple sources to build a comprehensive knowledge base.

### Data Sources

- **RSS/Atom Feeds**: Real-time updates from blogs and news sites
- **Web Pages**: Structured content from websites
- **Local Files**: Expert guides and documentation

### Components

- FeedIngestor, WebIngestor, FileIngestor for ingestion
- TextNormalizer for content cleaning and normalization


In [None]:
# Import ingestion and normalization modules
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from semantica.normalize import TextNormalizer


### Initialize Ingestion Components

Create instances of the ingestion classes and text normalizer.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from semantica.normalize import TextNormalizer

# Initialize ingestion components
normalizer = TextNormalizer()
feed_ingestor = FeedIngestor()
web_ingestor = WebIngestor()
file_ingestor = FileIngestor()

# Container for all ingested content
all_content = []

print("Ingestion components initialized.")


### Ingest RSS/Atom Feeds

RSS and Atom feeds provide structured, regularly updated content from blogs and news sites.


In [None]:
# Define feed URLs
feed_urls = [
    "https://makeupandbeautyblog.com/feed",
    "https://www.drbaileyskincare.com/blogs/blog.atom"
]

print("Ingesting from RSS/Atom feeds...")
for url in feed_urls:
    try:
        print(f"  Processing: {url}")
        feed_data = feed_ingestor.ingest_feed(url)
        
        # Extract content from feed items (limit to 3 per feed to avoid rate limits)
        for item in feed_data.items[:3]:
            text = item.content or item.description or item.title
            if text:
                all_content.append(text)
        print(f"    Successfully ingested {min(3, len(feed_data.items))} items")
    except Exception as e:
        print(f"    Warning: Failed to ingest {url}: {e}")

print(f"\nTotal feed items ingested: {len([c for c in all_content if c])}")


### Ingest Web Pages

Web ingestion extracts content from specific web pages, including Wikipedia articles and documentation.


In [None]:
# Define web URLs to ingest
web_urls = [
    "https://en.wikipedia.org/wiki/Retinol"
]

print("Ingesting from web pages...")
for url in web_urls:
    try:
        print(f"  Processing: {url}")
        content = web_ingestor.ingest_url(url)
        if content and content.text:
            all_content.append(content.text)
            print(f"    Successfully ingested content ({len(content.text)} characters)")
    except Exception as e:
        print(f"    Warning: Failed to ingest {url}: {e}")

print(f"\nTotal web pages ingested: {len([c for c in all_content if c])}")


### Ingest Local Files

Local files provide expert knowledge and documentation that may not be available online.


In [None]:
# Define local files to ingest
local_files = [
    "expert_skincare_guide.txt",
    "data/sample_graphrag_paper.txt"
]

print("Ingesting from local files...")
for file_path in local_files:
    try:
        if os.path.exists(file_path):
            print(f"  Processing: {file_path}")
            file_data = file_ingestor.ingest_file(file_path)
            # ingest_file returns a single FileObject, not a list
            if hasattr(file_data, 'text') and file_data.text:
                all_content.append(file_data.text)
            elif hasattr(file_data, 'content') and file_data.content:
                # If content is bytes, decode it
                if isinstance(file_data.content, bytes):
                    try:
                        text = file_data.content.decode('utf-8')
                        all_content.append(text)
                    except UnicodeDecodeError:
                        print(f"    Warning: Could not decode content from {file_path}")
            print(f"    Successfully ingested {file_path}")
        else:
            print(f"    Warning: File not found: {file_path}")
    except Exception as e:
        print(f"    Warning: Failed to ingest {file_path}: {e}")

print(f"\nTotal local files processed: {len([c for c in all_content if c])}")


### Normalize Content

Text normalization cleans and standardizes the ingested content:
- Removes extra whitespace
- Normalizes encoding
- Standardizes formatting
- Filters out very short content


In [None]:
# Normalize all ingested content
print("Normalizing content...")
normalized_content = []

for text in all_content:
    if text and len(text) > 50:  # Filter out very short content
        normalized_text = normalizer.normalize(text)
        normalized_content.append(normalized_text)

print(f"\nPhase 1 Complete. Ingested {len(normalized_content)} documents from multiple sources.")
print(f"  - Feed items: {len(feed_urls)} feeds processed")
print(f"  - Web pages: {len(web_urls)} URLs processed")
print(f"  - Local files: {len(local_files)} files processed")
print(f"  - Total normalized documents: {len(normalized_content)}")


---

## Phase 2: Document Processing & Chunking

Split documents into semantic chunks that preserve entity boundaries and maintain context.

### Why Chunking?

- **Processing Efficiency**: LLMs have token limits
- **Context Preservation**: Maintain semantic coherence within chunks
- **Entity Awareness**: Preserve entity boundaries to avoid splitting important relationships

### Chunking Strategy

- **EntityAwareChunker**: Intelligently splits text while preserving entity mentions
- **Chunk Size**: 1000 characters (balance between context and processing efficiency)
- **Overlap**: 200 characters (ensures continuity between chunks)


In [None]:
# Import chunking module
from semantica.split import EntityAwareChunker


### Initialize Chunker

Configure the EntityAwareChunker with appropriate chunk size and overlap.


In [None]:
from semantica.split import EntityAwareChunker

# Initialize EntityAwareChunker
# - chunk_size: Maximum characters per chunk (1000)
# - chunk_overlap: Characters to overlap between chunks (200) for context continuity
chunker = EntityAwareChunker(chunk_size=1000, chunk_overlap=200)

print("EntityAwareChunker initialized.")
print(f"  - Chunk size: 1000 characters")
print(f"  - Overlap: 200 characters")


### Chunk All Documents

Process all normalized documents into semantic chunks.


In [None]:
# Chunk all normalized documents
all_chunks = []

print("Chunking documents...")
for i, text in enumerate(normalized_content, 1):
    if text:
        chunks = chunker.chunk(text)
        all_chunks.extend(chunks)
        print(f"  Document {i}: {len(chunks)} chunks created")

print(f"\nPhase 2 Complete. Generated {len(all_chunks)} semantic chunks.")
if all_chunks:
    avg_size = sum(len(str(c.text)) for c in all_chunks) / len(all_chunks)
    print(f"  - Average chunk size: {avg_size:.0f} characters")
    print(f"  - Total chunks: {len(all_chunks)}")


---

## Phase 3: Semantic Extraction

Extract structured knowledge from unstructured text using multiple extraction methods.

### Extraction Methods

1. **Named Entity Recognition (NER)**: Identify entities (people, places, concepts)
2. **Relation Extraction**: Find relationships between entities
3. **Triplet Extraction**: Extract subject-predicate-object triplets (RDF format)
4. **Event Detection**: Identify events and their participants
5. **Coreference Resolution**: Link pronouns and references to entities

Multiple methods provide comprehensive coverage and increased reliability.


In [None]:
# Import semantic extraction modules
from semantica.semantic_extract import (
    NERExtractor, 
    RelationExtractor, 
    TripletExtractor,
    EventDetector,
    CoreferenceResolver
)

### Initialize Extractors

Create instances of all extraction classes with appropriate configurations.


In [None]:
from semantica.semantic_extract import NERExtractor, RelationExtractor

# Initialize Named Entity Recognition extractor
# Uses LLM method with Groq for high-quality entity extraction
ner = NERExtractor(
    method="llm", 
    provider="groq", 
    model="llama-3.1-8b-instant"
)

# Initialize Relation Extraction extractor
# Extracts relationships between entities
rel_ext = RelationExtractor(
    method="llm", 
    provider="groq", 
    model="llama-3.1-8b-instant"
)

# Initialize Triplet Extraction extractor
# Extracts RDF-style triplets (subject-predicate-object)
triplet_ext = TripletExtractor(
    method="llm", 
    provider="groq", 
    model="llama-3.1-8b-instant"
)

# Initialize Event Detection extractor
# Detects events and their participants
event_detector = EventDetector()

# Initialize Coreference Resolution
# Links pronouns and references to entities
coref_resolver = CoreferenceResolver()

print("All extractors initialized successfully.")
print("  - NER Extractor: Ready")
print("  - Relation Extractor: Ready")
print("  - Triplet Extractor: Ready")
print("  - Event Detector: Ready")
print("  - Coreference Resolver: Ready")


### Initialize Results Container

Create a dictionary to store all extraction results.


In [None]:
# Container for all extraction results
combined_results = {
    "entities": [],
    "relationships": [],
    "triplets": [],
    "events": []
}

print("Results container initialized.")


### Extract Named Entities

Named Entity Recognition identifies entities such as:
- **Persons**: People, characters
- **Organizations**: Companies, institutions
- **Locations**: Places, regions
- **Concepts**: Ideas, topics
- **Products**: Items, ingredients


In [None]:
# Process first 10 chunks to avoid rate limits
print("Extracting semantic information from chunks...")
print(f"Processing {min(10, len(all_chunks))} chunks...\n")

for i, chunk in enumerate(all_chunks[:10]):
    txt = str(chunk.text)
    if len(txt) < 50:
        continue
        
    print(f"Chunk {i+1}/{min(10, len(all_chunks))}:")
    
    try:
        # Step 1: Named Entity Recognition
        print(f"  Extracting entities...")
        entities = ner.extract(txt)
        
        for e in entities:
            combined_results["entities"].append({
                "name": e.text, 
                "type": e.label, 
                "id": e.text.lower().replace(' ', '_').replace('-', '_'),
                "confidence": getattr(e, 'confidence', 0.8)
            })
        
        print(f"    Found {len(entities)} entities")
        
        # Step 2: Relation Extraction (requires entities)
        if entities:
            print(f"  Extracting relationships...")
            try:
                relations = rel_ext.extract(txt, entities=entities)
                
                for r in relations:
                    combined_results["relationships"].append({
                        "source": r.subject, 
                        "target": r.object, 
                        "type": r.predicate,
                        "confidence": getattr(r, 'confidence', 0.7)
                    })
                
                print(f"    Found {len(relations)} relationships")
            except Exception as e:
                print(f"    Warning: Error extracting relationships: {e}")
        else:
            print(f"  Skipping relationship extraction (no entities found)")
        
        # Step 3: Triplet Extraction
        print(f"  Extracting triplets...")
        try:
            triplets = triplet_ext.extract_triplets(txt, entities=entities if entities else None)
            
            for t in triplets:
                combined_results["triplets"].append({
                    "subject": t.subject,
                    "predicate": t.predicate,
                    "object": t.object
                })
            
            print(f"    Found {len(triplets)} triplets")
        except Exception as e:
            print(f"    Warning: Error extracting triplets: {e}")
        
        # Step 4: Event Detection
        print(f"  Detecting events...")
        try:
            events = event_detector.detect_events(txt)
            
            for evt in events:
                combined_results["events"].append({
                    "type": evt.event_type,
                    "text": evt.text,
                    "participants": evt.participants
                })
            
            print(f"    Found {len(events)} events")
        except Exception as e:
            print(f"    Warning: Error detecting events: {e}")
        
        print()  # Blank line between chunks
        
    except Exception as e:
        print(f"  Warning: Error processing chunk: {e}\n")
        continue


### Extraction Process Overview

The extraction loop processes each chunk through four steps:

1. **Named Entity Recognition**: Identifies entities in the text
2. **Relation Extraction**: Finds relationships between entities (requires entities from step 1)
3. **Triplet Extraction**: Extracts RDF-style triplets
4. **Event Detection**: Identifies temporal events and participants

All extraction results are stored in the `combined_results` dictionary.


### Understanding the Extraction Results

Each extraction method produces different types of structured data:

- **Entities**: Individual concepts, people, places, products
- **Relationships**: Connections between entities (e.g., "Retinol COMPLEMENTS Niacinamide")
- **Triplets**: RDF-style facts (subject-predicate-object)
- **Events**: Temporal occurrences with participants


### Extraction Summary

Review the extraction results and statistics.


In [None]:
# Display extraction summary
print("=" * 60)
print("Phase 3 Complete - Extraction Summary")
print("=" * 60)
print(f"Entities extracted: {len(combined_results['entities'])}")
print(f"Relationships extracted: {len(combined_results['relationships'])}")
print(f"Triplets extracted: {len(combined_results['triplets'])}")
print(f"Events detected: {len(combined_results['events'])}")
print("=" * 60)

# Show sample entities
if combined_results['entities']:
    print("\nSample entities:")
    for entity in combined_results['entities'][:5]:
        print(f"  - {entity['name']} ({entity['type']})")

# Show sample relationships
if combined_results['relationships']:
    print("\nSample relationships:")
    for rel in combined_results['relationships'][:5]:
        print(f"  - {rel['source']} --[{rel['type']}]--> {rel['target']}")


---

## Phase 4: Knowledge Graph Construction & Refinement

Build the knowledge graph from extracted entities and relationships, then refine it through entity resolution and analysis.

### Graph Construction Process

1. Graph building from entities and relationships
2. Entity resolution to deduplicate and merge similar entities
3. Graph analysis of structure and properties
4. Centrality calculation to identify important entities
5. Community detection to find clusters of related entities

Entity resolution merges duplicate entities (e.g., "Retinol" and "retinol") to improve graph quality.


### Build Initial Knowledge Graph

Use GraphBuilder to create the initial graph structure from extracted entities and relationships.


In [None]:
from semantica.kg import GraphBuilder

# Initialize GraphBuilder with entity merging enabled
gb = GraphBuilder(merge_entities=True)

# Build knowledge graph from extraction results
print("Building knowledge graph...")
kg = gb.build(sources=[combined_results])

print(f"Initial graph statistics:")
print(f"  - Entities: {len(kg.get('entities', []))}")
print(f"  - Relationships: {len(kg.get('relationships', []))}")
print(f"  - Metadata: {kg.get('metadata', {})}")


### Resolve Entities (Deduplication)

Entity resolution merges duplicate entities using semantic similarity. This is crucial for:
- **Consistency**: Ensure each real-world entity appears only once
- **Accuracy**: Improve graph quality by removing duplicates
- **Information Aggregation**: Combine properties from multiple mentions
- **Quality**: Create a cleaner, more accurate graph


In [None]:
from semantica.kg import EntityResolver

# Initialize EntityResolver
# similarity_threshold: Minimum similarity (0.85 = 85%) to consider entities as duplicates
resolver = EntityResolver(similarity_threshold=0.85)

print("Resolving entities (deduplication)...")
print(f"  Method: Semantic similarity matching")
print(f"  Threshold: 0.85 (85% similarity)")

# Resolve entities using semantic method
resolved_entities = resolver.resolve_entities(
    kg.get('entities', []), 
)

# Create final graph with resolved entities
kg_final = {
    **kg,
    'entities': resolved_entities
}

print(f"\nEntity resolution complete:")
print(f"  - Original entities: {len(kg.get('entities', []))}")
print(f"  - Resolved entities: {len(kg_final['entities'])}")
print(f"  - Entities merged: {len(kg.get('entities', [])) - len(kg_final['entities'])}")


### Analyze Graph Structure

Graph analysis provides insights into the graph's topology and connectivity.


In [None]:
from semantica.kg import GraphAnalyzer

# Initialize GraphAnalyzer
analyzer = GraphAnalyzer()

print("Analyzing graph structure...")

# Perform comprehensive graph analysis
analysis = analyzer.analyze_graph(kg_final)

# Extract metrics from nested structure
metrics = analysis.get('metrics', {})
connectivity = analysis.get('connectivity', {})

print(f"\nGraph structure metrics:")
print(f"  - Graph density: {metrics.get('density', 0):.4f}")
print(f"  - Connected components: {connectivity.get('connected_components', 0)}")
print(f"  - Average degree: {metrics.get('avg_degree', 0):.2f}")
print(f"  - Total nodes: {metrics.get('num_nodes', 0)}")
print(f"  - Total edges: {metrics.get('num_edges', 0)}")


### Calculate Centrality Measures

Centrality measures identify the most important entities in the graph:
- **Degree Centrality**: Entities with many connections
- **Betweenness Centrality**: Entities that bridge different parts of the graph
- **Closeness Centrality**: Entities that are close to all other entities


In [None]:
from semantica.kg import CentralityCalculator

# Initialize CentralityCalculator
centrality_calc = CentralityCalculator()

print("Calculating centrality measures...")

# Calculate degree centrality (simplest and most common)
# This returns a dictionary containing 'centrality', 'rankings', etc.
result = centrality_calc.calculate_degree_centrality(kg_final)

if result and "rankings" in result:
    # Use the pre-computed rankings (highest scores first)
    top_entities = result["rankings"][:5]

    print(f"\nTop 5 entities by degree centrality:")
    for rank, item in enumerate(top_entities, 1):
        # item['node'] is the entity ID, item['score'] is the normalized score
        print(f"  {rank}. {item['node']}: {item['score']:.4f}")
else:
    print("  No centrality data available")

### Detect Communities

Community detection identifies clusters of closely connected entities, revealing:
- **Thematic Groups**: Entities that belong to the same topic
- **Subnetworks**: Dense clusters within the graph
- **Domain Structure**: How knowledge is organized


In [None]:
# Initialize CommunityDetector
from semantica.kg import CommunityDetector
community_detector = CommunityDetector()

print("Detecting communities...")
print("  Method: Louvain algorithm (greedy modularity optimization)")

# Detect communities using Louvain algorithm
result = community_detector.detect_communities(
    kg_final, 
    method="louvain"
)

if result and "communities" in result:
    communities = result["communities"]
    print(f"\nCommunity detection results:")
    print(f"  - Total communities found: {len(communities)}")
    
    # Show top 3 communities
    sorted_communities = sorted(communities, key=len, reverse=True)
    for i, community in enumerate(sorted_communities[:3], 1):
        print(f"  - Community {i}: {len(community)} entities")
        # Show sample entities from this community
        sample_entities = list(community)[:3]
        print(f"    Sample entities: {', '.join(sample_entities)}")
else:
    print("  No communities detected")


### Phase 4 Summary

Review the final graph statistics and quality metrics.


In [None]:
print("=" * 60)
print("Phase 4 Complete - Knowledge Graph Summary")
print("=" * 60)
# Extract metrics from nested structure (if not already extracted)
metrics = analysis.get('metrics', {})
connectivity = analysis.get('connectivity', {})

print(f"Final graph contains:")
print(f"  - Entities: {len(kg_final['entities'])}")
print(f"  - Relationships: {len(kg_final.get('relationships', []))}")
print(f"  - Graph density: {metrics.get('density', 0):.4f}")
print(f"  - Connected components: {connectivity.get('connected_components', 0)}")
print(f"  - Communities: {len(communities) if communities else 0}")
print("=" * 60)


---

## Phase 5: Vector Store Population

Generate embeddings for all document chunks and store them in the vector database for hybrid retrieval.

### Why Vector Store?

- **Semantic Search**: Find text chunks by meaning, not just keywords
- **Hybrid Retrieval**: Combine with graph traversal for comprehensive results
- **Fast Lookup**: Efficient similarity search using FAISS

### Embedding Process

1. Generate embeddings for all chunks using the configured embedding model
2. Store vectors with metadata linking back to original chunks
3. Enable fast similarity search for retrieval


In [None]:
# Prepare texts for embedding generation
print("Preparing texts for embedding...")
texts = [str(c.text) for c in all_chunks]

print(f"  - Total chunks to embed: {len(texts)}")
print(f"  - Embedding model: all-MiniLM-L6-v2")
print(f"  - Expected dimension: 384")


In [None]:
# Generate embeddings using the configured embedding generator
print("Generating embeddings...")
embeddings = core.embedding_generator.generate_embeddings(texts)

print(f"Embeddings generated successfully:")
print(f"  - Total embeddings: {len(embeddings)}")
print(f"  - Embedding dimension: {embeddings.shape[1] if len(embeddings) > 0 else 0}")


### Prepare Metadata

Create metadata for each vector that links it back to the original chunk and provides context.


In [None]:
# Create metadata for each vector
metadata_list = []

print("Preparing metadata...")
for i, chunk in enumerate(all_chunks):
    metadata_list.append({
        "text": str(chunk.text),
        "chunk_id": i,
        "source": "multi_source_ingestion",
        "chunk_length": len(str(chunk.text))
    })

print(f"  - Metadata entries created: {len(metadata_list)}")


### Store Vectors

Store all embeddings in the FAISS vector store for fast similarity search.


In [None]:
# Store vectors in FAISS vector store
print("Storing vectors in vector store...")
vs.store_vectors(vectors=embeddings, metadata=metadata_list)

print(f"\nPhase 5 Complete. Vector store populated successfully.")
print(f"  - Vectors stored: {len(embeddings)}")
print(f"  - Vector store backend: FAISS")
print(f"  - Embedding dimension: {embeddings.shape[1] if len(embeddings) > 0 else 0}")
print(f"  - Ready for similarity search")


---

## Phase 6: GraphRAG Query System

Implement hybrid retrieval that combines vector search with graph traversal.

### Hybrid Retrieval

- **Vector Search**: Finds semantically similar text chunks
- **Graph Traversal**: Follows entity relationships to find connected information

### AgentContext Features

- Auto-detection of RAG vs GraphRAG based on available components
- Graph expansion with configurable hop limits
- Hybrid scoring combining vector and graph scores
- Entity linking to connect query terms to graph entities


In [None]:
# Import context and provider modules
from semantica.context import AgentContext, EntityLinker
from semantica.semantic_extract.providers import create_provider


### Initialize AgentContext

AgentContext is the main interface for GraphRAG. It orchestrates:
- Vector store queries
- Knowledge graph traversal
- Hybrid scoring and ranking
- Context aggregation


In [None]:
from semantica.context import AgentContext

# Initialize AgentContext with hybrid retrieval enabled
print("Initializing AgentContext for GraphRAG...")

ctx = AgentContext(
    vector_store=vs,                    # Vector store for semantic search
    knowledge_graph=kg_final,           # Knowledge graph for relationship traversal
    use_graph_expansion=True,           # Enable graph traversal
    max_expansion_hops=2,               # Traverse up to 2 hops from initial entities
    hybrid_alpha=0.6                    # 60% weight on graph, 40% on vector
)

print("AgentContext initialized successfully.")
print(f"  - Graph expansion: Enabled")
print(f"  - Max expansion hops: 2")
print(f"  - Hybrid alpha: 0.6 (60% graph, 40% vector)")


### Initialize LLM Provider

Set up the LLM provider for generating final answers from retrieved context.


In [None]:
# Initialize LLM provider for answer generation
llm_provider = create_provider("groq", model="llama-3.3-70b-versatile")

print("LLM provider initialized.")
print(f"  - Provider: Groq")
print(f"  - Model: llama-3.3-70b-versatile")
print(f"  - Ready for answer generation")


### Initialize Entity Linker

EntityLinker resolves text mentions to entities in the knowledge graph, enabling better query understanding.


In [None]:
# Initialize EntityLinker
linker = EntityLinker(knowledge_graph=kg_final)

print("EntityLinker initialized.")
print(f"  - Knowledge graph: Linked")
print(f"  - Ready for entity resolution in queries")


### Interactive Query Example


In [None]:
# Define the user query
user_query = "What ingredients synergize with Retinol to prevent irritation?"

print("=" * 70)
print("GRAPH RAG QUERY PROCESSING")
print("=" * 70)
print(f"Query: {user_query}\n")


### Retrieve Context Using Hybrid Search

Use AgentContext to retrieve relevant context using both vector search and graph traversal.


In [None]:
# Retrieve context using hybrid retrieval
print("Retrieving context using hybrid search...")
print("  - Vector search: Finding semantically similar chunks")
print("  - Graph traversal: Following entity relationships")
print("  - Max results: 5")
print("  - Graph expansion: Enabled (2 hops)\n")

context_results = ctx.retrieve(
    user_query, 
    max_results=5,
    use_graph=True,              # Enable graph-based retrieval
    expand_graph=True,           # Expand to related entities
    include_entities=True,       # Include related entities in results
    include_relationships=True   # Include relationships in results
)

print(f"Retrieved {len(context_results)} context results.\n")


### Display Retrieved Context

Review the retrieved context and multi-hop connections discovered by GraphRAG.


In [None]:
if context_results:
    print("=" * 70)
    print("MULTI-HOP CONTEXT DISCOVERED")
    print("=" * 70)
    
    for i, res in enumerate(context_results, 1):
        print(f"\nResult {i}:")
        print(f"  Content: {res.get('content', '')[:150]}...")
        print(f"  Relevance Score: {res.get('score', 0):.4f}")
        
        # Display related entities (multi-hop connections)
        if res.get('related_entities'):
            print(f"  Related Entities ({len(res['related_entities'])}):")
            for entity in res['related_entities'][:3]:
                entity_name = entity.get('name', entity.get('content', 'Unknown'))
                print(f"    - {entity_name}")
        
        # Display related relationships
        if res.get('related_relationships'):
            print(f"  Related Relationships: {len(res['related_relationships'])}")
    
    print("\n" + "=" * 70)
else:
    print("No relevant context found.")
    print("This may indicate:")
    print("  - The knowledge graph doesn't contain relevant information")
    print("  - Try a different query")
    print("  - Check if entities in the query exist in the graph")


### Generate Final Answer

Use the LLM to synthesize a comprehensive answer from the retrieved context.


In [None]:
if context_results:
    # Combine all retrieved context
    context_text = "\n\n".join([r.get('content', '') for r in context_results])
    
    # Create prompt for LLM
    prompt = f"""Based on the following context from a knowledge graph, answer the user query accurately and comprehensively.

Context:
{context_text}

Query: {user_query}

Provide a detailed answer based on the context above:"""
    
    print("=" * 70)
    print("GENERATING FINAL ANSWER")
    print("=" * 70)
    print("Using LLM to synthesize answer from retrieved context...\n")
    
    try:
        final_answer = llm_provider.generate(prompt, temperature=0.3)
        print(final_answer)
    except Exception as e:
        print(f"Warning: LLM generation failed: {e}")
        print("\nHowever, we successfully retrieved relevant context using GraphRAG!")
        print("The context above can be used to answer the query manually.")


---

## Phase 7: Visualization & Export

Visualize the knowledge graph and export it in various formats for analysis and sharing.

### Visualization

- Interactive network visualization with multiple layout algorithms (spring, circular, hierarchical)

### Export Formats

- **JSON**: Human-readable, easy to process
- **GraphML**: Standard graph format, compatible with many tools
- **RDF**: Semantic web format, supports SPARQL queries
- **CSV**: Tabular format for analysis


### Visualize Knowledge Graph

Create a visual representation of the knowledge graph showing entities and their relationships.


In [None]:
from semantica.visualization import KGVisualizer
import matplotlib.pyplot as plt

# Initialize KGVisualizer
viz = KGVisualizer()

print("Visualizing knowledge graph...")
print("  - Layout: Spring (force-directed)")
print("  - Title: GraphRAG Knowledge Graph")

try:
    viz.visualize_network(
        kg_final, 
        layout="spring", 
        title="GraphRAG Knowledge Graph",
        output="static"
    )
    plt.show()
    print("Graph visualization complete.")
except Exception as e:
    print(f"Warning: Visualization error: {e}")
    print("Graph structure:")
    print(f"  - Entities: {len(kg_final.get('entities', []))}")
    print(f"  - Relationships: {len(kg_final.get('relationships', []))}")


### Export Knowledge Graph

Export the graph in multiple formats for different use cases.


In [None]:
from semantica.export import GraphExporter

# Initialize GraphExporter
exporter = GraphExporter()

print("\nExporting knowledge graph...")

# Export to JSON (human-readable, easy to process)
try:
    exporter.export_knowledge_graph(kg_final, "graphrag_kg.json", format="json")
    print("  Exported to JSON: graphrag_kg.json")
except Exception as e:
    print(f"  Warning: JSON export error: {e}")

# Export to GraphML (standard graph format)
try:
    exporter.export_knowledge_graph(kg_final, "graphrag_kg.graphml", format="graphml")
    print("  Exported to GraphML: graphrag_kg.graphml")
except Exception as e:
    print(f"  Warning: GraphML export error: {e}")

print("\nPhase 7 Complete. Graph visualized and exported.")


---

## Summary

This notebook demonstrated a complete GraphRAG pipeline from data ingestion to question answering.

### Pipeline Overview

The complete pipeline consisted of 7 phases:

1. **Phase 0**: Setup & Foundation Seeding
   - Configured Semantica with all components
   - Seeded knowledge graph with ground truth data

2. **Phase 1**: Multi-Source Ingestion
   - Ingested from RSS feeds, web pages, and local files
   - Normalized all content for processing

3. **Phase 2**: Document Processing & Chunking
   - Split documents into semantic chunks
   - Preserved entity boundaries

4. **Phase 3**: Comprehensive Semantic Extraction
   - Extracted entities using NER
   - Extracted relationships between entities
   - Extracted RDF triplets
   - Detected events and their participants

5. **Phase 4**: Knowledge Graph Construction & Refinement
   - Built graph from extracted data
   - Resolved duplicate entities
   - Analyzed graph structure
   - Calculated centrality measures
   - Detected communities

6. **Phase 5**: Vector Store Population
   - Generated embeddings for all chunks
   - Stored vectors with metadata

7. **Phase 6**: GraphRAG Query System
   - Implemented hybrid retrieval
   - Demonstrated multi-hop reasoning
   - Generated answers using LLM

8. **Phase 7**: Visualization & Export
   - Visualized knowledge graph
   - Exported in multiple formats

### Key Takeaways

- **GraphRAG** combines the best of vector search and graph traversal
- **Multi-hop reasoning** allows answering complex questions that require connecting multiple pieces of information
- **Hybrid retrieval** provides more contextually accurate results than vector-only search
- **Semantica framework** provides all the tools needed for production-ready GraphRAG systems

### Next Steps

- Experiment with different queries to explore the knowledge graph
- Add more data sources to expand the knowledge base
- Fine-tune extraction parameters for your specific domain
- Explore advanced features like temporal reasoning and conflict resolution
- Integrate with persistent graph stores (Neo4j, ArangoDB) for production use
