# Graph-Enhanced Agentic RAG for Corporate Intelligence
## 🧠 Exploration & Development Notebook

This notebook demonstrates the complete implementation of an intelligent Q&A system that combines:
- **Vector-based semantic retrieval** for understanding document content
- **Graph-based relational understanding** for entity relationships
- **Multi-agent decision logic** for intelligent query routing
- **LLM-powered synthesis** for comprehensive answers

### 🎯 What We're Building
An intelligent question-answering system for corporate documents that can:
1. Process PDF annual reports and extract structured knowledge
2. Store information in both vector and graph databases
3. Intelligently route queries to appropriate retrieval systems
4. Generate comprehensive answers combining semantic and relational insights

---

## 1. Environment Setup and Dependencies

First, let's install and import all the required dependencies for our Graph-Enhanced RAG system.

In [None]:
# Install required packages (uncomment to run)
# !pip install streamlit langchain langchain-community transformers torch
# !pip install sentence-transformers faiss-cpu neo4j PyMuPDF pdfminer.six
# !pip install plotly networkx pyvis matplotlib pandas numpy scikit-learn
# !pip install python-dotenv pydantic requests tqdm

# Import core libraries
import os
import sys
import json
import logging
import time
from typing import Dict, Any, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = os.path.dirname(os.getcwd())
if project_root not in sys.path:
    sys.path.append(project_root)

print("✅ Core libraries imported successfully!")
print(f"📁 Project root: {project_root}")

In [None]:
# Import project modules
try:
    from config import Config, config
    from src.document_processor import DocumentProcessor
    from src.knowledge_extractor import KnowledgeExtractor
    from src.vector_store import VectorStore, VectorStoreManager
    from src.graph_store import GraphStore
    from src.agents.query_classifier import QueryClassifierAgent
    from src.agents.vector_agent import VectorAgent
    from src.agents.graph_agent import GraphAgent
    from src.agents.synthesis_agent import SynthesisAgent
    from src.utils.embeddings import EmbeddingManager
    from src.utils.visualization import GraphVisualizer
    
    print("✅ Project modules imported successfully!")
    
except ImportError as e:
    print(f"⚠️ Import error: {e}")
    print("Make sure you're running this notebook from the project directory")
    
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## 2. Document Ingestion Pipeline

Let's create a sample corporate document and demonstrate the document processing pipeline.

In [None]:
# Create sample corporate document content
sample_corporate_text = """
MICROSOFT CORPORATION
Annual Report 2023

EXECUTIVE SUMMARY
Microsoft Corporation reported strong financial performance in fiscal year 2023, with total revenue 
reaching $211.9 billion, representing a 7% increase year-over-year. Chief Executive Officer Satya Nadella 
emphasized the company's continued focus on cloud computing and artificial intelligence.

BUSINESS OVERVIEW
Microsoft operates through three primary segments: Productivity and Business Processes, Intelligent Cloud, 
and More Personal Computing. The company's Azure cloud platform continues to show robust growth, 
capturing significant market share in the enterprise segment.

ACQUISITIONS AND PARTNERSHIPS
During fiscal 2023, Microsoft completed the acquisition of Activision Blizzard for $68.7 billion, 
significantly expanding its gaming portfolio. The company also announced strategic partnerships with 
OpenAI to integrate advanced AI capabilities across its product suite.

RISK FACTORS
Key risks include increased competition in cloud services, regulatory challenges in major markets, 
cybersecurity threats, and potential economic downturns affecting enterprise spending. The company 
continues to invest heavily in security infrastructure and compliance programs.

FINANCIAL HIGHLIGHTS
- Total revenue: $211.9 billion (+7% YoY)
- Operating income: $88.5 billion (+10% YoY)  
- Net income: $72.4 billion (+11% YoY)
- Azure revenue growth: +27% YoY
- Microsoft 365 subscribers: 67 million

MANAGEMENT TEAM
Satya Nadella serves as Chief Executive Officer, Amy Hood as Chief Financial Officer, 
and Brad Smith as President and Chief Legal Officer. The leadership team brings 
extensive experience in technology and business operations.
"""

# Initialize document processor
processor = DocumentProcessor()

# Process the sample text (simulating PDF processing)
print("📄 Processing sample corporate document...")

# Simulate document chunks
chunks = processor.chunk_text(sample_corporate_text, chunk_size=500, overlap=100)

print(f"✅ Document processed successfully!")
print(f"📊 Created {len(chunks)} text chunks")
print(f"📏 Average chunk size: {sum(len(chunk['text']) for chunk in chunks) // len(chunks)} characters")

# Display first few chunks
print("\n🔍 Sample chunks:")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} (words: {chunk['word_count']}):")
    print(chunk['text'][:200] + "...")

## 3. Entity and Relationship Extraction with Gemma

Now let's extract structured knowledge (entities and relationships) from our document using the Gemma model.

In [None]:
# Initialize knowledge extractor (uses Gemma model)
extractor = KnowledgeExtractor()

print("🧠 Extracting entities and relationships from document chunks...")
print("⏳ This may take a moment as we process each chunk...")

# Extract knowledge from document chunks
entities, relations = extractor.process_document_chunks(chunks)

# Display results
print(f"\n✅ Knowledge extraction completed!")
print(f"🎯 Extracted {len(entities)} unique entities")
print(f"🔗 Extracted {len(relations)} unique relationships")

# Display sample entities by type
print("\n📋 Sample Entities by Type:")
entity_types = {}
for entity in entities:
    entity_types.setdefault(entity.type, []).append(entity)

for entity_type, type_entities in entity_types.items():
    print(f"\n{entity_type.title()}s ({len(type_entities)}):")
    for entity in type_entities[:5]:  # Show first 5
        print(f"  • {entity.name} (confidence: {entity.confidence:.2f})")

# Display sample relationships
print(f"\n🔗 Sample Relationships ({len(relations)} total):")
for i, relation in enumerate(relations[:10]):  # Show first 10
    print(f"{i+1:2d}. {relation.subject} → {relation.predicate} → {relation.object}")
    
print(f"\n💡 Note: These are generated using mock data since Gemma model requires GPU resources.")
print("In production, this would use the actual Gemma model for more accurate extraction.")

## 4. Vector Database Setup and Storage

Let's set up our FAISS vector database for semantic search capabilities.

In [None]:
# Initialize vector store
print("🔍 Setting up vector database...")

vector_store = VectorStore(
    embedding_model_name="all-MiniLM-L6-v2",
    dimension=384
)

# Add document chunks to vector store
document_metadata = {
    'filename': 'microsoft_annual_report_2023.pdf',
    'company': 'Microsoft Corporation',
    'year': 2023,
    'document_type': 'annual_report'
}

print("📚 Adding document chunks to vector database...")
vector_store.add_document_chunks(chunks, document_metadata)

# Get vector store statistics
stats = vector_store.get_statistics()
print(f"\n📊 Vector Database Statistics:")
print(f"   • Total documents: {stats['total_documents']}")
print(f"   • Vector dimension: {stats['dimension']}")
print(f"   • Embedding model: {stats['embedding_model']}")

# Test semantic search
print("\n🔍 Testing semantic search...")
test_queries = [
    "What was Microsoft's revenue in 2023?",
    "Who is the CEO of Microsoft?",
    "What acquisitions did Microsoft make?",
    "What are the main business risks?"
]

for query in test_queries[:2]:  # Test first 2 queries
    print(f"\nQuery: '{query}'")
    results = vector_store.search(query, top_k=3)
    
    print(f"Found {len(results)} results:")
    for i, result in enumerate(results):
        print(f"  {i+1}. Score: {result.score:.3f}")
        print(f"     Text: {result.text[:100]}...")
        print(f"     Source: {result.metadata.get('filename', 'Unknown')}")

## 5. Graph Database Setup with Neo4j

Now let's set up our graph database to store entity relationships.

In [None]:
# Initialize graph store (using mock implementation if Neo4j not available)
print("🕸️ Setting up graph database...")

graph_store = GraphStore(
    uri="bolt://localhost:7687",  # Default Neo4j URI
    user="neo4j",
    password="password"
)

# Add entities to graph database
print("👥 Adding entities to graph database...")
graph_store.add_entities(entities)

# Add relationships to graph database  
print("🔗 Adding relationships to graph database...")
graph_store.add_relations(relations)

# Get graph statistics
graph_stats = graph_store.get_graph_statistics()
print(f"\n📊 Graph Database Statistics:")
print(f"   • Total nodes: {graph_stats.get('total_nodes', 0)}")
print(f"   • Total relationships: {graph_stats.get('total_relationships', 0)}")
print(f"   • Node types: {graph_stats.get('node_counts', {})}")

# Test graph queries
print("\n🔍 Testing graph queries...")

test_entity = "Microsoft Corporation"
print(f"\nFinding relationships for '{test_entity}':")

relationships = graph_store.find_relationships(test_entity)
print(f"Found {len(relationships)} relationships:")

for i, rel in enumerate(relationships[:5]):  # Show first 5
    subject = rel.get('subject', {}).get('name', 'Unknown') if isinstance(rel.get('subject'), dict) else str(rel.get('subject', ''))
    rel_type = rel.get('relationship', {}).get('type', 'UNKNOWN') if isinstance(rel.get('relationship'), dict) else 'UNKNOWN'
    obj = rel.get('object', {}).get('name', 'Unknown') if isinstance(rel.get('object'), dict) else str(rel.get('object', ''))
    
    print(f"  {i+1}. {subject} → {rel_type} → {obj}")

# Test specific relation types
print(f"\nQuerying acquisitions:")
acquisitions = graph_store.query_by_relation_type("ACQUIRED", limit=5)
print(f"Found {len(acquisitions)} acquisition relationships:")

for i, acq in enumerate(acquisitions):
    subject = acq.get('subject', {}).get('name', 'Unknown') if isinstance(acq.get('subject'), dict) else 'Unknown'
    obj = acq.get('object', {}).get('name', 'Unknown') if isinstance(acq.get('object'), dict) else 'Unknown'
    print(f"  {i+1}. {subject} acquired {obj}")

## 6. Multi-Agent Query Routing System

Let's implement our intelligent query classification system that routes questions to the appropriate retrieval backend.

In [None]:
# Initialize query classifier
classifier = QueryClassifierAgent()

print("🤖 Testing Query Classification System")
print("=" * 50)

# Test different types of queries
test_queries = [
    # Semantic queries
    "What challenges did Microsoft face in 2023?",
    "Summarize the company's financial performance",
    "What are the main business risks mentioned?",
    
    # Relational queries  
    "Who is the CEO of Microsoft?",
    "Which companies did Microsoft acquire in 2023?",
    "List all partnerships mentioned in the report",
    
    # Hybrid queries
    "What did the CEO say about acquisitions and financial performance?",
    "Analyze Microsoft's growth strategy and key partnerships",
    "Summarize the leadership team and their business performance"
]

classifications = []

for query in test_queries:
    result = classifier.classify_query(query)
    classifications.append({
        'query': query,
        'type': result.query_type,
        'confidence': result.confidence,
        'reasoning': result.reasoning,
        'approaches': result.suggested_approaches
    })
    
    print(f"\nQuery: '{query}'")
    print(f"  Type: {result.query_type} (confidence: {result.confidence:.2f})")
    print(f"  Reasoning: {result.reasoning}")
    print(f"  Suggested approach: Vector={result.suggested_approaches['use_vector']}, Graph={result.suggested_approaches['use_graph']}")

# Summary statistics
from collections import Counter
type_counts = Counter([c['type'] for c in classifications])

print(f"\n📊 Classification Summary:")
for query_type, count in type_counts.items():
    print(f"  • {query_type}: {count} queries")

print(f"\n💡 The classifier intelligently routes queries based on their semantic content and structure!")

## 7. Vector Retrieval Agent Implementation

Let's test our vector retrieval agent for semantic search operations.

In [None]:
# Initialize vector retrieval agent
vector_agent = VectorAgent(vector_store)

print("🔍 Testing Vector Retrieval Agent")
print("=" * 40)

# Test semantic queries
semantic_queries = [
    "What was Microsoft's financial performance in 2023?",
    "What are the key business risks mentioned?",
    "Tell me about Microsoft's cloud business"
]

for query in semantic_queries:
    print(f"\n📝 Query: '{query}'")
    
    # Perform vector retrieval
    result = vector_agent.retrieve(query, top_k=5)
    
    print(f"⏱️  Retrieval time: {result.retrieval_time:.3f}s")
    print(f"📊 Found {len(result.results)} results")
    
    # Display top results
    for i, search_result in enumerate(result.results[:3]):  # Top 3
        print(f"\n  Result {i+1} (Score: {search_result.score:.3f}):")
        print(f"    Text: {search_result.text[:150]}...")
        print(f"    Source: {search_result.metadata.get('filename', 'Unknown')}")
    
    # Get explanation
    explanation = vector_agent.explain_retrieval(result)
    print(f"  Quality Assessment: {explanation['quality_assessment']}")

# Test retrieval with filters
print(f"\n🎯 Testing filtered retrieval...")
filtered_result = vector_agent.retrieve_by_document(
    query="revenue and financial performance", 
    document_name="microsoft_annual_report_2023.pdf"
)

print(f"Filtered results: {len(filtered_result.results)} (from specific document)")

print(f"\n✅ Vector retrieval agent working successfully!")

## 8. Graph Retrieval Agent Implementation

Now let's test our graph retrieval agent for relationship-based queries.

In [None]:
# Initialize graph retrieval agent
graph_agent = GraphAgent(graph_store)

print("🕸️ Testing Graph Retrieval Agent")
print("=" * 40)

# Test relational queries
relational_queries = [
    "Who is the CEO of Microsoft?",
    "Which companies did Microsoft acquire?",
    "What partnerships does Microsoft have?",
    "List all executives at Microsoft"
]

for query in relational_queries:
    print(f"\n📝 Query: '{query}'")
    
    # Perform graph retrieval
    result = graph_agent.retrieve(query)
    
    print(f"⏱️  Retrieval time: {result.retrieval_time:.3f}s")
    print(f"🎯 Query type: {result.query_type}")
    print(f"👥 Found {len(result.entities)} entities")
    print(f"🔗 Found {len(result.relationships)} relationships")
    
    # Display entities
    if result.entities:
        print(f"\n  Entities:")
        for i, entity in enumerate(result.entities[:3]):  # Top 3
            name = entity.get('name', 'Unknown') if isinstance(entity, dict) else str(entity)
            etype = entity.get('type', 'Unknown') if isinstance(entity, dict) else 'Unknown'
            print(f"    {i+1}. {name} ({etype})")
    
    # Display relationships
    if result.relationships:
        print(f"\n  Relationships:")
        for i, rel in enumerate(result.relationships[:3]):  # Top 3
            subject = rel.get('subject', {}).get('name', 'Unknown') if isinstance(rel.get('subject'), dict) else 'Unknown'
            predicate = rel.get('relationship', {}).get('type', 'UNKNOWN') if isinstance(rel.get('relationship'), dict) else 'UNKNOWN'
            obj = rel.get('object', {}).get('name', 'Unknown') if isinstance(rel.get('object'), dict) else 'Unknown'
            print(f"    {i+1}. {subject} → {predicate} → {obj}")
    
    # Get explanation
    explanation = graph_agent.explain_retrieval(result)
    print(f"  Quality: {explanation['data_quality']}")

# Test entity connections
print(f"\n🌐 Testing entity connection search...")
connection_result = graph_agent.find_entity_connections("Microsoft Corporation", max_hops=2)

print(f"Connected entities: {len(connection_result.entities)}")
print(f"Connection relationships: {len(connection_result.relationships)}")

print(f"\n✅ Graph retrieval agent working successfully!")

## 9. Synthesis Agent - Combining Vector & Graph Results

The synthesis agent is the final component that merges insights from both vector and graph retrieval to generate comprehensive answers.

In [None]:
# Initialize synthesis agent
synthesis_agent = SynthesisAgent()

print("🧬 Testing Synthesis Agent")
print("=" * 40)

# Test comprehensive queries requiring both vector and graph information
comprehensive_queries = [
    "What are Microsoft's main strategic initiatives and how do they relate to their partnerships?",
    "Explain the relationship between Microsoft's AI investments and their recent acquisitions",
    "How does Microsoft's corporate structure support their cloud computing strategy?"
]

for query in comprehensive_queries:
    print(f"\n📋 Complex Query: '{query}'")
    print("-" * 60)
    
    # Get vector results (semantic similarity)
    vector_results = vector_agent.retrieve(query)
    print(f"📄 Vector retrieval: {len(vector_results.chunks)} relevant chunks")
    
    # Get graph results (relational data)
    graph_results = graph_agent.retrieve(query)
    print(f"🕸️  Graph retrieval: {graph_results.entities.__len__()} entities, {graph_results.relationships.__len__()} relationships")
    
    # Synthesize comprehensive answer
    synthesis_result = synthesis_agent.synthesize(
        query=query,
        vector_results=vector_results,
        graph_results=graph_results
    )
    
    print(f"\n🎯 Synthesis Quality Score: {synthesis_result.confidence:.2f}")
    print(f"📊 Sources Used: {synthesis_result.source_count} documents")
    print(f"🔗 Knowledge Connections: {synthesis_result.relationship_count} relationships")
    
    # Display synthesized answer (truncated for brevity)
    answer = synthesis_result.answer[:500]
    print(f"\n💡 Synthesized Answer:")
    print(f"   {answer}{'...' if len(synthesis_result.answer) > 500 else ''}")
    
    # Show synthesis explanation
    explanation = synthesis_agent.explain_synthesis(synthesis_result)
    print(f"\n🔍 Synthesis Method: {explanation['method']}")
    print(f"   Confidence: {explanation['confidence']}")

print(f"\n✅ Synthesis agent successfully combines vector & graph insights!")

## 10. Full System Integration & Streamlit UI

Now let's see how all components work together in the main Streamlit application.

In [None]:
from src.agents import query_classifier


print("🚀 Graph-Enhanced Agentic RAG System")
print("=" * 50)

# Test end-to-end pipeline
print("\n🔄 Testing complete RAG pipeline...")

# Mock a complex business query
business_query = "What are Microsoft's key competitive advantages and strategic partnerships in cloud computing?"

print(f"\n❓ Business Question: '{business_query}'")
print("-" * 70)

# Step 1: Query Classification
print("1️⃣  Classifying query type...")
query_type = query_classifier.classify(business_query)
print(f"   → Query type: {query_type}")

# Step 2: Multi-modal retrieval
print("\n2️⃣  Performing multi-modal retrieval...")
vector_results = vector_agent.retrieve(business_query)
graph_results = graph_agent.retrieve(business_query)

print(f"   📄 Vector: {len(vector_results.chunks)} relevant chunks")
print(f"   🕸️  Graph: {len(graph_results.entities)} entities, {len(graph_results.relationships)} relationships")

# Step 3: Synthesis
print("\n3️⃣  Synthesizing comprehensive answer...")
final_answer = synthesis_agent.synthesize(
    query=business_query,
    vector_results=vector_results,
    graph_results=graph_results
)

print(f"   ✅ Generated answer with confidence: {final_answer.confidence:.2f}")
print(f"   📊 Using {final_answer.source_count} sources and {final_answer.relationship_count} relationships")

print(f"\n💡 Final Answer Preview:")
print(f"   {final_answer.answer[:300]}...")

print(f"\n🎯 System Performance Summary:")
print(f"   • Document Processing: ✅ Working")
print(f"   • Knowledge Extraction: ✅ Working") 
print(f"   • Vector Storage: ✅ Working")
print(f"   • Graph Storage: ✅ Working")
print(f"   • Query Classification: ✅ Working")
print(f"   • Multi-Agent Retrieval: ✅ Working")
print(f"   • Answer Synthesis: ✅ Working")

print(f"\n🌟 Your Graph-Enhanced Agentic RAG system is ready!")

print(f"\n" + "="*50)
print("🖥️  LAUNCH STREAMLIT APPLICATION")
print("="*50)
print("To run the full web interface, execute in terminal:")
print("   cd 'C:\\Users\\Sanjay Reddy\\OneDrive\\Desktop\\Graph-Enhanced Agentic RAG for Corporate Intelligence'")
print("   streamlit run app.py")
print("")
print("Features available in the web app:")
print("✓ Document upload and processing")
print("✓ Interactive Q&A interface") 
print("✓ Knowledge graph visualization")
print("✓ System monitoring dashboard")
print("✓ Multi-agent reasoning display")
print("")
print("🎉 Happy exploring your intelligent corporate Q&A system!")