# Graph RAG Querying

This notebook demonstrates a simplified Retrieval Augmented Generation (RAG) approach for Building Information Modeling data using LangChain.

**What we'll accomplish:**
- Set up a simple LangChain-based RAG pipeline
- Load pre-computed entity embeddings into a FAISS vector store
- Perform semantic search and generate answers using a small LLM
- Analyze and visualize the results

## 0. Setup
This notebook can run in either Google Colab or locally. The setup cell will automatically configure your environment.

In [None]:
import os
from pathlib import Path

# Detect environment
try:
    from IPython import get_ipython
    IN_COLAB = 'google.colab' in str(get_ipython())
except:
    IN_COLAB = False

# Configure environment
if IN_COLAB:
    !git clone https://github.com/qaecy/built2025.git
    %cd built2025
    requirements_path = "requirements.txt"
else:    
    # Find requirements.txt based on current directory
    current_dir = Path().resolve()
    requirements_path = "../requirements.txt" if current_dir.name == "notebooks" else "requirements.txt"
    print(f"Looking for requirements at: {Path(requirements_path).resolve()}")

# Install dependencies if requirements.txt exists
if os.path.exists(requirements_path):
    %pip install -r {requirements_path}
    if IN_COLAB:
        %pip install -e .
    print("✓ Environment setup complete")
else:
    print("⚠️ Could not find requirements.txt")

## 1. Import Libraries and Setup Paths

We'll import necessary libraries and set up paths for our data and embeddings.

In [None]:
import sys
import pandas as pd
from pathlib import Path
from IPython.display import display
import matplotlib.pyplot as plt

# Add project to path if running locally
if not IN_COLAB:
    project_root = Path().resolve()
    if project_root.name == 'notebooks':
        project_root = project_root.parent
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))

# Import our simplified RAG implementation
from src.graph_rag.vector_based import VectorRAG

# Define paths
project_root = Path().resolve()
if project_root.name == 'notebooks':
    project_root = project_root.parent

DATA_DIR = project_root / "data"
EMBEDDINGS_DIR = DATA_DIR / "embeddings" / "buildingsmart_duplex"
GRAPH_DIR = DATA_DIR / "graph" / "buildingsmart_duplex"

print(f"Embeddings directory: {EMBEDDINGS_DIR}")
print(f"Graph directory: {GRAPH_DIR}")

## 2. Simple RAG with LangChain

In this section, we'll demonstrate a simplified LangChain-based RAG approach.

### 2.1 Explore Available Embeddings

In [None]:
# Display available embedding files
embedding_files = list(EMBEDDINGS_DIR.glob("*.json"))
embedding_info = [{
    "Filename": file.name,
    "Source File": file.stem.replace("_embeddings", ".ttl"),
    "Size (MB)": round(file.stat().st_size / (1024 * 1024), 2)
} for file in embedding_files]

display(pd.DataFrame(embedding_info))

### 2.2 Initialize RAG System

In [None]:
# Initialize our BIM RAG system
# Initialize our Vector RAG system
# llm_model = "Qwen/Qwen2.5-Coder-0.5B-Instruct" # NOTE faster, less accurate
# llm_model = "Qwen/Qwen2.5-Coder-1.5B-Instruct" # NOTE slower, more accurate
# llm_model = "Qwen/Qwen2.5-Coder-3B-Instruct"
# llm_model = "Qwen/Qwen2.5-Coder-3B-Instruct-GPTQ-Int4" # NOTE cant run this, require gpu
# llm_model = "Qwen/Qwen2.5-0.5B-Instruct"
# llm_model = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
# llm_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# TODO what model to use? every of these models struggle to produce usable answers.
#   1) could make even cleaner data for embedding
#   2) look into better llm models
# files = [
#     EMBEDDINGS_DIR / "Duplex_A_20110907_embeddings.json",
# ]
vector_rag = VectorRAG(embedding_files=list(EMBEDDINGS_DIR.glob("*.json")))

### 2.3 Ask Questions

In [None]:
# List of questions to ask
questions = [
    "What types of doors are in the building?",
    "How many windows are in the building?",
    "What materials are used in the exterior walls?",
    "What rooms have smoke detectors?"
]

# Ask each question and display results
top_k = 5
for question in questions:
    print("\n" + "="*50)
    print(f"Question: {question}")
    print("="*50)
    
    # Get answer from RAG system
    result = vector_rag.query(question, top_k=top_k)
    
    # Display the answer
    print(f"\nAnswer:\n{result['answer']}")
    
    # Display sources in a table
    sources_df = pd.DataFrame([{
        'Entity': s['entity'],
        'Source': s['source'],
        'Context': s['text'][:100] + '...'  # Truncate long context
    } for s in result['sources']])
    
    print("\nSources:")
    display(sources_df)

### 2.4 Visualize Sources

In [None]:
# Custom question
custom_question = "What are the dimensions of the kitchen?"

# Get answer
result = vector_rag.query(custom_question)
print(f"Question: {custom_question}\n")
print(f"Answer:\n{result['answer']}\n")

# Get analysis
if 'analysis' in result:
    analysis = result['analysis']
    
    # Create visualizations
    plt.figure(figsize=(12, 5))
    
    # Source distribution
    plt.subplot(1, 2, 1)
    sources = list(analysis['source_distribution'].keys())
    counts = list(analysis['source_distribution'].values())
    plt.bar(sources, counts)
    plt.title('Sources Used in Answer')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Number of References')
    
    # Entity types
    if 'entity_types' in analysis and analysis['entity_types']:
        plt.subplot(1, 2, 2)
        types = list(analysis['entity_types'].keys())
        type_counts = list(analysis['entity_types'].values())
        plt.bar(types, type_counts)
        plt.title('Entity Types Referenced')
        plt.xticks(rotation=45, ha='right')
        plt.ylabel('Count')
    
    plt.tight_layout()
    plt.show()

## 3. Graph RAG (Query-Based)

Now, let's explore a different RAG approach that queries a knowledge graph directly using SPARQL.
This method leverages an LLM to translate natural language questions into SPARQL queries, which are then executed against a graph database loaded from TTL files.

**Key components:**
- **Graph Store:** `pyoxigraph` is used to load RDF triples from TTL files.
- **NL-to-SPARQL:** An LLM (e.g., GPT-4o) translates questions to SPARQL using schema hints and few-shot examples.
- **Schema Context:** A boiled-down schema description (`boiled_down_schema.txt`) provides vocabulary.
- **Few-Shot Examples:** `few_shot_examples.json` demonstrates query patterns.

In [None]:
# Import the GraphRAG class
# Import the QueryRAG class
from src.graph_rag.query_based import QueryRAG

# Define paths (relative to project root established earlier)
TTL_DIR = GRAPH_DIR # Assuming TTLs are in the 'graph/buildingsmart_duplex' directory
SCHEMA_FILE = DATA_DIR / "graph" / "reduced_schema.txt"
EXAMPLES_FILE = DATA_DIR / "graph" / "few_shot_examples.json"

print(f"TTL directory: {TTL_DIR}")
print(f"Schema file: {SCHEMA_FILE}")
print(f"Examples file: {EXAMPLES_FILE}")

# Check if OPENAI_API_KEY is set (required for GraphRAG)
api_key_set = "OPENAI_API_KEY" in os.environ
if not api_key_set:
    print("\n⚠️ WARNING: OPENAI_API_KEY environment variable not set.")
    print("GraphRAG requires an OpenAI API key to function.")
    print("Please set it (e.g., os.environ['OPENAI_API_KEY'] = 'your_key') or the next cell will fail.")

# Initialize GraphRAG (only if API key is set)
# Initialize QueryRAG (only if API key is set)
query_rag = None
if api_key_set:
    ttl_files = list(TTL_DIR.glob("*.ttl"))
    if not ttl_files:
        print(f"Error: No TTL files found in {TTL_DIR}")
    elif not SCHEMA_FILE.is_file():
        print(f"Error: Schema file not found at {SCHEMA_FILE}")
    elif not EXAMPLES_FILE.is_file():
        print(f"Error: Examples file not found at {EXAMPLES_FILE}")
    else:
        try:
            print(f"\nInitializing QueryRAG with {len(ttl_files)} TTL files...")
            query_rag = QueryRAG(
                ttl_files=ttl_files,
                schema_file=SCHEMA_FILE,
                examples_file=EXAMPLES_FILE
                # llm_model="gpt-4o" # Default is now gpt-4o
            )
            print("✓ QueryRAG initialized.")
        except Exception as e:
            print(f"\n❌ Error initializing QueryRAG: {e}")
else:
     print("\nSkipping QueryRAG initialization due to missing API key.")

In [None]:
# Define example questions for GraphRAG
# Define example questions for QueryRAG
graph_questions = [
    "How many IfcDoor instances exist?", # Uses COUNT pattern
    "Show me the labels of all IfcWall instances.", # Uses basic select + optional label
    "What is the Area of the space labeled 'A103'?", # Uses property lookup via ifc:hasPropertySet
    "Does a wall with label 'Basic Wall:Interior - Furring (152 mm Stud):190774' exist?" # Should generate an ASK query
]

# Query using GraphRAG if initialized
# Query using QueryRAG if initialized
if query_rag and query_rag.chain:
    for question in graph_questions:
        print("\n" + "="*50)
        print(f"Question: {question}")
        print("="*50)
        
        try:
            result = query_rag.query(question)
            
            print("\nGenerated SPARQL:")
            print(result["sparql_query"] or "N/A")
            
            # Display Raw Results (nicer formatting)
            print("\nRaw Results:")
            if isinstance(result["raw_results"] , list):
                if not result["raw_results"]:
                    print("[]")
                else:
                    print("[")
                    for i, res_dict in enumerate(result["raw_results"]):
                        if i >= 5: # Limit display for brevity
                             print(f"  ... ({len(result['raw_results']) - 5} more)")
                             break
                        print(f"  {res_dict}")
                    print("]")
            else:
                print(result["raw_results"] or "N/A")
            
            print("\nFormatted Answer:")
            print(result["answer"])
            
        except Exception as e:
             print(f"\n❌ An unexpected error occurred during query processing: {e}")
             # Consider adding traceback here for debugging
             # import traceback
             # traceback.print_exc()
else:
    print("QueryRAG was not initialized successfully (check API key and file paths), skipping queries.")


## 5. Understanding the RAG Pipeline(s)

Let's break down how the RAG systems work:

**Vector-Based RAG (Section 2 & 4):**
1. **Document Loading**: Loads pre-computed *embeddings* from JSON files.
2. **Vector Store Creation**: Embeddings loaded into FAISS for *semantic similarity search*.
3. **Query Processing**: Question is embedded -> FAISS finds similar documents -> Context is retrieved.
4. **Answer Generation**: LLM uses question + retrieved *text context* to generate answer.
5. **Source Tracking**: Tracks *source documents* used for context.

**Graph-Based RAG (Section 3):**
1. **Graph Loading**: Loads *RDF triples* from TTL files into a graph store (`pyoxigraph`).
2. **NL-to-SPARQL**: LLM translates natural language question into a *SPARQL query* using schema/examples.
3. **Query Execution**: The generated SPARQL query is executed directly against the graph database.
4. **Answer Formatting**: Results from the SPARQL query are formatted (potentially using LLM again in future) into a readable answer.
5. **Transparency**: Shows the *generated SPARQL query* and *raw graph results*.

These approaches allow models to answer questions based on specific building information, either by finding relevant text snippets (vector) or querying structured data relationships (graph).

## 6. Next Steps

Here are some ways to extend these RAG systems:

**Vector RAG:**
1. Try different embedding models.
2. Experiment with different LLMs for generation.
3. Improve text chunking strategies before embedding.

**Graph RAG:**
4. Refine the NL-to-SPARQL prompt with more/better few-shot examples or improved schema representation.
5. Use a more powerful LLM (like GPT-4) for better SPARQL generation accuracy.
6. Implement validation for generated SPARQL queries before execution.
7. Use an LLM to summarize the raw SPARQL results into a more natural answer.

**Hybrid Approaches:**
8. Combine vector search with graph queries (e.g., use vector search to find relevant entities, then use graph queries to get specific properties of those entities).
9. Use graph context to enhance vector retrieval or LLM generation.