# Graph RAG Querying

This notebook demonstrates and compares two complementary approaches to Retrieval Augmented Generation (RAG) for Building Information Modeling (BIM) data, building upon the data processed and embedded in Part 1 (`01_data_integration.ipynb`) [![Open data integration notebook in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/qaecy/built2025/blob/main/notebooks/01_data_integration.ipynb).

**Goal:** Query the processed data (RDF graph and text embeddings) using two different RAG methods.

**The Two Approaches:**
1.  **Vector-Based RAG**: Uses semantic search over pre-computed text embeddings of building entities.
   - *Best for:* Finding relevant text passages, understanding entity types and descriptions (e.g., "What types of doors are in the building?").
   - *Analogy:* Think of this as a smart search engine that finds relevant text and uses it to answer questions.
2.  **Query-Based RAG**: Translates natural language questions into SPARQL queries executed against an RDF knowledge graph.
   - *Best for:* Precise property lookups, relationship queries, counting instances (e.g., "What is the area of room A103?").
   - *Analogy:* Think of this as a database query system that can precisely look up properties and relationships.

**When to use each**:
- Use Vector RAG for descriptive, similarity-based questions.
- Use Graph RAG for precise, property-based questions.
- Consider combining both for complex queries.

**Learning Objectives**
By the end of this notebook, you will:
1. Understand when to use vector-based vs. query-based RAG.
2. Set up and use both RAG approaches with BIM data.
3. Compare the strengths and limitations of each method.
4. See how the approaches can be combined for more powerful querying.

**Notebook Structure**
1. **Setup and Dependencies** - Configure the environment and load necessary libraries.
2. **Method 1: Vector-Based RAG** - Implement semantic search over text embeddings.
3. **Method 2: Graph-Based RAG** - Query the knowledge graph using SPARQL.
4. **Comparison, Summary, and Next Steps** - Analyze the approaches and explore extensions.

## 0. Setup
This notebook can run in either Google Colab or locally. The setup cell below automatically configures your environment by detecting whether it's running in Colab or locally, cloning the repository if needed (Colab), and installing dependencies from `requirements.txt`.

**Key Point:** This setup ensures the notebook runs consistently anywhere with minimal configuration. It is identical to Part 1's setup.

**Key Dependencies and Their Purpose**
- **LangChain**: Framework for building RAG applications.
- **pyoxigraph**: Graph database for executing SPARQL queries.
- **FAISS**: Vector similarity search library.
- **OpenAI API**: Required for both RAG approaches (different models used).
  - Vector RAG uses `gpt-3.5-turbo` (more cost-effective for text generation).
  - Graph RAG uses `gpt-4o` (better at SPARQL generation).

In [None]:
import os
from pathlib import Path

# Detect environment
try:
    from IPython import get_ipython
    IN_COLAB = 'google.colab' in str(get_ipython())
except:
    IN_COLAB = False

# Configure environment
if IN_COLAB:
    !git clone https://github.com/qaecy/bilt2025.git
    %cd bilt2025
    requirements_path = "requirements.txt"
    from google.colab import userdata
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
else:    
    # Find requirements.txt based on current directory
    current_dir = Path().resolve()
    requirements_path = "../requirements.txt" if current_dir.name == "notebooks" else "requirements.txt"
    print(f"Looking for requirements at: {Path(requirements_path).resolve()}")

# Install dependencies if requirements.txt exists
if os.path.exists(requirements_path):
    %pip install -r {requirements_path}
    if IN_COLAB:
        %pip install -e .
    print("✓ Environment setup complete")
else:
    print("⚠️ Could not find requirements.txt")

## 1. Import Libraries and Setup Paths

Here we import the necessary libraries and set up paths to the data prepared in the previous lab. Note the different handling for Colab vs. local environments to ensure correct path setup.

**Key Point:** These libraries provide the tools for our RAG approaches, including our custom RAG classes (`VectorRAG` and `QueryRAG`).

In [None]:
import sys
import pandas as pd
from pathlib import Path
from IPython.display import display
import matplotlib.pyplot as plt

# Add project to path if running locally
if not IN_COLAB:
    project_root = Path().resolve()
    if project_root.name == 'notebooks':
        project_root = project_root.parent
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))

# Import our simplified RAG implementations
from src.graph_rag.vector_based import VectorRAG
from src.graph_rag.query_based import QueryRAG

# Define paths
project_root = Path().resolve()
if project_root.name == 'notebooks':
    project_root = project_root.parent

DATA_DIR = project_root / "data"
EMBEDDINGS_DIR = DATA_DIR / "embeddings" / "buildingsmart_duplex"
GRAPH_DIR = DATA_DIR / "graph" / "buildingsmart_duplex"
SCHEMA_FILE = DATA_DIR / "graph" / "reduced_schema.txt"
EXAMPLES_FILE = DATA_DIR / "graph" / "few_shot_examples.json"

print(f"Embeddings directory: {EMBEDDINGS_DIR}")
print(f"Graph directory: {GRAPH_DIR}")
print(f"Schema file: {SCHEMA_FILE}")
print(f"Examples file: {EXAMPLES_FILE}")

## 2. Method 1: Vector-Based RAG

In this section, we'll demonstrate a vector-based RAG approach using LangChain. This approach is particularly useful for:
- Finding similar text descriptions
- Understanding entity types and their characteristics
- Answering questions that require semantic understanding

**Key Point:** This approach leverages pre-computed embeddings for semantic search and answer generation, acting like a smart search engine for relevant text.

### 2.1 Explore Available Embeddings

Let's start by listing the embedding files created in Part 1. These JSON files contain the text representations and vector embeddings for entities from our building models.

In [None]:
# Display available embedding files
embedding_files = list(EMBEDDINGS_DIR.glob("*.json"))
embedding_info = [{
    "Filename": file.name,
    "Source File": file.stem.replace("_embeddings", ".ttl"),
    "Size (MB)": round(file.stat().st_size / (1024 * 1024), 2)
} for file in embedding_files]

display(pd.DataFrame(embedding_info))

### 2.2 Initialize Vector RAG System

We initialize our `VectorRAG` class from `src.graph_rag.vector_based.py`.

**Behind the Scenes (`VectorRAG`):**
1. **Document Loading**: Loads pre-computed *embeddings* (text + vector) from the specified JSON files.
2. **Vector Store Creation**: Embeddings are loaded into a FAISS vector store for efficient *semantic similarity search*.
3. **Query Processing**: The input question is embedded, and FAISS finds the most semantically similar documents (text chunks) based on vector distance.
4. **Context Retrieval & Prompting**: The retrieved text chunks serve as context. An internal LangChain prompt template (`RetrievalQA`) combines this context with the user's original question.
5. **Answer Generation**: The combined prompt (context + question) is sent to an OpenAI model (`gpt-3.5-turbo` by default) to synthesize a natural language answer.
6. **Source Tracking**: The system keeps track of the *source documents* (text chunks) used to generate the answer, providing transparency.

In [None]:
# Initialize our vector based RAG
vector_rag = VectorRAG(embedding_files=list(EMBEDDINGS_DIR.glob("*.json")))

### 2.3 Ask Questions (Vector-Based)

Now, let's ask some questions suited for this approach. The `vector_rag.query(question, top_k=k)` method works as follows:
1. Embeds the input `question`.
2. Performs a similarity search in the FAISS vector store to find the `top_k` most relevant text chunks (documents/context).
3. Sends the retrieved context and the original question to the LLM (via the prompt template) to generate an answer.
4. Returns the generated answer along with the source documents used for transparency.

**Key Point:** The system finds relevant text snippets and uses an LLM to synthesize an answer. The sources show *which* text chunks were used.

In [None]:
# List of questions to ask
questions = [
    "What types of doors are in the building?",
    "How many windows are in the building?",
    "What materials are used in the exterior walls?",
    "What rooms have smoke detectors?"
]

# Ask each question and display results
top_k = 5
for question in questions:
    print("\n" + "="*50)
    print(f"Question: {question}")
    print("="*50)
    
    # Get answer from RAG system
    result = vector_rag.query(question, top_k=top_k)
    
    # Display the answer
    print(f"\nAnswer:\n{result['answer']}")
    
    # Display sources in a table
    sources_df = pd.DataFrame([{
        'Entity': s['entity'],
        'Source': s['source'],
        'Context': s['text'][:100] + '...'  # Truncate long context
    } for s in result['sources']])
    
    print("\nSources:")
    display(sources_df)

## 3. Method 2: Graph-Based RAG (Query-Based)

Now, let's explore the query-based RAG approach that interacts directly with the knowledge graph using SPARQL. This method is particularly useful for:
- Precise property lookups
- Relationship queries
- Counting instances
- Complex property path queries

**Key Point:** This approach is like a database query system that can precisely look up properties and relationships within the structured graph data.

### 3.1 Initialize Graph RAG System

We initialize our `QueryRAG` class from `src.graph_rag.query_based.py`. This requires the OpenAI API key to be set, as it relies on an LLM (GPT-4o recommended) for natural language to SPARQL translation.

**Key Concept: Context is Crucial for LLM Guidance**
Unlike the vector approach which relies on embedding similarity, the query-based approach needs explicit guidance for the LLM to translate natural language into precise SPARQL. We provide this guidance through two key context files:
1. `reduced_schema.txt`: Contains essential classes and properties the LLM needs to know about.
2. `few_shot_examples.json`: Provides concrete examples of question-to-SPARQL translation patterns.
These files are manually curated to give the LLM the necessary vocabulary and structural patterns to work with our specific graph data.

**Behind the Scenes (`QueryRAG`):**
1. **Graph Loading**: Loads *RDF triples* from the specified TTL files into an in-memory `pyoxigraph` graph store.
2. **NL-to-SPARQL Context Prep**: Prepares context for the LLM, including:
  - The simplified schema description (`reduced_schema.txt`) to provide vocabulary (available classes, properties) and relationship information.
  - Few-shot examples (`few_shot_examples.json`) demonstrating common question-to-SPARQL translation patterns. The chosen examples specifically aim to teach the LLM:
    - How to **count** instances (`SELECT (COUNT(...) ...)` pattern).
    - How to retrieve **labels** along with instances (`SELECT ?instance ?label ... OPTIONAL { ?instance rdfs:label ?label . }` pattern).
    - How to perform **property lookups** that require traversing through intermediate PropertySet nodes (`?instance ifc:hasPropertySet ?pset . ?pset prop:PropertyName ?valueNode . ?valueNode rdf:value ?value .` pattern).
    - How to check for the **existence** of something (`ASK WHERE {...}` pattern - *Note: While not in the default `few_shot_examples.json`, the system prompt encourages this, and we test it later.*).
3. **SPARQL Generation**: Sets up a LangChain chain (`GraphSparqlQAChain`) that sends the natural language question, schema, and examples to the LLM (`gpt-4o` by default) to generate a SPARQL query.
4. **Query Execution**: The generated SPARQL query is executed directly against the `pyoxigraph` graph store.
5. **Answer Formatting**: The raw results from the SPARQL query are formatted (currently basic formatting, potentially LLM-enhanced in future) into a readable natural language answer.
6. **Transparency**: The intermediate *generated SPARQL query* and the *raw graph results* are returned alongside the final answer.

In [None]:
# Check if OPENAI_API_KEY is set (required for QueryRAG)
api_key_set = "OPENAI_API_KEY" in os.environ
if not api_key_set:
    print("\n⚠️ WARNING: OPENAI_API_KEY environment variable not set.")
    print("QueryRAG requires an OpenAI API key to function.")
    print("Please set it (e.g., os.environ['OPENAI_API_KEY'] = 'your_key') or the next cell will fail.")

# Initialize QueryRAG (only if API key is set)
query_rag = None
if api_key_set:
    ttl_files = list(GRAPH_DIR.glob("*.ttl"))
    if not ttl_files:
        print(f"Error: No TTL files found in {GRAPH_DIR}")
    elif not SCHEMA_FILE.is_file():
        print(f"Error: Schema file not found at {SCHEMA_FILE}")
    elif not EXAMPLES_FILE.is_file():
        print(f"Error: Examples file not found at {EXAMPLES_FILE}")
    else:
        try:
            print(f"\nInitializing QueryRAG with {len(ttl_files)} TTL files...")
            query_rag = QueryRAG(
                ttl_files=ttl_files,
                schema_file=SCHEMA_FILE,
                examples_file=EXAMPLES_FILE
                # llm_model="gpt-4o" # Default is now gpt-4o
            )
            print("✓ QueryRAG initialized.")
        except Exception as e:
            print(f"\n❌ Error initializing QueryRAG: {e}")
else:
     print("\nSkipping QueryRAG initialization due to missing API key.")

### 3.2 Ask Questions (Graph-Based)

Let's ask questions that are better suited for graph traversal and precise lookups. The `query_rag.query(question)` method works as follows:
1. Sends the `question`, schema information, and few-shot examples to the LLM.
2. The LLM generates a SPARQL query based on its understanding of the question and the provided context.
3. The generated SPARQL query is executed against the loaded `pyoxigraph` graph database.
4. The raw results from the query are processed (potentially using the LLM again for summarization, though currently basic formatting is applied) to generate a natural language answer.
5. Returns the final answer, the generated SPARQL query, and the raw graph results for transparency.

**Key Point:** This method provides precise answers derived directly from graph queries, offering high accuracy for structured data retrieval.

In [None]:
# Define example questions for QueryRAG
graph_questions = [
    "How many IfcWindow instances are there?", # Similar COUNT pattern, different class
    "Show me the labels of all IfcSpace instances.", # Similar SELECT+LABEL pattern, different class
    "What is the volume of the space labeled 'A101'?", # Similar property lookup pattern, different property/instance
    "Does a door with label 'M_Single-Flush:0915 x 2134mm:190721' exist?" # Similar ASK pattern, different instance/label
]

# Query using QueryRAG if initialized
if query_rag and query_rag.chain:
    for question in graph_questions:
        print("\n" + "="*50)
        print(f"Question: {question}")
        print("="*50)
        
        try:
            result = query_rag.query(question)
            
            print("\nGenerated SPARQL:")
            print(result["sparql_query"] or "N/A")
            
            # Display Raw Results (nicer formatting)
            print("\nRaw Results:")
            if isinstance(result["raw_results"] , list):
                if not result["raw_results"]:
                    print("[]")
                else:
                    print("[")
                    for i, res_dict in enumerate(result["raw_results"]):
                        if i >= 5: # Limit display for brevity
                             print(f"  ... ({len(result['raw_results']) - 5} more)")
                             break
                        print(f"  {res_dict}")
                    print("]")
            else:
                print(result["raw_results"] or "N/A")
            
            print("\nFormatted Answer:")
            print(result["answer"])
            
        except Exception as e:
             print(f"\n❌ An unexpected error occurred during query processing: {e}")
else:
    print("QueryRAG was not initialized successfully (check API key and file paths), skipping queries.")


## 4. Comparison, Summary, and Next Steps

We've explored two distinct RAG approaches for BIM data based on the results from our queries. Here's a summary reflecting their observed strengths and ideal use cases:

### Comparison: When to Use Each Approach (Based on Observed Results)

| Use Case                     | Vector RAG                                    | Graph RAG                                         | Notes Based on Experiments                                          |
|------------------------------|-----------------------------------------------|---------------------------------------------------|---------------------------------------------------------------------|
| Finding similar descriptions | ✅ Good (e.g., Wall Materials, Smoke Detectors) | ❌ Not ideal                                      | Vector RAG excels at finding relevant *text* passages.              |
| Precise property lookups     | ⚠️ Limited (If explicitly in text)           | ✅ **Best** (e.g., Volume of Space A101)            | Graph queries retrieve exact values; revealed multiple Volume entries. |
| Relationship queries         | ⚠️ Implicit                                  | ✅ **Best** (Built for structure)                  | Graph explicitly models links (e.g., Space -> PSet -> Volume).      |
| Counting instances           | ⚠️ Approximate/Inaccurate (Window count wrong) | ✅ **Precise** (Window count correct)             | Graph `COUNT` is reliable; Vector RAG seemed to misinterpret.         |
| Complex property paths       | ❌ Impossible                                | ✅ **Best** (e.g., PSet traversal for properties) | SPARQL handles multi-step traversals required for properties.     |
| General knowledge / Fuzzy Qs | ✅ Good (e.g., Smoke detector location)        | ⚠️ Limited (Needs specific schema/examples)       | Vector search handles broader topics; Graph needs precise mapping.    |
| Exact data point retrieval   | ⚠️ Depends on text source                    | ✅ **Best** (e.g., Count, Volume, Existence check) | Graph ensures accuracy for structured facts via SPARQL.               |

**Key Takeaway:** Neither approach is universally "better"; they are complementary tools, validated by our results. 
- **Vector RAG** proved useful for retrieving descriptive text and handling fuzzier queries, but struggled with precise counts or specific property values not explicitly stated in the text.
- **Graph RAG** excelled at precise, structured data retrieval (counting, property lookups, existence checks) but requires curated context (schema, examples) and correct query execution logic.

### Summary
This notebook demonstrated:
- Setting up and using Vector-Based RAG, highlighting its strengths in semantic text retrieval (materials, detector info) and weaknesses in precise counting.
- Setting up and using Graph-Based RAG, showing its power for accurate data retrieval (counts, labels, properties like Volume) once context (schema/examples) is provided and execution bugs (like the newline issue) are resolved.

### Future Directions & Next Steps
Potential next steps could include:
- **Hybrid Approaches:** Experimenting with workflows that use vector search to find entities and graph search to get details, or using graph context to enrich embeddings.
- **Agent Systems:** Building an agent that intelligently routes questions to the appropriate RAG system based on the question type.
- **Context Enhancement:** Improving the Graph RAG's `reduced_schema.txt` and `few_shot_examples.json` to handle more query types.
- **Visualization:** Integrating tools to display query results spatially or graphically.
- **Scaling:** Testing and optimizing these techniques on larger, more complex building models.
