# RAG Query System Testing

This notebook tests the RAG (Retrieval Augmented Generation) system for querying climate literature.

## System Overview

- **Retrieval**: SQLAlchemy + pgvector for semantic search
- **Synthesis**: GPT-4o for generating responses
- **Data**: Document chunks stored in PostgreSQL from lit_mining_v3.ipynb

## Prerequisites

1. Document chunks must be uploaded to PostgreSQL (run lit_mining_v3.ipynb first)
2. DATABASE_URL must be set in `.env` file
3. OpenAI API key must be set in environment

## Setup and Initialization

In [12]:
# Import required libraries
from rag_query_system import ClimateRAGSystem
import os
from dotenv import load_dotenv
import json

# Load environment variables
load_dotenv('../.env')

# Verify environment
DATABASE_URL = os.getenv("DATABASE_URL")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if not DATABASE_URL:
    raise ValueError("DATABASE_URL not found in environment")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found in environment")

print("‚úÖ Environment variables loaded")
print(f"   Database: {DATABASE_URL.split('@')[1] if '@' in DATABASE_URL else 'configured'}")

‚úÖ Environment variables loaded
   Database: localhost:5432/climate_viewer_dev


In [13]:
# Initialize the RAG system
rag_system = ClimateRAGSystem(
    database_url=DATABASE_URL,
    model="gpt-4o",
    embedding_model="text-embedding-3-small"
)

print("‚úÖ RAG system initialized")
print(f"   Model: gpt-4o")
print(f"   Embedding model: text-embedding-3-small")

‚úÖ RAG system initialized
   Model: gpt-4o
   Embedding model: text-embedding-3-small


## System Health Check

In [15]:
from sqlalchemy import create_engine, text

engine = create_engine(DATABASE_URL)

with engine.connect() as conn:
    result = conn.execute(text("SELECT COUNT(*) FROM document_chunks"))
    total_chunks = result.scalar()
    
    result = conn.execute(text("SELECT COUNT(*) FROM document_chunks WHERE embedding IS NOT NULL"))
    chunks_with_embeddings = result.scalar()
    
    result = conn.execute(text("""
        SELECT DISTINCT unnest(relevant_layers) as layer 
        FROM document_chunks 
        WHERE relevant_layers IS NOT NULL
        ORDER BY layer
    """))
    layers = [row[0] for row in result]

print("üìä Database Status:")
print(f"   Total chunks: {total_chunks}")
print(f"   Chunks with embeddings: {chunks_with_embeddings}")
print(f"\nüìÅ Available Layers ({len(layers)}):")
for layer in layers:
    print(f"   - {layer}")

üìä Database Status:
   Total chunks: 568
   Chunks with embeddings: 568

üìÅ Available Layers (8):
   - annual_high_wave_flooding
   - compound_flooding
   - drainage_backflow
   - emergent_and_shallow_groundwater
   - future_erosion_hazard_zone
   - groundwater_inundation
   - low_lying_flooding
   - passive_marine_flooding


## Test 1: Groundwater Inundation

In [16]:
query = "What are the projected impacts of groundwater inundation in Hawaii due to sea level rise?"

result = rag_system.generate_response(
    query=query,
    top_k=10,
    layers=["groundwater_inundation"],
    min_confidence="MEDIUM",
    temperature=0.3
)

rag_system.print_response(result)

üîç Retrieving top 10 chunks...
   Filtering by layers: groundwater_inundation
‚úÖ Retrieved 10 chunks
üìù Building context prompt...
ü§ñ Generating response with gpt-4o...

ANSWER
Groundwater inundation in Hawaii due to sea level rise is a significant concern, especially for low-lying areas like Honolulu and Waikiki. As sea levels rise, the groundwater table‚Äîwhich is the level of water beneath the ground‚Äîalso rises. This can lead to flooding from below, even if the ocean itself doesn't directly overflow onto the land.

Here's what this means in practical terms: As sea levels are projected to rise by up to 1 meter (about 3.2 feet) by the end of the century, areas like Waikiki could see up to 42% of their land affected by groundwater flooding. This is because the groundwater is pushed up closer to the surface, leading to flooding in basements, roads, and other infrastructure. For example, nearly 90% of on-site sewage disposal systems in Honolulu are already compromised during hig

## Test 2: Coastal Erosion

In [5]:
query = "What are the observed and projected erosion rates for Hawaiian beaches?"

result = rag_system.generate_response(
    query=query,
    top_k=8,
    layers=["future_erosion_hazard_zone"],
    min_confidence="HIGH",
    temperature=0.2
)

rag_system.print_response(result)

üîç Retrieving top 8 chunks...
   Filtering by layers: future_erosion_hazard_zone
‚úÖ Retrieved 8 chunks
üìù Building context prompt...
ü§ñ Generating response with gpt-4o...

ANSWER
Hawaiian beaches are experiencing noticeable erosion, and this trend is expected to continue and even worsen in the future. Here's a breakdown of what's happening:

**Current Erosion Rates:**
- On average, beaches on the islands of Kauai, Oahu, and Maui are eroding at a rate of about 0.11 meters (or roughly 4 inches) per year. This means that over the past century, many beaches have been gradually shrinking.
- Specifically, Maui is seeing the most erosion, with about 78% of its beaches eroding at an average rate of 0.13 meters (or about 5 inches) per year. Oahu's beaches are eroding too, but at a slower rate of about 0.03 meters (or just over an inch) per year.

**Future Projections:**
- Looking ahead, the situation is expected to get worse due to rising sea levels. By 2050, around 92% of Hawaiian shore

## Test 3: Multi-Layer Flooding

In [None]:
query = "What types of flooding will affect Honolulu and Waikiki with 3 feet of sea level rise?"

result = rag_system.generate_response(
    query=query,
    top_k=15,
    layers=["passive_marine_flooding", "groundwater_inundation", "drainage_backflow", "compound_flooding"],
    min_confidence="MEDIUM",
    temperature=0.3
)

rag_system.print_response(result)

## Test 4: Open Query (No Filters)

In [None]:
query = "What infrastructure in Hawaii is most vulnerable to sea level rise by 2050?"

result = rag_system.generate_response(
    query=query,
    top_k=12,
    layers=None,
    min_confidence="MEDIUM",
    temperature=0.3
)

rag_system.print_response(result)

## Test 5: Auto-Detection of Layers from Keywords

The system can automatically detect relevant layers based on keywords in your query!

In [6]:
# Example 1: Query with "groundwater" keyword - should auto-detect groundwater_inundation layer
query = "What is passive marine flooding?"

result = rag_system.generate_response(
    query=query,
    top_k=10,
    layers=None,  # No manual layer specification (auto_detect_layers=True is the default)
    min_confidence="MEDIUM",
    temperature=0.3
)

rag_system.print_response(result)

üîç Retrieving top 10 chunks...
‚úÖ Retrieved 10 chunks
üìù Building context prompt...
ü§ñ Generating response with gpt-4o...

ANSWER
Passive marine flooding, sometimes called marine inundation, occurs when rising sea levels cause ocean water to gradually overflow onto the land. Imagine it like a bathtub slowly filling up and spilling over the edge. This type of flooding doesn't need a storm or big waves to happen‚Äîit's a result of the sea level itself getting higher over time. 

In Hawaii, this means that as sea levels rise, low-lying coastal areas could start to see water creeping in more frequently and eventually staying there permanently. For example, research from the University of Hawai ªi shows that with sea level rise scenarios ranging from 0 to 10 feet, areas that are close to the ocean and at low elevation are at risk. This could affect places like Waikƒ´kƒ´ and other parts of Honolulu, where even a few feet of sea level rise could lead to regular flooding of roads, homes

In [None]:
# Example 2: Query with "erosion" keyword - should auto-detect future_erosion_hazard_zone
query = "What are the beach erosion rates on Oahu and Maui?"

result = rag_system.generate_response(
    query=query,
    top_k=8,
    # Note: layers=None and auto_detect_layers=True are defaults
    temperature=0.3
)

rag_system.print_response(result)

In [None]:
# Example 3: Query with "wave" keyword - should auto-detect annual_high_wave_flooding
query = "What is the impact of wave runup and high waves on coastal infrastructure?"

result = rag_system.generate_response(
    query=query,
    top_k=10,
    temperature=0.3
)

rag_system.print_response(result)

In [11]:
# Example 4: Multiple keywords - should detect multiple layers
query = "What are the combined effects of storm drain backflow and coastal flooding in Honolulu?"

result = rag_system.generate_response(
    query=query,
    top_k=15,  # Retrieve more chunks since we're covering multiple layers
    temperature=0.3
)

rag_system.print_response(result)

üîç Auto-detected layers: passive_marine_flooding, drainage_backflow
üîç Retrieving top 15 chunks...
   Filtering by layers: passive_marine_flooding, drainage_backflow
‚úÖ Retrieved 15 chunks
üìù Building context prompt...
ü§ñ Generating response with gpt-4o...

ANSWER
In Honolulu, the combined effects of storm drain backflow and coastal flooding can lead to significant flooding challenges, especially in areas like Waikƒ´kƒ´. Here's how it works: as sea levels rise, ocean water can push back into storm drains, causing them to overflow. This backflow can flood streets and properties, particularly during high tides or storms when water levels are already elevated.

By 2050, with projections of about 2 feet of sea level rise, areas in Honolulu could experience more frequent and severe flooding. For instance, Waikƒ´kƒ´, a major tourist hub, is particularly vulnerable. The rising groundwater levels, combined with inadequate drainage systems, mean that even a small increase in sea level 

### Testing the Layer Detection Function Directly

You can also test the layer detection without running a full query:

In [10]:
# Test layer detection on various queries
test_queries = [
    "What is the impact of groundwater on buildings?",
    "How fast are beaches eroding?",
    "What happens during high wave events?",
    "Will storm drains overflow with sea level rise?",
    "What areas are below critical elevation thresholds?",
    "How does compound flooding work?",
]

print("Testing Layer Auto-Detection:\n")
print("=" * 80)

for query in test_queries:
    detected = rag_system.detect_layers_from_query(query)
    print(f"\nQuery: {query}")
    if detected:
        print(f"Detected Layers: {', '.join(detected)}")
    else:
        print("Detected Layers: None (will search all layers)")
    print("-" * 80)

Testing Layer Auto-Detection:


Query: What is the impact of groundwater on buildings?
Detected Layers: groundwater_inundation
--------------------------------------------------------------------------------

Query: How fast are beaches eroding?
Detected Layers: None (will search all layers)
--------------------------------------------------------------------------------

Query: What happens during high wave events?
Detected Layers: annual_high_wave_flooding
--------------------------------------------------------------------------------

Query: Will storm drains overflow with sea level rise?
Detected Layers: drainage_backflow
--------------------------------------------------------------------------------

Query: What areas are below critical elevation thresholds?
Detected Layers: low_lying_flooding
--------------------------------------------------------------------------------

Query: How does compound flooding work?
Detected Layers: compound_flooding
-------------------------------

## Advanced: Direct Chunk Retrieval

In [9]:
query = "groundwater flooding in urban areas"

chunks = rag_system.retrieve_chunks(
    query=query,
    top_k=5,
    layers=["groundwater_inundation"],
    min_confidence="HIGH"
)

print(f"Retrieved {len(chunks)} chunks:\n")

for i, chunk in enumerate(chunks, 1):
    print(f"[{i}] {chunk['filename']}")
    print(f"    Similarity: {chunk['similarity_score']:.4f}")
    print(f"    Confidence: {chunk['confidence']}")
    print(f"    Text: {chunk['text'][:200]}...\n")

Retrieved 5 chunks:

[1] Habel_et_al_flood_comparison.md
    Similarity: 0.6315
    Confidence: HIGH
    Text: ## Sea-Level Rise induced Multi-Mechanism flooding and contribution to Urban infrastructure failure

Sea-level rise (SLR) induced flooding is often envisioned as solely originating from a direct marin...

[2] Habel_et_al_flood_comparison.md
    Similarity: 0.5915
    Confidence: HIGH
    Text: Here a method is developed that identifies flooding extents and infrastructure vulnerabilities that are likely to result from alternate flood sources over coming decades. The method includes simulatio...

[3] Habel_et_al_flood_comparison.md
    Similarity: 0.5828
    Confidence: HIGH
    Text: In such cases, it is assumed that flooding will remain unless all mechanisms featured in that area are mitigated. For example, if GWI and direct marine flooding are featured in an area, and only the d...

[4] annurev-marine-020923-120737.md
    Similarity: 0.5812
    Confidence: HIGH
    Text: Rate

## Custom Query Cell

In [8]:
# Write your own custom query
query = "What is groundwater flooding?"

result = rag_system.generate_response(
    query=query,
    top_k=10,
    layers=None,
    min_confidence="MEDIUM",
    temperature=0.3,
    auto_detect_layers=True
)

rag_system.print_response(result)

üîç Auto-detected layers: groundwater_inundation
üîç Retrieving top 10 chunks...
   Filtering by layers: groundwater_inundation
‚úÖ Retrieved 10 chunks
üìù Building context prompt...
ü§ñ Generating response with gpt-4o...

ANSWER
Groundwater flooding is a type of flooding that happens when the water table, which is the upper level of groundwater, rises to the point where it reaches the surface. This can cause flooding from below, rather than from rain or ocean water coming in from above. In places like Honolulu and Waikiki, this is becoming a concern because as sea levels rise, the groundwater levels rise too. 

For example, research shows that in Waikiki, about 42% of the area has groundwater depths shallower than 1.3 meters (about 4 feet). This means that as sea levels continue to rise, the groundwater could reach the surface more frequently, leading to flooding even on sunny days. This is particularly important because it can affect basements, roads, and underground infrastructu