# Sub-Challenge 1: RAG-Based Well Report Summarization

**GeoHackathon 2025 - Guide for Testing Sub-Challenge 1**

This notebook provides a complete step-by-step guide to test the RAG (Retrieval-Augmented Generation) system for geothermal well report summarization.

---

## What is Sub-Challenge 1?

**Goal:** Build an AI system that can answer questions about geothermal well reports by:
1. Finding relevant sections in PDF documents (Retrieval)
2. Using an LLM to generate accurate answers (Generation)
3. Providing citations to source documents

**Scoring:** 50% of total grade

**Key Features:**
- TOC-enhanced parsing (30-86x faster than full PDF parsing)
- Section-aware chunking with metadata
- Table extraction and separate indexing
- Query intent mapping to section types
- Citation with page numbers and section references

---

## Architecture Overview

```
User Query
    â†“
Intent Mapper ("well depth" â†’ ['depth', 'borehole'])
    â†“
Query Embedding (nomic-embed-text-v1.5, 768-dim)
    â†“
ChromaDB Retrieval (filter by section_type)
    â†“
Context Building (top 5 chunks with metadata)
    â†“
Ollama LLM (llama3.2:3b, temp=0.1)
    â†“
Answer + Citations
```

---

## Step 1: Prerequisites Check

Before starting, verify that all required services are running:

In [None]:
import sys
import requests
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

print("="*80)
print("PREREQUISITES CHECK")
print("="*80)

# Check ChromaDB
print("\n1. ChromaDB Service:")
try:
    response = requests.get('http://localhost:8000/api/v1/heartbeat', timeout=5)
    if response.status_code == 200:
        print("   [OK] ChromaDB is running on localhost:8000")
    else:
        print("   [ERROR] ChromaDB responded but with error")
except Exception as e:
    print(f"   [ERROR] ChromaDB not accessible: {e}")
    print("   Run: docker-compose up -d chromadb")

# Check Ollama
print("\n2. Ollama Service:")
try:
    response = requests.get('http://localhost:11434/api/tags', timeout=5)
    if response.status_code == 200:
        models = [m['name'] for m in response.json()['models']]
        print("   [OK] Ollama is running on localhost:11434")
        print(f"   Available models: {models}")
        if 'llama3.2:3b' in models:
            print("   [OK] llama3.2:3b model is available")
        else:
            print("   [WARN] llama3.2:3b not found. Run: ollama pull llama3.2:3b")
    else:
        print("   [ERROR] Ollama responded but with error")
except Exception as e:
    print(f"   [ERROR] Ollama not accessible: {e}")
    print("   Download from: https://ollama.ai")
    print("   Then run: ollama pull llama3.2:3b")

# Check TOC database
print("\n3. TOC Database:")
toc_db_path = Path.cwd().parent / 'outputs' / 'exploration' / 'toc_database.json'
if toc_db_path.exists():
    import json
    with open(toc_db_path) as f:
        toc_db = json.load(f)
    print(f"   [OK] TOC database found with {len(toc_db)} entries")
    print(f"   Wells available: {list(toc_db.keys())}")
else:
    print("   [ERROR] TOC database not found")
    print("   Run: python notebooks/build_toc_database.py")

print("\n" + "="*80)
print("If all checks pass, you're ready to proceed!")
print("="*80)

---

## Step 2: Initialize the RAG System

Load all components (this takes ~10-15 seconds):

In [None]:
from rag_system import WellReportRAG

print("Initializing RAG system...\n")
print("This will load:")
print("  - TOC database (9 wells)")
print("  - Embedding model (nomic-embed-text-v1.5, 768-dim)")
print("  - Vector store connection (ChromaDB)")
print("  - LLM client (Ollama llama3.2:3b)")
print("\nPlease wait...\n")

rag = WellReportRAG()

print("\n" + "="*80)
print("RAG system ready! You can now query well reports.")
print("="*80)

---

## Step 3: Check Indexed Data

Verify what wells have been indexed into ChromaDB:

In [None]:
# Get collection stats
stats = rag.vector_store.get_collection_stats()

print("="*80)
print("CHROMADB STATUS")
print("="*80)
print(f"\nCollection: {stats['collection_name']}")
print(f"Total chunks indexed: {stats['total_chunks']}")

# Sample some chunks to see which wells are indexed
results = rag.vector_store.collection.get(limit=100, include=['metadatas'])

well_names = set()
document_names = set()
chunk_types = {'text': 0, 'table': 0}

for metadata in results['metadatas']:
    if 'well_name' in metadata:
        well_names.add(metadata['well_name'])
    if 'document_name' in metadata:
        document_names.add(metadata['document_name'])
    if 'chunk_type' in metadata:
        chunk_types[metadata['chunk_type']] = chunk_types.get(metadata['chunk_type'], 0) + 1

print(f"\nWells indexed: {len(well_names)}")
for well in sorted(well_names):
    # Count chunks per well
    well_chunks = rag.vector_store.collection.get(
        where={'well_name': well},
        limit=10000,
        include=['metadatas']
    )
    text_chunks = sum(1 for m in well_chunks['metadatas'] if m.get('chunk_type') == 'text')
    table_chunks = sum(1 for m in well_chunks['metadatas'] if m.get('chunk_type') == 'table')
    print(f"  - {well}: {len(well_chunks['ids'])} chunks ({text_chunks} text + {table_chunks} tables)")

print(f"\nDocuments indexed: {len(document_names)}")
for doc in sorted(document_names):
    print(f"  - {doc}")

print("\n" + "="*80)

if stats['total_chunks'] == 0:
    print("\n[WARNING] No data indexed yet!")
    print("Run: python scripts/index_all_wells.py")
    print("This will take 20-40 minutes to index all 8 wells.")
elif len(well_names) < 8:
    print(f"\n[INFO] Only {len(well_names)}/8 wells indexed.")
    print("The indexing script may still be running.")
    print("Check: tail -f outputs/indexing_log_fixed.txt")
else:
    print("\n[OK] All wells indexed successfully!")

---

## Step 4: Example Queries

### Query 1: Simple Factual Question

In [None]:
# Query about well depth
query = "What is the measured depth and true vertical depth of Well 5?"
well_name = "Well 5"

print("="*80)
print(f"QUERY: {query}")
print(f"WELL: {well_name}")
print("="*80)

result = rag.query(query, well_name=well_name, n_results=5)

print(f"\n[ANSWER]\n{result['answer']}")
print(f"\n[SOURCES] Retrieved {len(result['sources'])} chunks:")
for i, source in enumerate(result['sources'], 1):
    print(f"\n  Source {i}:")
    print(f"    Section: {source['section_number']} - {source['section_title']}")
    print(f"    Type: {source['section_type']}")
    print(f"    Page: {source['page']}")
    print(f"    Chunk type: {source['chunk_type']}")
    print(f"    Distance: {source['distance']:.4f}")
    print(f"    Preview: {source['text'][:150]}...")

print("\n" + "="*80)

### Query 2: Table-Heavy Question

In [None]:
# Query about casing data (typically in tables)
query = "What are the casing sizes and depths for Well 5?"
well_name = "Well 5"

print("="*80)
print(f"QUERY: {query}")
print(f"WELL: {well_name}")
print("="*80)

result = rag.query(query, well_name=well_name, n_results=5)

print(f"\n[ANSWER]\n{result['answer']}")
print(f"\n[SOURCES] Retrieved {len(result['sources'])} chunks:")
for i, source in enumerate(result['sources'], 1):
    print(f"\n  Source {i}:")
    print(f"    Type: {source['chunk_type']}")
    print(f"    Section: {source['section_number']} - {source['section_title']}")
    print(f"    Page: {source['page']}")
    if source['chunk_type'] == 'table':
        print(f"    [TABLE DATA]:")
        print("    " + source['text'][:300].replace('\n', '\n    '))

print("\n" + "="*80)

### Query 3: Cross-Well Comparison

In [None]:
# Query across all wells (no well_name filter)
query = "What are the typical measured depths for geothermal wells?"

print("="*80)
print(f"QUERY: {query}")
print("WELL: All wells")
print("="*80)

result = rag.query(query, well_name=None, n_results=10)

print(f"\n[ANSWER]\n{result['answer']}")
print(f"\n[SOURCES] Retrieved {len(result['sources'])} chunks from multiple wells:")

# Group by well
wells_found = {}
for source in result['sources']:
    well = source['well_name']
    if well not in wells_found:
        wells_found[well] = []
    wells_found[well].append(source)

for well, sources in wells_found.items():
    print(f"\n  {well}: {len(sources)} chunks")
    for source in sources[:2]:  # Show first 2 from each well
        print(f"    - Section {source['section_number']}, Page {source['page']}")

print("\n" + "="*80)

### Query 4: Section-Filtered Query

In [None]:
# Test query intent mapping
query = "Tell me about the drilling trajectory of Well 5"
well_name = "Well 5"

print("="*80)
print(f"QUERY: {query}")
print(f"WELL: {well_name}")
print("="*80)

# First, show what section types are mapped
print("\n[INTENT MAPPING]")
section_types = rag.intent_mapper.get_section_types(query)
print(f"Query mapped to section types: {section_types}")
print("This means the retrieval will prioritize these sections.")

result = rag.query(query, well_name=well_name, n_results=5)

print(f"\n[ANSWER]\n{result['answer']}")
print(f"\n[SOURCES] Retrieved {len(result['sources'])} chunks:")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. Section: {source['section_type']} | Page: {source['page']} | Distance: {source['distance']:.4f}")

print("\n" + "="*80)

---

## Step 5: Interactive Query Interface

Run this cell to create an interactive interface for querying:

In [None]:
def interactive_query():
    """
    Interactive query interface
    Type 'quit' to exit
    """
    print("="*80)
    print("INTERACTIVE QUERY INTERFACE")
    print("="*80)
    print("\nAvailable wells:")
    for well in rag.toc_database.keys():
        print(f"  - {well}")
    print("\nType 'quit' to exit")
    print("="*80)
    
    while True:
        print("\n" + "-"*80)
        query = input("\nEnter your question: ").strip()
        
        if query.lower() in ['quit', 'exit', 'q']:
            print("\nExiting...")
            break
        
        if not query:
            continue
        
        well_input = input("Enter well name (or press Enter for all wells): ").strip()
        well_name = well_input if well_input else None
        
        try:
            print("\n[PROCESSING...]")
            result = rag.query(query, well_name=well_name, n_results=5)
            
            print(f"\n[ANSWER]\n{result['answer']}")
            print(f"\n[SOURCES] {len(result['sources'])} chunks retrieved")
            
            show_sources = input("\nShow source details? (y/n): ").strip().lower()
            if show_sources == 'y':
                for i, source in enumerate(result['sources'], 1):
                    print(f"\n  Source {i}:")
                    print(f"    Well: {source['well_name']}")
                    print(f"    Section: {source['section_number']} - {source['section_title']}")
                    print(f"    Page: {source['page']}")
                    print(f"    Type: {source['chunk_type']}")
        
        except Exception as e:
            print(f"\n[ERROR] {e}")

# Run the interactive interface
interactive_query()

---

## Step 6: Performance Benchmarking

Measure query performance (target: <10 seconds per query):

In [None]:
import time

# Test queries
test_queries = [
    ("What is the measured depth of Well 5?", "Well 5"),
    ("What are the casing specifications?", "Well 5"),
    ("Describe the well trajectory", "Well 5"),
    ("What drilling challenges were encountered?", "Well 5"),
    ("What is the production capacity?", "Well 5"),
]

print("="*80)
print("PERFORMANCE BENCHMARK")
print("="*80)
print(f"\nRunning {len(test_queries)} test queries...\n")

times = []

for i, (query, well) in enumerate(test_queries, 1):
    print(f"{i}. {query}")
    
    start = time.time()
    result = rag.query(query, well_name=well, n_results=5)
    elapsed = time.time() - start
    
    times.append(elapsed)
    print(f"   Time: {elapsed:.2f}s")
    print(f"   Sources: {len(result['sources'])} chunks")
    print()

print("="*80)
print("RESULTS")
print("="*80)
print(f"Total queries: {len(times)}")
print(f"Average time: {sum(times)/len(times):.2f}s")
print(f"Min time: {min(times):.2f}s")
print(f"Max time: {max(times):.2f}s")
print(f"\nTarget: <10s per query")
if sum(times)/len(times) < 10:
    print("[OK] Performance target met!")
else:
    print("[WARN] Performance below target")
print("="*80)

---

## Step 7: Quality Assessment

Test answer quality with known ground truth:

In [None]:
# Test with questions where we know the answer
quality_tests = [
    {
        'query': 'What is the measured depth of Well 5?',
        'well': 'Well 5',
        'expected_keywords': ['2523', '2524', 'meter', 'MD'],
        'description': 'Should find MD ~2523-2524m in depth section'
    },
    {
        'query': 'What is the inner diameter of the production casing?',
        'well': 'Well 5',
        'expected_keywords': ['177.8', '178', 'mm', 'casing'],
        'description': 'Should find ID 177.8mm in casing section'
    },
]

print("="*80)
print("QUALITY ASSESSMENT")
print("="*80)

passed = 0
total = len(quality_tests)

for i, test in enumerate(quality_tests, 1):
    print(f"\nTest {i}/{total}")
    print(f"Query: {test['query']}")
    print(f"Expected: {test['description']}")
    
    result = rag.query(test['query'], well_name=test['well'], n_results=5)
    answer = result['answer'].lower()
    
    # Check if expected keywords are in answer
    found_keywords = [kw for kw in test['expected_keywords'] if kw.lower() in answer]
    
    print(f"\nAnswer: {result['answer'][:200]}...")
    print(f"\nKeywords found: {found_keywords}/{len(test['expected_keywords'])}")
    
    if len(found_keywords) >= len(test['expected_keywords']) * 0.5:  # At least 50% keywords
        print("[OK] Test passed")
        passed += 1
    else:
        print("[FAIL] Test failed")

print("\n" + "="*80)
print(f"RESULTS: {passed}/{total} tests passed ({100*passed/total:.0f}%)")
print(f"Target: >90% accuracy")
if passed/total >= 0.9:
    print("[OK] Quality target met!")
else:
    print("[WARN] Quality below target")
print("="*80)

---

## Step 8: Troubleshooting

Common issues and solutions:

In [None]:
print("="*80)
print("TROUBLESHOOTING GUIDE")
print("="*80)

print("""
Issue 1: ChromaDB connection failed
  Solution: docker-compose up -d chromadb
  Check: docker ps | grep chromadb

Issue 2: Ollama connection failed
  Solution: Download from https://ollama.ai and run:
    ollama serve
    ollama pull llama3.2:3b

Issue 3: No data indexed (0 chunks)
  Solution: python scripts/index_all_wells.py
  Time: ~20-40 minutes for all 8 wells
  Check progress: tail -f outputs/indexing_log_fixed.txt

Issue 4: Query returns "I don't know"
  Causes:
    - Data not indexed for that well
    - Question is outside document scope
    - Retrieved chunks not relevant
  Debug:
    - Check result['sources'] to see what was retrieved
    - Try broader query terms
    - Verify well is indexed: rag.vector_store.get_collection_stats()

Issue 5: Slow queries (>10s)
  Causes:
    - Ollama LLM generation is slow (CPU-bound)
    - Large context (many chunks retrieved)
  Solutions:
    - Reduce n_results (default 5)
    - Use more specific queries (better section filtering)
    - Wait for LLM response (normal for CPU)

Issue 6: Unicode errors on Windows
  Solution: All emojis have been removed from source code
  If you see new errors, run: python scripts/remove_emojis_from_py_files.py

Issue 7: Memory issues
  Solution: Close other applications
  Note: nomic-embed-text-v1.5 uses ~500MB RAM
        llama3.2:3b uses ~2-3GB RAM
""")

print("="*80)

---

## Summary

**Sub-Challenge 1 Checklist:**

âœ… Services running (ChromaDB + Ollama)  
âœ… Data indexed (all 8 wells)  
âœ… RAG system initialized  
âœ… Queries working (factual answers with citations)  
âœ… Performance: <10s per query  
âœ… Quality: >90% accuracy  

**Next Steps:**
1. Test with your own questions
2. Try different wells
3. Experiment with query phrasing
4. Check citations are accurate
5. Move to Sub-Challenge 2 (parameter extraction)

**Files Generated:**
- `outputs/indexing_log_fixed.txt` - Indexing progress
- `outputs/indexing_results.json` - Indexing summary
- ChromaDB data in Docker volume `hackathon_chroma_data`

**Key Metrics:**
- Indexing time: 20-40 minutes (one-time)
- Query time: <10 seconds
- Accuracy target: >90%
- Points: 50% of total grade

Good luck! ðŸš€