# 🇬🇭 Ghana Census Q&A - Production Ready
## 100 Documents + Smart Responses | GAIN Dialogue Series 2026

**Presenter:** Thomas Torku, Ph.D.

---

### What Makes This Version Special?

| Feature | Basic Version | This Version |
|---------|---------------|--------------|
| Documents | 10 | **100** |
| Fine-tuning pairs | ~30 | **~300** |
| Out-of-scope handling | Returns irrelevant results | **Polite decline** |

**Key Insight:** Production RAG systems need to handle questions they *can't* answer. This version shows how to detect low-relevance matches and respond appropriately.

### Example:
```
Q: "What is the population of America?"
A: "I only have information about Ghana's census data. 
    I can help with questions about Ghana's population, 
    literacy rates, electricity access, and more."
```

---


---

## 📦 Step 1: Setup

Install required packages. Everything runs FREE on Google Colab!

**Key Insight:** We use `sentence-transformers` for embeddings and `chromadb` for vector storage - both are open-source and free.


In [None]:
# Install required packages
!pip install -q sentence-transformers chromadb python-dotenv accelerate

print("✅ All packages installed successfully!")
print("\n📊 Package versions:")
import sentence_transformers
import chromadb
print(f"  - sentence-transformers: {sentence_transformers.__version__}")
print(f"  - chromadb: {chromadb.__version__}")

---

## 📊 Step 2: Prepare Data

Load Ghana Census 2021 data with proper structure: ID, content, and metadata (region, topic, year).

**Key Insight:** Good metadata enables filtering and improves search relevance.


In [None]:
import json
from typing import List, Dict

# Ghana Census 2021 Sample Data
GHANA_CENSUS_DATA = [
    {
        "id": "census_001",
        "content": "Ghana's total population according to the 2021 Population and Housing Census is 30,832,019. This represents an increase of 6,047,539 (24.4%) over the 2010 census population of 24,658,823. The intercensal growth rate between 2010 and 2021 is 2.1% per annum.",
        "metadata": {"region": "National", "topic": "Population", "year": 2021}
    },
    {
        "id": "census_002",
        "content": "The Greater Accra Region has a population of 5,446,237, making it the most populous region. It is followed by the Ashanti Region with 5,432,485 people. The least populous region is Savannah Region with 588,152 people.",
        "metadata": {"region": "Greater Accra", "topic": "Population", "year": 2021}
    },
    {
        "id": "census_003",
        "content": "In the Greater Accra Region, the literacy rate is 87.8% for persons aged 15 years and older. Male literacy rate is 91.2% while female literacy rate is 84.6%. Urban areas have higher literacy rates than rural areas.",
        "metadata": {"region": "Greater Accra", "topic": "Literacy", "year": 2021}
    },
    {
        "id": "census_004",
        "content": "The Northern Region has a literacy rate of 44.2% for persons aged 15 years and older. This is lower than the national average of 76.4%. Male literacy rate is 56.7% while female literacy rate is 32.8%.",
        "metadata": {"region": "Northern", "topic": "Literacy", "year": 2021}
    },
    {
        "id": "census_005",
        "content": "Access to electricity: Nationally, 84.3% of households have access to electricity. Greater Accra has the highest access at 94.2%, followed by Ashanti at 89.1%. The Northern Region has 67.3% access to electricity.",
        "metadata": {"region": "National", "topic": "Electricity", "year": 2021}
    },
    {
        "id": "census_006",
        "content": "The unemployment rate in Ghana is 13.4% according to the 2021 census. Youth unemployment (ages 15-24) stands at 19.7%. Greater Accra has an unemployment rate of 15.2%, while rural areas have lower unemployment at 8.9%.",
        "metadata": {"region": "National", "topic": "Employment", "year": 2021}
    },
    {
        "id": "census_007",
        "content": "The Ashanti Region has a population of 5,432,485, making it the second most populous region. The region has 1,234,567 households. The average household size is 4.4 persons.",
        "metadata": {"region": "Ashanti", "topic": "Population", "year": 2021}
    },
    {
        "id": "census_008",
        "content": "National literacy rate: 76.4% of the population aged 15 years and older can read and write. Male literacy is 82.3% and female literacy is 70.8%. There is a significant gap between urban (86.5%) and rural (64.1%) literacy rates.",
        "metadata": {"region": "National", "topic": "Literacy", "year": 2021}
    },
    {
        "id": "census_009",
        "content": "Access to improved water sources: 87.2% of households have access to improved water sources. In urban areas, this is 93.4%, while in rural areas it is 78.9%. Greater Accra has the highest access at 96.1%.",
        "metadata": {"region": "National", "topic": "Water Access", "year": 2021}
    },
    {
        "id": "census_010",
        "content": "Internet usage: 58.3% of Ghanaians aged 15 and above have used the internet in the past 3 months. Urban areas have 72.1% internet usage while rural areas have 38.2%. Greater Accra leads with 78.4% internet usage.",
        "metadata": {"region": "National", "topic": "Technology", "year": 2021}
    },
]

print(f"✅ Loaded {len(GHANA_CENSUS_DATA)} census documents")
print("\n📊 Sample document:")
print(json.dumps(GHANA_CENSUS_DATA[0], indent=2))

---

## 🧠 Step 3: Initialize RAG Pipeline

Create the core RAG system with embedding model and vector database.

**How it works:**
1. **Embed** - Convert text to 384-dimensional vectors
2. **Search** - Find similar vectors using cosine similarity
3. **Generate** - Create answers from retrieved context

**Key Insight:** The embedding model captures semantic meaning - "population count" matches "how many people" even without shared words.


In [None]:
import os
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer
import chromadb

@dataclass
class SearchResult:
    """Represents a search result with content and score."""
    content: str
    metadata: Dict
    score: float

class GhanaCensusRAG:
    """RAG Pipeline for Ghana Census Data - Completely FREE!"""

    def __init__(self, embedding_model="all-MiniLM-L6-v2"):
        print("🔧 Initializing RAG Pipeline (100% FREE - No API needed!)")

        # Load embedding model
        print(f"   📥 Loading embedding model: {embedding_model}")
        self.embedding_model = SentenceTransformer(embedding_model)

        # Setup ChromaDB
        print("   💾 Setting up ChromaDB...")
        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(
            name="ghana_census",
            metadata={"description": "Ghana 2021 Census Data"}
        )

        print("✅ RAG Pipeline initialized!\n")

    def add_documents(self, documents: List[Dict]):
        """Add documents to vector database."""
        print(f"📊 Processing {len(documents)} documents...")

        ids = [doc["id"] for doc in documents]
        contents = [doc["content"] for doc in documents]
        metadatas = [doc["metadata"] for doc in documents]

        # Generate embeddings
        print("   🧮 Generating embeddings...")
        embeddings = self.embedding_model.encode(contents, show_progress_bar=True)

        # Store in ChromaDB
        print("   💾 Storing in vector database...")
        self.collection.add(
            ids=ids,
            embeddings=embeddings.tolist(),
            documents=contents,
            metadatas=metadatas
        )

        print(f"✅ Added {len(documents)} documents to vector store\n")

    def search(self, query: str, n_results: int = 3) -> List[SearchResult]:
        """Semantic search over documents."""
        # Generate query embedding
        query_embedding = self.embedding_model.encode([query])[0]

        # Search ChromaDB
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=["documents", "metadatas", "distances"]
        )

        # Convert to SearchResult objects
        search_results = []
        if results["documents"] and results["documents"][0]:
            for i, doc in enumerate(results["documents"][0]):
                distance = results["distances"][0][i]
                score = 1 / (1 + distance)

                search_results.append(SearchResult(
                    content=doc,
                    metadata=results["metadatas"][0][i],
                    score=score
                ))

        return search_results

    def generate_answer(self, query: str, context_results: List[SearchResult]) -> str:
        """Generate answer using retrieved context (no API needed!)."""
        if not context_results:
            return "I couldn't find relevant information in the census data."

        top_result = context_results[0]
        content = top_result.content
        region = top_result.metadata.get('region', 'Ghana')
        topic = top_result.metadata.get('topic', 'Census Data')

        answer_parts = [
            "Based on the Ghana 2021 Census data, here's what I found:\n",
            content if len(content) <= 300 else f"{content[:300]}...",
            f"\n\n[Source: {region} - {topic}]"
        ]

        if len(context_results) > 1:
            answer_parts.append(f"\n\nAdditional information from {len(context_results)} census sources.")

        return "".join(answer_parts)

    def query(self, question: str, n_context: int = 3) -> Dict:
        """Full RAG pipeline: search + generate answer."""
        search_results = self.search(question, n_results=n_context)
        answer = self.generate_answer(question, search_results)

        sources = [
            {
                "region": r.metadata.get("region"),
                "topic": r.metadata.get("topic"),
                "relevance_score": round(r.score, 3),
                "content": r.content[:200] + "..." if len(r.content) > 200 else r.content
            }
            for r in search_results
        ]

        return {
            "query": question,
            "answer": answer,
            "sources": sources
        }

# Initialize the RAG system
rag = GhanaCensusRAG()

---

## 📥 Step 4: Load Data

Add documents to the vector database. Each document gets embedded and indexed for fast retrieval.

**Key Insight:** Once indexed, searches across millions of documents take milliseconds.


In [None]:
# Add documents to RAG system
rag.add_documents(GHANA_CENSUS_DATA)

print("✅ Vector database is ready!")
print(f"   Total documents: {len(GHANA_CENSUS_DATA)}")
print(f"   Embedding dimensions: 384")
print(f"   Model: all-MiniLM-L6-v2")

---

## 🔍 Step 5: Test Search

Verify semantic search works correctly before building the full pipeline.

**Key Insight:** Always test with edge cases - even nonsense queries return results, but with low scores.


In [None]:
# Test semantic search
test_query = "What is the population of Ghana?"

print(f"🔍 Query: {test_query}\n")
print("📊 Top 3 most relevant documents:\n")

results = rag.search(test_query, n_results=3)

for i, result in enumerate(results, 1):
    print(f"{'='*60}")
    print(f"Result {i}:")
    print(f"  Region: {result.metadata['region']}")
    print(f"  Topic: {result.metadata['topic']}")
    print(f"  Relevance Score: {result.score:.3f}")
    print(f"  Content: {result.content[:150]}...")
    print()

---

## 💬 Step 6: Full Q&A Pipeline

Combine retrieval with answer generation. Users ask questions, system finds relevant data and generates responses.

**Key Insight:** We use rule-based generation (FREE), but production systems often use LLMs for more natural responses.


In [None]:
# Test questions
test_questions = [
    "What is the population of Ghana?",
    "How does literacy rate compare between Greater Accra and Northern Region?",
    "What percentage of households have access to electricity?",
    "What is the unemployment rate in Ghana?"
]

for question in test_questions:
    print(f"\n{'='*70}")
    print(f"❓ Question: {question}")
    print(f"{'='*70}\n")

    result = rag.query(question)

    print(f"📝 Answer:\n{result['answer']}\n")

    print(f"📚 Sources ({len(result['sources'])} documents):")
    for i, source in enumerate(result['sources'], 1):
        print(f"   {i}. [{source['region']}] {source['topic']} (relevance: {source['relevance_score']})")

    print()

---

## 🎯 Step 7: Fine-Tune with 100 Documents

With 100 documents, we generate **~300 training pairs** instead of ~30.

**Key Insight:** More training data = better domain adaptation. The model learns:
- All 16 regional names and their associations
- Topic-specific vocabulary (oil/gas, mining, agriculture)
- Ghanaian statistical terminology

### Training Pairs Generated:
| Document Type | Example Query | Positive Match |
|--------------|---------------|----------------|
| Population | "Kumasi population" | Ashanti Region demographics |
| Literacy | "female education Northern" | Northern Region literacy stats |
| Economy | "Obuasi gold mine" | Ashanti mining economy |
| New Regions | "Savannah Region" | 2019 newly created region data |


In [None]:
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader
import random

# ============================================
# GENERATE TRAINING PAIRS FROM 100 DOCUMENTS
# ============================================
# More data = Better fine-tuning!

def generate_training_pairs(documents, num_pairs=300):
    """Generate triplets: (query, positive, negative) from documents."""
    training_data = []
    
    # Query templates for different topics
    query_templates = {
        "Population": ["population of {region}", "how many people in {region}", "{region} population count"],
        "Literacy": ["literacy rate {region}", "{region} education level", "reading writing {region}"],
        "Electricity": ["electricity access {region}", "power supply {region}", "{region} electrification"],
        "Employment": ["unemployment {region}", "jobs in {region}", "{region} employment rate"],
        "Water": ["water access {region}", "drinking water {region}", "{region} water supply"],
        "Housing": ["housing {region}", "household size {region}", "{region} homes"],
        "Health": ["health insurance {region}", "doctors {region}", "{region} healthcare"],
        "Demographics": ["demographics {region}", "{region} growth rate", "population density {region}"],
        "Economy": ["economy {region}", "{region} economic activities", "industries {region}"],
        "Education": ["schools {region}", "university {region}", "{region} education"],
        "Technology": ["internet {region}", "mobile phones {region}", "{region} technology"],
    }
    
    for doc in documents:
        region = doc["metadata"].get("region", "Ghana")
        topic = doc["metadata"].get("topic", "General")
        content = doc["content"]
        
        # Generate queries for this document
        if topic in query_templates:
            templates = query_templates[topic]
        else:
            templates = [f"{topic} {region}", f"{region} {topic} statistics"]
        
        for template in templates[:2]:  # Use 2 templates per doc
            query = template.format(region=region)
            
            # Find a negative (different topic or region)
            negatives = [d for d in documents 
                        if d["metadata"].get("topic") != topic 
                        or d["metadata"].get("region") != region]
            
            if negatives:
                negative = random.choice(negatives)
                training_data.append({
                    "query": query,
                    "positive": content[:500],  # Truncate long content
                    "negative": negative["content"][:500]
                })
    
    # Shuffle and limit
    random.shuffle(training_data)
    return training_data[:num_pairs]

print(f"📊 Generating training pairs from {len(GHANA_CENSUS_DATA)} documents...")
training_data = generate_training_pairs(GHANA_CENSUS_DATA, num_pairs=300)
print(f"✅ Generated {len(training_data)} training triplets")

# Show sample
print("\n📝 Sample training pair:")
sample = training_data[0]
print(f"   Query: {sample['query']}")
print(f"   Positive: {sample['positive'][:100]}...")
print(f"   Negative: {sample['negative'][:100]}...")


In [None]:
import os
os.environ['WANDB_DISABLED'] = 'true'  # Disable W&B logging (no account needed!)

# Prepare training data
train_examples = [
    InputExample(texts=[item["query"], item["positive"], item["negative"]])
    for item in training_data
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=4)

# Initialize model for fine-tuning
model = SentenceTransformer("all-MiniLM-L6-v2")

# Setup triplet loss
train_loss = losses.TripletLoss(model=model)

print("🏋️ Fine-tuning model on Ghana census data...")
print("   This will take ~1-2 minutes\n")

# Fine-tune (3 epochs is enough for demo)
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=10,
    show_progress_bar=True
)

print("\n✅ Fine-tuning complete!")
print("   Model is now optimized for Ghana census queries")

In [None]:
# Save the fine-tuned model for Streamlit app
MODEL_SAVE_PATH = '/content/finetuned_model'
model.save(MODEL_SAVE_PATH)
print(f"✅ Fine-tuned model saved to: {MODEL_SAVE_PATH}")

---

## 📊 Step 8: Compare Performance

Measure improvement from fine-tuning using relevance scores.

**Key Insight:** Don't just compare numbers - verify the RIGHT documents are being retrieved.


In [None]:
# Create RAG with fine-tuned model
print("🔧 Creating RAG system with fine-tuned model...\n")

class GhanaCensusRAGFinetuned(GhanaCensusRAG):
    def __init__(self, model):
        print("🔧 Initializing RAG with fine-tuned model")
        self.embedding_model = model

        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(
            name="ghana_census_finetuned",
            metadata={"description": "Ghana Census - Fine-tuned"}
        )
        print("✅ Fine-tuned RAG initialized!\n")

rag_finetuned = GhanaCensusRAGFinetuned(model)
rag_finetuned.add_documents(GHANA_CENSUS_DATA)

# Compare both models
test_query = "Tell me about literacy in Accra"

print(f"\n{'='*70}")
print(f"🔍 Test Query: {test_query}")
print(f"{'='*70}\n")

print("📊 BASE MODEL Results:")
base_results = rag.search(test_query, n_results=3)
for i, r in enumerate(base_results, 1):
    print(f"   {i}. [{r.metadata['region']}] {r.metadata['topic']} - Score: {r.score:.3f}")

print("\n📊 FINE-TUNED MODEL Results:")
ft_results = rag_finetuned.search(test_query, n_results=3)
for i, r in enumerate(ft_results, 1):
    print(f"   {i}. [{r.metadata['region']}] {r.metadata['topic']} - Score: {r.score:.3f}")

print("\n💡 Notice: Fine-tuned model often has higher relevance scores!")

---

## 🎨 Step 9: Interactive Interface

Create a user-friendly Q&A interface with sample questions and source citations.

**Key Insight:** Always show sources - users need to verify AI-generated answers.


In [None]:
def ask_question(question: str, use_finetuned: bool = True):
    """Interactive Q&A function"""
    model_to_use = rag_finetuned if use_finetuned else rag
    model_name = "Fine-tuned" if use_finetuned else "Base"

    print(f"\n{'='*70}")
    print(f"❓ Your Question: {question}")
    print(f"🤖 Using: {model_name} Model")
    print(f"{'='*70}\n")

    result = model_to_use.query(question)

    print(f"📝 Answer:\n{result['answer']}\n")

    print(f"📚 Sources:")
    for i, source in enumerate(result['sources'], 1):
        print(f"   {i}. [{source['region']}] {source['topic']} (relevance: {source['relevance_score']})")

# Try it out!
ask_question("What is the unemployment rate in Ghana?")
ask_question("Compare internet usage between urban and rural areas")
ask_question("Tell me about water access in Ghana")

---

## 🎯 YOUR TURN: Ask Your Own Questions!

Modify the cell below to ask your own questions about Ghana Census data!

In [None]:
# YOUR CUSTOM QUESTION HERE!
my_question = "What is the population of Greater Accra?"  # Change this!

ask_question(my_question, use_finetuned=True)

In [None]:
# Evaluation function
def evaluate_model(rag_system, test_queries):
    """Evaluate retrieval performance"""
    correct_at_1 = 0
    correct_at_3 = 0
    reciprocal_ranks = []

    for query_info in test_queries:
        query = query_info["query"]
        expected_topic = query_info["expected_topic"]

        results = rag_system.search(query, n_results=5)

        # Find rank of first relevant result
        rank = None
        for i, result in enumerate(results, 1):
            if result.metadata["topic"] == expected_topic:
                rank = i
                break

        if rank:
            if rank == 1:
                correct_at_1 += 1
            if rank <= 3:
                correct_at_3 += 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)

    n = len(test_queries)
    return {
        "accuracy@1": correct_at_1 / n,
        "accuracy@3": correct_at_3 / n,
        "mrr": sum(reciprocal_ranks) / n
    }

# Test queries with expected topics
eval_queries = [
    {"query": "What is Ghana's population?", "expected_topic": "Population"},
    {"query": "How many people can read in Accra?", "expected_topic": "Literacy"},
    {"query": "Do households have power?", "expected_topic": "Electricity"},
    {"query": "What is the jobless rate?", "expected_topic": "Employment"},
]

print("📊 Evaluating Base Model...")
base_metrics = evaluate_model(rag, eval_queries)

print("📊 Evaluating Fine-tuned Model...")
ft_metrics = evaluate_model(rag_finetuned, eval_queries)

print(f"\n{'='*70}")
print("📈 EVALUATION RESULTS")
print(f"{'='*70}\n")
print(f"{'Metric':<20} {'Base Model':<15} {'Fine-tuned':<15} {'Improvement'}")
print(f"{'-'*70}")

for metric in ["accuracy@1", "accuracy@3", "mrr"]:
    base_val = base_metrics[metric]
    ft_val = ft_metrics[metric]
    improvement = ((ft_val - base_val) / base_val * 100) if base_val > 0 else 0

    print(f"{metric:<20} {base_val:<15.1%} {ft_val:<15.1%} {improvement:+.1f}%")

print(f"\n{'='*70}")

---

## 🛡️ Out-of-Scope Question Handling

**Problem:** What happens when someone asks "What is the population of America?"

Without handling, RAG returns the *closest* match, which is misleading.

### Our Solution: Relevance Threshold

| Query | Match Score | Response |
|-------|-------------|----------|
| "Ghana population" | 0.72 | ✅ Returns census data |
| "America population" | 0.28 | ⚠️ Polite decline |

```python
if best_score < 0.35:
    return "I specialize in Ghana census data..."
```

**Key Insight:** Production systems must gracefully handle what they *don't* know.

---


In [None]:
!pip install streamlit

In [None]:
%%writefile streamlit_app.py
import streamlit as st
import os
from dataclasses import dataclass
from typing import List, Dict
from sentence_transformers import SentenceTransformer
import chromadb

# Page config
st.set_page_config(page_title="Ghana Census Q&A", page_icon="🇬🇭", layout="wide")

# ============================================
# CSS - Force light mode with visible text
# ============================================
st.markdown("""
<style>
    /* Force light background everywhere */
    .stApp, .main, [data-testid="stAppViewContainer"], [data-testid="stHeader"] {
        background-color: #ffffff !important;
    }
    
    /* Dark text for all content */
    .stMarkdown, .stMarkdown p, .stMarkdown h1, .stMarkdown h2, .stMarkdown h3,
    .stMarkdown li, .stMarkdown span, .stMarkdown strong, .stMarkdown em,
    p, span, li, label, div {
        color: #1a1a1a !important;
    }
    
    /* Title */
    h1 {
        color: #006B3F !important;
        text-align: center;
    }
    
    /* Chat messages - light background, dark text */
    [data-testid="stChatMessage"] {
        background-color: #f5f5f5 !important;
        border: 1px solid #ddd !important;
        border-radius: 10px !important;
        padding: 15px !important;
        margin: 10px 0 !important;
    }
    
    [data-testid="stChatMessage"] p,
    [data-testid="stChatMessage"] span,
    [data-testid="stChatMessage"] div,
    .stChatMessage p {
        color: #1a1a1a !important;
    }
    
    /* User message bubble */
    [data-testid="stChatMessage"][data-testid*="user"] {
        background-color: #e3f2fd !important;
    }
    
    /* Assistant message bubble */
    [data-testid="stChatMessage"][data-testid*="assistant"] {
        background-color: #f5f5f5 !important;
    }
    
    /* Info box */
    .stAlert, [data-testid="stAlert"] {
        background-color: #e8f5e9 !important;
        color: #1a1a1a !important;
    }
    .stAlert p, [data-testid="stAlert"] p {
        color: #1a1a1a !important;
    }
    
    /* Sidebar */
    [data-testid="stSidebar"], [data-testid="stSidebar"] > div {
        background-color: #f8f9fa !important;
    }
    [data-testid="stSidebar"] p, [data-testid="stSidebar"] span,
    [data-testid="stSidebar"] label, [data-testid="stSidebar"] h1,
    [data-testid="stSidebar"] h2, [data-testid="stSidebar"] h3 {
        color: #1a1a1a !important;
    }
    
    /* Metrics */
    [data-testid="stMetricValue"] {
        color: #006B3F !important;
    }
    [data-testid="stMetricLabel"] {
        color: #333 !important;
    }
    
    /* Expander */
    .streamlit-expanderHeader, [data-testid="stExpander"] summary {
        background-color: #f0f0f0 !important;
        color: #1a1a1a !important;
    }
    .streamlit-expanderContent, [data-testid="stExpander"] > div {
        background-color: #fafafa !important;
        color: #1a1a1a !important;
    }
    
    /* Input */
    .stTextInput input, .stChatInput textarea, [data-testid="stChatInput"] textarea {
        background-color: #ffffff !important;
        color: #1a1a1a !important;
        border: 1px solid #ccc !important;
    }
    
    /* Buttons */
    .stButton button {
        background-color: #006B3F !important;
        color: white !important;
    }
</style>
""", unsafe_allow_html=True)

# ============================================
# DATA
# ============================================
GHANA_CENSUS_DATA = [
    {"id": "nat_001", "content": "Ghana's total population according to the 2021 Population and Housing Census is 30,832,019. This represents an increase of 6,047,539 (24.4%) over the 2010 census population of 24,658,823. The intercensal growth rate between 2010 and 2021 is 2.1% per annum.", "metadata": {"region": "National", "topic": "Population", "year": 2021}},
    {"id": "nat_002", "content": "The national sex ratio is 97 males per 100 females. The male population is 15,177,124 (49.2%) while females are 15,654,895 (50.8%).", "metadata": {"region": "National", "topic": "Demographics", "year": 2021}},
    {"id": "nat_003", "content": "Ghana's population density is 129 persons per square kilometer in 2021. Greater Accra has the highest density at 1,578 persons/km².", "metadata": {"region": "National", "topic": "Density", "year": 2021}},
    {"id": "nat_004", "content": "The national literacy rate is 76.4% for persons aged 15 years and older. Male literacy stands at 82.5% while female literacy is 70.8%.", "metadata": {"region": "National", "topic": "Literacy", "year": 2021}},
    {"id": "nat_005", "content": "National unemployment rate is 13.4%. Youth unemployment (15-24 years) is significantly higher at 19.7%.", "metadata": {"region": "National", "topic": "Employment", "year": 2021}},
    {"id": "nat_006", "content": "Access to electricity nationally is 84.3% of households. Urban areas have 93.1% access compared to 70.8% in rural areas.", "metadata": {"region": "National", "topic": "Electricity", "year": 2021}},
    {"id": "nat_007", "content": "Safe drinking water access is 87.7% nationally. Pipe-borne water serves 43.2% of households.", "metadata": {"region": "National", "topic": "Water", "year": 2021}},
    {"id": "nat_008", "content": "Average household size in Ghana is 3.6 persons, down from 4.4 in 2010.", "metadata": {"region": "National", "topic": "Housing", "year": 2021}},
    {"id": "nat_009", "content": "Total fertility rate in Ghana is 3.9 children per woman. Rural areas have higher fertility (4.7) than urban areas (3.3).", "metadata": {"region": "National", "topic": "Fertility", "year": 2021}},
    {"id": "nat_010", "content": "Life expectancy at birth is 64.1 years nationally. Females have higher life expectancy (65.8 years) than males (62.4 years).", "metadata": {"region": "National", "topic": "Health", "year": 2021}},
    {"id": "nat_011", "content": "Internet usage nationally is 58.5% among persons aged 12 and above. Mobile phone ownership is 75.3%.", "metadata": {"region": "National", "topic": "Technology", "year": 2021}},
    {"id": "nat_012", "content": "Agriculture employs 38.3% of the working population nationally. Services sector employs 43.2%.", "metadata": {"region": "National", "topic": "Employment", "year": 2021}},
    {"id": "nat_013", "content": "Housing ownership: 47.2% of households own their dwelling, 36.8% rent.", "metadata": {"region": "National", "topic": "Housing", "year": 2021}},
    {"id": "nat_014", "content": "Ghana has 16 administrative regions as of 2019. Six new regions were created.", "metadata": {"region": "National", "topic": "Administration", "year": 2021}},
    {"id": "nat_015", "content": "Ghana's urbanization rate is 56.7%, up from 50.9% in 2010.", "metadata": {"region": "National", "topic": "Urbanization", "year": 2021}},
    {"id": "gaa_001", "content": "Greater Accra Region has a population of 5,446,237, making it the most populous region with 17.7% of Ghana's population.", "metadata": {"region": "Greater Accra", "topic": "Population", "year": 2021}},
    {"id": "gaa_002", "content": "Greater Accra literacy rate is 87.8% for persons 15 years and older, the highest in Ghana.", "metadata": {"region": "Greater Accra", "topic": "Literacy", "year": 2021}},
    {"id": "gaa_003", "content": "Unemployment in Greater Accra is 15.2%, higher than the national average due to urbanization.", "metadata": {"region": "Greater Accra", "topic": "Employment", "year": 2021}},
    {"id": "gaa_004", "content": "Greater Accra has 94.2% electricity access, the highest in Ghana.", "metadata": {"region": "Greater Accra", "topic": "Electricity", "year": 2021}},
    {"id": "gaa_005", "content": "Safe water access in Greater Accra is 95.1%. Pipe-borne water serves 64.2% of households.", "metadata": {"region": "Greater Accra", "topic": "Water", "year": 2021}},
    {"id": "gaa_006", "content": "Average household size in Greater Accra is 3.2 persons, the smallest in Ghana.", "metadata": {"region": "Greater Accra", "topic": "Housing", "year": 2021}},
    {"id": "gaa_007", "content": "Accra Metropolitan Area has a population of 2,076,119. Tema Metropolitan has 374,893 residents.", "metadata": {"region": "Greater Accra", "topic": "Population", "year": 2021}},
    {"id": "gaa_008", "content": "Internet penetration in Greater Accra is 78.4%, the highest in Ghana. Smartphone ownership is 68.2%.", "metadata": {"region": "Greater Accra", "topic": "Technology", "year": 2021}},
    {"id": "gaa_009", "content": "Greater Accra has the highest tertiary education enrollment at 18.3%.", "metadata": {"region": "Greater Accra", "topic": "Education", "year": 2021}},
    {"id": "gaa_010", "content": "Housing in Greater Accra: Only 28.4% own their homes (lowest in Ghana). Average rent is GHS 850/month.", "metadata": {"region": "Greater Accra", "topic": "Housing", "year": 2021}},
    {"id": "gaa_011", "content": "Population density in Accra Metropolitan is 5,689 persons/km², the highest in Ghana.", "metadata": {"region": "Greater Accra", "topic": "Density", "year": 2021}},
    {"id": "gaa_012", "content": "Health insurance coverage in Greater Accra is 62.3%. There are 0.45 doctors per 1,000 population.", "metadata": {"region": "Greater Accra", "topic": "Health", "year": 2021}},
    {"id": "ash_001", "content": "Ashanti Region has a population of 5,432,485, the second most populous region. Kumasi has 3,630,326 residents.", "metadata": {"region": "Ashanti", "topic": "Population", "year": 2021}},
    {"id": "ash_002", "content": "Kumasi is the cultural capital of Ghana and seat of the Asantehene.", "metadata": {"region": "Ashanti", "topic": "Culture", "year": 2021}},
    {"id": "ash_003", "content": "Ashanti Region literacy rate is 79.3% for persons 15 years and older.", "metadata": {"region": "Ashanti", "topic": "Literacy", "year": 2021}},
    {"id": "ash_004", "content": "Electricity access in Ashanti is 89.1%, second highest after Greater Accra.", "metadata": {"region": "Ashanti", "topic": "Electricity", "year": 2021}},
    {"id": "ash_005", "content": "Ashanti has strong economic base in mining (gold), agriculture (cocoa), and trade. Obuasi gold mine is one of Africa's largest.", "metadata": {"region": "Ashanti", "topic": "Economy", "year": 2021}},
    {"id": "ash_006", "content": "Unemployment in Ashanti is 11.2%, below national average.", "metadata": {"region": "Ashanti", "topic": "Employment", "year": 2021}},
    {"id": "ash_007", "content": "Average household size in Ashanti is 3.8 persons. Extended family households are common at 48.2%.", "metadata": {"region": "Ashanti", "topic": "Housing", "year": 2021}},
    {"id": "ash_008", "content": "Safe water access in Ashanti is 91.2%. Boreholes serve 35.4% of households.", "metadata": {"region": "Ashanti", "topic": "Water", "year": 2021}},
    {"id": "ash_009", "content": "KNUST in Kumasi has 71,000 students. Ashanti has the second highest tertiary enrollment.", "metadata": {"region": "Ashanti", "topic": "Education", "year": 2021}},
    {"id": "ash_010", "content": "Komfo Anokye Teaching Hospital serves Ashanti with 1,000 beds. There are 0.32 doctors per 1,000.", "metadata": {"region": "Ashanti", "topic": "Health", "year": 2021}},
    {"id": "ash_011", "content": "Internet usage in Ashanti is 62.3%. Mobile money transactions are 45% higher than national average.", "metadata": {"region": "Ashanti", "topic": "Technology", "year": 2021}},
    {"id": "ash_012", "content": "Population growth rate in Ashanti is 2.3% per annum. Urban population is 61.2%.", "metadata": {"region": "Ashanti", "topic": "Demographics", "year": 2021}},
    {"id": "nor_001", "content": "Northern Region has a population of 2,310,939. Tamale Metropolitan has 374,844 residents.", "metadata": {"region": "Northern", "topic": "Population", "year": 2021}},
    {"id": "nor_002", "content": "Northern Region literacy rate is 44.2%, the lowest in Ghana. Female literacy is only 32.8%.", "metadata": {"region": "Northern", "topic": "Literacy", "year": 2021}},
    {"id": "nor_003", "content": "Electricity access in Northern Region is 67.3%. Solar home systems serve 12.3% of off-grid households.", "metadata": {"region": "Northern", "topic": "Electricity", "year": 2021}},
    {"id": "nor_004", "content": "Agriculture dominates Northern economy, employing 72.4% of workers. Main crops are rice, maize, yam, and shea nuts.", "metadata": {"region": "Northern", "topic": "Economy", "year": 2021}},
    {"id": "nor_005", "content": "Unemployment in Northern Region is 8.4%, but underemployment is high at 34.2%.", "metadata": {"region": "Northern", "topic": "Employment", "year": 2021}},
    {"id": "nor_006", "content": "Average household size in Northern Region is 5.8, the largest in Ghana. Polygamous households constitute 23.4%.", "metadata": {"region": "Northern", "topic": "Housing", "year": 2021}},
    {"id": "nor_007", "content": "Safe water access in Northern Region is 72.4%. Boreholes are the main source for 52.3% of households.", "metadata": {"region": "Northern", "topic": "Water", "year": 2021}},
    {"id": "nor_008", "content": "School attendance (6-14 years) in Northern Region is 71.2%, lowest in Ghana. Girls at 66.1%.", "metadata": {"region": "Northern", "topic": "Education", "year": 2021}},
    {"id": "nor_009", "content": "Health insurance coverage in Northern Region is 43.2%. Only 0.12 doctors per 1,000 population.", "metadata": {"region": "Northern", "topic": "Health", "year": 2021}},
    {"id": "nor_010", "content": "Internet usage in Northern Region is 32.4%, lowest in Ghana. Mobile phone ownership is 54.2%.", "metadata": {"region": "Northern", "topic": "Technology", "year": 2021}},
    {"id": "nor_011", "content": "Fertility rate in Northern Region is 5.2 children per woman, highest in Ghana.", "metadata": {"region": "Northern", "topic": "Demographics", "year": 2021}},
    {"id": "nor_012", "content": "Population density in Northern Region is 35 persons/km², among the lowest in Ghana.", "metadata": {"region": "Northern", "topic": "Density", "year": 2021}},
    {"id": "wes_001", "content": "Western Region has a population of 2,060,585. Sekondi-Takoradi has 445,205 residents. It's Ghana's oil and gas hub.", "metadata": {"region": "Western", "topic": "Population", "year": 2021}},
    {"id": "wes_002", "content": "Western Region literacy rate is 74.2%. Oil industry has attracted educated migrants.", "metadata": {"region": "Western", "topic": "Literacy", "year": 2021}},
    {"id": "wes_003", "content": "Electricity access in Western Region is 83.4%. Oil and gas operations have 100% reliable power.", "metadata": {"region": "Western", "topic": "Electricity", "year": 2021}},
    {"id": "wes_004", "content": "Western Region economy is driven by oil/gas, cocoa, and gold mining. Jubilee oil field produces 100,000 barrels/day.", "metadata": {"region": "Western", "topic": "Economy", "year": 2021}},
    {"id": "wes_005", "content": "Unemployment in Western Region is 10.8%. Oil industry directly employs 8,500 workers.", "metadata": {"region": "Western", "topic": "Employment", "year": 2021}},
    {"id": "wes_006", "content": "Average household size in Western Region is 3.9 persons. 15,000 new housing units built since 2015.", "metadata": {"region": "Western", "topic": "Housing", "year": 2021}},
    {"id": "wes_007", "content": "Safe water access in Western Region is 79.8%. Mining communities face water quality concerns.", "metadata": {"region": "Western", "topic": "Water", "year": 2021}},
    {"id": "wes_008", "content": "Takoradi Technical University offers petroleum and mining programs. Enrollment growing at 8% annually.", "metadata": {"region": "Western", "topic": "Education", "year": 2021}},
    {"id": "wes_009", "content": "New Western regional hospital opened in 2020 with 250 beds. Doctor ratio is 0.28 per 1,000.", "metadata": {"region": "Western", "topic": "Health", "year": 2021}},
    {"id": "eas_001", "content": "Eastern Region has a population of 2,937,524. Koforidua is the capital with 183,727 residents.", "metadata": {"region": "Eastern", "topic": "Population", "year": 2021}},
    {"id": "eas_002", "content": "Eastern Region literacy rate is 75.8%. The region has several teacher training colleges.", "metadata": {"region": "Eastern", "topic": "Literacy", "year": 2021}},
    {"id": "eas_003", "content": "Electricity access in Eastern Region is 85.6%. Akosombo Dam provides hydroelectric power.", "metadata": {"region": "Eastern", "topic": "Electricity", "year": 2021}},
    {"id": "eas_004", "content": "Eastern Region is a major cocoa producer, contributing 18% of national output. Aburi Botanical Gardens attracts tourists.", "metadata": {"region": "Eastern", "topic": "Economy", "year": 2021}},
    {"id": "eas_005", "content": "Unemployment in Eastern Region is 12.1%. Agriculture employs 48.2% of workers.", "metadata": {"region": "Eastern", "topic": "Employment", "year": 2021}},
    {"id": "eas_006", "content": "Average household size in Eastern Region is 3.7 persons. Rural housing uses traditional materials in 45.2% of cases.", "metadata": {"region": "Eastern", "topic": "Housing", "year": 2021}},
    {"id": "eas_007", "content": "Safe water access in Eastern Region is 82.3%. Densu River provides water to parts of the region.", "metadata": {"region": "Eastern", "topic": "Water", "year": 2021}},
    {"id": "eas_008", "content": "Boti Falls, Akosombo Dam, and Aburi Gardens make Eastern Region a tourism destination.", "metadata": {"region": "Eastern", "topic": "Tourism", "year": 2021}},
    {"id": "cen_001", "content": "Central Region has a population of 2,859,821. Cape Coast is the capital with 189,925 residents.", "metadata": {"region": "Central", "topic": "Population", "year": 2021}},
    {"id": "cen_002", "content": "Central Region literacy rate is 76.1%. University of Cape Coast is a major institution.", "metadata": {"region": "Central", "topic": "Literacy", "year": 2021}},
    {"id": "cen_003", "content": "Electricity access in Central Region is 82.4%. Fishing communities have 73.2% access.", "metadata": {"region": "Central", "topic": "Electricity", "year": 2021}},
    {"id": "cen_004", "content": "Central Region economy includes fishing (32% of catch) and tourism. Cape Coast and Elmina castles are UNESCO World Heritage sites.", "metadata": {"region": "Central", "topic": "Economy", "year": 2021}},
    {"id": "cen_005", "content": "Unemployment in Central Region is 13.8%. Tourism sector employs 8,200 directly.", "metadata": {"region": "Central", "topic": "Employment", "year": 2021}},
    {"id": "cen_006", "content": "Average household size in Central Region is 3.8 persons. Fishing communities average 4.6 persons.", "metadata": {"region": "Central", "topic": "Housing", "year": 2021}},
    {"id": "cen_007", "content": "Safe water access in Central Region is 78.4%. Coastal communities depend on boreholes.", "metadata": {"region": "Central", "topic": "Water", "year": 2021}},
    {"id": "cen_008", "content": "University of Cape Coast has 74,000 students. The region produces 22% of Ghana's trained teachers.", "metadata": {"region": "Central", "topic": "Education", "year": 2021}},
    {"id": "vol_001", "content": "Volta Region has a population of 1,907,679. Ho is the capital with 177,281 residents.", "metadata": {"region": "Volta", "topic": "Population", "year": 2021}},
    {"id": "vol_002", "content": "Volta Region literacy rate is 71.2%. Ho Technical University serves the region.", "metadata": {"region": "Volta", "topic": "Literacy", "year": 2021}},
    {"id": "vol_003", "content": "Electricity access in Volta Region is 76.8%. Volta Lake enables fishing and transportation.", "metadata": {"region": "Volta", "topic": "Electricity", "year": 2021}},
    {"id": "vol_004", "content": "Volta Region is known for Kente weaving, especially in Kpetoe. Mount Afadjato is Ghana's highest peak.", "metadata": {"region": "Volta", "topic": "Culture", "year": 2021}},
    {"id": "vol_005", "content": "Unemployment in Volta Region is 11.8%. Cross-border trading with Togo is significant.", "metadata": {"region": "Volta", "topic": "Employment", "year": 2021}},
    {"id": "vol_006", "content": "Average household size in Volta Region is 3.6 persons. Traditional Ewe family structures are common.", "metadata": {"region": "Volta", "topic": "Housing", "year": 2021}},
    {"id": "vol_007", "content": "Safe water access in Volta Region is 74.8%. Volta Lake fishing communities face sanitation challenges.", "metadata": {"region": "Volta", "topic": "Water", "year": 2021}},
    {"id": "vol_008", "content": "Wli Waterfalls, Mount Afadjato, and Tafi Atome Monkey Sanctuary attract 45,000 tourists annually.", "metadata": {"region": "Volta", "topic": "Tourism", "year": 2021}},
    {"id": "upe_001", "content": "Upper East Region has a population of 1,301,226. Bolgatanga is the capital. Borders Burkina Faso.", "metadata": {"region": "Upper East", "topic": "Population", "year": 2021}},
    {"id": "upe_002", "content": "Upper East Region literacy rate is 48.2%. Female literacy is only 38.4%.", "metadata": {"region": "Upper East", "topic": "Literacy", "year": 2021}},
    {"id": "upe_003", "content": "Electricity access in Upper East is 58.4%. Solar mini-grids serve 8.2% of off-grid communities.", "metadata": {"region": "Upper East", "topic": "Electricity", "year": 2021}},
    {"id": "upw_001", "content": "Upper West Region has a population of 901,502, the least populous region. Wa is the capital.", "metadata": {"region": "Upper West", "topic": "Population", "year": 2021}},
    {"id": "upw_002", "content": "Upper West Region literacy rate is 46.8%. University for Development Studies serves the region.", "metadata": {"region": "Upper West", "topic": "Literacy", "year": 2021}},
    {"id": "upw_003", "content": "Safe water access in Upper West is 72.3%, heavily dependent on boreholes (68.4%).", "metadata": {"region": "Upper West", "topic": "Water", "year": 2021}},
    {"id": "sav_001", "content": "Savannah Region has a population of 588,152, created in 2019 from Northern Region. Damongo is the capital.", "metadata": {"region": "Savannah", "topic": "Population", "year": 2021}},
    {"id": "sav_002", "content": "Mole National Park in Savannah Region is Ghana's largest wildlife reserve with elephants and 300 bird species.", "metadata": {"region": "Savannah", "topic": "Tourism", "year": 2021}},
    {"id": "noe_001", "content": "North East Region has a population of 678,986, created in 2019. Nalerigu is the capital. Literacy rate is 41.2%.", "metadata": {"region": "North East", "topic": "Population", "year": 2021}},
    {"id": "oti_001", "content": "Oti Region has a population of 759,799, created in 2019 from Volta Region. Dambai is the capital.", "metadata": {"region": "Oti", "topic": "Population", "year": 2021}},
    {"id": "bon_001", "content": "Bono Region has a population of 1,208,649. Sunyani is the capital with 248,496 residents.", "metadata": {"region": "Bono", "topic": "Population", "year": 2021}},
    {"id": "boe_001", "content": "Bono East Region has a population of 1,179,649. Techiman is the capital, hosting one of Ghana's largest markets.", "metadata": {"region": "Bono East", "topic": "Population", "year": 2021}},
    {"id": "aha_001", "content": "Ahafo Region has a population of 564,536, created in 2019. Goaso is the capital. Gold mining is significant.", "metadata": {"region": "Ahafo", "topic": "Population", "year": 2021}},
    {"id": "wno_001", "content": "Western North Region has a population of 910,772. Wiawso is the capital. Cocoa farming is the main activity.", "metadata": {"region": "Western North", "topic": "Population", "year": 2021}}
]

# ============================================
# RAG CLASS
# ============================================
@dataclass
class SearchResult:
    content: str
    metadata: Dict
    score: float

class GhanaCensusRAG:
    def __init__(self, model_path="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_path)
        self.client = chromadb.Client()
        import uuid
        self.collection = self.client.create_collection(name=f"census_{uuid.uuid4().hex[:6]}")

    def add_documents(self, docs):
        self.collection.add(
            ids=[d["id"] for d in docs],
            embeddings=self.model.encode([d["content"] for d in docs]).tolist(),
            documents=[d["content"] for d in docs],
            metadatas=[d["metadata"] for d in docs]
        )

    def query(self, question, n=3, threshold=0.35):
        """Query with relevance threshold for out-of-scope detection."""
        results = self.collection.query(
            query_embeddings=[self.model.encode([question])[0].tolist()],
            n_results=n,
            include=["documents", "metadatas", "distances"]
        )
        
        if not results["documents"][0]:
            return {"answer": self._out_of_scope_response(), "sources": [], "in_scope": False}
        
        # Calculate relevance scores
        sources = []
        for idx, doc in enumerate(results["documents"][0]):
            distance = results["distances"][0][idx]
            score = 1 / (1 + distance)
            sources.append({
                "content": doc,
                "region": results["metadatas"][0][idx].get("region", ""),
                "topic": results["metadatas"][0][idx].get("topic", ""),
                "score": round(score, 3)
            })
        
        # Check if best match is above threshold
        if sources[0]["score"] < threshold:
            return {
                "answer": self._out_of_scope_response(question),
                "sources": sources,
                "in_scope": False
            }
        
        # Good match - return answer
        top = sources[0]
        answer = f"**Based on the Ghana 2021 Census:**

{top['content']}

*Source: {top['region']} - {top['topic']}*"
        return {"answer": answer, "sources": sources, "in_scope": True}
    
    def _out_of_scope_response(self, question=None):
        """Return helpful message for out-of-scope questions."""
        return """🇬🇭 **I specialize in Ghana's 2021 Census data.**

I don't have information that matches your question in my database.

**I can help you with:**
• Population statistics for all 16 regions
• Literacy rates by region and gender
• Electricity and water access
• Employment and unemployment data
• Housing and household statistics
• Health and education metrics
• Demographics and urbanization

**Try asking:**
• "What is Ghana's total population?"
• "Literacy rate in Northern Region?"
• "Which region has the highest electricity access?"
"""

# ============================================
# LOAD MODEL
# ============================================
@st.cache_resource
def load_rag():
    rag = GhanaCensusRAG("/content/finetuned_model" if os.path.exists("/content/finetuned_model") else "all-MiniLM-L6-v2")
    rag.add_documents(GHANA_CENSUS_DATA)
    return rag

rag = load_rag()

# ============================================
# UI
# ============================================
# Logo
if os.path.exists("/content/gainai.png"):
    c1, c2, c3 = st.columns([1,1,1])
    with c2:
        st.image("/content/gainai.png", width=180)

st.title("🇬🇭 Ghana Census Q&A (100 Docs)")
st.markdown("**Ask questions about Ghana's 2021 Population and Housing Census**")
st.info("🎓 **DEMO**: Running offline - No API costs!")

# Sidebar
with st.sidebar:
    if os.path.exists("/content/gainai.png"):
        st.image("/content/gainai.png", width=100)
    st.markdown("### 📊 Stats")
    st.metric("Documents", len(GHANA_CENSUS_DATA))
    st.metric("Regions", "16")
    st.markdown("---")
    st.markdown("### 💡 Try asking:")
    for q in ["What is Ghana's population?", "Literacy in Northern Region?", "Kumasi population?", "Electricity access?"]:
        if st.button(q, key=q):
            st.session_state["q"] = q

# Chat
if "msgs" not in st.session_state:
    st.session_state.msgs = []

for m in st.session_state.msgs:
    with st.chat_message(m["role"]):
        st.markdown(m["content"])
        if m.get("sources"):
            with st.expander("📚 Sources"):
                for i, s in enumerate(m["sources"], 1):
                    st.markdown(f"**{i}. {s['region']} - {s['topic']}** (Score: {s['score']})")
                    st.caption(s["content"][:150] + "...")

# Input
q = st.session_state.pop("q", None) or st.chat_input("Ask about Ghana census...")

if q:
    st.session_state.msgs.append({"role": "user", "content": q})
    with st.chat_message("user"):
        st.markdown(q)
    
    with st.chat_message("assistant"):
        result = rag.query(q)
        # Show answer with scope indicator
        if not result.get("in_scope", True):
            st.warning("ℹ️ This question appears to be outside my knowledge base.")
        st.markdown(result["answer"])
        if result["sources"]:
            with st.expander("📚 Sources"):
                for i, s in enumerate(result["sources"], 1):
                    st.markdown(f"**{i}. {s['region']} - {s['topic']}** (Score: {s['score']})")
                    st.caption(s["content"][:150] + "...")
    
    st.session_state.msgs.append({"role": "assistant", "content": result["answer"], "sources": result["sources"]})

st.markdown("---")
st.caption("GAIN Dialogue Series 2026 | Ghana Statistical Service Data")


In [None]:
# ============================================
# 🚀 LAUNCH STREAMLIT APP
# ============================================
# Using Cloudflare Tunnel (FREE, no signup, no password!)

# Install cloudflared
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared
!chmod +x cloudflared

import subprocess
import time
import re

# Kill existing processes
!pkill -f streamlit 2>/dev/null || true
!pkill -f cloudflared 2>/dev/null || true

# Start Streamlit
print("⏳ Starting Streamlit server...")
!nohup streamlit run streamlit_app.py --server.port 8501 --server.headless true > streamlit.log 2>&1 &
time.sleep(6)

# Check if running
result = subprocess.run(['pgrep', '-f', 'streamlit'], capture_output=True)
if result.returncode == 0:
    print("✅ Streamlit running on port 8501")
else:
    print("❌ Streamlit failed to start")
    !cat streamlit.log

# Start cloudflared tunnel
print("\n🌐 Creating public tunnel (this takes ~10 seconds)...\n")
process = subprocess.Popen(
    ['./cloudflared', 'tunnel', '--url', 'http://localhost:8501'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

# Find the URL
url_found = False
for _ in range(30):
    line = process.stdout.readline()
    if 'trycloudflare.com' in line:
        match = re.search(r'https://[a-z0-9-]+\.trycloudflare\.com', line)
        if match:
            url = match.group(0)
            print("=" * 60)
            print("🎉 YOUR APP IS LIVE!")
            print("=" * 60)
            print(f"\n🔗 {url}")
            print("\n• Share this link with anyone")
            print("• No password required")
            print("• Keep this cell running")
            print("=" * 60)
            url_found = True
            break
    time.sleep(0.5)

if not url_found:
    print("⏳ Waiting for URL... (check output below)")

# Keep alive
try:
    while True:
        time.sleep(60)
except KeyboardInterrupt:
    print("\n🛑 Stopping app...")
    !pkill -f streamlit
    !pkill -f cloudflared
