# 🇬🇭 Ghana Census Q&A System
## Building AI Applications with RAG | GAIN Dialogue Series 2026

**Presenter:** Thomas Torku, Ph.D.  
**Date:** January 2026

---

### What We're Building

A question-answering system that retrieves information from Ghana's 2021 Census data and generates accurate, sourced answers.

**Architecture:**
```
Query → Embed → Search Vector DB → Retrieve Context → Generate Answer
```

**Tech Stack:** Python • Sentence-Transformers • ChromaDB • Streamlit

---


---

## 📦 Step 1: Setup

Install required packages. Everything runs FREE on Google Colab!

**Key Insight:** We use `sentence-transformers` for embeddings and `chromadb` for vector storage - both are open-source and free.


In [None]:
# Install required packages
!pip install -q sentence-transformers chromadb python-dotenv accelerate

print("✅ All packages installed successfully!")
print("\n📊 Package versions:")
import sentence_transformers
import chromadb
print(f"  - sentence-transformers: {sentence_transformers.__version__}")
print(f"  - chromadb: {chromadb.__version__}")

---

## 📊 Step 2: Prepare Data

Load Ghana Census 2021 data with proper structure: ID, content, and metadata (region, topic, year).

**Key Insight:** Good metadata enables filtering and improves search relevance.


In [None]:
import json
from typing import List, Dict

# Ghana Census 2021 Sample Data
GHANA_CENSUS_DATA = [
    {
        "id": "census_001",
        "content": "Ghana's total population according to the 2021 Population and Housing Census is 30,832,019. This represents an increase of 6,047,539 (24.4%) over the 2010 census population of 24,658,823. The intercensal growth rate between 2010 and 2021 is 2.1% per annum.",
        "metadata": {"region": "National", "topic": "Population", "year": 2021}
    },
    {
        "id": "census_002",
        "content": "The Greater Accra Region has a population of 5,446,237, making it the most populous region. It is followed by the Ashanti Region with 5,432,485 people. The least populous region is Savannah Region with 588,152 people.",
        "metadata": {"region": "Greater Accra", "topic": "Population", "year": 2021}
    },
    {
        "id": "census_003",
        "content": "In the Greater Accra Region, the literacy rate is 87.8% for persons aged 15 years and older. Male literacy rate is 91.2% while female literacy rate is 84.6%. Urban areas have higher literacy rates than rural areas.",
        "metadata": {"region": "Greater Accra", "topic": "Literacy", "year": 2021}
    },
    {
        "id": "census_004",
        "content": "The Northern Region has a literacy rate of 44.2% for persons aged 15 years and older. This is lower than the national average of 76.4%. Male literacy rate is 56.7% while female literacy rate is 32.8%.",
        "metadata": {"region": "Northern", "topic": "Literacy", "year": 2021}
    },
    {
        "id": "census_005",
        "content": "Access to electricity: Nationally, 84.3% of households have access to electricity. Greater Accra has the highest access at 94.2%, followed by Ashanti at 89.1%. The Northern Region has 67.3% access to electricity.",
        "metadata": {"region": "National", "topic": "Electricity", "year": 2021}
    },
    {
        "id": "census_006",
        "content": "The unemployment rate in Ghana is 13.4% according to the 2021 census. Youth unemployment (ages 15-24) stands at 19.7%. Greater Accra has an unemployment rate of 15.2%, while rural areas have lower unemployment at 8.9%.",
        "metadata": {"region": "National", "topic": "Employment", "year": 2021}
    },
    {
        "id": "census_007",
        "content": "The Ashanti Region has a population of 5,432,485, making it the second most populous region. The region has 1,234,567 households. The average household size is 4.4 persons.",
        "metadata": {"region": "Ashanti", "topic": "Population", "year": 2021}
    },
    {
        "id": "census_008",
        "content": "National literacy rate: 76.4% of the population aged 15 years and older can read and write. Male literacy is 82.3% and female literacy is 70.8%. There is a significant gap between urban (86.5%) and rural (64.1%) literacy rates.",
        "metadata": {"region": "National", "topic": "Literacy", "year": 2021}
    },
    {
        "id": "census_009",
        "content": "Access to improved water sources: 87.2% of households have access to improved water sources. In urban areas, this is 93.4%, while in rural areas it is 78.9%. Greater Accra has the highest access at 96.1%.",
        "metadata": {"region": "National", "topic": "Water Access", "year": 2021}
    },
    {
        "id": "census_010",
        "content": "Internet usage: 58.3% of Ghanaians aged 15 and above have used the internet in the past 3 months. Urban areas have 72.1% internet usage while rural areas have 38.2%. Greater Accra leads with 78.4% internet usage.",
        "metadata": {"region": "National", "topic": "Technology", "year": 2021}
    },
]

print(f"✅ Loaded {len(GHANA_CENSUS_DATA)} census documents")
print("\n📊 Sample document:")
print(json.dumps(GHANA_CENSUS_DATA[0], indent=2))

---

## 🧠 Step 3: Initialize RAG Pipeline

Create the core RAG system with embedding model and vector database.

**How it works:**
1. **Embed** - Convert text to 384-dimensional vectors
2. **Search** - Find similar vectors using cosine similarity
3. **Generate** - Create answers from retrieved context

**Key Insight:** The embedding model captures semantic meaning - "population count" matches "how many people" even without shared words.


In [None]:
import os
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer
import chromadb

@dataclass
class SearchResult:
    """Represents a search result with content and score."""
    content: str
    metadata: Dict
    score: float

class GhanaCensusRAG:
    """RAG Pipeline for Ghana Census Data - Completely FREE!"""

    def __init__(self, embedding_model="all-MiniLM-L6-v2"):
        print("🔧 Initializing RAG Pipeline (100% FREE - No API needed!)")

        # Load embedding model
        print(f"   📥 Loading embedding model: {embedding_model}")
        self.embedding_model = SentenceTransformer(embedding_model)

        # Setup ChromaDB
        print("   💾 Setting up ChromaDB...")
        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(
            name="ghana_census",
            metadata={"description": "Ghana 2021 Census Data"}
        )

        print("✅ RAG Pipeline initialized!\n")

    def add_documents(self, documents: List[Dict]):
        """Add documents to vector database."""
        print(f"📊 Processing {len(documents)} documents...")

        ids = [doc["id"] for doc in documents]
        contents = [doc["content"] for doc in documents]
        metadatas = [doc["metadata"] for doc in documents]

        # Generate embeddings
        print("   🧮 Generating embeddings...")
        embeddings = self.embedding_model.encode(contents, show_progress_bar=True)

        # Store in ChromaDB
        print("   💾 Storing in vector database...")
        self.collection.add(
            ids=ids,
            embeddings=embeddings.tolist(),
            documents=contents,
            metadatas=metadatas
        )

        print(f"✅ Added {len(documents)} documents to vector store\n")

    def search(self, query: str, n_results: int = 3) -> List[SearchResult]:
        """Semantic search over documents."""
        # Generate query embedding
        query_embedding = self.embedding_model.encode([query])[0]

        # Search ChromaDB
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=["documents", "metadatas", "distances"]
        )

        # Convert to SearchResult objects
        search_results = []
        if results["documents"] and results["documents"][0]:
            for i, doc in enumerate(results["documents"][0]):
                distance = results["distances"][0][i]
                score = 1 / (1 + distance)

                search_results.append(SearchResult(
                    content=doc,
                    metadata=results["metadatas"][0][i],
                    score=score
                ))

        return search_results

    def generate_answer(self, query: str, context_results: List[SearchResult]) -> str:
        """Generate answer using retrieved context (no API needed!)."""
        if not context_results:
            return "I couldn't find relevant information in the census data."

        top_result = context_results[0]
        content = top_result.content
        region = top_result.metadata.get('region', 'Ghana')
        topic = top_result.metadata.get('topic', 'Census Data')

        answer_parts = [
            "Based on the Ghana 2021 Census data, here's what I found:\n",
            content if len(content) <= 300 else f"{content[:300]}...",
            f"\n\n[Source: {region} - {topic}]"
        ]

        if len(context_results) > 1:
            answer_parts.append(f"\n\nAdditional information from {len(context_results)} census sources.")

        return "".join(answer_parts)

    def query(self, question: str, n_context: int = 3) -> Dict:
        """Full RAG pipeline: search + generate answer."""
        search_results = self.search(question, n_results=n_context)
        answer = self.generate_answer(question, search_results)

        sources = [
            {
                "region": r.metadata.get("region"),
                "topic": r.metadata.get("topic"),
                "relevance_score": round(r.score, 3),
                "content": r.content[:200] + "..." if len(r.content) > 200 else r.content
            }
            for r in search_results
        ]

        return {
            "query": question,
            "answer": answer,
            "sources": sources
        }

# Initialize the RAG system
rag = GhanaCensusRAG()

---

## 📥 Step 4: Load Data

Add documents to the vector database. Each document gets embedded and indexed for fast retrieval.

**Key Insight:** Once indexed, searches across millions of documents take milliseconds.


In [None]:
# Add documents to RAG system
rag.add_documents(GHANA_CENSUS_DATA)

print("✅ Vector database is ready!")
print(f"   Total documents: {len(GHANA_CENSUS_DATA)}")
print(f"   Embedding dimensions: 384")
print(f"   Model: all-MiniLM-L6-v2")

---

## 🔍 Step 5: Test Search

Verify semantic search works correctly before building the full pipeline.

**Key Insight:** Always test with edge cases - even nonsense queries return results, but with low scores.


In [None]:
# Test semantic search
test_query = "What is the population of Ghana?"

print(f"🔍 Query: {test_query}\n")
print("📊 Top 3 most relevant documents:\n")

results = rag.search(test_query, n_results=3)

for i, result in enumerate(results, 1):
    print(f"{'='*60}")
    print(f"Result {i}:")
    print(f"  Region: {result.metadata['region']}")
    print(f"  Topic: {result.metadata['topic']}")
    print(f"  Relevance Score: {result.score:.3f}")
    print(f"  Content: {result.content[:150]}...")
    print()

---

## 💬 Step 6: Full Q&A Pipeline

Combine retrieval with answer generation. Users ask questions, system finds relevant data and generates responses.

**Key Insight:** We use rule-based generation (FREE), but production systems often use LLMs for more natural responses.


In [None]:
# Test questions
test_questions = [
    "What is the population of Ghana?",
    "How does literacy rate compare between Greater Accra and Northern Region?",
    "What percentage of households have access to electricity?",
    "What is the unemployment rate in Ghana?"
]

for question in test_questions:
    print(f"\n{'='*70}")
    print(f"❓ Question: {question}")
    print(f"{'='*70}\n")

    result = rag.query(question)

    print(f"📝 Answer:\n{result['answer']}\n")

    print(f"📚 Sources ({len(result['sources'])} documents):")
    for i, source in enumerate(result['sources'], 1):
        print(f"   {i}. [{source['region']}] {source['topic']} (relevance: {source['relevance_score']})")

    print()

---

## 🎯 Step 7: Fine-Tune Embeddings

Improve retrieval for Ghana-specific terminology (regions, districts, local terms).

**How it works:**
- Create query-document pairs
- Train model to bring related pairs closer in vector space
- Result: Better retrieval for domain-specific queries

**Key Insight:** Fine-tuning teaches the model that "Kumasi" relates to "Ashanti Region" - knowledge a general model doesn't have.


In [None]:
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader

# Training data: (query, positive example, negative example)
training_data = [
    {
        "query": "How many people live in Ghana?",
        "positive": "Ghana's total population according to the 2021 Population and Housing Census is 30,832,019.",
        "negative": "The unemployment rate in Ghana is 13.4% according to the 2021 census."
    },
    {
        "query": "What is the population of Ashanti Region?",
        "positive": "The Ashanti Region has a population of 5,432,485, making it the second most populous region.",
        "negative": "The literacy rate in the Northern Region is 44.2% for persons aged 15 years and older."
    },
    {
        "query": "Greater Accra literacy statistics",
        "positive": "In the Greater Accra Region, the literacy rate is 87.8% for persons aged 15 years and older.",
        "negative": "Access to electricity: Nationally, 84.3% of households have access to electricity."
    },
    {
        "query": "Northern region education levels",
        "positive": "The Northern Region has a literacy rate of 44.2% for persons aged 15 years and older.",
        "negative": "The Greater Accra Region has a population of 5,446,237, making it the most populous region."
    },
    {
        "query": "How many households have power?",
        "positive": "Access to electricity: Nationally, 84.3% of households have access to electricity.",
        "negative": "Internet usage: 58.3% of Ghanaians aged 15 and above have used the internet."
    },
]

print(f"📊 Training data: {len(training_data)} examples")
print("\nSample training triplet:")
print(f"  Query: {training_data[0]['query']}")
print(f"  Positive: {training_data[0]['positive'][:60]}...")
print(f"  Negative: {training_data[0]['negative'][:60]}...")

In [None]:
import os
os.environ['WANDB_DISABLED'] = 'true'  # Disable W&B logging (no account needed!)

# Prepare training data
train_examples = [
    InputExample(texts=[item["query"], item["positive"], item["negative"]])
    for item in training_data
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=4)

# Initialize model for fine-tuning
model = SentenceTransformer("all-MiniLM-L6-v2")

# Setup triplet loss
train_loss = losses.TripletLoss(model=model)

print("🏋️ Fine-tuning model on Ghana census data...")
print("   This will take ~1-2 minutes\n")

# Fine-tune (3 epochs is enough for demo)
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=10,
    show_progress_bar=True
)

print("\n✅ Fine-tuning complete!")
print("   Model is now optimized for Ghana census queries")

In [None]:
# Save the fine-tuned model for Streamlit app
MODEL_SAVE_PATH = '/content/finetuned_model'
model.save(MODEL_SAVE_PATH)
print(f"✅ Fine-tuned model saved to: {MODEL_SAVE_PATH}")

---

## 📊 Step 8: Compare Performance

Measure improvement from fine-tuning using relevance scores.

**Key Insight:** Don't just compare numbers - verify the RIGHT documents are being retrieved.


In [None]:
# Create RAG with fine-tuned model
print("🔧 Creating RAG system with fine-tuned model...\n")

class GhanaCensusRAGFinetuned(GhanaCensusRAG):
    def __init__(self, model):
        print("🔧 Initializing RAG with fine-tuned model")
        self.embedding_model = model

        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(
            name="ghana_census_finetuned",
            metadata={"description": "Ghana Census - Fine-tuned"}
        )
        print("✅ Fine-tuned RAG initialized!\n")

rag_finetuned = GhanaCensusRAGFinetuned(model)
rag_finetuned.add_documents(GHANA_CENSUS_DATA)

# Compare both models
test_query = "Tell me about literacy in Accra"

print(f"\n{'='*70}")
print(f"🔍 Test Query: {test_query}")
print(f"{'='*70}\n")

print("📊 BASE MODEL Results:")
base_results = rag.search(test_query, n_results=3)
for i, r in enumerate(base_results, 1):
    print(f"   {i}. [{r.metadata['region']}] {r.metadata['topic']} - Score: {r.score:.3f}")

print("\n📊 FINE-TUNED MODEL Results:")
ft_results = rag_finetuned.search(test_query, n_results=3)
for i, r in enumerate(ft_results, 1):
    print(f"   {i}. [{r.metadata['region']}] {r.metadata['topic']} - Score: {r.score:.3f}")

print("\n💡 Notice: Fine-tuned model often has higher relevance scores!")

---

## 🎨 Step 9: Interactive Interface

Create a user-friendly Q&A interface with sample questions and source citations.

**Key Insight:** Always show sources - users need to verify AI-generated answers.


In [None]:
def ask_question(question: str, use_finetuned: bool = True):
    """Interactive Q&A function"""
    model_to_use = rag_finetuned if use_finetuned else rag
    model_name = "Fine-tuned" if use_finetuned else "Base"

    print(f"\n{'='*70}")
    print(f"❓ Your Question: {question}")
    print(f"🤖 Using: {model_name} Model")
    print(f"{'='*70}\n")

    result = model_to_use.query(question)

    print(f"📝 Answer:\n{result['answer']}\n")

    print(f"📚 Sources:")
    for i, source in enumerate(result['sources'], 1):
        print(f"   {i}. [{source['region']}] {source['topic']} (relevance: {source['relevance_score']})")

# Try it out!
ask_question("What is the unemployment rate in Ghana?")
ask_question("Compare internet usage between urban and rural areas")
ask_question("Tell me about water access in Ghana")

---

## 🎯 YOUR TURN: Ask Your Own Questions!

Modify the cell below to ask your own questions about Ghana Census data!

In [None]:
# YOUR CUSTOM QUESTION HERE!
my_question = "What is the population of Greater Accra?"  # Change this!

ask_question(my_question, use_finetuned=True)

In [None]:
# Evaluation function
def evaluate_model(rag_system, test_queries):
    """Evaluate retrieval performance"""
    correct_at_1 = 0
    correct_at_3 = 0
    reciprocal_ranks = []

    for query_info in test_queries:
        query = query_info["query"]
        expected_topic = query_info["expected_topic"]

        results = rag_system.search(query, n_results=5)

        # Find rank of first relevant result
        rank = None
        for i, result in enumerate(results, 1):
            if result.metadata["topic"] == expected_topic:
                rank = i
                break

        if rank:
            if rank == 1:
                correct_at_1 += 1
            if rank <= 3:
                correct_at_3 += 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)

    n = len(test_queries)
    return {
        "accuracy@1": correct_at_1 / n,
        "accuracy@3": correct_at_3 / n,
        "mrr": sum(reciprocal_ranks) / n
    }

# Test queries with expected topics
eval_queries = [
    {"query": "What is Ghana's population?", "expected_topic": "Population"},
    {"query": "How many people can read in Accra?", "expected_topic": "Literacy"},
    {"query": "Do households have power?", "expected_topic": "Electricity"},
    {"query": "What is the jobless rate?", "expected_topic": "Employment"},
]

print("📊 Evaluating Base Model...")
base_metrics = evaluate_model(rag, eval_queries)

print("📊 Evaluating Fine-tuned Model...")
ft_metrics = evaluate_model(rag_finetuned, eval_queries)

print(f"\n{'='*70}")
print("📈 EVALUATION RESULTS")
print(f"{'='*70}\n")
print(f"{'Metric':<20} {'Base Model':<15} {'Fine-tuned':<15} {'Improvement'}")
print(f"{'-'*70}")

for metric in ["accuracy@1", "accuracy@3", "mrr"]:
    base_val = base_metrics[metric]
    ft_val = ft_metrics[metric]
    improvement = ((ft_val - base_val) / base_val * 100) if base_val > 0 else 0

    print(f"{metric:<20} {base_val:<15.1%} {ft_val:<15.1%} {improvement:+.1f}%")

print(f"\n{'='*70}")

In [None]:
!pip install streamlit

In [None]:
%%writefile streamlit_app.py
import streamlit as st
import os
from dataclasses import dataclass
from typing import List, Dict
from sentence_transformers import SentenceTransformer
import chromadb

# Page config
st.set_page_config(page_title="Ghana Census Q&A", page_icon="🇬🇭", layout="wide")

# ============================================
# CSS - Force light mode with visible text
# ============================================
st.markdown("""
<style>
    /* Force light background everywhere */
    .stApp, .main, [data-testid="stAppViewContainer"], [data-testid="stHeader"] {
        background-color: #ffffff !important;
    }
    
    /* Dark text for all content */
    .stMarkdown, .stMarkdown p, .stMarkdown h1, .stMarkdown h2, .stMarkdown h3,
    .stMarkdown li, .stMarkdown span, .stMarkdown strong, .stMarkdown em,
    p, span, li, label, div {
        color: #1a1a1a !important;
    }
    
    /* Title */
    h1 {
        color: #006B3F !important;
        text-align: center;
    }
    
    /* Chat messages - light background, dark text */
    [data-testid="stChatMessage"] {
        background-color: #f5f5f5 !important;
        border: 1px solid #ddd !important;
        border-radius: 10px !important;
        padding: 15px !important;
        margin: 10px 0 !important;
    }
    
    [data-testid="stChatMessage"] p,
    [data-testid="stChatMessage"] span,
    [data-testid="stChatMessage"] div,
    .stChatMessage p {
        color: #1a1a1a !important;
    }
    
    /* User message bubble */
    [data-testid="stChatMessage"][data-testid*="user"] {
        background-color: #e3f2fd !important;
    }
    
    /* Assistant message bubble */
    [data-testid="stChatMessage"][data-testid*="assistant"] {
        background-color: #f5f5f5 !important;
    }
    
    /* Info box */
    .stAlert, [data-testid="stAlert"] {
        background-color: #e8f5e9 !important;
        color: #1a1a1a !important;
    }
    .stAlert p, [data-testid="stAlert"] p {
        color: #1a1a1a !important;
    }
    
    /* Sidebar */
    [data-testid="stSidebar"], [data-testid="stSidebar"] > div {
        background-color: #f8f9fa !important;
    }
    [data-testid="stSidebar"] p, [data-testid="stSidebar"] span,
    [data-testid="stSidebar"] label, [data-testid="stSidebar"] h1,
    [data-testid="stSidebar"] h2, [data-testid="stSidebar"] h3 {
        color: #1a1a1a !important;
    }
    
    /* Metrics */
    [data-testid="stMetricValue"] {
        color: #006B3F !important;
    }
    [data-testid="stMetricLabel"] {
        color: #333 !important;
    }
    
    /* Expander */
    .streamlit-expanderHeader, [data-testid="stExpander"] summary {
        background-color: #f0f0f0 !important;
        color: #1a1a1a !important;
    }
    .streamlit-expanderContent, [data-testid="stExpander"] > div {
        background-color: #fafafa !important;
        color: #1a1a1a !important;
    }
    
    /* Input */
    .stTextInput input, .stChatInput textarea, [data-testid="stChatInput"] textarea {
        background-color: #ffffff !important;
        color: #1a1a1a !important;
        border: 1px solid #ccc !important;
    }
    
    /* Buttons */
    .stButton button {
        background-color: #006B3F !important;
        color: white !important;
    }
</style>
""", unsafe_allow_html=True)

# ============================================
# DATA
# ============================================
GHANA_CENSUS_DATA = [
    {"id": "census_001", "content": "Ghana's total population according to the 2021 Population and Housing Census is 30,832,019. This represents an increase of 6,047,539 (24.4%) over the 2010 census population of 24,658,823. The intercensal growth rate between 2010 and 2021 is 2.1% per annum.", "metadata": {"region": "National", "topic": "Population", "year": 2021}},
    {"id": "census_002", "content": "The Greater Accra Region has a population of 5,446,237, making it the most populous region. It is followed by the Ashanti Region with 5,432,485 people. The least populous region is Savannah Region with 588,152 people.", "metadata": {"region": "Greater Accra", "topic": "Population", "year": 2021}},
    {"id": "census_003", "content": "In the Greater Accra Region, the literacy rate is 87.8% for persons aged 15 years and older. Male literacy rate is 91.2% while female literacy rate is 84.6%. Urban areas have higher literacy rates than rural areas.", "metadata": {"region": "Greater Accra", "topic": "Literacy", "year": 2021}},
    {"id": "census_004", "content": "The Northern Region has a literacy rate of 44.2% for persons aged 15 years and older. This is lower than the national average of 76.4%. Male literacy rate is 56.7% while female literacy rate is 32.8%.", "metadata": {"region": "Northern", "topic": "Literacy", "year": 2021}},
    {"id": "census_005", "content": "Access to electricity: Nationally, 84.3% of households have access to electricity. Greater Accra has the highest access at 94.2%, followed by Ashanti at 89.1%. The Northern Region has 67.3% access to electricity.", "metadata": {"region": "National", "topic": "Electricity", "year": 2021}},
    {"id": "census_006", "content": "The Ashanti Region, with Kumasi as its capital, has a population of 5,432,485. It is known as the cultural heartland of Ghana. The region has a literacy rate of 79.3% and electricity access of 89.1%.", "metadata": {"region": "Ashanti", "topic": "Demographics", "year": 2021}},
    {"id": "census_007", "content": "Employment statistics: The national unemployment rate is 13.4%. Youth unemployment (ages 15-24) is higher at 19.7%. Greater Accra has the highest employment rate among all regions.", "metadata": {"region": "National", "topic": "Employment", "year": 2021}},
    {"id": "census_008", "content": "Access to safe drinking water: 87.7% of households nationally have access to improved water sources. Greater Accra (95.1%) and Ashanti (91.2%) have the highest access rates. Upper West Region has 72.3% access.", "metadata": {"region": "National", "topic": "Water", "year": 2021}},
    {"id": "census_009", "content": "The Volta Region has a population of 1,907,679 and is located in the eastern part of Ghana. It has a literacy rate of 71.2% and 76.8% access to electricity. The region is known for the Volta Lake.", "metadata": {"region": "Volta", "topic": "Demographics", "year": 2021}},
    {"id": "census_010", "content": "Housing statistics: The average household size in Ghana is 3.6 persons. Greater Accra has the smallest average household size (3.2) while Northern Region has the largest (5.8). Most urban households live in compound houses.", "metadata": {"region": "National", "topic": "Housing", "year": 2021}},
]

# ============================================
# RAG CLASS
# ============================================
@dataclass
class SearchResult:
    content: str
    metadata: Dict
    score: float

class GhanaCensusRAG:
    def __init__(self, model_path="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_path)
        self.client = chromadb.Client()
        import uuid
        self.collection = self.client.create_collection(name=f"census_{uuid.uuid4().hex[:6]}")

    def add_documents(self, docs):
        self.collection.add(
            ids=[d["id"] for d in docs],
            embeddings=self.model.encode([d["content"] for d in docs]).tolist(),
            documents=[d["content"] for d in docs],
            metadatas=[d["metadata"] for d in docs]
        )

    def query(self, question, n=3):
        results = self.collection.query(
            query_embeddings=[self.model.encode([question])[0].tolist()],
            n_results=n,
            include=["documents", "metadatas", "distances"]
        )
        
        if not results["documents"][0]:
            return {"answer": "No relevant information found.", "sources": []}
        
        sources = []
        for i, doc in enumerate(results["documents"][0]):
            score = 1 / (1 + results["distances"][0][i])
            sources.append({
                "content": doc,
                "region": results["metadatas"][0][i].get("region", ""),
                "topic": results["metadatas"][0][i].get("topic", ""),
                "score": round(score, 3)
            })
        
        top = sources[0]
        answer = f"**Based on the Ghana 2021 Census:**\n\n{top['content']}\n\n*Source: {top['region']} - {top['topic']}*"
        return {"answer": answer, "sources": sources}

# ============================================
# LOAD MODEL
# ============================================
@st.cache_resource
def load_rag():
    rag = GhanaCensusRAG("/content/finetuned_model" if os.path.exists("/content/finetuned_model") else "all-MiniLM-L6-v2")
    rag.add_documents(GHANA_CENSUS_DATA)
    return rag

rag = load_rag()

# ============================================
# UI
# ============================================
# Logo
if os.path.exists("/content/gainai.png"):
    c1, c2, c3 = st.columns([1,1,1])
    with c2:
        st.image("/content/gainai.png", width=180)

st.title("🇬🇭 Ghana Census Q&A")
st.markdown("**Ask questions about Ghana's 2021 Population and Housing Census**")
st.info("🎓 **DEMO**: Running offline - No API costs!")

# Sidebar
with st.sidebar:
    if os.path.exists("/content/gainai.png"):
        st.image("/content/gainai.png", width=100)
    st.markdown("### 📊 Stats")
    st.metric("Documents", len(GHANA_CENSUS_DATA))
    st.metric("Regions", "16")
    st.markdown("---")
    st.markdown("### 💡 Try asking:")
    for q in ["What is Ghana's population?", "Literacy in Northern Region?", "Kumasi population?", "Electricity access?"]:
        if st.button(q, key=q):
            st.session_state["q"] = q

# Chat
if "msgs" not in st.session_state:
    st.session_state.msgs = []

for m in st.session_state.msgs:
    with st.chat_message(m["role"]):
        st.markdown(m["content"])
        if m.get("sources"):
            with st.expander("📚 Sources"):
                for i, s in enumerate(m["sources"], 1):
                    st.markdown(f"**{i}. {s['region']} - {s['topic']}** (Score: {s['score']})")
                    st.caption(s["content"][:150] + "...")

# Input
q = st.session_state.pop("q", None) or st.chat_input("Ask about Ghana census...")

if q:
    st.session_state.msgs.append({"role": "user", "content": q})
    with st.chat_message("user"):
        st.markdown(q)
    
    with st.chat_message("assistant"):
        result = rag.query(q)
        st.markdown(result["answer"])
        if result["sources"]:
            with st.expander("📚 Sources"):
                for i, s in enumerate(result["sources"], 1):
                    st.markdown(f"**{i}. {s['region']} - {s['topic']}** (Score: {s['score']})")
                    st.caption(s["content"][:150] + "...")
    
    st.session_state.msgs.append({"role": "assistant", "content": result["answer"], "sources": result["sources"]})

st.markdown("---")
st.caption("GAIN Dialogue Series 2026 | Ghana Statistical Service Data")


In [None]:
# ============================================
# 🚀 LAUNCH STREAMLIT APP
# ============================================
# Using Cloudflare Tunnel (FREE, no signup, no password!)

# Install cloudflared
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared
!chmod +x cloudflared

import subprocess
import time
import re

# Kill existing processes
!pkill -f streamlit 2>/dev/null || true
!pkill -f cloudflared 2>/dev/null || true

# Start Streamlit
print("⏳ Starting Streamlit server...")
!nohup streamlit run streamlit_app.py --server.port 8501 --server.headless true > streamlit.log 2>&1 &
time.sleep(6)

# Check if running
result = subprocess.run(['pgrep', '-f', 'streamlit'], capture_output=True)
if result.returncode == 0:
    print("✅ Streamlit running on port 8501")
else:
    print("❌ Streamlit failed to start")
    !cat streamlit.log

# Start cloudflared tunnel
print("\n🌐 Creating public tunnel (this takes ~10 seconds)...\n")
process = subprocess.Popen(
    ['./cloudflared', 'tunnel', '--url', 'http://localhost:8501'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True
)

# Find the URL
url_found = False
for _ in range(30):
    line = process.stdout.readline()
    if 'trycloudflare.com' in line:
        match = re.search(r'https://[a-z0-9-]+\.trycloudflare\.com', line)
        if match:
            url = match.group(0)
            print("=" * 60)
            print("🎉 YOUR APP IS LIVE!")
            print("=" * 60)
            print(f"\n🔗 {url}")
            print("\n• Share this link with anyone")
            print("• No password required")
            print("• Keep this cell running")
            print("=" * 60)
            url_found = True
            break
    time.sleep(0.5)

if not url_found:
    print("⏳ Waiting for URL... (check output below)")

# Keep alive
try:
    while True:
        time.sleep(60)
except KeyboardInterrupt:
    print("\n🛑 Stopping app...")
    !pkill -f streamlit
    !pkill -f cloudflared
