# **5.0** ‎ Retrieval Augmented Generation (RAG) Setup

### Purpose of This Notebook

This notebook sets up a **vector database indexer** to support **Retrieval-Augmented Generation (RAG)** in the municipal chatbot pipeline.

In a RAG system, the chatbot retrieves relevant information from external documents (e.g., policies, FAQs) to ground LLM responses. \
This requires converting documents into vector embeddings and storing them in a **vector database** that supports fast semantic search.

---

### What is Indexing and Why It Matters

**Indexing** involves:
- Breaking documents into text chunks
- Converting each chunk into a dense vector embedding
- Storing those vectors in a searchable database (e.g., Chroma, FAISS)

This enables the chatbot to:
- Retrieve the most relevant content for a given query
- Provide accurate, grounded answers instead of relying solely on the LLM’s internal memory
- Support fast, scalable semantic search over large document sets


---

### This Notebook Covers

- Document loading and chunking
- Embedding generation using `SentenceTransformer`
- Storing vectors and metadata in **ChromaDB**
- Testing retrieval for downstream use in RAG workflows

### **5.0.1** ‎ ‎ Setup PostgreSQL DB

For PostgreSQL, we'll need to install the following library: `Psycopg2`, one of the most popular PostgreSQL database adapter for Python programming.

In [12]:
!pip install psycopg2-binary --quiet

In [11]:
import os

# Set environment variables
os.environ["DB_NB_NAME"] = "municipal_app"
os.environ["DB_NB_USER"] = "postgres"
os.environ["DB_NB_PASSWORD"] = "fyukiAmane03!"
os.environ["DB_NB_HOST"] = "localhost"         
os.environ["DB_NB_PORT"] = "5432"

### **5.0.2** ‎ ‎ Load Issue Records from PostgreSQL

This notebook assumes you already have live municipal issue records stored in PostgreSQL DB to be used for RAG embedding later on. \
If you do not, you can run the 2 scripts located in `./ai_models/chatbot/scripts/postgresql_setup`, which will help setup the DB and create mock data for it. \

In [13]:
import psycopg2
import os
import pandas as pd

# Load from .env
from dotenv import load_dotenv
load_dotenv()

DB_NAME = os.getenv("DB_NB_NAME", "database")
DB_USER = os.getenv("DB_NB_USER", "postgres")
DB_PASSWORD = os.getenv("DB_NB_PASSWORD")
DB_HOST = os.getenv("DB_NB_HOST", "localhost")
DB_PORT = os.getenv("DB_NB_PORT", "5432")

def fetch_issues(query, limit=100):
    conn = psycopg2.connect(
        dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD,
        host=DB_HOST, port=DB_PORT
    )
    df = pd.read_sql(query, conn, params=(limit,))
    conn.close()
    return df

query = """
        SELECT issue_id, description, latitude, longitude,
               status, severity, full_address,
               (SELECT type_name FROM issue_types WHERE issue_type_id = i.issue_type_id) AS issue_type,
               (SELECT subtype_name FROM issue_subcategories WHERE subcategory_id = i.issue_subcategory_id) AS subcategory,
               (SELECT agency_name FROM agencies WHERE agency_id = i.agency_id) AS agency,
               datetime_reported
        FROM issues i
        ORDER BY datetime_reported DESC
        LIMIT %s
    """

issues_df = fetch_issues(query, 200)
issues_df.head()

  df = pd.read_sql(query, conn, params=(limit,))


Unnamed: 0,issue_id,description,latitude,longitude,status,severity,full_address,issue_type,subcategory,agency,datetime_reported
0,244,Kind author which court back explain daughter....,1.3361,103.9281,Closed,Critical,Bedok Reservoir,Pests,Cockroaches in Food Establishment,Housing and Development Board (HDB),2025-04-15 16:43:19
1,340,Drop trouble action marriage table. Energy som...,1.3321,103.8478,Closed,Critical,Toa Payoh Central,Animals & Bird,Bird Issues,Housing and Development Board (HDB),2025-04-15 16:34:58
2,123,Analysis meet town environment. Treatment its ...,1.3331,103.742,Closed,Low,Jurong East MRT,Pests,Rodents in Food Establishment,Building and Construction Authority (BCA),2025-04-15 16:21:39
3,235,Put foreign side news pattern nor. Page langua...,1.3361,103.9281,Closed,Low,Bedok Reservoir,Animals & Bird,Injured Animal,People’s Association (PA),2025-04-15 15:43:14
4,55,Time decision teach future. Scene offer chance...,1.3361,103.9281,Closed,Low,Bedok Reservoir,Construction Sites,Construction Noise,Urban Redevelopment Authority (URA),2025-04-15 15:33:56


# **5.1** ‎ RAG Indexing

### How is Indexing Useful?
RAG indexing is the process of preparing external documents for retrieval in a RAG system. \
This involves splitting documents into chunks, generating embeddings using an embedding model, and storing them in a searchable vector database. \
Once indexed, these chunks can be efficiently retrieved based on semantic similarity to user queries, allowing for more context-aware answers in the chatbot.

---
### What is a Vector Database?

A **Vector Database (VectorDB)** stores and searches high-dimensional vectors—typically from embeddings of text, images, or other unstructured data.\
It's a key component in **Retrieval-Augmented Generation (RAG)** systems, enabling fast similarity search to retrieve relevant context for a query.

---
### Comparison Between VectorDB Options

Below are 3 VectorDB options that were trialed or will be trialed:

| Feature                | FAISS                   | ChromaDB                 | Huawei CSS                  |
|------------------------|--------------------------|---------------------------|------------------------------|
| Deployment             | Self-hosted              | Local / Docker            | Managed Cloud Service        |
| Persistence            | ❌ Manual                | ✅ Built-in               | ✅ Built-in                  |
| Metadata Filtering     | ❌                       | ✅                        | ✅                           |
| Integration Ease       | ⚠️ Manual setup          | ✅ Python-native          | ⚠️ Needs API wrapper         |
| Scalability            | High (RAM/GPU-based)     | Moderate (~millions)      | High (cloud-native)          |
| Best For               | Fast, local search       | Prototyping & testing     | Production RAG pipelines     |


---

### Which Are We Using?

We're using **ChromaDB** locally for its simplicity, built-in persistence, and seamless integration with tools like LangChain—ideal for quick experimentation.\
For production, we’ll switch to **Huawei Cloud CSS**, a managed and scalable vector database suited for large-scale RAG applications with hybrid search support.\
This setup balances ease of development locally with performance and scalability in deployment.

### **5.1.1** ‎ ‎ Embedding the Documents

Before we can start performing indexing, we need to embed the documents that we plan to store in the Vector Database. \
As mentioned, **Embedding** is the process of converting text into a fixed-size numeric vector that captures its meaning. \
These vectors allow us to compare and search text semantically, not just by keywords.


### **Convert Data to Text Chunks**

In [14]:
import textwrap

def format_issue_row(row):
    return textwrap.dedent(f"""\
        Description: {row.description}
        Location: {row.full_address} ({row.latitude:.4f}, {row.longitude:.4f})
        Reported on: {row.datetime_reported.strftime('%Y-%m-%d')}
        Severity: {row.severity}, Status: {row.status}
        Category: {row.issue_type} > {row.subcategory}
        Agency: {row.agency}
    """).strip()

documents = [format_issue_row(row) for _, row in issues_df.iterrows()]

metadata = [
    {
        "issue_id": int(row.issue_id),
        "agency": row.agency,
        "issue_type": row.issue_type,
        "subcategory": row.subcategory,
        "latitude": float(row.latitude),
        "longitude": float(row.longitude),
        "status": row.status,
        "severity": row.severity,
        "reported_on": row.datetime_reported.strftime("%Y-%m-%d"),
        "full_address": row.full_address,
    }
    for _, row in issues_df.iterrows()
]

print(documents)
print(metadata)

['Description: Kind author which court back explain daughter. Within five should night very lot when.\nLocation: Bedok Reservoir (1.3361, 103.9281)\nReported on: 2025-04-15\nSeverity: Critical, Status: Closed\nCategory: Pests > Cockroaches in Food Establishment\nAgency: Housing and Development Board (HDB)', 'Description: Drop trouble action marriage table. Energy some international who say group travel reveal.\nLocation: Toa Payoh Central (1.3321, 103.8478)\nReported on: 2025-04-15\nSeverity: Critical, Status: Closed\nCategory: Animals & Bird > Bird Issues\nAgency: Housing and Development Board (HDB)', 'Description: Analysis meet town environment. Treatment its whatever success.\nLocation: Jurong East MRT (1.3331, 103.7420)\nReported on: 2025-04-15\nSeverity: Low, Status: Closed\nCategory: Pests > Rodents in Food Establishment\nAgency: Building and Construction Authority (BCA)', 'Description: Put foreign side news pattern nor. Page language local million set deal great.\nLocation: Bedo

### **Recursive Text Splitting (Token-aware)**

### **Generate Embeddings**

We are using the `SentenceTransformer` model from  HuggingFace to convert the texts into vectors (embeddings), so we'll need the dependency:

In [2]:
!pip install sentence-transformers --quiet

In [15]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode(documents, show_progress_bar=True)
embeddings = [e.tolist() for e in embeddings]

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

### **5.1.2** ‎ ‎ Indexing with ChromaDB

To start managing and querying vector databases with ChromaDB, you'll need to download the following:

In [16]:
!pip install chromadb --quiet

In [17]:
import chromadb
from chromadb.config import Settings

# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path="../vector_stores/chroma_store_textonly")
collection = chroma_client.get_or_create_collection("municipal_issues")

# Add to Chroma
ids = [f"issue_{meta['issue_id']}" for meta in metadata]

collection.add(
    documents=documents,
    embeddings=embeddings,
    metadatas=metadata,
    ids=ids
)

print("Stored all issue documents in Chroma.")

Stored all issue documents in Chroma.


### **Query ChromaDB via Vector Similarity Search**

For ChromaDB (and most vector databases), the default search method is vector similarity search, typically using cosine similarity or Euclidean distance. \
Essentially, when a user makes the query, the retrieval system would follow a path like this:

1. Embeds the query (if you provided an embedding function).

2. Calculates similarity between the query embedding and all stored document embeddings.

3. Returns the top N most similar documents based on the similarity score.

In [35]:
query = "Rodent problems in Ang Mo Kio"
query_embedding = model.encode(query).tolist()

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    where={"agency": "National Environment Agency (NEA)"}
)

for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
    print("---")
    print(doc)
    print(meta)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

---
Description: Foreign gun every field treatment. Side here back site artist key out.
Location: Ang Mo Kio Hub (1.3691, 103.8494)
Reported on: 2025-04-13
Severity: High, Status: Closed
Category: Pests > Rodents in Food Establishment
Agency: National Environment Agency (NEA)
{'full_address': 'Ang Mo Kio Hub', 'severity': 'High', 'reported_on': '2025-04-13', 'subcategory': 'Rodents in Food Establishment', 'issue_type': 'Pests', 'latitude': 1.3691, 'agency': 'National Environment Agency (NEA)', 'status': 'Closed', 'longitude': 103.8494, 'issue_id': 82}
---
Description: Charge time cultural far relationship article suddenly. Treat above pretty power organization focus interest. Finally paper special whether.
Location: Ang Mo Kio Hub (1.3691, 103.8494)
Reported on: 2025-04-13
Severity: Medium, Status: Closed
Category: Pests > Rodents in Common Areas
Agency: National Environment Agency (NEA)
{'latitude': 1.3691, 'severity': 'Medium', 'issue_id': 228, 'issue_type': 'Pests', 'subcategory': '

# **5.2** ‎ RAG Reranking

**RAG reranking** is an optional but powerful enhancement, sometimes sought after in the RAG workflow. \
After the initial retrieval is done, a reranking step is applied to sort those chunks based on their **true semantic relevance** to the user query. \
This is done using more precise models—often **cross-encoders** or **dedicated reranker APIs** online. \
These models often consider the full interaction between the query and each chunk, not just their individual embeddings.

Reranking is especially useful when:
- Many chunks have similar embedding scores

- The top-ranked results include noise or loosely related content

- Higher-quality, more contextually aligned retrieval improves LLM output

By reordering retrieved chunks before passing them to the LLM, reranking helps **prioritise the most meaningful context**. \
This, of course can lead to more accurate, focused, and trustworthy municipal responses.

### **5.2.1** ‎ ‎ Flashrank Reranking

**Flashrank**, an open-source framework built on top of **cross-encoder models**, scores the relevance of chunks by jointly encoding the query and passage. \
Unlike vector similarity search (which uses independent embeddings), Flashrank evaluates the full interaction between the query and each document. \
This approach makes it significantly more accurate for fine-grained ranking. \
By using models like `ms-marco-MiniLM-L-12-v2`, Flashrank balances speed and semantic precision. \
This is ideal for RAG pipelines that need smarter result ordering before generation.

#### ✅ Advantages
- **High Precision** – Cross-encoders deeply model the interaction between query and document

- **Plug-and-Play** – Easy to integrate with any retrieval system (e.g. Chroma, FAISS)

- **Open Source** – No API costs, fully local, and customizable

- **Lightweight Models Available** – Fast enough for real-time use with smaller models

#### ⚠️ Limitations
- **Slower than Embedding Search** – Requires forward passes per (query, document) pair

- **Scales Linearly with Candidates** – Best used on top-k results, not entire corpus

- **No Training Out of the Box** – You must choose a suitable pre-trained reranker model

Flashrank is best used in **post-retrieval reranking**, where you want to reorder the top-N results retrieved, and ensure only relevant chunks are passed to the LLM. \
For the use of Flashrank, do install the following if you have not done so:

In [6]:
!pip install --upgrade --quiet  flashrank

#### Wrapper Function for Flash Reranker

In [45]:
from typing import List, Dict
from langchain_core.documents import Document
from langchain.retrievers.document_compressors import FlashrankRerank

def rerank_chroma_results_with_flashrank(
    collection,
    query: str,
    embedding_model,
    filter: Dict = None,
    top_n: int = 5,
    rerank_top_k: int = 3,
) -> List[Document]:
    """
    Performs a filtered ChromaDB query, then reranks the results using Flashrank via LangChain.

    Parameters:
        collection: The chromadb.Collection object.
        query (str): The user query to search against.
        embedding_model: The embedding model used to encode the query.
        filter (dict, optional): Filter to apply in Chroma query (e.g., {"agency": "NEA"}).
        top_n (int): Number of results to retrieve from Chroma before reranking.
        rerank_top_k (int): Number of top reranked documents to return.

    Returns:
        List[Document]: LangChain Document objects sorted by semantic relevance.
    """
    # Encode query into embedding
    query_embedding = embedding_model.encode(query).tolist()

    # Query Chroma
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_n,
        where=filter or {}
    )

    docs = results["documents"][0]
    metas = results["metadatas"][0]

    if not docs:
        return []

    # Wrap into LangChain Documents
    wrapped_docs = [
        Document(page_content=doc, metadata=meta)
        for doc, meta in zip(docs, metas)
    ]

    # Apply Flashrank reranking
    reranker = FlashrankRerank()
    reranked_docs = reranker.compress_documents(wrapped_docs, query)

    return reranked_docs[:rerank_top_k]

#### Using Flash Reranker with ChromaDB

In [46]:
reranked_docs = rerank_chroma_results_with_flashrank(
    collection=collection,
    query="Rodent problems in Ang Mo Kio",
    embedding_model=model,
    filter={"agency": "National Environment Agency (NEA)"},
    top_n=5,
    rerank_top_k=3
)

for i, doc in enumerate(reranked_docs, 1):
    print(f"Rank {i}")
    print("Score:", doc.metadata.get("relevance_score"))
    print("Text:", doc.page_content)
    print("Metadata:", doc.metadata)
    print("---")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1
Score: 0.9842671
Text: Description: Charge time cultural far relationship article suddenly. Treat above pretty power organization focus interest. Finally paper special whether.
Location: Ang Mo Kio Hub (1.3691, 103.8494)
Reported on: 2025-04-13
Severity: Medium, Status: Closed
Category: Pests > Rodents in Common Areas
Agency: National Environment Agency (NEA)
Metadata: {'id': 1, 'relevance_score': np.float32(0.9842671), 'agency': 'National Environment Agency (NEA)', 'full_address': 'Ang Mo Kio Hub', 'longitude': 103.8494, 'reported_on': '2025-04-13', 'issue_id': 228, 'subcategory': 'Rodents in Common Areas', 'severity': 'Medium', 'latitude': 1.3691, 'status': 'Closed', 'issue_type': 'Pests'}
---
Rank 2
Score: 0.9096412
Text: Description: Foreign gun every field treatment. Side here back site artist key out.
Location: Ang Mo Kio Hub (1.3691, 103.8494)
Reported on: 2025-04-13
Severity: High, Status: Closed
Category: Pests > Rodents in Food Establishment
Agency: National Environmen