<h1 align="center" style="background-color:black; color:lime; padding:10px; font-family:monospace;">
📜 pdf_loader.py
</h1>

```python
from langchain_community.document_loaders import PyPDFLoader

def load_pdf(file_path: str):
    """
    Load a PDF file and return its content as a list of documents.

    Args:
        file_path (str): The path to the PDF file.

    Returns:
        list: A list of documents extracted from the PDF.
    """
    loader = PyPDFLoader(file_path)
    documents = loader.load()
    
    return documents
```

---

## **Step-by-Step Explanation**

### **1. Import**

```python
from langchain_community.document_loaders import PyPDFLoader
```

* **`PyPDFLoader`** is a special class in LangChain’s **community** package.
* It’s used for **reading PDF files** and converting them into a format that LangChain understands — namely, a **list of Document objects**.
* Each **Document** contains:

  * `.page_content` → the text of the page.
  * `.metadata` → extra info like page number, file name, etc.

Think of this as a **PDF-to-structured-text converter** that’s LangChain-friendly.

---

### **2. Function Definition**

```python
def load_pdf(file_path: str):
```

* `file_path` → the location of your PDF file as a **string**.
* The `: str` after `file_path` is **type hinting**, telling other developers (and IDEs) that this parameter should be a string.

---

### **3. Loader Creation**

```python
loader = PyPDFLoader(file_path)
```

* Here you’re **creating an instance** of `PyPDFLoader` and telling it which PDF file to read.
* At this point, the file isn’t fully processed yet — you’ve just prepared a “loader” object that knows **where to look**.

---

### **4. Load the PDF**

```python
documents = loader.load()
```

* `.load()` actually **opens the PDF, reads each page, and extracts the text**.
* It splits the PDF into **a list of `Document` objects** — usually one per page.
* Example of what `documents` might look like:

```python
[
    Document(page_content="Text from page 1", metadata={"page": 1, "source": "file.pdf"}),
    Document(page_content="Text from page 2", metadata={"page": 2, "source": "file.pdf"})
]
```

* This structure is perfect for **LLM pipelines** because:

  1. You can process each page separately.
  2. Metadata helps you know where text came from.

---

### **5. Return the Documents**

```python
return documents
```

* The function sends back the **list of Document objects** so you can use them later for:

  * Splitting into smaller chunks
  * Creating embeddings
  * Storing in a vector database
  * Running LLM queries

---

## **Why This Function is Useful**

* You now have a **reusable utility**: just call `load_pdf("myfile.pdf")` and instantly get all text in LangChain’s Document format.
* It hides the internal complexity of opening, parsing, and structuring the PDF.
* You can later swap `PyPDFLoader` for another loader (e.g., `PyMuPDFLoader`) without changing the rest of your code.

---

✅ **In short**:
This function **loads a PDF**, processes it into a **list of page-level Document objects**, and returns them — making it the first step in most **RAG (Retrieval-Augmented Generation)** pipelines.

<h1 align="center" style="background-color:black; color:lime; padding:10px; font-family:monospace;">
📜 text_splitter.py
</h1>

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List
from langchain_core.documents import Document

def split_text(documents: List[Document], chunk_size=512, chunk_overlap=64):
    """
    Split a list of LangChain Document objects into smaller chunks.

    Args:
        documents (List[Document]): The documents to split.
        chunk_size (int): Max size of each chunk.
        chunk_overlap (int): Overlap between chunks.

    Returns:
        List[Document]: Smaller chunks with metadata preserved.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    chunks = splitter.split_documents(documents) 
    
    return chunks
```

---

## **1️⃣ What’s the purpose of this function?**

When working with **Large Language Models (LLMs)**, we can’t just give them huge documents (like an entire PDF at once) — they have **context limits** (e.g., 4,000 tokens).
So, we **split big documents into smaller chunks** while still preserving meaning.

---

## **2️⃣ Imports explained**

* **`RecursiveCharacterTextSplitter`**

  * A tool from LangChain that splits text into smaller chunks **smartly**, without cutting in the middle of words or breaking meaning unnecessarily.
  * It tries larger separators (`\n\n` = paragraph) first, then smaller ones (`\n`, `.` = sentence, `" "` = space), and finally splits at the character level if needed.

* **`List`** (from `typing`)

  * Used for type hints to specify that `documents` is a **list** of something.

* **`Document`** (from `langchain_core.documents`)

  * LangChain’s special object that holds both:

    1. **Text content** (`page_content`)
    2. **Metadata** (like page number, source filename, etc.)

---

## **3️⃣ Function Parameters**

* **`documents`**: A list of LangChain `Document` objects (already loaded from a PDF or other source).
* **`chunk_size`**:

  * Maximum number of **characters** in each chunk.
  * Default: `512`.
* **`chunk_overlap`**:

  * Number of characters to **repeat** between chunks.
  * This ensures continuity so that important sentences spanning two chunks aren’t broken.
  * Default: `64`.

---

## **4️⃣ How the splitter works**

```python
splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", ".", " ", ""]
)
```

* **`separators`**: Order in which the splitter tries to break text:

  1. Paragraphs (`\n\n`)
  2. New lines (`\n`)
  3. Sentences (`.`)
  4. Spaces (` `)
  5. No separator (character-by-character as a last resort)

The splitter tries to **preserve natural breaks** first, and only if that fails, it cuts smaller.

---

## **5️⃣ Splitting the documents**

```python
chunks = splitter.split_documents(documents)
```

* `split_documents`:

  * Takes a list of `Document` objects
  * Splits them into **multiple smaller `Document` objects**
  * **Preserves metadata** (e.g., if the original had `{"source": "file.pdf"}`, each chunk will still have that metadata)

---

## **6️⃣ Return value**

* **Returns**: A **list of small chunks**, each a `Document` object containing:

  * `page_content` → the chunk’s text
  * `metadata` → same metadata as the original

---

## **7️⃣ Example**

If you have **one PDF page** with 1,500 characters:

* `chunk_size=512`, `chunk_overlap=64` → It might split into:

  * **Chunk 1**: characters 0–512
  * **Chunk 2**: characters 448–960 (64 characters overlap with Chunk 1)
  * **Chunk 3**: characters 896–end

This overlap ensures sentences are not lost when chunks are processed separately.

---

✅ **In short:**
This function takes large text documents and **cuts them into bite-sized pieces** that are small enough for LLMs to handle, **while preserving important context** between pieces.


<h1 align="center" style="background-color:black; color:lime; padding:10px; font-family:monospace;">
📜 embed_store.py
</h1>

## **Big Picture**

This function takes a list of **documents** (already loaded, e.g., from PDFs or text files), converts them into **embeddings** using Cohere’s multilingual model, and stores them inside a **Chroma vector database** on disk for future retrieval.

Think of it like:

📄 Documents → 🧠 Convert to numbers (embeddings) → 📦 Store in Chroma DB → 🔍 Search later.

---

## **Detailed Breakdown**

### 1️⃣ Imports

```python
from langchain_chroma import Chroma
```

* **Chroma** → a **vector store** that saves and retrieves embeddings efficiently.
* You can save it to disk and reload later.

```python
from langchain_community.embeddings import CohereEmbeddings
```

* **CohereEmbeddings** → LangChain wrapper for Cohere’s embedding API.
* We’ll use it to convert text into **vector representations**.

```python
from langchain_core.documents import Document
from typing import List
```

* **Document** → LangChain’s standard way to represent a text chunk + metadata.
* **List** → For type hinting, saying `documents` should be a **list** of `Document`.

```python
import os
from dotenv import load_dotenv
```

* `load_dotenv()` loads **environment variables** from a `.env` file (e.g., your API key).
* `os.getenv()` will then fetch `COHERE_API_KEY` securely.

---

### 2️⃣ Load environment variables

```python
load_dotenv()
```

* Reads your `.env` file.
* Ensures `os.getenv("COHERE_API_KEY")` works.

---

### 3️⃣ The Function

```python
def embed_and_store_documents(
    documents: List[Document],
    persist_directory: str = "./chroma_store",
    model_name: str = "embed-multilingual-v3.0"
):
```

* **documents** → The list of already-loaded text chunks.
* **persist\_directory** → Folder where Chroma will store the embeddings on disk.
* **model\_name** → Which Cohere model to use. Default is multilingual (`embed-multilingual-v3.0`).

---

### 4️⃣ Create the Embedding Model

```python
embedding_model = CohereEmbeddings(
    cohere_api_key=os.getenv("COHERE_API_KEY"),
    model=model_name,
    user_agent="langchain" 
)
```

* Uses your **Cohere API key**.
* Uses the specified **model**.
* `user_agent="langchain"` is just an identifier for API usage tracking.

💡 **What it does:**
This object will take any text and return its **vector embedding** (a list of numbers).

---

### 5️⃣ Create / Load the Vector Store

```python
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_model
)
```

* **persist\_directory** → Where the Chroma DB will be saved so you can reload later.
* **embedding\_function** → The embedding model to use for converting queries & documents.

---

### 6️⃣ Add Documents to the Vector Store

```python
vectorstore.add_documents(documents)
```

* Converts each `Document` into **embeddings** using the `embedding_model`.
* Saves them in the **Chroma DB**.
* This makes them **searchable by meaning**, not just keywords.

---

### 7️⃣ Confirmation

```python
print(f"✅ Stored {len(documents)} documents.")
```

* Prints how many documents were stored.

---

### 8️⃣ Return the Vector Store

```python
return vectorstore
```

* Lets you use the `vectorstore` immediately for retrieval.

---

## **Flow Summary**

1. Load your `.env` so API keys are available.
2. Create a Cohere embedding model.
3. Create / load a Chroma vector database.
4. Convert all your documents into embeddings.
5. Store them in the Chroma DB.
6. Return the DB object for querying later.

---

💡 **Example Usage**

```python
from langchain_core.documents import Document

docs = [
    Document(page_content="This is a sample text about AI."),
    Document(page_content="Another document about machine learning.")
]

vectorstore = embed_and_store_documents(docs)

# Later, you can search:
results = vectorstore.similarity_search("What is AI?")
for r in results:
    print(r.page_content)
```

<h1 align="center" style="background-color:black; color:lime; padding:10px; font-family:monospace;">
📜 retriver.py
</h1>

## **Code Purpose**

This function `get_retriever` sets up a **Retriever** using:

* **Cohere embeddings** for text vectorization (turning text into number vectors).
* **Chroma** as the **vector database** to store and search those vectors.
* Returns a retriever that can find the top-5 most similar documents for a given query.

---

## **Step-by-Step Breakdown**

```python
from langchain_chroma import Chroma
```

* **Imports the Chroma vector database integration** for LangChain.
* Chroma stores your documents in a vectorized form and allows fast similarity search.

---

```python
from langchain_community.embeddings import CohereEmbeddings
```

* Imports **CohereEmbeddings** class from LangChain's community module.
* Cohere provides embedding models like `"embed-multilingual-v3.0"` that support multiple languages.

---

```python
import os
```

* Imports Python’s built-in `os` module to access environment variables (for the API key).

---

```python
def get_retriever(persist_directory="./chroma_store", model_name="embed-multilingual-v3.0"):
```

* Defines a **function** `get_retriever` that:

  * `persist_directory` → Folder where Chroma stores its data files.
  * `model_name` → Name of the Cohere embedding model (default: multilingual model).

---

```python
    embedding = CohereEmbeddings(
        cohere_api_key=os.getenv("COHERE_API_KEY"),
        model=model_name,
        user_agent="langchain"
    )
```

* Creates an **embedding object** that:

  * Reads the Cohere API key from an **environment variable** (`COHERE_API_KEY`).
  * Uses the specified `model_name` to generate embeddings.
  * Sets `user_agent="langchain"` so Cohere knows this request is from a LangChain app.

💡 **What embeddings do**: They convert text into numerical vectors so we can measure "semantic similarity" between pieces of text.

---

```python
    vectorstore = Chroma(   
        persist_directory=persist_directory,
        embedding_function=embedding
    )
```

* Creates a **Chroma vector database** object that:

  * Uses `persist_directory` for saving/reloading the database.
  * Uses the `embedding_function` (Cohere) to embed new data when needed.

---

```python
    return vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})
```

* Turns the vector store into a **Retriever**.
* **`search_type="similarity"`** → finds documents whose embeddings are closest to the query embedding.
* **`search_kwargs={"k": 5}`** → returns the **top 5 most similar** documents.

---

## **How It Works Together**

1. **Embedding Model** — Cohere turns your query & documents into vectors.
2. **Vector Store** — Chroma stores those vectors in a database on disk (`./chroma_store`).
3. **Retriever** — Instead of you writing the search logic, `.as_retriever()` returns a ready-to-use search tool.
4. **Result** — When you pass a query to the retriever, it:

   * Embeds the query with Cohere.
   * Compares it with stored vectors.
   * Returns the top-5 most relevant document chunks.

---

✅ **Example Usage**

```python
retriever = get_retriever()

# Search for relevant documents
docs = retriever.get_relevant_documents("தமிழ் மொழி வரலாறு")  # Tamil language history
for doc in docs:
    print(doc.page_content)
```

<h1 align="center" style="background-color:black; color:lime; padding:10px; font-family:monospace;">
📜 qa_with_retriever.py
</h1>

## **Purpose of This Code**

This script connects **Google’s Gemini AI model** with a **Retriever** (powered by Chroma + Cohere embeddings) to build a **multilingual Q\&A bot for Sri Lankan government services**.

When a user asks a question:

1. The retriever searches for relevant document chunks from stored data.
2. The Gemini model generates an answer in **the same language** as the question, using only the retrieved context.

---

## **Step-by-Step Breakdown**

### **Imports**

```python
import os
import google.generativeai as genai
from dotenv import load_dotenv
from modules.retriever import get_retriever
```

* **`os`** — Access environment variables (API keys).
* **`google.generativeai`** — Google’s SDK for Gemini AI models.
* **`dotenv`** — Loads `.env` file so API keys can be stored securely.
* **`get_retriever`** — Custom function (from your earlier code) that sets up a Chroma-based retriever for searching stored documents.

---

### **Load Environment Variables**

```python
load_dotenv()
```

* Reads the `.env` file and loads values into environment variables.
* This is where your **`GOOGLE_API_KEY`** is stored.

---

### **Configure Gemini API**

```python
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
gemini_model = genai.GenerativeModel("gemini-1.5-flash")
```

* **`genai.configure`** → Passes your Gemini API key so Google can authenticate your requests.
* **`GenerativeModel("gemini-1.5-flash")`** → Loads the **Gemini 1.5 Flash** model (fast, multi-modal, and supports multiple languages).

---

### **Set Up the Retriever**

```python
retriever = get_retriever(
    persist_directory="./chroma_store",
    model_name="embed-multilingual-v3.0"
)
```

* Creates a **Retriever**:

  * Uses the multilingual Cohere embedding model to handle **Tamil, Sinhala, and English**.
  * Reads from your **Chroma vector database** stored in `./chroma_store`.

---

## **Main Function**

```python
def answer_query_with_gemini(query: str) -> str:
```

* Takes the **user’s question** (`query`) and returns an AI-generated answer.

---

### **Step 1: Retrieve Relevant Context**

```python
docs = retriever.invoke(query)
if not docs:
    return "No relevant documents found for your question."
```

* **`retriever.invoke(query)`** → Finds the top 5 most semantically similar document chunks to the query.
* If no results → returns a message saying nothing relevant was found.

---

```python
context = "\n\n".join(doc.page_content for doc in docs)
```

* Joins all retrieved document texts into a **single block of context**.
* This context will be fed into Gemini so it only answers from relevant official information.

---

### **Step 2: Prepare the Prompt for Gemini**

```python
prompt = f"""
You are a trusted virtual assistant for Sri Lankan government services.

Your job is to help users in **Tamil**, **Sinhala**, or **English**, depending on the language of the question. Do NOT translate. Reply in the same language.

Use the following context from official documents to answer the user's question. 

--- CONTEXT START ---
{context}
--- CONTEXT END ---

--- USER QUESTION ---
{query}
--- END ---

Guidelines:
- ✅ Respond only using the given context.
- ❌ If not found in context, say "I'm not sure based on the available information."
- 🧾 Format the answer in **clear bullet points or numbered steps**.
- 🔁 Do not repeat the question.

Answer:
"""
```

This is a **structured, multilingual-aware prompt** that:

* Tells Gemini it is a **Sri Lankan government services assistant**.
* Forces the answer to be in **the same language as the query**.
* Restricts answers to **only the retrieved context**.
* Encourages **bullet points or numbered steps**.
* Tells it **not to repeat the question**.

---

### **Step 3: Generate the Answer**

```python
try:
    response = gemini_model.generate_content(prompt)
    return response.text.strip()
except Exception as e:  
    return f"Gemini API error: {e}"
```

* **`generate_content`** → Sends the prompt to Gemini and gets a response.
* **`.text.strip()`** → Returns the clean text answer (without extra spaces).
* If Gemini API fails → Returns an error message.

<h1 align="center" style="background-color:black; color:lime; padding:10px; font-family:monospace;">
📜 app.py
</h1>

## **📂 File Overview**

This script is a **Streamlit web app** that acts as a **chatbot interface** for answering questions about Sri Lankan government services.
It connects the **frontend UI** (Streamlit) with your **backend logic** (`answer_query_with_gemini` function) to display responses in a **chat format**.

---

## **1️⃣ Imports**

```python
import streamlit as st
from modules.qa_with_retriever import answer_query_with_gemini
```

* **`streamlit` (`st`)** → Used to create the **UI components** (sidebar, inputs, chat bubbles).
* **`answer_query_with_gemini`** → The **function you wrote earlier** that:

  1. Retrieves relevant documents from ChromaDB.
  2. Sends them to **Google Gemini** for an AI-generated answer.
  3. Returns that answer back to the app.

---

## **2️⃣ Page Configuration**

```python
st.set_page_config(
    page_title="🇱🇰 Gov Services FAQ Bot",
    page_icon="🤖",
    layout="wide",
    initial_sidebar_state="expanded"
)
```

* **`page_title`** → Title shown in the browser tab.
* **`page_icon`** → Emoji for the favicon.
* **`layout="wide"`** → Makes the app span the entire width.
* **`initial_sidebar_state="expanded"`** → Sidebar is open by default.

---

## **3️⃣ Sidebar UI**

```python
with st.sidebar:
    st.title("🤖 Sri Lanka Gov FAQ Bot")
    st.markdown(""" ... """)
    st.caption("Powered with ❤️ by Gemini, Cohere & Streamlit")
```

* Displays **bot name** and description.
* Lists the **technologies used**.
* Gives **instructions** on how to use the bot.
* Adds a **developer credit section** (name, GitHub, email).
* Includes a **disclaimer** (important for AI apps).

---

## **4️⃣ Main Title**

```python
st.markdown(
    "<h2 style='text-align:center; margin-bottom:0;'>🇱🇰 Government Services FAQ Assistant</h2>",
    unsafe_allow_html=True
)
st.markdown(
    "<p style='text-align:center; color: gray; margin-top:5px;'>Ask questions in Tamil, Sinhala, or English</p>",
    unsafe_allow_html=True
)
```

* HTML is used here to **center-align** and **style** the text.
* `unsafe_allow_html=True` → Allows HTML styling inside Streamlit markdown.

---

## **5️⃣ Chat History Storage**

```python
if "messages" not in st.session_state:
    st.session_state.messages = []
```

* `st.session_state` → Stores **persistent data** between user interactions.
* `"messages"` → List of all past messages (`role` and `content`) so that:

  * Chat history **doesn’t disappear** after every response.

---

## **6️⃣ Display Previous Chat Messages**

```python
for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"], unsafe_allow_html=True)
```

* Loops through `messages` and displays each in a **chat bubble**.
* `msg["role"]` is either `"user"` or `"assistant"`.
* `msg["content"]` is the actual text.

---

## **7️⃣ User Input Field**

```python
query = st.chat_input("Type your question about passports, NIC, land docs, and more...")
```

* Special Streamlit component for a **chat-style input box**.
* When the user types and hits **Enter**, `query` will store that text.

---

## **8️⃣ When User Sends a Message**

```python
if query:
    st.session_state.messages.append({"role": "user", "content": query})
```

* Adds the user’s question to chat history.

```python
with st.chat_message("user"):
    st.markdown(query)
```

* Displays the user’s message immediately.

---

## **9️⃣ Getting the Bot’s Answer**

```python
with st.chat_message("assistant"):
    with st.spinner("Getting answer..."):
        result = answer_query_with_gemini(query)
```

* Shows a **loading spinner** while fetching the answer.
* Calls **`answer_query_with_gemini`** to:

  * Retrieve relevant data.
  * Ask Gemini to answer.
  * Return the final reply.

---

## **🔟 Processing the Response**

```python
if isinstance(result, dict):
    answer = result.get("answer", "")
    sources = result.get("sources", [])
else:
    answer = result
    sources = []
```

* Handles both:

  * **Dict format**: Contains `"answer"` and `"sources"`.
  * **String format**: Just an answer without sources.

---

## **1️⃣1️⃣ Formatting & Display**

```python
formatted_answer = answer.replace("\n", "<br>")
st.markdown(formatted_answer, unsafe_allow_html=True)
```

* Replaces line breaks (`\n`) with `<br>` so they render properly in HTML.

---

## **1️⃣2️⃣ Showing Sources**

```python
if sources:
    st.markdown("<hr>", unsafe_allow_html=True)
    st.markdown("**Sources:**")
    for s in sources:
        st.markdown(f"- `{s}`")
```

* If sources are available, they’re displayed below the answer.

---

## **1️⃣3️⃣ Saving Bot’s Message to History**

```python
st.session_state.messages.append({"role": "assistant", "content": formatted_answer})
```

* Stores the assistant’s reply so it’s visible in future renders.

---

## **💡 How It Works in Sequence**

1. User visits the page → sees the **sidebar instructions** and **title**.
2. User types a question in the chat box.
3. Question is stored in session state and displayed.
4. The app calls `answer_query_with_gemini(query)`:

   * Retrieves relevant documents from Chroma.
   * Sends them to Gemini AI for processing.
5. Gemini returns an answer.
6. Answer is **formatted** and displayed with any sources.
7. Both question and answer are **saved to session state** for history.
