# Simple Retrieval Augmented Generation (RAG) System

In this notebook, we will build a simple RAG system to answer questions based on Markdown files. As an [Obsidian](https://obsidian.md) user, all my notes are stored as Markdown files in a directory that is called a *vault* in Obsidian's parlance. The idea is to build a system that: 
1. Reads these files
2. Splits them into chunks if required
3. Converts them into embeddings
4. Stores the embeddings in a persistent database called ChromaDB
5. Retrieves chunks from the database relevant to the query
6. Asks the query to a LLM with the retrieved chunk provided as context

We begin by loading the required packages and creating an OpenAI client which we will then use to ask the query.

In [None]:
import os
import pathlib
from typing import Dict, List, Optional

import chromadb
from chromadb import Collection
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI

In [None]:
load_dotenv("../secrets.env")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

Next, we define the root directory of a Obsidian vault and then iteratively load the Markdown files in the vault. In order to enable filtering of files based on their location, we add the file's relative path as another key in the output dictionary.

In [None]:
VAULT_DIR = (
    "/Users/tejaskale/Library/Mobile Documents/iCloud~md~obsidian/Documents/RAG Test"
)

In [None]:
def load_markdown_files(vault_dir: str) -> List[Dict[str, str]]:
    """
    Recursively loads all Markdown (.md) files from the specified Obsidian vault
    directory.

    Args:
        vault_dir (str): Path to the root directory of the Obsidian vault.

    Returns:
        list of dict: A list where each element is a dictionary with 'content'
        (str) and 'path' (str) keys, representing the file content and its relative path.
    """
    md_files = []
    for path in pathlib.Path(vault_dir).rglob("*.md"):
        with open(path, "r", encoding="utf-8") as f:
            content = f.read()
        # Exclude the file name from the relative path (keep only the directory)
        relative_path = str(path.relative_to(vault_dir).parent.as_posix())
        md_files.append({"content": content, "path": relative_path})
    return md_files


markdown_files = load_markdown_files(VAULT_DIR)

In [None]:
markdown_files[0]

Having loaded the required files, we know split each file into chunks, tokenise each chunk using OpenAI's embeddings, and then save the embeddings to a persistent ChromaDB vector store. In this example, we begin with the default value to *chunk size* and *chunk overlap* that was provided in the tutorials. If the answers were get from the LLM do not match our expectation, these parameters can be tweaked for improvements.

While creating documents in ChromaDB, we specify the file path as metadata for each document. This gives us the ability to filter data based on the file path directory and thus save the overhead costs of searching through everything and potentially getting unrelated answers.

In [None]:
def split_text(
    text: str, chunk_size: Optional[int] = 1000, chunk_overlap: Optional[int] = 200
) -> List[str]:
    """
    Splits text into chunks using Langchain's RecursiveCharacterTextSplitter.

    Args:
        text (str): The input text to split.
        chunk_size (int, optional): The maximum size of each chunk. Defaults to 1000.
        chunk_overlap (int, optional): The number of overlapping characters between
        chunks. Defaults to 200.

    Returns:
        List[str]: A list of text chunks.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    return splitter.split_text(text)


def embed_and_store_markdown_files(
    markdown_files: List[Dict[str, str]],
    client: OpenAI,
    persist_directory: str = "./chroma_db",
    collection_name: str = "obsidian_vault",
) -> Collection:
    """
    Splits each markdown file into chunks, generates embeddings for each chunk using the provided OpenAI client,
    and stores the embeddings along with the chunk text and metadata in a persistent ChromaDB collection.

    Args:
        markdown_files (list): List of dictionaries, each containing 'content' (str)
            and 'path' (str) for a markdown file.
        client (OpenAI): An OpenAI client instance for generating embeddings.
        persist_directory (str, optional): Directory path to persist the ChromaDB
            database. Defaults to "./chroma_db".
        collection_name (str, optional): Name of the ChromaDB collection. Defaults
            to "obsidian_vault".

    Returns:
        Collection: The ChromaDB collection cursor after storing all embeddings.
    """
    chroma_client = chromadb.PersistentClient(path=persist_directory)
    collection = chroma_client.get_or_create_collection(collection_name)
    for idx, file in enumerate(markdown_files):
        chunks = split_text(file["content"])
        for i, chunk in enumerate(chunks):
            response = client.embeddings.create(
                input=chunk, model="text-embedding-ada-002"
            )
            embedding = response.data[0].embedding
            collection.add(
                embeddings=[embedding],
                documents=[chunk],
                metadatas=[{"path": file["path"]}],
                ids=[f"{idx}_{i}"],
            )

    return collection


# Run the embedding and storage process, and get the collection cursor
collection = embed_and_store_markdown_files(markdown_files, client)

Having stores vector embeddings our our documents in a database, we now write a function that takes a user query and optional metadata filter from the user and uses it to filter for n documents from the ChromaDB database.

In [None]:
def retrieve_relevant_documents(
    query: str,
    client: OpenAI,
    collection: Collection,
    n_results: int = 5,
    metadata_filter: Optional[Dict[str, str]] = None,
) -> List[Dict[str, str]]:
    """
    Retrieves the top n most relevant documents from the ChromaDB collection for a given query,
    optionally filtering by metadata.

    Args:
        query (str): The user query.
        client (OpenAI): An OpenAI client instance for generating the query embedding.
        collection (Collection): The ChromaDB collection to search.
        n_results (int, optional): Number of top documents to retrieve. Defaults to 5.
        metadata_filter (dict, optional): Metadata filter for narrowing down results.

    Returns:
        List[Dict[str, str]]: List of dictionaries with 'document' and 'metadata' keys.
    """
    # Generate embedding for the query
    response = client.embeddings.create(input=query, model="text-embedding-ada-002")
    query_embedding = response.data[0].embedding

    # Query the ChromaDB collection
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        where=metadata_filter if metadata_filter else None,
        include=["documents", "metadatas"],
    )

    # Format the results
    docs = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        docs.append({"document": doc, "metadata": meta})
    return docs

In [None]:
retrieve_relevant_documents(
    "What is the most important mindset for vibe coding?", client, collection
)

With the relevant context now available, we now define a function that takes the query as input, retrieves the relevant documents, creates a prompt with the query and the documents, and asks the LLM to provide an answer.

In [None]:
def answer_query_with_context(
    query: str,
    client: OpenAI,
    collection: Collection,
    n_results: int = 5,
    metadata_filter: Optional[Dict[str, str]] = None,
) -> str:
    """
    Answers a user query by retrieving relevant documents from the ChromaDB collection,
    constructing a prompt with the query and context, and querying GPT-4o mini for a response.

    Args:
        query (str): The user query.
        client (OpenAI): An OpenAI client instance.
        collection (Collection): The ChromaDB collection to search.
        n_results (int, optional): Number of top documents to retrieve. Defaults to 5.
        metadata_filter (Optional[Dict[str, str]], optional): Metadata filter for narrowing down results.

    Returns:
        str: The response generated by GPT-4o mini.
    """
    relevant_docs = retrieve_relevant_documents(
        query, client, collection, n_results=n_results, metadata_filter=metadata_filter
    )

    context = "\n\n".join(doc["document"] for doc in relevant_docs)
    prompt = (
        f"Answer the following question using the provided context.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        f"Answer:"
    )

    system_prompt = """
    You are a helpful assistant who answers questions based only on the provided context.
    If the answer is not directly available in the context, respond with:
    'The answer is not available in the provided context.'
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {"role": "user", "content": prompt},
        ],
        max_tokens=512,
        temperature=0.2,
    )

    return response.choices[0].message.content.strip()

In [None]:
answer_query_with_context(
    "What is the most important mindset for vibe coding?", client, collection
)

## Resources
For this notebook, I relied on two sources:
- [Module 1](https://github.com/DataTalksClub/llm-zoomcamp/tree/main/01-intro) of the course LLM Zoomcamp
- [Simple RAG](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb) notebook by Nir Diamant