# RAG - Vector DB (Chroma)

What we are doing here?
* Get data for RAG - Assumption is data preparation and cleaning alreayd done.
* Extract the metadata and content.
* Chunk the data (Chunk size = 1000).
* Embed the chunks using an embedding model (here all-MiniLM-L6-v2)
* Store the content, metadata and embedded content to Vector DB - ChromaDB
* Retrieve the content based on user query
* Call the LLM (OpenAI API) and generate the text.


# Step 1: Data Preparation

Data already prepared and stored in CSV. Details [Here](https://github.com/tva04/create-data-set/)

In [1]:
import pandas as pd
# The raw GitHub URL for the CSV file
github_csv_url = 'https://raw.githubusercontent.com/tva04/create-data-set/main/data-set/wikipedia_sports_data.csv'

try:
    custom_data = pd.read_csv(github_csv_url)
    print("DataFrame successfully loaded!")
    print("\nFirst 5 rows:")
    display(custom_data.head())

except Exception as e:
    print(f"An error occurred while reading the CSV: {e}")

DataFrame successfully loaded!

First 5 rows:


Unnamed: 0,Title,Section,Subsection,Content
0,Cricket_World_Cup,Introduction,,The ICC Men's Cricket World Cup is a quadrenni...
1,Cricket_World_Cup,History,,The first international cricket match was play...
2,Cricket_World_Cup,History,Prudential World Cups (1975–1983),The inaugural Cricket World Cup was hosted in ...
3,Cricket_World_Cup,History,Various champions (1987–1996),India and Pakistan jointly hosted the 1987 tou...
4,Cricket_World_Cup,History,Australian treble (1999–2007),"In 1999, the event was hosted by England, with..."


In [2]:
custom_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Title       96 non-null     object
 1   Section     96 non-null     object
 2   Subsection  59 non-null     object
 3   Content     96 non-null     object
dtypes: object(4)
memory usage: 3.1+ KB


# Step 2: Embedding & Storing

## ChromaDB

In [3]:
!pip install chromadb



In [4]:
import chromadb
from sentence_transformers import SentenceTransformer

In [5]:
chroma_data_path = 'chroma_data'
COLLECTION_NAME = "wiki-sports-data"
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
CHUNK_OVERLAP = 20
CHUNK_SIZE = 1000

In [6]:
chroma_client = chromadb.PersistentClient(path=chroma_data_path)
chroma_collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

## Storing Data (Embedding)

In [7]:
embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Chunking the data

In [8]:
def simple_text_splitter(text: str, chunk_size: int, chunk_overlap: int):
    """
    A simple text splitter function to chunk content based on size and overlap.
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        # Determine the end of the current chunk
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)

        # Calculate the next start position using overlap
        start += (chunk_size - chunk_overlap)

        # Ensure we don't move backwards or repeat the last chunk if it was the end
        if end == text_length:
            break

    return chunks

In [9]:
def embedd_data(df):
    all_contents = []
    all_metadatas = []
    all_ids = []
    total_chunks = 0
    # Iterate over the DataFrame rows
    for index, row in df.iterrows():
        doc_content = row['Content']
        doc_title = f"{row['Title']} - {row['Section']} - {row['Subsection']}"

        # 3.1. Chunk the document content
        chunks = simple_text_splitter(doc_content, CHUNK_SIZE, CHUNK_OVERLAP)

        # 3.2. Prepare data for Chroma
        for i, chunk in enumerate(chunks):
            # Global chunk ID across all documents
            chunk_id = total_chunks + i + 1

            all_contents.append(chunk)
            all_metadatas.append({
                "source": doc_title,
                "doc_index": index,
                "chunk_id": chunk_id
            })
            # Create a unique, descriptive ID for each chunk
            all_ids.append(f"{doc_title.replace(' ', '_')}_chunk{i+1}")
        total_chunks += len(chunks)
    return all_contents, all_metadatas, all_ids, total_chunks


In [10]:
all_contents, all_metadatas, all_ids, total_chunks = embedd_data(custom_data)

In [11]:
print('Sample content\n', all_contents[0])
print('\n\nSample metedata\n', all_metadatas[0])
print('\n\nSample IDs\n', all_ids[0])

Sample content
 The ICC Men's Cricket World Cup is a quadrennial world cup for cricket in One Day International (ODI) format, organised by the International Cricket Council (ICC). The tournament is one of the world's most viewed sporting events and considered the flagship event of the international cricket calendar by the ICC. 
The first Cricket World Cup was organised in England in June 1975, with the first ODI cricket match having been played only four years earlier. However, a separate Women's Cricket World Cup had been held two years before the first men's tournament, and a tournament involving multiple international teams had been held as early as 1912, when a triangular tournament of Test matches was played between Australia, England and South Africa. The first three World Cups were held in England. From the 1987 tournament onwards, hosting has been shared between countries under an unofficial rotation system, with 14 ICC members having hosted at least one match in the tournament

In [12]:
print(f"Generating embeddings for {total_chunks} total chunks...")

embeddings = embedding_model.encode(all_contents).tolist()
chroma_collection.add(
        embeddings=embeddings,
        documents=all_contents,
        metadatas=all_metadatas,
        ids=all_ids
    )

Generating embeddings for 201 total chunks...


In [13]:
# Checking the data
results = chroma_collection.get(
        limit=2,
        include=['documents', 'metadatas', 'embeddings']
    )
print('ID\n',results['ids'][0])
print('\nContent\n',results['documents'][0])
print('\nEmbeddings\n\n',results['embeddings'][0])


ID
 Cricket_World_Cup_-_Introduction_-_nan_chunk1

Content
 The ICC Men's Cricket World Cup is a quadrennial world cup for cricket in One Day International (ODI) format, organised by the International Cricket Council (ICC). The tournament is one of the world's most viewed sporting events and considered the flagship event of the international cricket calendar by the ICC. 
The first Cricket World Cup was organised in England in June 1975, with the first ODI cricket match having been played only four years earlier. However, a separate Women's Cricket World Cup had been held two years before the first men's tournament, and a tournament involving multiple international teams had been held as early as 1912, when a triangular tournament of Test matches was played between Australia, England and South Africa. The first three World Cups were held in England. From the 1987 tournament onwards, hosting has been shared between countries under an unofficial rotation system, with 14 ICC members having

# Step 3: Retrieval

Load the vector db, collections,etc. Here just using the collection variable from above step.

In [14]:
query_text = input('Please enter your query: ')

Please enter your query: Who won 2025 Women's cricket world cup?


In [15]:
query_embedding = embedding_model.encode(query_text).tolist()
print("Query embedded.", query_embedding)

Query embedded. [-0.026585392653942108, 0.0594341866672039, -0.017878128215670586, 0.013185758143663406, 0.04435097798705101, 0.04049894958734512, -0.020816070958971977, -0.027476631104946136, 0.015999823808670044, 0.11172843724489212, -0.06452883034944534, -0.12033694982528687, 0.023265743628144264, 0.0061559597961604595, -0.0030877466779202223, 0.00782565213739872, -0.03508227318525314, -0.1372017115354538, 0.03862108290195465, -0.09695789217948914, 0.059644002467393875, 0.02103356644511223, -0.0019856570288538933, 0.019657714292407036, 0.0638575330376625, -0.04160309210419655, -0.03529170900583267, 0.030449291691184044, -0.055108048021793365, 0.015728633850812912, -0.01078282855451107, 0.06073644757270813, -0.08497782051563263, 0.005017654504626989, -0.09447012096643448, -0.0015362582635134459, -0.04060957953333855, 0.061204176396131516, 0.09397917985916138, -0.02925211936235428, -0.03506992757320404, -0.11853161454200745, -0.01537895668298006, 0.016465188935399055, 0.05260299518704

In [16]:
retrieved_results = chroma_collection.query(
    query_embeddings=query_embedding,
    n_results=5, # Retrieve top 5 results
    include=['documents', 'metadatas', 'distances']
)

print("Retrieved Documents:")
for i in range(len(retrieved_results['ids'])):
    print(f"\n--- Result {i+1} ---")
    print(f"Document ID: {retrieved_results['ids'][i]}")
    print(f"Distance: {retrieved_results['distances'][i]}")
    print(f"Source: {retrieved_results['metadatas'][i]}")
    print(f"Content: {retrieved_results['documents'][i]}")

Retrieved Documents:

--- Result 1 ---
Document ID: ["2025_Women's_Cricket_World_Cup_-_Introduction_-_nan_chunk1", '2027_Cricket_World_Cup_-_Background_-_nan_chunk1', '2027_Cricket_World_Cup_-_Introduction_-_nan_chunk1', '2027_Cricket_World_Cup_-_Background_-_Preparations_chunk1', "2025_Women's_Cricket_World_Cup_-_Match_officials_-_nan_chunk1"]
Distance: [0.4231768250465393, 0.7897672653198242, 0.8235656023025513, 0.836820125579834, 0.8507355451583862]
Source: [{'source': "2025_Women's_Cricket_World_Cup - Introduction - nan", 'chunk_id': 186, 'doc_index': 80}, {'chunk_id': 181, 'doc_index': 75, 'source': '2027_Cricket_World_Cup - Background - nan'}, {'doc_index': 74, 'chunk_id': 180, 'source': '2027_Cricket_World_Cup - Introduction - nan'}, {'doc_index': 77, 'source': '2027_Cricket_World_Cup - Background - Preparations', 'chunk_id': 183}, {'doc_index': 86, 'source': "2025_Women's_Cricket_World_Cup - Match officials - nan", 'chunk_id': 192}]
Content: ["The 2025 ICC Women's Cricket World

# Step 4: LLM

In [17]:
!pip install openai
import openai



In [18]:
from google.colab import drive
drive.mount('/content/drive')
# Read Open AI key
with open('/content/drive/MyDrive/Secrets/openai_api_key.txt', "r") as f:
  openai.api_key = ' '.join(f.readlines())

#Choose GPT moodel
gpt_model='gpt-4o-mini'

Mounted at /content/drive


In [21]:
def call_llm(prompt):
  try:
    response = openai.chat.completions.create(model=gpt_model,
                                              messages = [
                                                  {"role": "system", "content": "You are an expert assistant"},
                                                  {"role": "assistant", "content": "You read the input and answer to the question."},
                                                  {"role": "user", "content": prompt}
                                              ],
                                              temperature=0.1
                                              )
    return response.choices[0].message.content.strip()
  except Exception as e:
    print(e)
  return str(e)

In [26]:
prompt_template = """Answer the question based on the context below.\n#####\nContext:\n{}.\n#####\nQuestion:\n{}"""
formatted_prompt = prompt_template.format("\n\n".join(retrieved_results['documents'][0]), query_text)
print(formatted_prompt)

Answer the question based on the context below.
#####
Context:
The 2025 ICC Women's Cricket World Cup was the 13th edition of Women's Cricket World Cup. India hosted the World Cup for the fourth time, after the 1978, 1997 and 2013 editions, with the tournament held from 30 September to 2 November 2025. This was the last time the tournament had eight teams. India became champions after defeating South Africa in the final, securing their maiden World Cup title. Australia where the defending champions, but were knocked out in the Semi-final by eventual champions India.

The Cricket World Cup is a quadrennial One Day International tournament played between men's national cricket teams, organized by the International Cricket Council (ICC). The tournament is held every four years, and was first played in 1975 in England. The last tournament, held in 2023 in India was contested by 10 teams. Australia are the defending champions, having defeated India in the final of the previous edition.
In N

In [24]:
llm_response = call_llm(formatted_prompt)

In [25]:
display(llm_response)

"India won the 2025 ICC Women's Cricket World Cup."