# RAG using ChromaDB, LlamaIndex and OpenAI

What we are doing here?

* Get data for RAG - Assumption is data preparation and cleaning alreayd done.
* Store the data in chromaDB (using llamaindex).
* Call OpenAI to retrieve the result.

# Step 1: Data preparation

* Data already prepared and stored in CSV. Details [Here](https://github.com/tva04/create-data-set/)
* CSV Data
https://raw.githubusercontent.com/tva04/create-data-set/main/data-set/wikipedia_sports_data.csv
* Details [here](https://github.com/tva04/gen-ai-rag/blob/main/rag_chromadb.ipynb) - Read CSV data from github link and convert it to data frame.
* Here manually copied to my-data folder to use Llama - SimpleDirectoryReader

# Step 2: Configure LLM

In [4]:
%pip install openai



In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
#Choose GPT moodel
gpt_model='gpt-4o-mini'
embedding_model = 'text-embedding-3-small'

In [7]:
import openai
import os
from openai import OpenAI

# Read Open AI key
with open('/content/drive/MyDrive/Secrets/openai_api_key.txt', "r") as f:
  api_key_value = f.read().strip()

# Step 3: Setup LlamaIndex, ChromaDB

Install necessary llama index libraries

In [8]:
%pip install llama-index llama-index-vector-stores-chroma



In [9]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

In [20]:
llm = OpenAI(model=gpt_model, api_key=api_key_value)
embed_model = OpenAIEmbedding(model=embedding_model, api_key=api_key_value)

Setup Chromadb

In [11]:
import chromadb

In [12]:
COLLECTION_NAME = "wiki-sports-data"
chroma_data_path = 'chroma_data'

In [13]:
chroma_client = chromadb.PersistentClient(path=chroma_data_path)
chroma_collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

Load data from CSV

In [14]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("my-data").load_data()

In [15]:
documents

[Document(id_='64c450c7-0b25-478b-8757-416037fd58d8', embedding=None, metadata={'file_path': '/content/my-data/wikipedia_sports_data.csv', 'file_name': 'wikipedia_sports_data.csv', 'file_type': 'text/csv', 'file_size': 155742, 'creation_date': '2025-11-25', 'last_modified_date': '2025-11-25'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Cricket_World_Cup, Introduction, nan, The ICC Men\'s Cricket World Cup is a quadrennial world cup for cricket in One Day International (ODI) format, organised by the International Cricket Council (ICC). The tournament is one of the world\'s most viewed sporting events and considered the flagship event o

# Step 3: Vector store

In [16]:
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

In [17]:
# Wrap the Chroma collection with ChromaVectorStore
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Create a StorageContext to hold the vector store
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Step 4: Create Index

In [21]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model,
)

# Step 5: Data Retrieval

In [22]:
query_engine = index.as_query_engine(llm=llm)

In [23]:
response = query_engine.query("Who won 2025 Women's cricket world cup?")

In [24]:
print(response.response)

India won the 2025 Women's Cricket World Cup, securing their maiden title by defeating South Africa in the final.


In [26]:
print("\nSources Retrieved from ChromaDB (Context)")
for source_node in response.source_nodes:
    print(f"Confidence Score (Similarity): {source_node.score:.4f}")
    print(f"Text Chunk:\n{source_node.text[:150]}\n")


Sources Retrieved from ChromaDB (Context)
Confidence Score (Similarity): 0.5337
Text Chunk:
2025_Women's_Cricket_World_Cup, Qualification, nan, The West Indies, semi-finalists at the preceding 2022 tournament, failed to qualify for the World 

Confidence Score (Similarity): 0.4547
Text Chunk:
In November 2021, the ICC announced that the 2027 Cricket World Cup will be played in South Africa, Zimbabwe and Namibia.
2027_Cricket_World_Cup, Back

