<a href="https://colab.research.google.com/github/tjoelc/AI-Engineering-Lab/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## *Hands-On ColBERT with Example*
In this section, we will get hands-on with the ColBERT and even check how it performs against a regular embedding model.
## **Step 1: Download Libraries**
Step 1: Download Libraries
We will start by downloading the following library:

In [1]:
!pip install ragatouille langchain langchain_openai chromadb einops sentence-transformers tiktoken

Collecting langchain_core (from ragatouille)
  Downloading langchain_core-0.3.79-py3-none-any.whl.metadata (3.2 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.9 (from langchain)
  Downloading langchain_text_splitters-0.3.11-py3-none-any.whl.metadata (1.8 kB)
Downloading langchain_core-0.3.79-py3-none-any.whl (449 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m449.8/449.8 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_text_splitters-0.3.11-py3-none-any.whl (33 kB)
Installing collected packages: langchain_core, langchain-text-splitters
  Attempting uninstall: langchain_core
    Found existing installation: langchain-core 1.0.0
    Uninstalling langchain-core-1.0.0:
      Successfully uninstalled langchain-core-1.0.0
  Attempting uninstall: langchain-text-splitters
    Found existing installation: langchain-text-splitters 1.0.0
    Uninstalling langchain-text-splitters-1.0.0:
      Successfully uninstalled langchain-text-splitters-1.0.0


# **Step 2: Download Pre-trained Model**

In [2]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

********************************************************************************
--------------------------------------------
RAGatouille version 0.0.10 will be migrating to a PyLate backend 
instead of the current Stanford ColBERT backend.
PyLate is a fully mature, feature-equivalent backend, that greatly facilitates compatibility.
However, please pin version <0.0.10 if you require the Stanford ColBERT backend.
********************************************************************************
  from ragatouille import RAGPretrainedModel
W1023 20:24:06.953000 7226 torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authenticat

[Oct 23, 20:24:13] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  self.scaler = torch.cuda.amp.GradScaler()


## Now let’s download a Wikipedia page and perform retrieval from it. For this, the code will be:

In [3]:
from ragatouille.utils import get_wikipedia_page

document = get_wikipedia_page("Elon_Musk")
print("Word Count:",len(document))
print(document[:1000])

Word Count: 46042
Elon Reeve Musk ( EE-lon; born June 28, 1971) is a businessman and entrepreneur known for his leadership of Tesla, SpaceX, Twitter, and xAI. Musk has been the wealthiest person in the world since 2021; as of October 2025, Forbes estimates his net worth to be US$500 billion.
Born into a wealthy family in Pretoria, South Africa, Musk emigrated in 1989 to Canada; he had obtained Canadian citizenship at birth through his Canadian-born mother. He received bachelor's degrees in 1997 from the University of Pennsylvania in Philadelphia, United States, before moving to California to pursue business ventures. In 1995, Musk co-founded the software company Zip2. Following its sale in 1999, he co-founded X.com, an online payment company that later merged to form PayPal, which was acquired by eBay in 2002. That year, Musk also became an American citizen.
In 2002, Musk founded the space technology company SpaceX, becoming its CEO and chief engineer; the company has since led innovat

# **Step 3: Indexing**

### **Now we will create an index on this document.**

In [4]:
RAG.index(
   # List of Documents
   collection=[document],
   # List of IDs for the above Documents
   document_ids=['elon_musk'],
   # List of Dictionaries for the metadata for the above Documents
   document_metadatas=[{"entity": "person", "source": "wikipedia"}],
   # Name of the index
   index_name="Elon2",
   # Chunk Size of the Document Chunks
   max_document_length=256,
   # Wether to Split Document or Not
   split_documents=True
   )

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Oct 23, 20:30:56] #> Note: Output directory .ragatouille/colbert/indexes/Elon2 already exists


[Oct 23, 20:30:56] #> Will delete 11 files already at .ragatouille/colbert/indexes/Elon2 in 20 seconds...
[Oct 23, 20:31:17] [0] 		 #> Encoding 51 passages..


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 2/2 [00:36<00:00, 18.27s/it]

[Oct 23, 20:31:54] [0] 		 avg_doclen_est = 194.4705810546875 	 len(local_sample) = 51
[Oct 23, 20:31:54] [0] 		 Creating 1,024 partitions.
[Oct 23, 20:31:54] [0] 		 *Estimated* 9,917 embeddings.
[Oct 23, 20:31:54] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Elon2/plan.json ..





used 14 iterations (2.7074s) to cluster 9423 items into 1024 clusters
[0.035, 0.037, 0.041, 0.038, 0.034, 0.039, 0.035, 0.036, 0.034, 0.035, 0.036, 0.033, 0.033, 0.038, 0.037, 0.036, 0.029, 0.035, 0.033, 0.035, 0.04, 0.034, 0.035, 0.035, 0.033, 0.038, 0.036, 0.038, 0.035, 0.036, 0.035, 0.039, 0.04, 0.036, 0.035, 0.031, 0.039, 0.038, 0.036, 0.041, 0.035, 0.033, 0.034, 0.034, 0.034, 0.034, 0.036, 0.04, 0.039, 0.036, 0.034, 0.036, 0.036, 0.035, 0.036, 0.037, 0.042, 0.038, 0.041, 0.035, 0.034, 0.037, 0.035, 0.035, 0.039, 0.037, 0.036, 0.038, 0.035, 0.036, 0.038, 0.034, 0.031, 0.037, 0.037, 0.036, 0.037, 0.039, 0.037, 0.036, 0.037, 0.038, 0.037, 0.039, 0.036, 0.035, 0.033, 0.038, 0.036, 0.041, 0.037, 0.037, 0.036, 0.039, 0.036, 0.035, 0.038, 0.035, 0.035, 0.036, 0.037, 0.038, 0.036, 0.037, 0.039, 0.033, 0.035, 0.033, 0.032, 0.031, 0.038, 0.038, 0.038, 0.035, 0.035, 0.036, 0.033, 0.04, 0.037, 0.037, 0.036, 0.039, 0.033, 0.038, 0.034, 0.037, 0.034, 0.033]


0it [00:00, ?it/s]

[Oct 23, 20:31:57] [0] 		 #> Encoding 51 passages..



  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:21<00:21, 21.69s/it][A
100%|██████████| 2/2 [00:33<00:00, 16.92s/it]
1it [00:34, 34.11s/it]
100%|██████████| 1/1 [00:00<00:00, 858.61it/s]

[Oct 23, 20:32:31] #> Optimizing IVF to store map from centroids to list of pids..
[Oct 23, 20:32:31] #> Building the emb2pid mapping..
[Oct 23, 20:32:31] len(emb2pid) = 9918



100%|██████████| 1024/1024 [00:00<00:00, 28176.65it/s]

[Oct 23, 20:32:31] #> Saved optimized IVF to .ragatouille/colbert/indexes/Elon2/ivf.pid.pt
Done indexing!





'.ragatouille/colbert/indexes/Elon2'

## Here we call the **.index()** of the RAG to index our document. To this, we pass the following:

### **collection:** This is a list of documents that we want to index. Here we have only one document, hence a list of a single document.
### **document_ids:** Each document expects a unique document ID. Here we pass it the name elon_musk because the document is about Elon Musk.
### **document_metadatas:** Each document has its metadata to it. This again is a list of dictionaries, where each dictionary contains a key-value pair metadata for a particular document.
### **index_name**: The name of the index that we are creating. Let’s name it Elon2.
### **max_document_size:** This is similar to the chunk size. We specify how much should each document chunk be. Here we are giving it a value of 256. If we do not specify any value, 256 will be taken as the default chunk size.
### **split_documents:** It is a boolean value, where True indicates that we want to split our document according to the given chunk size, and False indicates that we want to store the entire document as a single chunk

## **Step 4: General Query**

In [5]:
results = RAG.search(query="What companies did Elon Musk find?", k=3, index_name='Elon2')
for i, doc, in enumerate(results):
   print(f"---------------------------------- doc-{i} ------------------------------------")
   print(doc["content"])

Loading searcher for index Elon2 for the first time... This may take a few seconds
[Oct 23, 20:42:24] #> Loading codec...
[Oct 23, 20:42:24] #> Loading IVF...
[Oct 23, 20:42:24] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 23, 20:42:24] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 8192.00it/s]

[Oct 23, 20:42:24] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 458.09it/s]

[Oct 23, 20:42:24] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Oct 23, 20:42:24] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: What companies did Elon Musk find?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  3316,  2106,  3449,  2239, 14163,  6711,  2424,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


---------------------------------- doc-0 ------------------------------------
Elon Reeve Musk ( EE-lon; born June 28, 1971) is a businessman and entrepreneur known for his leadership of Tesla, SpaceX, Twitter, and xAI. Musk has been the wealthiest person in the world since 2021; as of October 2025, Forbes estimates his net worth to be US$500 billion.
Born into a wealthy family in Pretoria, South Africa, Musk emigrated in 1989 to Canada; he had obtained Canadian citizenship at birth through his Canadian-born mother. He received bachelor's degrees in 1997 from the University of Pennsylvania in Philadelphia, United States, before moving to California to pursue business ventures. In 1995, Musk co-founded the software company Zip2. Following its sale in 1999, he co-founded X.com, an online payment company that later merged to form PayPal, which was acquired by eBay in 2002. That year, Musk also became an American citizen.
In 2002, Musk founded the space technology company SpaceX, becoming i

## Step 5: Specific **Query**

In [7]:
results = RAG.search(query="How much Tesla stocks did Elon sold in \
Decemeber 2022?", k=3, index_name='Elon2')


for i, doc, in enumerate(results):
   print(f"""---------------
   ------------------- doc-{i} ------------------------------------""")
   print(doc["content"])

---------------
   ------------------- doc-0 ------------------------------------
Tesla began delivery of the Roadster, an electric sports car, in 2008. With sales of about 2,500 vehicles, it was the first mass production all-electric car to use lithium-ion battery cells. Under Musk, Tesla has since launched several well-selling electric vehicles, including the four-door sedan Model S (2012), the crossover Model X (2015), the mass-market sedan Model 3 (2017), the crossover Model Y (2020), and the pickup truck Cybertruck (2023).
In May 2020, Musk resigned as chairman of the board as part of the settlement of a lawsuit from the SEC over him tweeting that funding had been "secured" for potentially taking Tesla private.
The company has also constructed multiple lithium-ion battery and electric vehicle factories, called Gigafactories. Since its initial public offering in 2010, Tesla stock has risen significantly; it became the most valuable carmaker in summer 2020, and it entered the S&P 50

Here in the above code, we are asking a very specific question about how many stocks worth of Tesla Elon sold in the month of December 2022. We can see the output here. The doc-1 contains the answer to the question. Elon has sold $3.6 billion worth of his stock in Tesla. Again, ColBERT was able to successfully retrieve the relevant chunk for the given query

# **Step 6: Testing Other Models**
Let’s now try the same question with the other embedding models both open-source and closed here:

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

model_name = "jinaai/jina-embeddings-v2-base-en"
model_kwargs = {'device': 'cpu'}

embeddings = HuggingFaceEmbeddings(
   model_name=model_name,
   model_kwargs=model_kwargs,
)



  embeddings = HuggingFaceEmbeddings(
Some weights of BertModel were not initialized from the model checkpoint at jinaai/jina-embeddings-v2-base-en and are newly initialized: ['embeddings.position_embeddings.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.11.intermediate.de

## **Step 7: Create Embeddings**

In [9]:
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=256,
    chunk_overlap=0)
splits = text_splitter.split_text(document)
vectorstore = Chroma.from_texts(texts=splits,
                                embedding=embeddings,
                                collection_name="elon")
retriever = vectorstore.as_retriever(search_kwargs = {'k':3})



## **Step 8: Creating a Retriever**

In [10]:
docs = retriever.get_relevant_documents("What companies did Elon Musk find?",)

for i, doc in enumerate(docs):
 print(f"---------------------------------- doc-{i} ------------------------------------")
 print(doc.page_content)

---------------------------------- doc-0 ------------------------------------
During his speech after the second inauguration of Donald Trump, Musk twice made a gesture interpreted by many as a Nazi or a fascist Roman salute. He thumped his right hand over his heart, fingers spread wide, and then extended his right arm out, emphatically, at an upward angle, palm down and fingers together. He then repeated the gesture to the crowd behind him. As he finished the gestures, he said to the crowd, "My heart goes out to you. It is thanks to you that the future of civilization is assured."
It was widely condemned as an intentional Nazi salute in Germany, where making such gestures is illegal. The Anti-Defamation League said it was not a Nazi salute, but other Jewish organizations disagreed and condemned the salute. American public opinion was divided on partisan lines as to whether it was a fascist salute. Musk dismissed the accusations of Nazi sympathies, deriding them as "dirty tricks" and a

  docs = retriever.get_relevant_documents("What companies did Elon Musk find?",)


# **Step 9: Testing OpenAI’s Embedding Model**


In [17]:
from google.colab import userdata
import os

# Get the API key from Colab Secrets and set it as an environment variable
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma


embeddings = OpenAIEmbeddings()

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
              model_name = "gpt-4",
              chunk_size = 256,
              chunk_overlap  = 0,
              )

splits = text_splitter.split_text(document)
vectorstore = Chroma.from_texts(texts=splits,
                                embedding=embeddings,
                                collection_name="elon_collection")

retriever = vectorstore.as_retriever(search_kwargs = {'k':3})

In [18]:
docs = retriever.get_relevant_documents("How much Tesla stocks did Elon sold in Decemeber 2022?",)

for i, doc in enumerate(docs):
  print(f"""---------------------------------- doc-{i} ------------------------------------""")
  print(doc.page_content)

---------------------------------- doc-0 ------------------------------------
In 2019, Musk stated in a tweet that Tesla would build half a million cars that year. The SEC reacted by asking a court to hold him in contempt for violating the terms of the 2018 settlement agreement. A joint agreement between Musk and the SEC eventually clarified the previous agreement details, including a list of topics about which Musk needed preclearance. In 2020, a judge blocked a lawsuit that claimed a tweet by Musk regarding Tesla stock price ("too high imo") violated the agreement. Freedom of Information Act (FOIA)-released records showed that the SEC concluded Musk had subsequently violated the agreement twice by tweeting regarding "Tesla's solar roof production volumes and its stock price".
---------------------------------- doc-1 ------------------------------------
In October 2023, the SEC sued Musk over his refusal to testify a third time in an investigation into whether he violated federal law 