#🚀 Routing in RAG (Retrieval-Augmented Generation)

🔹 What is Routing in RAG?

In Retrieval-Augmented Generation (RAG), routing refers to how a query is directed to the right retrieval or generation pipeline. It ensures that the most relevant documents, models, or APIs handle the query efficiently.



🔥 1. Why is Routing Important in RAG?

In large-scale RAG systems, different types of queries require different processing. Routing helps:

✅ Improve Accuracy – Ensures the best retriever/generator is used.

✅ Optimize Latency – Sends queries to fast/efficient paths.

✅ Reduce Cost – Uses specialized models only when needed.

✅ Enable Multi-Source Retrieval – Queries different knowledge bases (e.g., vector DB, APIs).

⚙️ 2. Routing Components in a RAG System

1️⃣ Query Classification for Routing
Classifies queries to choose the right retriever & generator.

Uses ML models or rule-based approaches.

In [None]:
"""1️⃣ Query Classification for Routing
Classifies queries to choose the right retriever & generator.
Uses ML models or rule-based approaches."""

from transformers import pipeline
classifier = pipeline("text-classification", model="facebook/bart-large-mnli")
labels = ["factual", "opinion", "real-time", "technical"]


queries = [
    "What is the capital of France?", # Factual: Asking for a verifiable fact
    "The Earth is flat.", # Factual, but incorrect - the classifier might still label it as factual
    "I think Python is a great programming language.", # Opinion: Expressing a subjective viewpoint
    "The current stock price of Apple is...", # Real-time: Requires up-to-date information
    "How to install TensorFlow on a GPU?",  # Technical: Involves specific procedures or knowledge in a technical domain
    "The best movie of all time is The Shawshank Redemption.", # Opinion: Subjective preference
    "Global warming is a serious threat to our planet.", # Factual, with a degree of opinion/interpretation
    "What is the meaning of life?", # Philosophical, could be classified as opinion or factual depending on the model's interpretation
    "The weather in London is rainy today.", # Real-time, weather information changes frequently
    "How to configure a VPN connection on Windows?", # Technical, involves networking configuration
]

for query in queries:
    result = classifier(query)
    print(f"Query: {query}")
    print(f"Classification: {result}\n")

Device set to use cpu


Query: What is the capital of France?
Classification: [{'label': 'neutral', 'score': 0.4099216163158417}]

Query: The Earth is flat.
Classification: [{'label': 'contradiction', 'score': 0.5927756428718567}]

Query: I think Python is a great programming language.
Classification: [{'label': 'neutral', 'score': 0.9825617074966431}]

Query: The current stock price of Apple is...
Classification: [{'label': 'neutral', 'score': 0.7806341052055359}]

Query: How to install TensorFlow on a GPU?
Classification: [{'label': 'neutral', 'score': 0.8667290806770325}]

Query: The best movie of all time is The Shawshank Redemption.
Classification: [{'label': 'neutral', 'score': 0.9826608300209045}]

Query: Global warming is a serious threat to our planet.
Classification: [{'label': 'neutral', 'score': 0.9740337133407593}]

Query: What is the meaning of life?
Classification: [{'label': 'neutral', 'score': 0.47103163599967957}]

Query: The weather in London is rainy today.
Classification: [{'label': 'neut

In [None]:
"""2️⃣ Multi-Retriever Routing"""
def route_retrieval(query_type):
  if query_type == "factual":
    return "vector_db"
  elif query_type == "opinion":
    return "api"
  elif query_type == "real-time":
    return "vector_db"
  elif query_type == "technical":
    return "api"
  else:
    return "llm-only"

query_type = "real-time"
retriever = route_retrieval(query_type)
print(retriever)


vector_db


In [None]:
"""3️⃣ Multi-Generator Routing"""
def route_generation(query_type):
    if query_type in ["factual", "real-time"]:
        return "small_llm"
    elif query_type == "technical":
        return "large_llm"
    else:
        return "fine_tuned_model"

query_type = "technical"
generator = route_generation(query_type)
print(f"Using: {generator}")


Using: large_llm


🎯 3. End-to-End RAG Routing Flow

Step-by-Step Query Flow with

1️⃣ User Query → Classified into categories (Factual, Technical, Opinion, etc.)

2️⃣ Retriever Routing → Query sent to the best retrieval method (Vector, Hybrid, API).

3️⃣ Document Ranking & Filtering → Relevant documents ranked.

4️⃣ Generator Routing → Best LLM chosen based on complexity.

5️⃣ Response Generation → Final response generated.



User Query  
   │  
   ├──► Query Classifier  
   │       ├──► "Technical" → Hybrid Retrieval  
   │  
   ├──► Multi-Retriever Routing  
   │       ├──► FAISS + BM25 + API Calls  
   │  
   ├──► Document Ranking  
   │       ├──► Reranker Model (e.g., Cohere Rerank)  
   │  
   ├──► Multi-Generator Routing  
   │       ├──► GPT-4 for Complex Queries  
   │  
   └──► Final Response Generated  


🔥 4. Optimizing Routing in RAG


✅ Best Practices

✔ Use ML-based Query Classification – Improves accuracy.

✔ Implement Hybrid Retrieval – Best of vector & keyword search.

✔ Dynamically Select LLMs – Save cost on simple queries.

✔ Cache Common Responses – Reduce latency.


🚀 Advanced Techniques

✅ RAG with Agentic Routing – Uses LLMs to decide the pipeline.

✅ Meta-Prompting for Routing – Uses prompt engineering to determine best strategy.

✅ RLHF for Routing Optimization – Fine-tunes routing models using user feedback.



In [None]:
from transformers import pipeline

# Query Classifier
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Routing Functions
def classify_query(query):
    labels = ["factual", "opinion", "real-time", "technical"]
    result = classifier(query, candidate_labels=labels)
    return result['labels'][0]  # Get highest-confidence label

def route_retrieval(query_type):
    routes = {
        "factual": "vector_db",
        "technical": "hybrid_search",
        "real-time": "external_api",
        "opinion": "llm_only"
    }
    return routes.get(query_type, "vector_db")

def route_generation(query_type):
    routes = {
        "factual": "small_llm",
        "technical": "large_llm",
        "real-time": "real_time_api",
        "opinion": "fine_tuned_model"
    }
    return routes.get(query_type, "small_llm")

# Example Query
query = "What are the latest advancements in quantum computing?"
query_type = classify_query(query)
retrieval_method = route_retrieval(query_type)
generation_method = route_generation(query_type)

# Output Routing Decisions
print(f"Query Type: {query_type}")
print(f"Retrieval Method: {retrieval_method}")
print(f"Generation Method: {generation_method}")


Device set to use cpu


Query Type: technical
Retrieval Method: hybrid_search
Generation Method: large_llm


#🚀 Query Construction in Retrieval-Augmented Generation (RAG)

🔹 What is Query Construction?

Query Construction in RAG is the process of modifying, reformulating, or enhancing the user’s query to improve retrieval performance. The goal is to ensure that the retriever fetches the most relevant documents for the LLM.



🔥 1. Why is Query Construction Important?

Poorly formulated queries can result in irrelevant or incomplete document retrieval.

✅ Improves search accuracy by making queries more meaningful.

✅ Enhances retrieval performance by adding context.

✅ Boosts response quality by ensuring the model gets useful documents.

✅ Handles ambiguity by rewording vague queries.

##⚙️ 2. Query Construction Techniques

In [None]:
"""1️⃣ Keyword Extraction"""
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_keywords(query):
    doc = nlp(query)
    keywords = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(keywords)

query = "what are the latest advancements in quantum computing?"
refined_query = extract_keywords(query)
print(refined_query)

latest advancements quantum computing


In [None]:
from transformers import pipeline

# Load a correct text-generation model
query_expander = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", device_map="auto")

def expand_query(query):
    prompt = f"Expand the following query by providing related terms and synonyms: {query}"
    response = query_expander(prompt, max_length=100, num_return_sequences=1, do_sample=True, temperature=0.7)
    return response[0]['generated_text']

query = "How does deep learning work?"
expanded_query = expand_query(query)
print("Expanded Query:", expanded_query)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
"""3️⃣ Query Rewriting (Reformulation)"""
from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

def rewrite_query(query):
    input_text = f"Rephrase this query for better retrieval: {query}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    output_ids = model.generate(input_ids, max_length=100, num_return_sequences=1, num_beams=4, early_stopping=True)
    rewritten_query = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return rewritten_query

query = "Tell me about the latest in AI research?"
rewritten_query = rewrite_query(query)
print("Rewritten Query:", rewritten_query)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Rewritten Query: Rephrase this query for better retrieval: Tell me about the latest in AI research?


In [None]:
"""4️⃣ Multi-Turn Query Construction"""
!pip install --upgrade langchain langchain-community openai
from langchain.memory import ConversationBufferMemory
from langchain.chains import LLMChain
from langchain.llms import OpenAI


memory = ConversationBufferMemory(memory_key="chat_history")

# Add user messages to memory
memory.save_context({"input": "Tell me about transformers?"}, {"output": "Transformers are deep learning models."})
memory.save_context({"input": "How do they compare to RNNs?"}, {"output": "They process sequences in parallel."})

#get a constructed query
query = memory.load_memory_variables({})["chat_history"]
print(query)

Collecting langchain-community
  Downloading langchain_community-0.3.17-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

  memory = ConversationBufferMemory(memory_key="chat_history")


##🎯 3. Full RAG Query Construction Pipeline

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import spacy

# Load Models
nlp = spacy.load("en_core_web_sm")
t5_model = T5ForConditionalGeneration.from_pretrained("t5-small")
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Query Construction Functions
def extract_keywords(query):
    doc = nlp(query)
    return " ".join([token.text for token in doc if token.is_alpha and not token.is_stop])

def rewrite_query(query):
    input_text = f"Paraphrase this query for better search results: {query}"
    inputs = t5_tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    output = t5_model.generate(inputs['input_ids'], max_length=50, do_sample=True, temperature=0.7)
    return t5_tokenizer.decode(output[0], skip_special_tokens=True)

# Example Usage
query = "What is the latest in AI research?"
rewritten_query = rewrite_query(query)
keywords = extract_keywords(rewritten_query)

print("Original Query:", query)
print("Rewritten Query:", rewritten_query)
print("Keywords:", keywords)


Original Query: What is the latest in AI research?
Rewritten Query: Paraphrase this query for better search results: What is latest in AI research?
Keywords: Paraphrase query better search results latest AI research


🎯 Key Takeaways

✅ Query Construction is crucial for RAG to improve retrieval.

✅ Techniques include Keyword Extraction, Expansion, and Rewriting.

✅ Using LLMs like GPT & T5 can automate query enhancement.

✅ Multi-turn query construction helps in conversations.

✅ Combining methods leads to the best retrieval performance!

🚀 Next Steps? Try integrating this into FAISS, BM25, or LangChain RAG pipelines! 💡

#Indexing (Multi Representation)

##1️⃣ What is Indexing in RAG?
In Retrieval-Augmented Generation (RAG), indexing refers to the preprocessing and storage of documents or knowledge sources in a way that enables efficient retrieval. The quality of retrieval directly impacts how well the language model generates responses.

##2️⃣ What is Multi-Representation Indexing?
Traditional indexing stores only one representation of a document, such as a TF-IDF vector or a single embedding.
Multi-Representation Indexing (MRI) enhances this by storing multiple perspectives of each document, improving retrieval diversity and accuracy.

##🔹 Types of Representations in Multi-Representation Indexing
Lexical Representation (Sparse Retrieval)
Uses traditional methods like BM25 or TF-IDF.
Helps in retrieving documents based on exact word matches.
Semantic Representation (Dense Retrieval)
Uses Transformer-based embeddings (e.g., FAISS with BERT, OpenAI embeddings).
Retrieves documents based on semantic similarity, even if different words are used.
Hybrid Representation
Combines BM25 + Dense Vectors for improved results.
Uses a re-ranking model (like ColBERT or Cross-Encoders) to boost relevant documents.

##3️⃣ Why is Multi-Representation Indexing Important in RAG?

✅ Better Recall & Accuracy → Some queries work better with lexical retrieval, while others need dense retrieval.

✅ More Context Awareness → Captures both keyword-based and semantic meanings.

✅ Handles Out-of-Vocabulary Words → Dense embeddings can retrieve paraphrased information.

✅ Supports Hybrid Search → Improves response quality for complex queries.

##4️⃣ How to Implement Multi-Representation Indexing?

We can use LangChain + FAISS + BM25 for hybrid retrieval.



In [None]:
!pip install langchain langchain-community faiss-cpu unstructured


Collecting langchain-community
  Downloading langchain_community-0.3.17-py3-none-any.whl.metadata (2.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting unstructured
  Downloading unstructured-0.16.20-py3-none-any.whl.metadata (24 kB)
Collecting langchain-core<0.4.0,>=0.3.33 (from langchain)
  Downloading langchain_core-0.3.34-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain
  Downloading langchain-0.3.18-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.6 (from langchain)
 

In [None]:
from langchain.document_loaders import TextLoader

# Load documents
documents = [
    "Transformers are deep learning models that replaced RNNs.",
    "BM25 is a ranking function used in traditional search engines.",
    "FAISS is an efficient similarity search library for vector retrieval.",
]


In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
#generate multiple representations
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from rank_bm25 import BM25Okapi
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # Replace with your actual API key


#tokenize for BM25
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

#split text into chunks
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10)
chunks = text_splitter.create_documents(documents)

#embeddings for FAISS
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_texts(chunks, embeddings)

ImportError: Could not import tiktoken python package. This is needed in order to for OpenAIEmbeddings. Please install it with `pip install tiktoken`.

In [None]:
#implement hybrid search

def hybrid_search(query):
  query_tokens = query.lower().split()

  #bm25 retrieval
  bm25_scores = bm25.get_scores(query_tokens)
  bm25_results = sorted(zip(documents, bm25_scores), key=lambda x: x[1], reverse=True)

  #faiss retrieval
  dense_results = vector_store.similarity_search(query,k=2)

  #combine scores
  result = {doc[0] for doc in bm25_results| {res.page_content for res in dense_results}}
  return list(result)


#test query
query = "What is the latest in AI research?"
retrieved_docs = hybrid_search(query)
print(retrieved_docs)

NameError: name 'vector_store' is not defined

##5️⃣ Key Takeaways


✅ Multi-Representation Indexing improves retrieval by combining lexical (BM25) and semantic (FAISS) approaches.

✅ Hybrid Retrieval ensures that even if a query lacks exact words, it can still retrieve relevant results.

✅ LangChain + FAISS + BM25 provide a practical way to implement multi-representation indexing in RAG.

#Indexing (RAPTOR) in RAG


RAPTOR (Rapid AdapTive PrOximity-based Retrieval) is an indexing and retrieval technique optimized for Retrieval-Augmented Generation (RAG). It improves search efficiency by dynamically adapting to query intent and structuring data in a way that enhances proximity-based retrieval.

##🔹 Why RAPTOR for RAG?
Efficient Retrieval – RAPTOR organizes data in a hierarchical or graph-based format, improving search speed.

Multi-Representation Aware – It supports different representations of documents (e.g., text embeddings, metadata, and keyword-based indexing).

Adaptive Filtering – Dynamically adjusts ranking strategies based on query type.
Scalable – Handles large-scale document retrieval efficiently.

##🔹 How RAPTOR Works
Preprocessing:

Convert documents into multiple representations (e.g., dense embeddings, sparse keyword vectors).
Store these in a RAPTOR-optimized index.

Indexing:

Uses a hybrid approach combining dense (vector-based) and sparse (BM25, keyword) indexing.
Structures data using approximate nearest neighbors (ANN) trees, locality-sensitive hashing (LSH), or graph-based search.

Query Execution:

First, identify the query type (e.g., factual lookup vs. exploratory search).
Use adaptive retrieval strategies based on the query.
Retrieve top-k most relevant results by combining multiple indexing techniques.

##🔹 Simple RAPTOR-style Indexing

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi
import faiss
import numpy as np

# Sample documents
docs = [
    "The Eiffel Tower is in Paris.",
    "Machine learning is a subset of AI.",
    "Paris is the capital of France.",
    "Deep learning is a type of machine learning.",
    "France is in Europe."
]

#bm25 indexing
tokenized_docs = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

#faiss indexing
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(docs).toarray()
index = faiss.IndexFlatL2(vectors.shape[1])
index.add(np.array(vectors, dtype=np.float32))

#query processing
query = "Where is the Eiffel Tower?"
tokenized_query = query.lower().split()

#bm25 retrieval
bm25_scores = bm25.get_scores(tokenized_query)
top_bm25_idx = np.argsort(bm25_scores)[::-1][:2]

#faiss retrieval
query_vector = vectorizer.transform([query]).toarray()
_, top_faiss_idx = index.search(query_vector, 2)

#combine results
retrieved_docs = list(set(top_bm25_idx.tolist() + top_faiss_idx.flatten().tolist()))
print("Retrieved Documents:", [docs[i] for i in retrieved_docs])


Retrieved Documents: ['The Eiffel Tower is in Paris.', 'Paris is the capital of France.']


##🔹 Key Takeaways
RAPTOR blends dense and sparse retrieval techniques for better performance in RAG.

Adaptive indexing ensures queries get the best retrieval method.

It enhances accuracy and efficiency, making RAG more powerful for knowledge-intensive applications.

#Indexing (ColBERT) in RAG


ColBERT (Contextualized Late Interaction over BERT) is an advanced indexing and retrieval method that improves efficiency and accuracy in Retrieval-Augmented Generation (RAG). Unlike traditional retrieval techniques, ColBERT leverages BERT embeddings with late interaction to enhance search relevance.

##🔹 Why Use ColBERT for RAG?
Context-Aware Retrieval – Uses deep contextualized embeddings for better ranking.

Efficient Indexing – Unlike full cross-encoders, ColBERT balances efficiency and accuracy.

Late Interaction Mechanism – Stores token-level embeddings, allowing fast and flexible retrieval.

Scalability – Unlike full BERT-based cross-encoders, ColBERT is optimized for large-scale retrieval.

##🔹 How ColBERT Indexing Works
Document Tokenization:

Splits documents into tokens and encodes them using BERT-based embeddings.

Indexing Stage:

Stores each token’s embedding separately in a FAISS index for efficient retrieval.
Allows asymmetric storage (query embeddings remain small, while documents have rich embeddings).

Query Processing & Retrieval:

Encodes the query into token embeddings.
Uses "MaxSim" late interaction to score documents efficiently by computing token-wise maximum similarities.
Retrieves the most relevant results based on token-level semantic relevance.


##🔹 Simple ColBERT-style Indexing

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np


# Load a BERT-based model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Sample documents
docs= [ "The Eiffel Tower is located in Paris.",
    "Machine learning is a branch of artificial intelligence.",
    "Paris is the capital of France.",
    "Deep learning is a subset of machine learning.",
    "France is a country in Europe."]

#encode documents
doc_embeddings = model.encode(docs , convert_to_numpy = True)

#build faiss index
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(doc_embeddings)

#query processing
query = "Where is the Eiffel Tower?"
query_embedding = model.encode([query], convert_to_numpy=True)

#search
top_k = 2
distances, indices = index.search(query_embedding, top_k)

#retrieve results
retrieved_docs = [docs[i] for i in indices[0]]
print("Retrieved Documents:", retrieved_docs)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Retrieved Documents: ['The Eiffel Tower is located in Paris.', 'Paris is the capital of France.']


##🔹 Key Takeaways
ColBERT enhances retrieval in RAG using token-level embeddings.

Late interaction mechanism balances efficiency and deep contextual understanding.

Combining ColBERT with FAISS makes it scalable for large-scale RAG applications.