## Optimizing the RAG performance
- better retrievar
- better computational load

### Dataset preparation
- Using Stanford Q and A dataset (SQuAD)

In [1]:
# %pip install datasets
# %pip install einops
import nest_asyncio
nest_asyncio.apply()

In [2]:

from datasets import load_dataset

# loading the dataset
dataset = load_dataset("squad")

# Extract unique context from the dataset

data = [item["context"] for item in dataset["train"]]

texts = list(set(data))


In [3]:
texts[0]

'The United States is the chief remaining nation to assign official responsibilities to a region called the Near East. Within the government the State Department has been most influential in promulgating the Near Eastern regional system. The countries of the former empires of the 19th century have in general abandoned the term and the subdivision in favor of Middle East, North Africa and various forms of Asia. In many cases, such as France, no distinct regional substructures have been employed. Each country has its own French diplomatic apparatus, although regional terms, including Proche-Orient and Moyen-Orient, may be used in a descriptive sense. The most influential agencies in the United States still using Near East as a working concept are as follows.'

### Embed dataset
Embedding for each context-level, Each element of the above text list will be embedded into a single vector

In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from tqdm import tqdm
def batch_iterate(lst, batch_size):
    for i in range(0, len(lst), batch_size):
        yield lst[i : i + batch_size]
    
class EmbedData:
    """
    A class for generating and managing text embeddings using a Hugging Face embedding model.
    This class handles the loading of an embedding model and batch processing of text data
    to generate embeddings.
    Attributes:
        embed_model_name (str): Name of the Hugging Face model to use for embeddings.
            Defaults to "nomic-ai/nomic-embed-text-v1.5".
        embed_model: Loaded Hugging Face embedding model instance.
        batch_size (int): Number of texts to process in each batch. Defaults to 32.
        embeddings (list): Storage for generated embeddings.
    Example:
        >>> embed_data = EmbedData()
        >>> texts = ["Sample text 1", "Sample text 2"]
        >>> embed_data.embed(texts)
        >>> embeddings = embed_data.embeddings
    """
    def __init__(self, 
                 embed_model_name="nomic-ai/nomic-embed-text-v1.5",
                 batch_size = 32):
        self.embed_model_name = embed_model_name
        self.embed_model = self._load_embed_model()
        self.batch_size = batch_size
        self.embeddings = []

    def _load_embed_model(self):
        """
        Load and initialize a HuggingFace embedding model with specified configurations.

        Returns:
            HuggingFaceEmbedding: Initialized embedding model instance configured with the model name.
        """
        embed_model = HuggingFaceEmbedding(model_name=self.embed_model_name,
                                           trust_remote_code=True,
                                           cache_folder='./hf_cache')
        return embed_model
    
    def generate_embedding(self, context):
        return self.embed_model.get_text_embedding_batch(context)
    

    def embed(self, contexts):
        """
        Embeds a list of contexts into vector representations using batched processing.
        This method processes the input contexts in batches and generates embeddings 
        for each context using the underlying embedding model. The embeddings are stored
        internally in the class instance.
        Args:
            contexts (list): List of text contexts to be embedded.
                             Each context should be a string.
        Example:
            embedder = EmbeddingModel()
            contexts = ["text1", "text2", "text3"]
            embedder.embed(contexts)
        """
        self.contexts = contexts
        
        for batch_context in tqdm(batch_iterate(contexts, self.batch_size),
                                  total=len(contexts)//self.batch_size,
                                  desc="Embedding data in batches"):
                                  
            batch_embeddings = self.generate_embedding(batch_context)
            
            self.embeddings.extend(batch_embeddings)


In [18]:
batch_size = 32

embeddata = EmbedData(batch_size=batch_size)

embeddata.embed(texts)

<All keys matched successfully>
Embedding data in batches: 591it [27:37,  2.81s/it]                           


In [19]:
# # # Writing the embedings to pickle file
import pickle
with open("data/squad_embedded_full.pickle", "wb") as h:
    pickle.dump(embeddata,h)

In [20]:
import dill as pickle

with open("data/squad_embedded_full.pickle", "rb") as h:
    embeddata = pickle.load(h)

### Vector Database
as we have embedded our dataset, we can define a vector database and dump our embeddings in it.

In [71]:
## Qdrant
from qdrant_client import models
from qdrant_client import QdrantClient
class QdrantVDB:
    def __init__(self, collection_name, vector_dim=768, batch_size=512):
        self.collection_name = collection_name
        self.batch_size = batch_size
        self.vector_dim = vector_dim
    def define_client(self):
        self.client = QdrantClient(url="http://localhost:6333",
                                   prefer_grpc=True)
        
    def create_collection(self):
        if not self.client.collection_exists(collection_name=self.collection_name):
            self.client.create_collection(collection_name=self.collection_name,

                                          vectors_config=models.VectorParams(
                                                              size=self.vector_dim,
                                                              distance=models.Distance.DOT,
                                                              on_disk=True),
                                          optimizers_config=models.OptimizersConfigDiff(
                                                                            default_segment_number=5,
                                                                            indexing_threshold=0),
                                            # Adding Binary quantization for faster search
                                          quantization_config=models.BinaryQuantization(
                                                        binary=models.BinaryQuantizationConfig(always_ram=True)),
                                         )
    
    def ingest_data(self, embeddata):
        for batch_context, batch_embeddings in tqdm(zip(batch_iterate(embeddata.contexts, self.batch_size),
                                                        batch_iterate(embeddata.embeddings, self.batch_size)),
                                                    total=len(embeddata.contexts)//self.batch_size,
                                                    desc = "Ingesting in batches"):
            self.client.upload_collection(collection_name=self.collection_name,
                                          vectors=batch_embeddings,
                                          payload=[{"context": context} for context in batch_context])
            
            self.client.update_collection(collection_name=self.collection_name,
                                        optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000))

In [72]:
database = QdrantVDB(collection_name="squad_collection_qa")
database.define_client()
database.create_collection()
database.ingest_data(embeddata)

Ingesting in batches: 37it [00:04,  8.83it/s]                        


## Retriever 
## Search and Retrieve from VectorDB

In [73]:
import time
class Retriever:
    def __init__(self, vector_db, embeddata):
        self.vector_db = vector_db
        self.embeddata = embeddata
    def search(self, query):
        query_embedding = self.embeddata.embed_model.get_query_embedding(query)

        # Start the timer for logging the time taken for search
        start_time = time.time()

        result = self.vector_db.client.search(
            collection_name = self.vector_db.collection_name,
            query_vector = query_embedding,
            search_params = models.SearchParams(
                quantization = models.QuantizationSearchParams(
                    ignore= False ,
                    rescore = True,
                    oversampling = 2.0
                )# Adding Ignore as False for quantization
            ),
            timeout = 1000,
        )

        end_time = time.time()
        elapsed_time = end_time - start_time

        print(f"Execution time for the search: {elapsed_time:.4f} seconds")

        return result

In [74]:
## Sample query 
query = "What is the capital of France?"
Retriever(database, embeddata).search(query)[0]

Execution time for the search: 0.0584 seconds


ScoredPoint(id='fa116322-8428-4a95-a4a4-c401f40489a5', version=253, score=0.7405614256858826, payload={'context': 'Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometres (84 mi) south-east of Rouen. Paris is located in the north-bending arc of the river Seine and includes two islands, the Île Saint-Louis and the larger Île de la Cité, which form the oldest part of the city. The river\'s mouth on the English Channel (La Manche) is about 233 mi (375 km) downstream of the city, established around 7600 BC. The city is spread widely on both banks of the river. Overall, the city is relatively flat, and the lowest point is 35 m (115 ft) above sea level. Paris has several prominent hills, the highest of which is Montmartre at 130 m (427 ft). Montmart

As Retriever is working well 
Let's integrate retriever with LLM to generate responses based on retrieved-context and user queries.

### Defining RAG - LLM 

In [75]:
from llama_index.llms.ollama import Ollama
class RAG:
    def __init__(self,
                 retriever,
                 llm_name="llama3.2:1b"):
        self.llm_name = llm_name
        self.llm = self._setup_llm()
        self.retriever = retriever
        self.qa_prompt_tmpl_str = """Context information is below.
                                     ---------------------
                                     Context: {context}      
                                     ---------------------
                                     Given the context information above I want you
                                     to think step by step to answer the query in a 
                                     crisp manner, incase you don't know the 
                                     answer, please say 'I don't know!'
                                     ---------------------
                                     Query: {query}
                                     ---------------------
                                     Answer:"""

    def _setup_llm(self):
        return Ollama(model=self.llm_name)
    
    def generate_context(self, query):
        result = self.retriever.search(query)
        context = [dict(data) for data in result]
        combined_prompt = []
        for entry in context:
            context = entry["payload"]["context"]
            combined_prompt.append(context)

        return "\n\n---\n\n".join(combined_prompt)
    
    def query(self, query):
        context = self.generate_context(query=query)

        prompt = self.qa_prompt_tmpl_str.format(context=context,
                                                query=query)
        response = self.llm.complete(prompt)
        return dict(response)['text']

### Using RAG

In [76]:
retrirver = Retriever(database, embeddata)

rag = RAG(retriever=retrirver)

In [79]:
query = """The premium and VIP services in Airports
           are reserved for which type of passengers?"""

answer = rag.query(query=query)

Execution time for the search: 0.1595 seconds


In [80]:
from IPython.display import display, Markdown

display(Markdown(f"**Query:** {str(answer)}"))

**Query:** The premium and VIP services in airports are usually reserved for First and Business class passengers.

### Conclusion
- We speed up the RAG using Bindary quatization i.e., compute time and memory usage reduction using but at a cost of reduced accuracy
- Can explore other quantization like `ScalarQuantization` methods for better output
- Binary Quantization usually works on high dimensional datasets.
- For smaller datasets, We are better off using traditional RAG.

## Limitations
- Binary Quantization reduces the precision of the vector embeddings to binary representation, it does lead to a loss of granularity in the original data.
- We can incease the precision by oversampling the approximate nearest neighbours.
