This notebook demonstrates RAG's "Augmentation" & "Generation" processes.

> **_"Augmentation"_** is the process of combining information retrieved from external sources with the user input query and fed into the language model (Qwen, Llamma) to improve the quality and relevance of the generated answer.
>
> _Why is Augmentation Useful?_ </br>
> Without augmentation, the LLM is guessing based only on its internal training. With augmentation:
>
> - You can provide **up-to-date** or **domain-specific** info
>
> - You reduce **hallucinations** (made-up facts)
>
> - You make smaller or fine-tuned models perform much better
>
> 
> </br> **_"Generation"_** is the final step where a **language model (LLM)** takes the **augmented input**—which includes the **original user query + retrieved documents**—and produces a natural language response.
> 
> _Think of it like?_ </br>
> - The LLM is a smart student.
> - Retrieval gives it the right textbook pages. 
> - Augmentation is handing the student those pages along with the exam question—so it can give a smarter, informed answer.
> - Generation is **informed by real data**, not just pretraining. The language model uses the retrieved context to generate **fact-based, grounded answers**, reducing hallucination and improving domain-specific performance.

## **Create the Vector Store**

In [1]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

# Set up Chromadb
import chromadb
from chromadb.utils import embedding_functions
from chromadb.api.models import Collection

def create_vector_store(db_path: str, model_name: str) -> Collection:
    """
    Creates a persistent ChromaDB vector store with OpenAI embeddings.

    Args:
        db_path (str): Path where the ChromaDB database will be stored.

    Returns:
        Collection: A ChromaDB collection object for storing and retrieving embedded vectors.
    """

    # Initialize a ChromaDB PersistentClient with the specified database path
    client = chromadb.PersistentClient(path=db_path)
    
    # Create an embedding function using OpenAI's text embedding model
    embeddings = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name,
        device=device,
        trust_remote_code=True
    )

    # Create a new collection in the ChromaDB database with the embedding function
    try:
        db = client.create_collection(
            name="pdf_chunks",  # Name of the collection where embeddings will be stored
            embedding_function=embeddings
        )
    except Exception as err:
        db = client.get_collection(
            name="pdf_chunks",
            embedding_function=embeddings
        )

    # Return the created ChromaDB collection
    return db


In [2]:
db_alibaba_gte = create_vector_store(db_path="./chroma_alibaba_gte.db", model_name="Alibaba-NLP/gte-multilingual-base")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## **Retrieve Chunks**

In [3]:
from typing import Any, List

def retrieve_chunks(db: Collection, query: str, n_results: int = 2) -> List[Any]:
    """
    Retrieves relevant chunks from the  vector store for the given query.

    Args:
        db (Collection): The vector store object
        query (str): The search query text.
        n_results (int, optional): The number of relevant chunks to retrieve. Defaults to 2.

    Returns:
        List[Any]: A list of relevant chunks retrieved from the vector store.
    """

    # Perform a query on the database to get the most relevant chunks
    relevant_chunks = db.query(query_texts=[query], n_results=n_results)

    # Return the retrieved relevant chunks
    return relevant_chunks


In [4]:
query = "What is the attention mechanism?"
relevant_chunks = retrieve_chunks(db=db_alibaba_gte, query=query)

## **Build Context**

In [5]:
def build_context(relevant_chunks) -> str:
    """
    Builds a single context string by combining texts from relevant chunks.

    Args:
        relevant_chunks: relevant chunks retrieved from the vector store.

    Returns:
        str: A single string containing all document chunks combined with newline separators.
    """

    # combine the text from relevant chunks with newline separator
    context = "\n".join(relevant_chunks['documents'][0])

    # Return the combined context string
    return context


In [6]:
context = build_context(relevant_chunks=relevant_chunks)

## **Augment Prompt**

In [7]:
def augment_prompt(context, query):
    """
    Generates a rag prompt based on the given context and query.

    Args:
        context (str): The context the LLM should use to answer the question.
        query (str): The user query that needs to be answered based on the context.

    Returns:
        str: The generated rag prompt.
    """

    # Format the prompt with the provided context and query
    rag_prompt = f""" You are an AI model trained for question answering. You should answer the given question based on the given context only.
    Question : {query}
    \n
    Context : {context}
    \n
    If the answer is not present in the given context, respond as: The answer to this question is not available in the provided content.
    """

    # Return the formatted prompt
    return rag_prompt


In [8]:
augment_prompt = augment_prompt(context=context, query=query)
print(augment_prompt)

 You are an AI model trained for question answering. You should answer the given question based on the given context only.
    Question : What is the attention mechanism?
    

    Context :  Attention Is All You Need Ashish Vaswani Google Brain avaswani@google.comNoam Shazeer Google Brain noam@google.comNiki Parmar Google Research nikip@google.comJakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.comAidan N. Gomezy University of Toronto aidan@cs.toronto.eduŁukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions en

## **LLM Response**

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

def ask_llm(prompt):
    """
    Sends a prompt to the Alibaba's Qwen LLM and returns the answer.

    Args:
        prompt (str): The augmented prompt.

    Returns:
        str: The LLM generated answer.
    """
    
    model_name = "Qwen/Qwen2.5-1.5B-Instruct"

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto",
    )
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    messages = [
        # {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response


In [10]:
response = ask_llm(prompt=augment_prompt)

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


In [11]:
print(f"User Query:\n'{query}'\n\nLLM's Response:\n'{response}'")

User Query:
'What is the attention mechanism?'

LLM's Response:
'The attention mechanism is a technique used in machine learning, particularly in natural language processing and computer vision tasks. It allows a model to focus on different parts of its input during training and inference.

In the provided text, the attention mechanism is introduced as part of the Transformer architecture proposed by Ashish Vaswani et al., which aims to improve upon traditional recurrent or convolutional neural network-based sequence transduction models. The attention mechanism in Transformers uses self-attention to weigh the importance of different parts of the input sequence during the computation of a contextual representation. This helps the model to capture dependencies between elements within sequences more effectively than previous architectures did.'
