This notebook demonstrates RAG's "Indexing" & "Retrieval" processes.

> **_"Indexing"_** processes raw documents (pdf file in this example) by extracting their content (parsing) then splitting them into smaller, meaningful chunks. These chunks are then converted into vector embeddings using an embedding model (two embedding models is demonstrated here :D) and stored in a vector database (ChromaDB) for efiicient retrieval during query-time.
>
> Steps in Indexing:
> - _Parsing_ 
>   - Extract raw text from documents (e.g., PDFs, Web pages, etc)
> - _Chunking_ 
>   - Split large text into smaller chunks (e.g., paragraphs or sentences) to improve retrieval granularity.
> - _Embedding_ 
>   - Each chunk is passed through an embedding model (like a Sentence Transformer or BGE) to convert it into a dense vector representation.
> - _Storing in a Vector Database / Index_ 
>   - These embeddings are stored in a vector index using tools like ChromaDb, FAISS, Pinecone.
>
> **_"Retrieval"_** is the process of **finding and returning the most relevant documents or text chunks** from an external knowledge base (which was previously indexed) in response to a user query or prompt.
> 
> Steps in Retrieval:
> -  _User Input / Query_ 
>    - Example: ```What is the attention mechanism?```
> - _Query Embedding_ 
>   - The input is passed through an **embedding model** to get a **dense vector representation**.
> - _Similarity Search_ 
>   - This vector is compared with all the document vectors in the **vector index** (created during the indexing step) using **cosine similarity**, **dot product**, or **L2 distance**. 
> - _Top-K Retrieval_
>   - The top ```K``` most similar document chunks are retrieved. These are the ones the model considers **most relevant to the input query**.
> - _Return Results_
>   - The retrieved chunks (usually 3–10, depending on your setup) are passed into the generation model as part of the prompt.

## **Extract Text**

In [1]:
from typing import List
from PyPDF2 import PdfReader

def text_extract(pdf_path: str) -> str:
    """
    Extracts text from all pages of a given PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the PDF, concatenated with newline separators.
    """

    # An empty list to store extracted text from PDF pages
    pdf_pages = []

    # Open the PDF file in binary read mode
    with open(pdf_path, 'rb') as file:

        # Create a PdfReader object to read the PDF
        pdf_reader = PdfReader(file)

        # Iterate through all pages in the PDF
        for page in pdf_reader.pages:

            # Extract text from the current page
            text = page.extract_text()

            # Append the extracted text to the list
            pdf_pages.append(text)

    # Join all extracted text using newline separator
    pdf_text = "\n".join(pdf_pages)

    # Return the extracted text as a single string
    return pdf_text


In [2]:
# Download the PDF file
import requests

pdf_url = 'https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf'
response = requests.get(pdf_url)

pdf_path = 'attention_is_all_you_need.pdf'
with open(pdf_path, 'wb') as file:
    file.write(response.content)

In [3]:
pdf_text = text_extract(pdf_path)

## **Chunk Text**



In [4]:
from typing import List
import re
from collections import deque


def text_chunk(text: str, max_length: int = 1000) -> List[str]:
    """
    Splits a given text into chunks while ensuring that sentences remain intact.

    The function maintains sentence boundaries by splitting based on punctuation
    (. ! ?) and attempts to fit as many sentences as possible within `max_length`
    per chunk.

    Args:
        text (str): The input text to be chunked.
        max_length (int, optional): Maximum length of each chunk. Default is 1000.

    Returns:
        List[str]: A list of text chunks, each containing full sentences.
    """

    # Split text into sentences while ensuring punctuation (. ! ?) stays at the end
    sentences = deque(re.split(r'(?<=[.!?])\s+', text.replace('\n', ' ')))

    # An empty list to store the final chunks
    chunks = []

    # Temporary string to hold the current chunk
    chunk_text = ""

    while sentences:
        # Access sentence from the deque and strip any extra spaces
        sentence = sentences.popleft().strip()

        # Check if the sentence is non-empty before processing
        if sentence:
            # If adding this sentence exceeds max_length and chunk_text is not empty, store the current chunk
            if len(chunk_text) + len(sentence) > max_length and chunk_text:

                # Save the current chunk
                chunks.append(chunk_text)

                # Start a new chunk with the current sentence
                chunk_text = sentence
            else:
                # Append the sentence to the current chunk with a space
                chunk_text += " " + sentence

    # Add the last chunk if there's any remaining text
    if chunk_text:
        chunks.append(chunk_text)

    return chunks

In [5]:
chunks = text_chunk(pdf_text)

## **Create the Vector Store**

In [None]:
import torch
# Set up Chromadb
import chromadb
from chromadb.utils import embedding_functions
from chromadb.api.models import Collection

def create_vector_store(db_path: str, model_name: str) -> Collection:
    """
    Creates a persistent ChromaDB vector store with OpenAI embeddings.

    Args:
        db_path (str): Path where the ChromaDB database will be stored.

    Returns:
        Collection: A ChromaDB collection object for storing and retrieving embedded vectors.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Initialize a ChromaDB PersistentClient with the specified database path
    client = chromadb.PersistentClient(path=db_path)
    
    # Create an embedding function using OpenAI's text embedding model
    embeddings = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name,
        device=device,
        trust_remote_code=True
    )

    # Create a new collection in the ChromaDB database with the embedding function
    try:
        db = client.create_collection(
            name="pdf_chunks",  # Name of the collection where embeddings will be stored
            embedding_function=embeddings
        )
    except Exception as err:
        db = client.get_collection(
            name="pdf_chunks",
            embedding_function=embeddings
        )

    # Return the created ChromaDB collection
    return db


In [12]:
db_default = create_vector_store(db_path="./chroma_defautl.db", model_name="sentence-transformers/all-MiniLM-L6-v2")
db_alibaba_gte = create_vector_store(db_path="./chroma_alibaba_gte.db", model_name="Alibaba-NLP/gte-multilingual-base")

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
# Insert chunks into vector store
import os
import uuid

def insert_chunks_vectordb(chunks: List[str], db: Collection, file_path: str) -> None:
    """
    Inserts text chunks into a ChromaDB vector store with metadata.

    Args:
        chunks (List[str]): List of text chunks to be stored.
        db (Collection): The ChromaDB collection where the chunks will be inserted.
        file_path (str): Path of the source file for metadata.

    Returns:
        None
    """

    # Extract the file name from the given file path
    file_name = os.path.basename(file_path)

    # Generate unique IDs for each chunk
    id_list = [str(uuid.uuid4()) for _ in range(len(chunks))]

    # Create metadata for each chunk, storing the chunk index and source file name
    metadata_list = [{"chunk": i, "source": file_name} for i in range(len(chunks))]

    # Define batch size for inserting chunks to optimize performance
    batch_size = 40

    # Insert chunks into the database in batches
    for i in range(0, len(chunks), batch_size):
        end_id = min(i + batch_size, len(chunks))  # Ensure we don't exceed list length

        # Add the batch of chunks to the vector store
        db.add(
            documents=chunks[i:end_id],
            metadatas=metadata_list[i:end_id],
            ids=id_list[i:end_id]
        )

    print(f"{len(chunks)} chunks added to the vector store")


In [25]:
insert_chunks_vectordb(chunks=chunks, db=db_default, file_path=pdf_path)
insert_chunks_vectordb(chunks=chunks, db=db_alibaba_gte, file_path=pdf_path)

36 chunks added to the vector store
36 chunks added to the vector store


## **Retrieve Chunks**

In [13]:
from typing import Any, List

def retrieve_chunks(db: Collection, query: str, n_results: int = 2) -> List[Any]:
    """
    Retrieves relevant chunks from the  vector store for the given query.

    Args:
        db (Collection): The vector store object
        query (str): The search query text.
        n_results (int, optional): The number of relevant chunks to retrieve. Defaults to 2.

    Returns:
        List[Any]: A list of relevant chunks retrieved from the vector store.
    """

    # Perform a query on the database to get the most relevant chunks
    relevant_chunks = db.query(query_texts=[query], n_results=n_results)

    # Return the retrieved relevant chunks
    return relevant_chunks


##### Retrieve with ChromaDB's default Embedding Model 
Model : "sentence-transformers/all-MiniLM-L6-v2"

In [14]:
query = "What is the attention mechanism?"
relevant_chunks = retrieve_chunks(db=db_default, query=query)

for i, doc in enumerate(relevant_chunks["documents"][0]):
    print(f"Reference {i}\n'{doc}'\n")
    

Reference 0
'We also experimented with using learned positional embeddings [ 8] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training. 4 Why Self-Attention In this section we compare various aspects of self-attention layers to the recurrent and convolu- tional layers commonly used for mapping one variable-length sequence of symbol representations (x1;:::;x n)to another sequence of equal length (z1;:::;z n), withxi;zi2Rd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata. One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.'

Reference 1
'In these models, the number of operatio

##### Retrieve with Alibaba's General Text Embedding model
Model : "Alibaba-NLP/gte-multilingual-base"

In [15]:
query = "What is the attention mechanism?"
relevant_chunks = retrieve_chunks(db=db_alibaba_gte, query=query)

for i, doc in enumerate(relevant_chunks["documents"][0]):
    print(f"Reference {i}\n'{doc}'\n")
    

Reference 0
' Attention Is All You Need Ashish Vaswani Google Brain avaswani@google.comNoam Shazeer Google Brain noam@google.comNiki Parmar Google Research nikip@google.comJakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.comAidan N. Gomezy University of Toronto aidan@cs.toronto.eduŁukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train.'



### Conclusion
I'd say that Alibaba's Embeddisg model does better at retrieving relevant piece of information, thank to its superior ability to capture semantic meaning of texts.