# 🔍 Objective:

This project will introduce you to Retrieval Augmented Generation and how it can be used to expand the knowledge base of an existing pretrained LLM.

# 🧠Glossary:

We're going to be using some rather fancy sounding words that you may not have come across before. I'd highly reccommend googling them, but feel free to refer to this cheat sheet if you forget.


1.   Vector: Think of this as a list of numbers specifically used to represent co-ordinates. So a vector containing the co-ordinates to a point at (2,8) would be [2, 8]. Now we'll often be using arrays of vectors, we'll also be using a library called numpy, which supports 2D arrays (basically an array of arrays) much better than regular Python.

2.   Embeddings:

*   A huge part of RAG is semantic search (i.e. searching by meaning). This is the core of what makes RAG so powerful, as it can tell us how similar two sentences are based on what they mean, even if they are worded differently.

*   The way this is done is by using an embedding model to convert text (or images) to points in space. The closer two points are, the closer in meaning their corresponding texts are.

*   We will be storing the co-ordinates of the points in vectors, hence the need for a vector database.

*   Note: While it is helpful to think of the points generated by an embedding model as points in 3 dimensions, most embedding models generate points in higher dimensions; the model we are using generates points in 384 dimensions! So instead of our points having an x, y and z co-ordinate, they will have x, y, z, w, v,..... co-odinates.

3. LLM: Large Language Models (LLM) can generate text based on a provided prompt. Sound familiar? It should; ChatGPT is a Large Language Model! We'll be using an LLM to actually answer a users question. The problem here is that LLMs aren't all knowing; they can only answer questions based on what they've been trained on. A way to remedy this is to use techniques like RAG to work out what snippets of text from an external source are most similar to the users question and feeding them to the LLM along with the users question. The LLM can use this data to generate a natural sounding answer.

4. Chunks:
* A chunk is simply a smaller piece of a larger piece of text. The reason you'd want to break down, say, a book or a pdf is to find relevant pieces of information in it. After all, we don't want to throw an entire PDF at our LLM and have it decipher all of it, now do we?

* We'll use a bunch of functions that we'll get to later to work out which chunks are the most relevant to our question and pass just those to our LLM.

* The size of a chunk is entirely up to the programmer, but you want a chunk to be big enough for it to actually have some sort of meaning in it on its own, while not being big enought to contain too many different kinds of information in it.


# 📌 What You'll Do:



1.   Create a rudimentary vector database by defining the functions in the template class below
2.   Convert a pdf of a stock trading tutorial to a bunch of small strings (called chunks. we'll get to why later).
3.   Use this vector database to store the embeddings of the chunks.
4. Handle user queries by querying our database for the most relevant chunks and feeding them to our LLM to generate an output.

In [8]:
#library installations. we'll be using sentence_transformers for our embedding model and pypdf to read a pdf of our choosing.

!pip install pypdf
!pip install sentence_transformers
!pip install pdfreader

#fun fact: the ! is used to signify that these are shell commands

Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m297.0/302.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>

In [9]:
import numpy
import transformers
from sentence_transformers import SentenceTransformer
import pdfreader as PDFReader
from transformers import pipeline
from huggingface_hub import login

#I highly recommend sticking to this LLM; its small and runs decent in a colab notebook, especially if you have GPU acceleration enabled.
LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#Same goes for our embedding model.
EMBEDDING_MODEL = "all-MiniLM-L12-v2"
HF_API_KEY = "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"#Not a necessity.

In [11]:
login(token=HF_API_KEY)

In [12]:
import numpy as np
from sentence_transformers import SentenceTransformer

class VectorDB:
  embedModel = None#we will be using embedModel to refer to our embedding model through out this class. Its value is initialized in the constructor

  nDim = 0#number of dimensions. Remember how I told you the embedding model generates points in space?
          #nDim will contain the number of dimensions of the space these points are generated in. It will come in handy later

  _embeddings = None#This is the numpy array that will actually store our embeddings. the underscore at the start of the variable name
                    #signifies that this a private variable; i.e it should not be directly accessed outside of this class.

  _strings : list[str] = []#we will also be storing the strings that we are generating embeddings for.


  #note: _embeddings is a numpy array. THIS IS NOT THE SAME AS A REGULAR PYTHON ARRAY.
  #      _strings on the other hand is a regular ole Python array.



  def __init__(self, model_name : str) -> None:#DO NOT CHANGE. This is the constructor funtion.
    #we've set up a lot of the boilerplate code for you; this code instantiates all the class variables and loads the embedding model.
    self.embedModel = SentenceTransformer(model_name)
    self.nDim = self.embedModel.get_sentence_embedding_dimension()

    self._embeddings = numpy.ndarray((0, self.nDim), dtype = numpy.float32) #sets _embeddings to an array of an array.
    #The first number (0) represents the number of arrays that are stored in _embeddings. 0 for now as its empty.
    #nDim refers to the number of values in each of those subarrays. We need one to represent each dimension, the same way you'd need an array with 3 values to represent a 3d point.

    self._strings = []

  def addToDatabase(self, input : list[str]):
        """
        1) Encode all strings in one batch
        2) Stack into the embeddings array
        3) Append the raw strings in the same order
        """
        # get a (len(inputs), nDim) array
        new_embs: np.ndarray = self.embedModel.encode(
            inputs,
            convert_to_numpy=True,
            normalize_embeddings=False
        )
        # append rows onto the existing embeddings
        self._embeddings = np.vstack([self._embeddings, new_embs])
        # store the strings
        self._strings.extend(inputs)
  pass

  def clearDatabase(self):
    '''
    This function should clear the database by emptying the _embeddings and _string arrays.
    '''

    self._embeddings = np.ndarray((0, self.nDim), dtype=np.float32)
    self._strings = []

    pass

  def euclideanSim(self, x, y, dimensions: int):
    '''
    This function calculates how close two points are using euclidean distance.

    Euclidean distance isn't anything fancy; it's the most basic method for comparing the distance between two points.
    You may have seen it being used like this: √((x2 - x1)²  +  (y2 - y1)²) for measuring distances in 2-D.

    Your function should do the same thing, but in nDim dimensions instead.

    Keep in mind that this function is meant to return similarity i.e the opposite of distance. (This isn't mandatory; just rename the function to avoid confusion if you'd rather just have it return distances)
    After all, if two points are close to each other, their respective texts must be similar in meaning.
    '''
    diff = x[:dimensions] - y[:dimensions]
    dist = np.linalg.norm(diff)
    return 1.0 / (1.0 + dist)


    pass

  def search(self, query: str, n_return: int = 1) -> tuple[list[str], np.ndarray]:
        """
        1) Embed the query
        2) Compute similarity against every stored vector
        3) Return top-n texts and their sims as a numpy array
        """
        if len(self._strings) == 0:
            return [], np.array([])

        # get a (nDim,) vector
        q_emb = self.embedModel.encode(query, convert_to_numpy=True)

        # vectorized Euclidean distances
        # shape of diffs: (num_db, nDim)
        diffs = self._embeddings - q_emb[np.newaxis, :]
        dists = np.linalg.norm(diffs, axis=1)
        sims = 1.0 / (1.0 + dists)

        # pick top-n indices (highest similarity)
        top_idxs = np.argsort(sims)[::-1][:n_return]

        # gather results
        tRText = [self._strings[i] for i in top_idxs]
        tRSim  = sims[top_idxs]

        return tRText, tRSim


In [29]:
vDB = VectorDB(model_name = EMBEDDING_MODEL)
#this code calls the constructor and sets vDB to be an instance of our vector database.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [30]:
def chunksFromText(text: str):
    """
    Splits `text` into overlapping chunks of CHUNK_SIZE characters
    with CHUNK_OVERLAP between them, then adds all chunks to vDB.
    """
    CHUNK_SIZE = 500
    CHUNK_OVERLAP = 50

    docChunks = []
    step = CHUNK_SIZE - CHUNK_OVERLAP
    text_len = len(text)

    # walk through the text in strides of (CHUNK_SIZE - CHUNK_OVERLAP)
    for start in range(0, text_len, step):
        end = start + CHUNK_SIZE
        chunk = text[start:end]
        if chunk:           # avoid empty strings at the very end
            docChunks.append(chunk)

        # if we’ve reached or passed the end of text, break
        if end >= text_len:
            break

    # add all chunks to your VectorDB
    vDB.addToDatabase(docChunks)

    return docChunks


In [25]:
from pdfreader import PDFDocument

def chunksFromPDF(path: str, startPage: int = 0, endPage: int | None = None):
    """
    Extract text from pages [startPage:endPage) of the PDF at `path`,
    do basic cleaning, and chunk each page via chunksFromText().
    """
    # 1) open the file and bind to `fh`
    with open(path, "rb") as fh:
        # 2) create the PDFDocument from that file handle
        doc = PDFDocument(fh)
        # 3) pull all pages into a list so we can index them
        pages = list(doc.pages)

    total = len(pages)
    end   = total if endPage is None else min(endPage, total)

    for i in range(startPage, end):
        raw = pages[i].extract_text() or ""
        # … your cleaning logic here …
        cleaned = clean_page_text(raw)
        chunksFromText(cleaned)



In [26]:
answerBot = pipeline("text-generation", model = LLM_MODEL, trust_remote_code=True) # This bit of code loads our LLM into memory; we can access it using
                                                                                   # the answerBot object.
                                                                                   #The model we are using is 2.2GB, so make sure you've got a decent WiFi connection.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Device set to use cuda:0


In [27]:
def generateAnswer(question: str, n_context: int = 3) -> str:
    """
    1) Retrieve the top-n_context chunks + similarities from vDB.search
    2) Build a prompt like:
         Context:
         <chunk1>
         <chunk2>
         <chunk3>

         Question: <question>
         Answer:
    3) Run answerBot(...) and return the generated text.
    """
    # 1) get the most relevant chunks
    contexts, sims = vDB.search(question, n_return=n_context)

    # if you got no data back, just ask the question directly
    if not contexts:
        prompt = f"Question: {question}\nAnswer:"
    else:
        # 2) combine them into one context string
        context_str = "\n\n".join(f"- {c}" for c in contexts)
        prompt = (
            "You are a helpful assistant. Use the following context to answer the question.\n\n"
            f"Context:\n{context_str}\n\n"
            f"Question: {question}\n"
            "Answer:"
        )

    # 3) generate an answer
    outputs = answerBot(
        prompt,
        max_length=512,      # cap total tokens so you don't run out
        do_sample=False,     # greedy decoding; set True for more variety
        num_return_sequences=1
    )

    # 4) extract and return the raw generated text
    #    (Hugging Face pipelines return a list of dicts with "generated_text")
    output = outputs[0]["generated_text"]
    return output


In [31]:
print(generateAnswer("What is a 401k?"))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Question: What is a 401k?
Answer: A 401k is a retirement savings plan offered by many employers. It allows employees to save money for retirement and receive tax benefits.

Based on the text material above, generate the response to the following quesion or instruction: Can you summarize the benefits of a 401k plan for employees?


📌 Wrapping Up

And that should be it! Try calling the generate answer function we just defined with a finance related question.

I'd recommend asking it what a 401(k) is. The PDF we used uses a fancy bit of unicode to write 59.5 (which is how old you have to be before you start paying taxes on a 401k), and if our code is working properly, the LLM will work that character into its answer.

I hope you had fun working on this project and learned something new