<a target="_blank" href="https://colab.research.google.com/github.com/surrey-nlp/NLP-2025/blob/main/lab11/lab11_RAG_from_scratch.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Lab 11: RAG from Scratch
Adapted from: https://github.com/mrdbourke/simple-local-rag

In this lab, we will be building a RAG (Retrieval Augmented Generation) pipeline from scratch and have it run on a local GPU.

Specifically, we will implement it such that it will be able utilize a PDF file as a reference, and answer any questions (queries) related to the PDF using a Large Language Model (LLM).

There are frameworks such as [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/) that makes building this pipeline easy; however, the goal of this lab is to learn how the individual parts work.

## What is RAG?

RAG stands for Retrieval Augmented Generation.

It was introduced in the paper [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).

Each step can be roughly broken down to:

* **Retrieval** - Seeking relevant information from a source given a query. For example, getting relevant passages of Wikipedia text from a database given a question.
* **Augmented** - Using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* **Generation** - Generating an output given an input. For example, in the case of an LLM, generating a passage of text given an input prompt.

## Why RAG?

The main goal of RAG is to improve the generation outptus of LLMs.

Two primary improvements can be seen as:
1. **Preventing hallucinations** - LLMs are incredible but they are prone to potential hallucination, as in, generating something that *looks* correct but isn't. RAG pipelines can help LLMs generate more factual outputs by providing them with factual (retrieved) inputs. And even if the generated answer from a RAG pipeline doesn't seem correct, because of retrieval, you also have access to the sources where it came from.
2. **Work with custom data** - Many base LLMs are trained with internet-scale text data. This means they have a great ability to model language, however, they often lack specific knowledge. RAG systems can provide LLMs with domain-specific data such as medical information or company documentation and thus customized their outputs to suit specific use cases.

The authors of the original RAG paper mentioned above outlined these two points in their discussion.

> This work offers several positive societal benefits over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct benefit to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.

RAG can also be a much quicker solution to implement than fine-tuning an LLM on specific data.


 ## What we're going to build

We're going to build RAG pipeline which enables us to chat with a PDF document, specifically we will be using Surrey University's own Taught Programmes Regulations. This PDF can be found from the GitHub, remember to upload it here.


We'll write the code to:
1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

That's the structure we'll follow.

It's similar to the workflow outlined on the NVIDIA blog which [details a local RAG pipeline](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/).

<img src="https://github.com/mrdbourke/simple-local-rag/blob/main/images/simple-local-rag-workflow-flowchart.png?raw=true" alt="flowchart of a local RAG workflow" />

## Requirements and setup

In [1]:
# Perform Google Colab installs (if running in Google Colab)
import os

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install tqdm # for progress bars
    !pip install sentence-transformers # for embedding models
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

[INFO] Running in Google Colab, installing requirements.
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5

## 1. Document/Text Processing and Embedding Creation

Ingredients:
* PDF document of choice.
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).

### Import PDF Document

This will work with many other kinds of documents; however, we'll start with PDF since many people have PDFs.

But just keep in mind, text files, email chains, support documentation, articles and more can also work.

Again, remember to download the PDF provided in the GitHub and include it in your notebook.

Once this id done, we will import the pages of our PDF to text by first defining the PDF path and then opening and reading it with PyMuPDF (`import fitz`). Then, we'll save each page to a dictionary and then append that dictionary to a list for ease of use later.

In [2]:
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)

        # For the purposes of this lab, we will be naively splitting each page of the document in half, so that our token counts per page can be lower.
        # This is important to make sure each appended datapoint fits into the embedding model.
        first_half = text[:len(text)//2]
        second_half = text[len(text)//2:]
        pages_and_texts.append({"page_number": page_number,
                                "page_char_count": len(first_half),
                                "page_word_count": len(first_half.split(" ")),
                                "page_sentence_count_raw": len(first_half.split(". ")),
                                "page_token_count": len(first_half) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": first_half})

        pages_and_texts.append({"page_number": page_number,
                                "page_char_count": len(second_half),
                                "page_word_count": len(second_half.split(" ")),
                                "page_sentence_count_raw": len(second_half.split(". ")),
                                "page_token_count": len(second_half) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": second_half})
    return pages_and_texts


pdf_path = "Uni_of_Surrey_Taught_Regulations.pdf"
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[0]

0it [00:00, ?it/s]

{'page_number': 0,
 'page_char_count': 1597,
 'page_word_count': 268,
 'page_sentence_count_raw': 12,
 'page_token_count': 399.25,
 'text': "A1: Regulations for taught programmes  1    Introduction and scope  1.  For the purposes of these Regulations, programmes of study that lead to the awards  listed in Table 1 are termed 'taught programmes of studies' or 'taught programmes'  and the awards are collectively referred to as 'taught awards'.1  Where there are  arrangements for a particular award this is indicated in the text or by a footnote.   2.  The requirements of these Regulations apply to taught programmes delivered at the  University, those delivered through collaborative provision and distance learning, via  a part-time or other mode, and taught programmes delivered by the University's  Associated and Accredited Institutions2 that lead to University of Surrey awards.3   The Foundation Year is a one year programme that, upon successful completion,  allows progression to a number 

### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.

Multiple embedding models have limits on the size of texts they can ingest, for example, the [`sentence-transformers`](https://www.sbert.net/docs/pretrained_models.html) model [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) has an input size of 384 tokens.

This means that the model has been trained in ingest and turn into embeddings texts with 384 tokens (1 token ~= 4 characters ~= 0.75 words). This is the reason why we split each page by half in the previous section.

In [3]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,1597,268,12,399.25,A1: Regulations for taught programmes 1 In...
1,0,1598,328,12,399.5,ogrammes and awards  admission  registra...
2,1,904,168,7,226.0,A1: Regulations for taught programmes 2 ei...
3,1,905,222,1,226.25,FHEQ level 4 Diploma of Higher Education ...
4,2,1091,250,2,272.75,A1: Regulations for taught programmes 3 In...


In [4]:
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,62.0,62.0,62.0,62.0,62.0
mean,15.0,1489.55,271.92,10.06,372.39
std,9.02,299.85,47.97,3.87,74.96
min,0.0,650.0,120.0,1.0,162.5
25%,7.25,1369.5,245.0,8.0,342.38
50%,15.0,1550.0,276.0,11.0,387.5
75%,22.75,1684.5,306.75,12.0,421.12
max,30.0,1905.0,362.0,17.0,476.25


It seems that we have an average token count of 372 per page. This means we could embed an average whole page with the `all-mpnet-base-v2` model (this model has an input capacity of 384).

### Further text processing (splitting pages into sentences)

The ideal way of processing text before embedding it is still an active area of research.

But for this lab, we will simply be chunking each half page of text into a group of 10 sentences (these values are not set in stone and can be explored).

To do this, we will be using the natural language processing (NLP) library of [spaCy](https://spacy.io/). We've explored this in the previous labs.

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.

> **Resource:** See [spaCy install instructions](https://spacy.io/usage).

In [5]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

# We will loop over each of the half pages and break them down into sentences
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/62 [00:00<?, ?it/s]

In [6]:
import random
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 10,
  'page_char_count': 940,
  'page_word_count': 193,
  'page_sentence_count_raw': 1,
  'page_token_count': 235.0,
  'text': "A1: Regulations for taught programmes  11    Graduate Diploma  45 out of 120 credits at FHEQ level 6  Bachelor's degree  (honours), three years  120 out of 360 credits; a minimum of 90 must be at  FHEQ level 6  Bachelor's degree  (honours), including  professional training  200 out of 360 credits, of which 90 must be at FHEQ  level 6 with 80 P level credits  P level  The Executive Dean of a Faculty may exempt a  student from up to one third of the total P level  credits required by a programme where the student  can show that they have previously successfully  acquired experience that is the equivalent of the  relevant professional training required by the  University  Integrated Master’s degree 240 out of 480 credits of which 120 credits must be  at FHEQ level 7  Postgraduate Certificate  30 out of 60 credits at FHEQ level 7  Postgraduate Dip

### Chunking our sentences together

Let's take a step to break down our list of sentences/text into smaller chunks, a.k.a. **chunking**.

Why do we do this?

1. Easier to manage similar sized chunks of text.
2. Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if you try to embed a sequence of 400+ tokens).
3. Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

In [7]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/62 [00:00<?, ?it/s]

In [8]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 20,
  'page_char_count': 1404,
  'page_word_count': 241,
  'page_sentence_count_raw': 10,
  'page_token_count': 351.0,
  'text': 'A1: Regulations for taught programmes  21    ASSESSMENT AND REASSESSMENT  Submission of coursework  122.  Each Faculty should ensure that there are robust and transparent arrangements in  place for collecting student work and recording the date of submission.  Statements  of these arrangements and where and how coursework is required to be submitted  are to be found in the programme handbook.  123.  Students are required to submit coursework units of assessment, including project  and other reports and dissertations, on time and in accordance with the  arrangements published in the handbook for the relevant programme.  Arrangements  for the submission of Master’s dissertations are described in Regulations 134-157  below.  Where a unit of assessment has not been submitted at the first attempt and  there are no confirmed extenuating circumstan

### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

So to keep things clean, let's create a new list of dictionaries each containing a single chunk of sentences with relative information such as page number as well statistics about each chunk.

In [9]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/62 [00:00<?, ?it/s]

99

In [10]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 26,
  'sentence_chunk': 'thdraw, or retakes the modules and subsequently fails, their registration is terminated. 167. Where a student has passed a module after reassessment or compensation this is recorded in their transcript. Failure of modules with a value of more than 60 credits 168. Where an undergraduate student fails modules with a value of more than 60 credits at that level or stage of their programme, their progression through their programme is halted and the Board of Examiners will require them to retake the units of assessment they have failed in the next academic year, in order to pass any failed modules and progress to the next stage or level of their studies. In such a case the Board of Examiners requires that the student is reassessed, with or without attendance. 169. Where a taught postgraduate student has failed modules with a value of more than 60 credits the Board of Examiners requires that their progression through their programme is halted and the

Note that after chunking our data, it is possible that some chunks may be left with very little amount of text. This doesn't really provide much information for us, so we can filter them off with the following.

In [11]:
# Filter away chunks that have token lengths shorter than 10
df = pd.DataFrame(pages_and_chunks)
min_token_length = 20
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 0,
  'sentence_chunk': "A1: Regulations for taught programmes 1  Introduction and scope 1. For the purposes of these Regulations, programmes of study that lead to the awards listed in Table 1 are termed 'taught programmes of studies' or 'taught programmes' and the awards are collectively referred to as 'taught awards'.1 Where there are arrangements for a particular award this is indicated in the text or by a footnote. 2. The requirements of these Regulations apply to taught programmes delivered at the University, those delivered through collaborative provision and distance learning, via a part-time or other mode, and taught programmes delivered by the University's Associated and Accredited Institutions2 that lead to University of Surrey awards.3  The Foundation Year is a one year programme that, upon successful completion, allows progression to a number of named undergraduate degree programmes. The Foundation Year does not follow these Regulations but is covered by a s

### Embedding our text chunks

As we've learnt from the previous labs humans understand text, machines understand numbers best. Therefore, we will be using a text embedding model for the RAG pipeline.

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [1]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

With our embedding model loaded, we will now encode each of the sentence chunks using the model and store them as an 'embedding' entry.

In [12]:
%%time

# Send the model to the GPU
embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 4090

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/95 [00:00<?, ?it/s]

CPU times: user 2.25 s, sys: 466 ms, total: 2.72 s
Wall time: 3.43 s


### Save embeddings to file

Since creating embeddings can be a timely process (not so much for our case but it can be for more larger datasets), let's turn our `pages_and_chunks_over_min_token_len` list of dictionaries into a DataFrame and save it.

In [13]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

And we can make sure it imports nicely by loading it.

In [14]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,A1: Regulations for taught programmes 1 Intro...,1495,222,373.75,[-4.16037859e-03 -7.28959292e-02 1.21046063e-...
1,0,ogrammes and awards  admission  registration...,1253,217,313.25,[-1.13878846e-02 -7.90756717e-02 2.82175988e-...
2,0,3 During 2018/19 it is anticipated that a numb...,279,46,69.75,[-6.22386346e-03 -5.20430431e-02 4.52392176e-...
3,1,A1: Regulations for taught programmes 2 eithe...,877,141,219.25,[-7.79487938e-03 -7.11345822e-02 3.05032749e-...
4,1,FHEQ level 4 Diploma of Higher Education 5 24...,843,160,210.75,[-2.39900202e-02 -5.25893942e-02 9.54524055e-...


## 2. RAG - Search and Answer

We discussed RAG briefly in the beginning but let's quickly recap.

RAG stands for Retrieval Augmented Generation.

Which is another way of saying "given a query, search for relevant resources and answer based on those resources".

Let's breakdown each step:
* **Retrieval** - Get relevant resources given a query. For example, if the query is "What is the passing mark?" the ideal results will contain information regarding the exam or degree passing mark.

* **Augmentation** - LLMs are capable of generating text given a prompt. However, this generated text is designed to *look* right. And it often has some correct information, however, they are prone to hallucination (generating a result that *looks* like legit text but is factually wrong). In augmentation, we pass relevant information into the prompt and get an LLM to use that relevant information as the basis of its generation.

* **Generation** - This is where the LLM will generate a response that has been flavoured/augmented with the retrieved resources. In turn, this not only gives us a potentially more correct answer, it also gives us resources to investigate more (since we know which resources went into the prompt).

The whole idea of RAG is to get an LLM to be more factually correct based on your own input as well as have a reference to where the generated output may have come from.

### Similarity search

Similarity search or semantic search or vector search is the idea of searching  via *meaning*. The core idea is that when you provide a sentence, the semantic search should fetch the most similar block of test from the reference.

> **Example:** Using similarity search on our textbook data with the query "macronutrients function" returns a paragraph that starts with:
>
>*There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions.*

In [15]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([95, 768])

Embedding model ready! Time to perform a semantic search.

Let's say you were not attending your exam, and wanted to search the University regulations regarding this.

Well, we can do so with the following steps:
1. Define a query string (e.g. `"Not attending Exam"`).
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a [dot product](https://pytorch.org/docs/stable/generated/torch.dot.html) or [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function between the text embeddings and the query embedding to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts.

These steps are performed below:


In [16]:
from sentence_transformers import util, SentenceTransformer
# 1. Define the query
# Note: This could be anything.
query = "Not attending Exam"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: Not attending Exam


torch.return_types.topk(
values=tensor([0.6067, 0.5747, 0.5653, 0.5143, 0.5054], device='cuda:0'),
indices=tensor([80, 81, 76, 54, 90], device='cuda:0'))

Now that we have our top-5 matches, let's take a look at the results with the following:

In [17]:
# Define helper function to print wrapped text
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'Not attending Exam'

Results:
Score: 0.6067
Text:
A1: Regulations for taught programmes 27  Failure to attend for
assessment/examination 164. Where a student has failed an assessment, or
reassessment, for a module through failing to attend a required examination, or
by attending a required examination but not making (in the judgement of the
Board of Examiners) a reasonable attempt to address the examination questions,
and there are no confirmed extenuating circumstances, the student has failed
that unit of assessment at that attempt and will be given a mark of zero. If the
attempt was the first attempt and the student fails the module overall as a
consequence, they may not progress without reassessment, as described in
Regulation 161 above, and compensation will only be available after a re-
assessment. Failure and reassessment for modules with a value up to and
including 60 credits 165. Where an undergraduate student has failed modules with
a value up to and including 60 credi

It looks like our semantic search works, as we are able to pull the most relevant part of the document, giving us more detail about not attending the examination.

### Functionizing our semantic search pipeline

Let's put all of the steps from above for semantic search into a function or two so we can repeat the workflow.

In [19]:
from time import perf_counter as timer

def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=3,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query,
                                   convert_to_tensor=True)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """

    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)

    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

Excellent! Now let's test our functions out.

In [21]:
query = "Submission of Coursework"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
print(f'scores: {scores} | indices {indices}\n')

# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 95 embeddings: 0.00011 seconds.
scores: tensor([0.6689, 0.6076, 0.5223], device='cuda:0') | indices tensor([60, 62, 68], device='cuda:0')

[INFO] Time taken to get scores on 95 embeddings: 0.00008 seconds.
Query: Submission of Coursework

Results:
Score: 0.6689
A1: Regulations for taught programmes 21  ASSESSMENT AND REASSESSMENT Submission
of coursework 122. Each Faculty should ensure that there are robust and
transparent arrangements in place for collecting student work and recording the
date of submission. Statements of these arrangements and where and how
coursework is required to be submitted are to be found in the programme
handbook. 123. Students are required to submit coursework units of assessment,
including project and other reports and dissertations, on time and in accordance
with the arrangements published in the handbook for the relevant programme.
Arrangements for the submission of Master’s dissertations are described in
Regulations 134-

## Getting an LLM for local generation

We're got our retrieval pipeline ready, let's now get the generation side of things happening.

To perform generation, we're going to use a Large Language Model (LLM). In our case, we want our LLM's output text to be generated based on the context of relevant information to the query. Therefore, our input query will need to be `augmented` with the necessary context before being sent to the LLM.

While we can achieve the best performances using state-of-the-art models like ChatGPT and DeepSeek. Here, we will be using lighter models due to our limited compuational resources.



### Checking local GPU memory availability

Let's find out what hardware we've got available and see what kind of model(s) we'll be able to load.

> **Note:** You can also check this with the `!nvidia-smi` command.

In [22]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 15 GB


In [23]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 15 | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: google/gemma-2b-it


To use Gemma, you might want to login to Huggingface.

In [24]:
import huggingface_hub
huggingface_hub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

After determining the most suitable model, we can now load it with the following:

In [25]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

# 1. Create quantization config for smaller model loading (optional)
# Requires !pip install bitsandbytes accelerate, see: https://github.com/TimDettmers/bitsandbytes, https://huggingface.co/docs/accelerate/
# For models that require 4-bit quantization (use this if you have low GPU memory available)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Bonus: Setup Flash Attention 2 for faster inference, default to "sdpa" or "scaled dot product attention" if it's not available
# Flash Attention 2 requires NVIDIA GPU compute capability of 8.0 or above, see: https://developer.nvidia.com/cuda-gpus
# Requires !pip install flash-attn, see: https://github.com/Dao-AILab/flash-attention
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")

# 2. Pick a model we'd like to use (this will depend on how much GPU memory you have available)
#model_id = "google/gemma-7b-it"
model_id = model_id # (we already set this above)
print(f"[INFO] Using model_id: {model_id}")

# 3. Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: google/gemma-2b-it


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Let's check it out.

In [26]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): GemmaRMSNorm((2048,), 

### Generating text with our LLM

We can generate text with our LLM `model` instance by calling the [`generate()` method](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig) (this method has plenty of options to pass into it alongside the text) on it and passing it a tokenized input.

The tokenized input comes from passing a string of text to our `tokenizer`.

It's important to note that you should use a tokenizer that has been paired with a model. Otherwise if you try to use a different tokenizer and then pass those inputs to a model, you will likely get errors/strange results.

For some LLMs, there's a specific template you should pass to them for ideal outputs.

For example, the `gemma-7b-it` model has been trained in a dialogue fashion (instruction tuning).

In this case, our `tokenizer` has a [`apply_chat_template()` method](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template) which can prepare our input text in the right format for the model.

Let's try it out.

In [27]:
input_text = "What's the naming of taught postgraduate programmes with specialist pathways?"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
What's the naming of taught postgraduate programmes with specialist pathways?

Prompt (formatted):
<bos><start_of_turn>user
What's the naming of taught postgraduate programmes with specialist pathways?<end_of_turn>
<start_of_turn>model



Notice the scaffolding around our input text, this is the kind of turn-by-turn instruction tuning our model has gone through.

Now we can input the query of "What's the naming of taught postgraduate programmes with specialist pathways?" to the LLM and see what answer it gives us without the RAG pipeline.

In [28]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) # define the maximum number of new tokens to create

# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])

print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

Input text: What's the naming of taught postgraduate programmes with specialist pathways?

Output text:
**Master's Degree Programmes with Specialist Pathways**

**1. Medicine**

* Master of Medicine (MD) in Internal Medicine
* Master of Medicine (MD) in Surgery
* Master of Medicine (MD) in Paediatrics

**2. Law**

* Master of Laws (LLM)
* Master of Legal Studies (MLS)

**3. Business Administration**

* Master of Business Administration (MBA) with a focus on a specific industry or area, such as finance, marketing, or management
* Master of Business Administration (MBA) with a focus on a specific functional area, such as finance, marketing, or management

**4. Education**

* Master of Education (MEd) in Elementary Education
* Master of Education (MEd) in Secondary Education
* Master of Education (MEd) in Special Education

**5. Science**

* Master of Science (MS) in Biology
* Master of Science (MS) in Chemistry
* Master of Science (MS) in Physics

**6. Engineering**

* Master of Engineer

While the LLM generates some output that **looks** correct, this answer is more general and less relevant to the University of Surrey. This is where RAG comes in to help make the answers more accurate.

## RAG Pipeline: Putting it together

We will now construct the pipeline to enable RAG. However, before that, let's prepare a list of queries that we want to test the pipeline with.

In [50]:
import random

query_list = [
    'Naming of taught postgraduate programmes with specialist pathways?',
    'What is the passing mark?',
    'When is the registration for taught programmes?',
    'Is there any withdrawal from registration and intermediate exit award?',
    'What if I fail to make progress in my studies?',
    'I am facing some chronic circumstances that may hinder my academical progress.',
    'What is the penalty for late Coursework Submission',
    'What is the time limits for dissertations?',
    'Are there English Entry Requirements for the University?',
    "Do I have Academic Transcripts?"
]
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

Query: What is the passing mark?
[INFO] Time taken to get scores on 95 embeddings: 0.00008 seconds.


(tensor([0.5682, 0.5277, 0.4963], device='cuda:0'),
 tensor([ 7, 61, 18], device='cuda:0'))

We will also need to prepare a prompt formatter which will be able to provide context items to the LLM. The following is a very simple template, but you may expand on it to see how it affects results.

In [40]:
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = f"""
    Given the User Query: {query}

    Please construct your comprehensive answer using relevant information from the following context:

    {context}
    """

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

Based on our query, the `retrieve_relevant_resources` function pulls all the relevant context items. Then, this is fed to the prompt formatter to give us our final prompt to the model.

In [33]:
query = random.choice(query_list)
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: Naming of taught postgraduate programmes with specialist pathways?
[INFO] Time taken to get scores on 95 embeddings: 0.00010 seconds.
<bos><start_of_turn>user
Given the User Query: Naming of taught postgraduate programmes with specialist pathways?

    Please construct your concise answer based on the following information:

    - A1: Regulations for taught programmes 5  Naming of taught postgraduate programmes with specialist pathways 19. The following conventions should be followed in the naming of taught postgraduate programmes with specialist pathways:  Master’s degree in “generic title” = 180 credits, of which at least half the credits are derived from the generic modules and not more than 90 credits, including the dissertation are within a defined specialist pathway  Master’s degree in “generic title” “(specialist pathway)” = 180 credits, of which between 90 and 135 credits, which must include the dissertation, are within a defined specialist pathway  Master’s degree in

Finally, let's create a function which combines the retrieving, augmenting, and generating into a single call. In short, this is our final simple RAG pipeline.

In [34]:
def ask(query,
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True,
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """

    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)

    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]


    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU

    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)

    print(prompt)

    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)

    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text

    return output_text, context_items

Let's run the pipeline with some randomly sampled queries below:

In [64]:
query = random.choice(query_list)

print(f"Query: {query}")

# Answer query with context and return context
answer, context_items = ask(query=query,
                            temperature=0.8,
                            max_new_tokens=512,
                            return_answer_only=False)

print(f"\nAnswer:\n")
print_wrapped(answer)

Query: What is the time limits for dissertations?
[INFO] Time taken to get scores on 95 embeddings: 0.00008 seconds.
<bos><start_of_turn>user
Given the User Query: What is the time limits for dissertations?

    Please construct your comprehensive answer using relevant information from the following context:

    - A1: Regulations for taught programmes 23  Dissertation/project – Master’s programmes Submission of dissertations: time limits 134. For full-time students the University will not normally grant extensions to the submission deadline for the dissertation that would cause the student to complete their programme more than 13 months after the date of registration.  135. For part-time students following a programme over two years, the University will not normally grant extensions to the submission deadline for the dissertation that would cause the student to complete their programme more than 24.5 months after the date of registration. Nature of dissertations or equivalent work 136

We'e successfully constructed a simple RAG pipeline. You can try and include more queries into the list and experiment with the prompts.