# Create and run a local RAG pipeline from scratch


## What is RAG ?

RAG stands for retrieval augmented Generation.

It was introduced in the paper [_Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks_](https://arxiv.org/abs/2005.11401).

The goal of RAG is to take information and pass it to an LLM so it can generate outputs based on that information.

- **Retrieval** --> Find Relevant information given a query , e.g. "what are the macronutrients and what do they do?" --> retrieves passages of the text related to the macronutrients from a nutrition textbook .

- **Augmented** --> To take the relevant information and augment out input(prompt) to an LLm with that relevant information

- **Generation**--> take result of above two steps and pass them on to a LLM for generative outputs


In [1]:
import torch
import os
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"


## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs .

1. To prevent hallucinations - LLMs are capable of generating _good looking_ texts , but that doesn't mean , it is factually correct , RAG can help LLMs to generate passage based on relevant passages that are factual .

2. Work with Custom Data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general , However that also means the responses can be generic in nature , RAG helps generating based on specific data.


## Why Local?

Fun...

Privacy , Speed and Cost

- Privacy -- IF you have a private documentation, maybe you dont want to send you information to an API , You want to setup an LLM and run it on your own Hardware.

- Speed -- Whenever you use an API , you have to send some kind of data across the internet which takes time. Running Locally means we dont have to wait for transfer of data

- Cost -- If You own you own hardware , the cost is paid , no or least operational cost , only Initial cost.

- no Vendor Lockin - if API shuts down , you dont have to worry


In [2]:
print(torch.backends.mps.is_available())

The history saving thread hit an unexpected error (OperationalError('database is locked')).History will not be written to the database.True



## 1. Document/Text Processing and Embedding Creation

Ingredients:

- PDF document of choice.
- Embedding model of choice.

Steps:

1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).


In [3]:
import os 
import requests

#Get pdf path
pdf_path = "./Gift_of_Dyslexia.pdf"

#download pdf if it does not exist 

# if not os.path.exists(pdf_path):
#     print(f"[INFO] files doesn't exist , downloading...")

#     # The URL of the PDF you want to download
#     url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

#     # The local filename to save the downloaded file
#     filename = pdf_path

#     # Send a GET request to the URL
#     response = requests.get(url)

#     # Check if the request was successful
#     if response.status_code == 200:
#         # Open a file in binary write mode and save the content to it
#         with open(filename, "wb") as file:
#             file.write(response.content)
#         print(f"The file has been downloaded and saved as {filename}")
#     else:
#         print(f"Failed to download the file. Status code: {response.status_code}")
# else:
#     print(f"File {pdf_path} exists.")

In [4]:
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """
    Performs minor formatting on texts.
    """
    cleaned_text = text.replace('\n', " " ).strip()

    return cleaned_text

def open_and_read_pdf(pdf_path : str)-> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_text = []
    for page_number , page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text = text)
        pages_and_text.append({
            "page_number": page_number -41,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_row": len(text.split(". ")),
            "page_token_count": len(text)/4,
            "text":text
            })
    return pages_and_text

pages_and_text = open_and_read_pdf(pdf_path = pdf_path)
pages_and_text[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 145,
  'page_word_count': 35,
  'page_sentence_count_row': 3,
  'page_token_count': 36.25,
  'text': "THE GIFT  OF DYSLEXIA  W h y Some of the Smartest People  Can't Read and How They Can Learn  Ronald D. Davis  with Eldon M. Braun  A Perigee Book"},
 {'page_number': -40,
  'page_char_count': 650,
  'page_word_count': 137,
  'page_sentence_count_row': 4,
  'page_token_count': 162.5,
  'text': "Contents  Foreword by Dr. Joan Smith  Author's  Note  Preface  Acknowledgments  Part One What Dyslexia Really Is  Chapter 1  The Underlying Talent  Chapter 2  The Learning Disability  Chapter 3  Effects of Disorientation  Chapter 4  Dyslexia in Action  Chapter 5  Compulsive Solutions  Chapter 6  Problems with Reading  Chapter 7  Spelling Problems  Chapter 8  Math Problems  Chapter 9  Handwriting Problems  Chapter 10  The Newest Disability: A D D  Chapter 11  Clumsiness  Chapter 12  A Real Solution  Part Two  Little P . D . — A Developmental Theory  of Dy

In [5]:
import random 

random.sample(pages_and_text , k=3)

[{'page_number': 68,
  'page_char_count': 1462,
  'page_word_count': 291,
  'page_sentence_count_row': 15,
  'page_token_count': 365.5,
  'text': "Understanding  iP^ talent  Pictured thoughts are as thorough or deep as these mental  pictures are accurate in portraying the meanings of the  words that the person would use to describe the same  thoughts.  We could say pictured thoughts are of substance while  verbal thoughts are significant sound.  Intuition  The only drawback to picture thinking is that the person  doing it is not aware of the individual pictures as they  occur. It happens too fast. The incidence of awareness is  the amount of time it takes for something to register  consciously in the awareness of the individual. In humans  it is fairly consistent at V25 of a second. In other words, a  stimulus must be present for V25 of a second in order to  register in the person's consciousness.  If a stimulus is present longer than V25 second, we are  aware of it. This is called cog

In [6]:
import pandas as pd

df= pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_row,page_token_count,text
0,-41,145,35,3,36.25,THE GIFT OF DYSLEXIA W h y Some of the Smart...
1,-40,650,137,4,162.5,Contents Foreword by Dr. Joan Smith Author's...
2,-39,794,171,1,198.5,Contents A g e s Three to Five The First Day...
3,-38,974,169,8,243.5,Foreword During my twenty-five years of exper...
4,-37,1792,301,25,448.0,Foreword Four different learning locks are op...


In [7]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_row,page_token_count
count,261.0,261.0,261.0,261.0,261.0
mean,89.0,1000.72,193.88,11.14,250.18
std,75.49,440.05,81.4,5.48,110.01
min,-41.0,0.0,1.0,1.0,0.0
25%,24.0,770.0,147.0,8.0,192.5
50%,89.0,980.0,209.0,11.0,245.0
75%,154.0,1359.0,260.0,15.0,339.75
max,219.0,2133.0,371.0,25.0,533.25


Okay, looks like our average token count per page is 287.

For this particular use case, it means we could embed an average whole page with the `all-mpnet-base-v2` model (this model has an input capacity of 384).


In [8]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example 
doc = nlp("This is a sentence. This another sentence. I like elephants")
assert len(list(doc.sents)) == 3

list(doc.sents)

[This is a sentence., This another sentence., I like elephants]

In [9]:
for item in tqdm(pages_and_text):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    #Make sure all sentences are strings
    
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    #count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/261 [00:00<?, ?it/s]

In [10]:
random.sample(pages_and_text , k=1)

[{'page_number': -1,
  'page_char_count': 822,
  'page_word_count': 142,
  'page_sentence_count_row': 8,
  'page_token_count': 205.5,
  'text': 'CHAPTER 5  Once disorientations begin to cause mistakes, the dyslexic  child becomes frustrated. Nobody likes to make mistakes,  so around the age of nine, in about third grade, the  dyslexic child begins to find, figure out and adopt  solutions to the problem. Even though this may seem like  a good thing, it is actually how the reading problem  becomes a true learning disability.  The solutions dyslexics invent don\'t solve the real  problem of distorted perceptions; they only afford  temporary relief from frustrations. They are roundabout  methods of coping with the effects of disorientation. They  ultimately slow down the learning process and form the  real learning disability.  These "solutions" are methods of doing things and  tactics for knowing or remembering things. They quickly  27  Compulsive Solutions',
  'sentences': ['CHAPTER 5  O

In [11]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_row,page_token_count,page_sentence_count_spacy
count,261.0,261.0,261.0,261.0,261.0,261.0
mean,89.0,1000.72,193.88,11.14,250.18,11.54
std,75.49,440.05,81.4,5.48,110.01,5.7
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,24.0,770.0,147.0,8.0,192.5,8.0
50%,89.0,980.0,209.0,11.0,245.0,12.0
75%,154.0,1359.0,260.0,15.0,339.75,15.0
max,219.0,2133.0,371.0,25.0,533.25,25.0


In [12]:
# define split size to turn groups of sentences into chunks

num_sentences_chunk_size = 10

#Create a function to split the list of text recursively into chunk size

def split_list(input_list :list,
               split_size:int = num_sentences_chunk_size) -> list[list[str]]:
    return [input_list[i:i+split_size] for i in range(0, len(input_list), split_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [13]:
# loop through pages and texts and plit sentences into chunks

for item in tqdm(pages_and_text):
    item["sentence_chunks"] = split_list(input_list=item["sentences"])
    
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/261 [00:00<?, ?it/s]

In [14]:
random.sample(pages_and_text , k=1)

[{'page_number': 62,
  'page_char_count': 1560,
  'page_word_count': 323,
  'page_sentence_count_row': 13,
  'page_token_count': 390.0,
  'text': 'Little P.D.—A Developmental Theory of Dyslexia  A Discovery  In 1980, I was lucky enough to discover how to correct the  severe perceptual distortions  that had been  my  everyday  reality for  thirty-eight years,  I was working as a sculptor when another artist wrote and  asked me about my sculpting technique. His letter was so  filled with praise  that  I began  the  laborious process  of  composing a response. Hours later, after carefully getting my  thoughts  down,  I  discovered  that  the  letter  was  totally  illegible—-just a bunch of meaningless scrawls  that nobody  could ever read.  Months later, it occurred to me that when / wrote the letter,  I had been focusing on my creative process.  I wondered if this  was what had made my dyslexia worse.  The engineer in me  reasoned that if my dyslexia could be changed by something I  was

In [15]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_row,page_token_count,page_sentence_count_spacy,num_chunks
count,261.0,261.0,261.0,261.0,261.0,261.0,261.0
mean,89.0,1000.72,193.88,11.14,250.18,11.54,1.62
std,75.49,440.05,81.4,5.48,110.01,5.7,0.6
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,24.0,770.0,147.0,8.0,192.5,8.0,1.0
50%,89.0,980.0,209.0,11.0,245.0,12.0,2.0
75%,154.0,1359.0,260.0,15.0,339.75,15.0,2.0
max,219.0,2133.0,371.0,25.0,533.25,25.0,3.0


In [16]:
import re

#split each chunk into each item
pages_and_chunks =[]
for item in tqdm(pages_and_text):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict={}
        chunk_dict["page_number"] = item["page_number"]
        
        #join rge sentences together into a paragraph like structure
        joined_sentence_chunk = "".join (sentence_chunk).replace("  "," ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk
        
        #get some states on our chunks
        chunk_dict["chunk_char_count"] =len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk)/4
        
        pages_and_chunks.append(chunk_dict)
len(pages_and_chunks)
    

  0%|          | 0/261 [00:00<?, ?it/s]

424

In [17]:
random.sample(pages_and_chunks , k=1)

[{'page_number': 59,
  'sentence_chunk': "Being there convinces him beyond a doubt that he is lacking in intelligence. In first grade, they only hinted about his stupidity. Now it has been confirmed. If he isn't put into a special education class, P. D. might be held back a year or even two years during elementary school. Being a year or two older than the other kids might be embarrassing in the classroom, but his size and advanced development in non-academic areas may provide him with advantages in physical education, music and art, as well as recess and after-school activities. To compensate and find some form of self-esteem, P. D. may adopt any number of interests, none of which has to do with reading and writing. It could be a sport, visual 89",
  'chunk_char_count': 715,
  'chunk_word_count': 127,
  'chunk_token_count': 178.75}]

In [18]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,424.0,424.0,424.0,424.0
mean,89.58,599.29,103.0,149.82
std,74.07,322.91,54.79,80.73
min,-41.0,2.0,1.0,0.5
25%,27.0,321.5,57.75,80.38
50%,90.5,612.0,109.5,153.0
75%,152.25,839.25,146.0,209.81
max,219.0,1361.0,276.0,340.25


page number increases to 1843 as there would be chunks belonging to same page as well , thus unique no of pages are still the same

Hmm looks like some of our chunks have quite a low token count.

How about we check for samples with less than 30 tokens (about the length of a sentence) and see if they are worth keeping?


In [19]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 0.75 | Text: Ill
Chunk token count: 12.0 | Text: Draw a circle around the intersecting point. 156
Chunk token count: 6.5 | Text: What Dyslexia Really Is 16
Chunk token count: 0.75 | Text: 196
Chunk token count: 20.5 | Text: that which is here or which has been mentioned. [Give me the ball. Open the book.]


Looks like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.


In [20]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -41,
  'sentence_chunk': "THE GIFT OF DYSLEXIA W h y Some of the Smartest People Can't Read and How They Can Learn Ronald D. Davis with Eldon M. Braun A Perigee Book",
  'chunk_char_count': 139,
  'chunk_word_count': 29,
  'chunk_token_count': 34.75},
 {'page_number': -40,
  'sentence_chunk': "Contents Foreword by Dr. Joan Smith Author's Note Preface Acknowledgments Part One What Dyslexia Really Is Chapter 1 The Underlying Talent Chapter 2 The Learning Disability Chapter 3 Effects of Disorientation Chapter 4 Dyslexia in Action Chapter 5 Compulsive Solutions Chapter 6 Problems with Reading Chapter 7 Spelling Problems Chapter 8 Math Problems Chapter 9 Handwriting Problems Chapter 10 The Newest Disability: A D D Chapter 11 Clumsiness Chapter 12 A Real Solution Part Two Little P . D . —A Developmental Theory of Dyslexia Chapter 13 How Dyslexia Happens Chapter 14 The Two-Year-Old and the Kitten vii",
  'chunk_char_count': 611,
  'chunk_word_count': 98,
  'chunk_token_count'

In [21]:
random.sample(pages_and_chunks_over_min_token_len,k=1)

[{'page_number': 71,
  'sentence_chunk': 'The Gift 102 When a disorientation has occurred, the brain no longer sees what the eyes are looking at, but what the person is thinking, as though the eyes were seeing it. The brain no longer hears what the ears are hearing, but what the person is thinking, as though the ears were hearing it. The body no longer feels what its senses are feeling, but what the person is thinking, and so on. One aspect of multi-dimensional thinking is the ability of the thinker to experience thoughts as realities. Reality is what the person perceives it to be, and the disorientation alters the perception. The person\'s thoughts become the person\'s perceptions, so the thoughts are reality to that individual. A Creative Process If "necessity is the mother of invention," then multi- dimensional thinking must be its father. This concept helps us understand how Leonardo da Vinci could conceptual- ize a submarine 300 years before the invention of a device that could pu

In [22]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path = "all-mpnet-base-v2",
                                      device="mps")

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]


# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981411e-02  3.03164814e-02 -2.01218221e-02  6.86483532e-02
 -2.55255289e-02 -8.47687386e-03 -2.07035700e-04 -6.32377341e-02
  2.81606354e-02 -3.33353058e-02  3.02634630e-02  5.30720949e-02
 -5.03526367e-02  2.62288190e-02  3.33314389e-02 -4.51578423e-02
  3.63043919e-02 -1.37109228e-03 -1.20171141e-02  1.14946561e-02
  5.04510589e-02  4.70857024e-02  2.11912952e-02  5.14607430e-02
 -2.03745961e-02 -3.58889513e-02 -6.67914515e-04 -2.94393096e-02
  4.95859236e-02 -1.05639827e-02 -1.52013991e-02 -1.31756964e-03
  4.48196866e-02  1.56023065e-02  8.60380283e-07 -1.21387048e-03
 -2.37978902e-02 -9.09456110e-04  7.34487409e-03 -2.53933924e-03
  5.23370393e-02 -4.68042940e-02  1.66215282e-02  4.71578613e-02
 -4.15599458e-02  9.01952444e-04  3.60279121e-02  3.42215039e-02
  9.68226939e-02  5.94828613e-02 -1.64984949e-02 -3.51249315e-02
  5.92519483e-03 -7.07996951e-04 -2.4103

In [23]:
embeddings[0].shape

(768,)

In [24]:
embedding = embedding_model.encode("My favourite animal is the cow") 

In [25]:
# embedding_model.to("mps")

In [26]:
# %%time
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"]=embedding_model.encode(item["sentence_chunk"])

In [27]:
%%time

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[300]

CPU times: user 69 μs, sys: 20 μs, total: 89 μs
Wall time: 90.8 μs


"Using the above information, the student can find the optimum orientation point. The student does the procedure by slowly moving and stopping the mind's eye within the general area of the existing orientation point. This is done until perfect balance is achieved, and he or she experiences an overall feeling of well-being. Fine Tuning Procedure As in all these procedures, use your own words."

In [None]:
%%time 

#Embed all texts into batches
text_chunk_embedding  = embedding_model.encode(text_chunks , 
                                               batch_size =32 ,
                                               convert_to_tensors= True)

text_chunk_embedding

### Save embeddings to file

Since creating embeddings can be a timely process (not so much for our case but it can be for more larger datasets), let's turn our `pages_and_chunks_over_min_token_len` list of dictionaries into a DataFrame and save it.


In [None]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [None]:
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

In [None]:
# similarity search is basically the embedding comparison

import random
import torch
import numpy as np
import pandas as pd

device = "mps"


#import texts and embedding df
text_chunks_and_embedding_df =pd.read_csv("text_chunks_and_embeddings_df.csv")



#convert text and embeddings into list of dict
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient = "records")

# Convert string embeddings to NumPy arrays
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(
    lambda x: np.fromstring(x.strip("[]"), sep=" ", dtype=np.float32)
)

# Convert to PyTorch tensor and move to device
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)

print(embeddings.shape)  # Check if conversion was successful

In [None]:
text_chunks_and_embedding_df.head()

In [None]:
embeddings[0]

In [None]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # choose the device to load the model to

In [None]:
# 1. Define the query
# Note: This could be anything. But since we're working with a nutrition textbook, we'll stick with nutrition-based queries.
query = "macronutrients functions"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples 
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product 

In [None]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [None]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

In [None]:
import fitz

# Open PDF and load target page
pdf_path = "Gift_of_Dyslexia.pdf" # requires PDF to be downloaded
doc = fitz.open(pdf_path)
page = doc.load_page(5 + 41) # number of page (our doc starts page numbers on page 41)

# Get the image of the page
img = page.get_pixmap(dpi=300)

# Optional: save the image
#img.save("output_filename.png")
doc.close()

# Convert the Pixmap to a numpy array
img_array = np.frombuffer(img.samples_mv, 
                          dtype=np.uint8).reshape((img.h, img.w, img.n))

# Display the image using Matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(13, 10))
plt.imshow(img_array)
plt.title(f"Query: '{query}' | Most relevant page:")
plt.axis('off') # Turn off axis
plt.show()

In [None]:
import torch

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

# Example tensors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate dot product
print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))

# Calculate cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

### Functionizing our semantic search pipeline

Let's put all of the steps from above for semantic search into a function or two so we can repeat the workflow.


In [None]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

In [None]:
query = "symptoms of pellagra"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

In [None]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

### Checking local GPU memory availability

Let's find out what hardware we've got available and see what kind of model(s) we'll be able to load.


In [None]:
import torch
import psutil

device = "mps"

# Get system memory (Apple M-Series shares memory with CPU)
total_memory_gb = round(psutil.virtual_memory().total / (2**30), 2)

print(f"Total system memory: {total_memory_gb} GB (Shared between CPU & GPU)")

# Torch MPS does not expose memory details like CUDA
if torch.backends.mps.is_available():
    print("MPS backend is available. Memory is dynamically allocated.")
else:
    print("MPS backend is not available.")


In [None]:
 # Select Gemma model based on available GPU memory
if total_memory_gb < 5.1:
    print(f"Your available GPU memory is {total_memory_gb:.2f}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif total_memory_gb < 8.1:
    print(f"GPU memory: {total_memory_gb:.2f}GB | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True 
    model_id = "google/gemma-2b-it"
elif total_memory_gb < 19.0:
    print(f"GPU memory: {total_memory_gb:.2f}GB | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False 
    model_id = "google/gemma-2b-it"
else:  # total_memory_gb >= 19.0
    print(f"GPU memory: {total_memory_gb:.2f}GB | Recommended model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False 
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

# Determine device and best attention mechanism
def get_device_and_attention():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        if is_flash_attn_2_available() and torch.cuda.get_device_capability(0)[0] >= 8:
            attn_implementation = "flash_attention_2"
        else:
            attn_implementation = "sdpa"
        print("[INFO] Using CUDA with:", attn_implementation)
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        attn_implementation = "sdpa"  # Best available for MPS
        print("[INFO] Using MPS with SDPA")
    else:
        device = torch.device("cpu")
        attn_implementation = None  # No special attention on CPU
        print("[WARNING] Using CPU, expect slower performance")
    
    return device, attn_implementation

# Get device and attention mechanism
device, attn_implementation = get_device_and_attention()

In [None]:
from huggingface_hub import notebook_login
notebook_login()


In [None]:
pip install -U bitsandbytes

In [None]:
torch.set_default_dtype(torch.float32)

In [None]:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)
quantization_config

In [None]:
if device.type == "mps":
    torch_dtype = torch.float32
else:
    torch_dtype = torch.bfloat16

# Load tokenizer and model with the appropriate settings
tokenizer = AutoTokenizer.from_pretrained(model_id , token = "")
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_id,
    torch_dtype=torch.float16,
    # quantization_config=quantization_config,
    low_cpu_mem_usage=False,
    attn_implementation=attn_implementation,
    token=""
)

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU 
    model.to("mps")
print("[INFO] Model successfully loaded on", device)

In [None]:
model

Ok, ok a bunch of layers ranging from embedding layers to attention layers (see the `GemmaFlashAttention2` layers!) to MLP and normalization layers.

The good news is that we don't have to know too much about these to use the model.

How about we get the number of parameters in our model?


In [None]:
def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(model)

In [None]:
def get_model_mem_size(model: torch.nn.Module):
    """
    Get how much memory a PyTorch model takes up.

    See: https://discuss.pytorch.org/t/gpu-memory-that-model-uses/56822
    """
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate various model sizes
    model_mem_bytes = mem_params + mem_buffers # in bytes
    model_mem_mb = model_mem_bytes / (1024**2) # in megabytes
    model_mem_gb = model_mem_bytes / (1024**3) # in gigabytes

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2),
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(model)

In [None]:
input_text = "What are the macronutrients, and what roles do they play in the human body?"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

In [None]:

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("mps")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig 
outputs = model.generate(**input_ids, max_new_tokens=256)  # define the maximum number of new tokens to create
print(f"Model output (tokens):\n{outputs[0]}\n")


Woohoo! We just generated some text on our local GPU!

Well not just yet...

Our LLM accepts tokens in and sends tokens back out.

We can conver the output tokens to text using [`tokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode).


In [None]:
 # Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

In [None]:
print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

In [None]:
# Nutrition-style questions generated with GPT4
/gpt4_questions = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "Explain the concept of energy balance and its importance in weight management."
]

# Manually created question list
manual_questions = [
    "How often should infants be breastfed?",
    "What are symptoms of pellagra?",
    "How does saliva help with digestion?",
    "What is the RDI for protein per day?",
    "water soluble vitamins"
]

query_list = gpt4_questions + manual_questions

And now let's check if our `retrieve_relevant_resources()` function works with our list of queries.


In [None]:
import random
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

In [None]:
# def prompt_formatter(query: str, 
#                      context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

In [None]:
query = random.choice(query_list)
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

In [None]:
%%time

input_ids = tokenizer(prompt, return_tensors="pt").to("mps")

# Generate an output of tokens
outputs = model.generate(**input_ids,
                             temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                             do_sample=True, # whether or not to use sampling, see https://huyenchip.com/2024/01/16/sampling.html for more
                             max_new_tokens=256) # how many new tokens to generate from prompt 

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, '')}")

Yesssssss!!!

Our RAG pipeline is complete!

We just Retrieved, Augmented and Generated!

And all on our own local GPU!

How about we functionize the generation step to make it easier to use?

We can put a little formatting on the text being returned to make it look nice too.

And we'll make an option to return the context items if needed as well.


In [None]:
def ask(query, 
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    
    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)
    
    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU 
        
    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)
    
    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("mps")

    # Generate an output of tokens
    outputs = model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)
    
    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text
    
    return output_text, context_items

In [None]:
query = random.choice(query_list)
print(f"Query: {query}")

# Answer query with context and return context 
answer, context_items = ask(query=query, 
                            temperature=0.7,
                            max_new_tokens=512,
                            return_answer_only=False)

print(f"Answer:\n")
print_wrapped(answer)
print(f"Context items:")
context_items

Local RAG workflow complete!

We've now officially got a way to Retrieve, Augment and Generate answers based on a source.

For now we can verify our answers manually by reading them and reading through the textbook.

But if you want to put this into a production system, it'd be a good idea to have some kind of evaluation on how well our pipeline works.

For example, you could use another LLM to rate the answers returned by our LLM and then use those ratings as a proxy evaluation.

However, I'll leave this and a few more interesting ideas as extensions.
