This notebook provides a demo for implement a RAG Pipeline to tune a pretrained LLM to answer questions using information from the first 20 pages of Harry Potter and the Sorcerer's Stone


Author: Rishika Srinivas

Credits: https://github.com/mrdbourke/simple-local-rag


Santa Cruz Artificial Intelligence Spring 2025

Load the Dependencies

In [1]:
import torch
import fitz
import pandas as pd
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
import textwrap
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
import tqdm as notebook_tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
PDF_PATH= "CCLSpring2025.pdf"

Load the Document into a pandas dataframe (which we'll be using as our 'vector database')



*   Chunk
*   Store in dataframe



In [3]:

def reformat_text(text: str):
  text = text.replace("\n", " ")
  return text

# load the information in the pdf into a list of dictionaries so we can
# easily store them into a dataframe
def chunk_pdf(doc,chunk_size=6):
  text_per_page = []
  # go to each page
  for page_number, page in enumerate(doc):
    sentence = ''
    accumulated_text=''
    # look at each line in the document
    for i,text in enumerate(page.get_text("text").split("\n")):

      # get rid of any header text
      if text.upper() == text:
        text=text.replace(text, " ")

      # once we've reached our chunk size, add that chunk to the dictionary which we will convert to a dataframe (this step is just so that we can see how the chunks look more easily)
      accumulated_text += text
      if i > 0 and i % chunk_size == 0:
        sentence += accumulated_text
        sentence = reformat_text(sentence)
        text_per_page.append({"Text": sentence,
                          "Page_#": page_number,

        })
        accumulated_text = ''
        sentence = ''

  return text_per_page

# See how the datatable looks after we chunk our document
doc = fitz.open(PDF_PATH)
pd.DataFrame(chunk_pdf(doc))

Unnamed: 0,Text,Page_#
0,Playfly College Esports Collegiate Chess Leagu...,0
1,By entering and participating in the Compet...,0
2,decisions regarding the interpretation of thes...,0
3,"Playfly Esports, 22 Cassatt Ave. Berwyn, PA ...",0
4,The following dates will apply to the All Di...,0
...,...,...
173,non-primary monitors and/or requesting to view...,18
174,computers or laptops using keyboard and mouse....,18
175,browser and browser tab being used to play on ...,18
176,be removed from competition at the discretion...,19


Combine everything we've done so far. Open the document, and chunk it!

In [4]:
#open the document, chunk the text, and store each chunk as an entry in the text_chunks list
doc = fitz.open(PDF_PATH)
chunk_size=5
chunks = chunk_pdf(doc,chunk_size=chunk_size)
text_chunks = []
for chunk in chunks:
  text_chunks.append(chunk['Text'])
print(f"Number of chunks: {len(text_chunks)}")
print(f"Text chunk 1 is {text_chunks[1]}")

Number of chunks: 213
Text chunk 1 is     By entering and participating in the Competition, each Student-athlete, as defined below, and 


In [6]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

embedding_model = SentenceTransformer(model_name_or_path="all-MiniLM-L6-v2",  # Use smaller SentenceTransformer model
                                      device="mps")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", eos_token="<EOS>")
model = AutoModelForCausalLM.from_pretrained(
    "distilgpt2",  # Use smaller causal language model
    torch_dtype=torch.float16,  # Use float16 for reduced memory usage
    low_cpu_mem_usage=True,    # Enable low memory usage
    device_map="auto"         # Automatically map layers to available devices
)
model = model.to("mps")  # Ensure the model is on the GPU if available


Convert the text chunks into vector embeddings!

In [7]:
# Sentences are encoded/embedded by calling model.encode()
text_chunk_embeddings = embedding_model.encode(
    text_chunks,
    batch_size=4,  # Reduce batch size to minimize memory spikes
    convert_to_tensor=True
)

In [8]:
text_chunk_embeddings.shape # Notice 148 entries in our list and 148 in our embedding tensor

torch.Size([213, 384])

Apply cosine similarity to extract top K most relevant chunks for our given query

In [9]:
#use some similarilty metric to calculate the similarity between your query and the sentences in the df
#credit: mrdbourke/simple-local-rag demo
import torch.nn as nn
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor, model,
                                n_resources_to_return: int=5):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, convert_to_tensor=True)
    # Get dot product scores on embeddings
    cos = nn.CosineSimilarity(dim=1, eps=1e-6)

    dot_scores = cos(query_embedding, embeddings)



    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices


Let's test this function out. Ask a query and see what relevant context is chosen!

In [10]:
#take any query and see the similarity scores and the indexes of the df that hold the relevant sentences!
query="What date does registration close for the chess tournament?"
scores, indices = retrieve_relevant_resources(query,text_chunk_embeddings,embedding_model)
print("Scores: ", scores, " Indices: ", indices)
print("Similar chunks are: ", [text_chunks[i] for i in indices])

Scores:  tensor([0.5350, 0.5304, 0.5296, 0.5176, 0.5091], device='mps:0')  Indices:  tensor([107, 106,   0,   6,   5], device='mps:0')
Similar chunks are:  ['  Schedule:  ●\u200b Registration Opens: December 13 ●\u200b Registration Closes: January 10 ●\u200b Schedule Release: January 17 ', 'LeagueSpot: Primary registration for the Competition will be done on LeagueSpot: https://esports.playflycollege.gg/league. All matches and Playoff brackets will be found on LeagueSpot.   Chess.com: All games will be played on Chess.com for Regular Season ', 'Playfly College Esports Collegiate Chess League  Spring 2025 Competition Official Rules and Regulations     ', 'Dates Location Registration Period Dec 13 - Jan 10 Online ', ' The following dates will apply to the All Divisions (defined below) of the Competition:   Stage ']


This function combines the query and context and put it into a format that enables us to pass the combination into our LLM as the prompt

In [15]:
#credit: mrdbourke/simple-local-rag demo
def prompt_formatter(query: str,
                     context) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "\n".join(context) if isinstance(context, list) else context

    # Create a base prompt
    base_prompt = """Answer this question about the chess tournament using information only from the provided context.
      Context:
      {context}
      User query: {query}
      Answer:"""

    # Update base prompt with context items and query
    prompt = base_prompt.format(context=context, query=query)

    return prompt, context

Putting it all together!



*   Retrieve relevant context
*   Combine context and query
*   use model.generate() to pass our prompt into the model





In [16]:
#credit: mrdbourke/simple-local-rag demo
def ask(query,
        temperature=0.7,
        max_new_tokens=50,
        format_answer_text=True,
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """

    # Get the scores and the rows (indices) of the text_chunk_embeddings that give the highest relevance scores
    scores, indices = retrieve_relevant_resources(query, text_chunk_embeddings,
                                                  embedding_model, n_resources_to_return=5)

    # We can use our text_chunks list to find the actual strings associated with those indices
    context_items = [text_chunks[i] for i in indices]
    context = " ".join(context_items)

    # Format the prompt with context items
    prompt, context = prompt_formatter(query=query,
                                       context=context)

    # Tokenize the prompt
    inputs = tokenizer(prompt, truncation=True, max_length=512, return_tensors="pt").to("mps")

    # Generate an output of tokens
    outputs = model.generate(
        **inputs,
        early_stopping=True,
        num_beams=4,
        max_new_tokens=100,  # Use max_new_tokens to control generation length
        length_penalty=2.0,
        no_repeat_ngram_size=3
    )

    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])
    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("Sure, here is the answer to the user query:\n\n", "")
        output_text = output_text.partition("<|im_end|>")[0]
    # Only return the answer without the context items
    if return_answer_only:
        return output_text

    return output_text, context

In [17]:
#credit: mrdbourke/simple-local-rag demo
def print_wrapped(text, wrap_length=50):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [18]:
# Answer query with context and return context
answer, context_items = ask(query="List out all the key deadlines of the tournament.",
                            temperature=0.2,
                            max_new_tokens=50,
                            return_answer_only=False)

print(f"Answer: {answer}\n")
print(f"Context items: {context_items}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer:  If you have any questions or questions about the tournament, please contact me at      (at)   http://www.youtube.com/watch?v=QQQqQqqqQQ
If you would like to contact me, please email me at [email protected] or follow me on Twitter at   https://twitter.com/#!/Patreon<|endoftext|>

Context items: ●​ Quarterfinals: March 15  ●​ Semifinals: March 22  ●​ Final and 3rd Place Match: March 29  Rescheduling: Playoff Matches may not be rescheduled  ●​ Semifinals: March 29 ●​ Final and 3rd Place Match: April 5  Rescheduling: Playoff Matches in Divisions 2 and 3 may not be rescheduled except under significant circumstances as determined by  Format:  Single elimination bracket; 11 am PT / 2 pm ET.  Schedule:  ●​ Round One: March 15 ●​ Quarterfinals: March 22  Quarterfinals round and immediately qualify to the Semifinals.  ●​ The remaining 4 Qualifying teams will be seeded as such:   Fair Play: During the Playoffs, Players in the top 3 Divisions (Divisions   The following dates will apply to