# Retrieval Augmented Generation (RAG) from Scratch

When LLMs are created, they are trained on a fixed dataset, which means they only know information available up to the time of training. But what if we want to talk to an LLM about new or ever-changing information that wasn't in their original dataset? Or what about private data? This is where RAG comes in. Using this approach we can pass relevant and up-to-date information from external sources to the LLM in real time before generating a response. Let's see how it's done...

![Retrieval](retrieval.png)

In [23]:
import json

def load_faqs(path: str) -> list[dict]:
    """Loads FAQs from a JSON file and adds a token count to each."""
    with open(path, "r") as file:
        file = json.load(file)
    qas = []
    for qa in file:
        faq_text = f"{qa['question']} {qa['answer']}"
        token_count = len(faq_text) // 4 # Calculate token count for the combined string. One token is roughly 4 english characters
        qas.append({'faq': faq_text, 'token_count': token_count})
    return qas


# Load FAQs from JSON file
faq_path = "faq.json"
faqs = load_faqs(faq_path)

# Print the FAQs
for faq in faqs:
    print(f"FAQ: {faq['faq']}")
    print(f"Token Count: {faq['token_count']}\n")

FAQ: What is your return policy? You can return any item within 30 days of purchase for a full refund, provided it is in its original condition.
Token Count: 34

FAQ: Do you offer free shipping? Yes, we offer free standard shipping on orders over $50.
Token Count: 21

FAQ: How can I track my order? Once your order is shipped, you will receive an email with a tracking number and a link to track your package.
Token Count: 34

FAQ: Can I change or cancel my order? You can change or cancel your order within 24 hours of placing it by contacting our customer service.
Token Count: 33

FAQ: What payment methods do you accept? We accept Visa, MasterCard, American Express, PayPal, and Apple Pay.
Token Count: 26

FAQ: Do you have a physical store location? Yes, we have several store locations. Please visit our website to find the nearest store.
Token Count: 32

FAQ: How do I use a promo code? Enter your promo code at checkout in the designated field to apply the discount.
Token Count: 26

FAQ: Wh

# Why do we care about token count?

Token count is important to think about because:
1. Embedding models can't ingest an infinite amount of tokens (more on this soon)
2. LLM's also can't ingest an infinite amount of tokens (context window)

For instance, an embedding model will be trained to take an input of _x_ amount of tokens embed it into numerical space.
If the token count of the thing we want to embed is larger than what the model supports, we must split it up into chunks.
In our case, we don't need to worry about this today. But let's say you're working with a large PDF, this will be something to take into account.
In the case of an LLM, GPT-4o for example has a max context window (input) of 128,000 tokens, so around 512,000 characters, and a max output length of 16,384 tokens.

https://platform.openai.com/tokenizer

# Processing each FAQ into an embedding

Okay, let's embed each FAQ into its own numerical representation, but what does this even mean? And why are we doing it?

While humans understand language and text, computers, as we all know, understand numbers. Let's provide it with some...
 
To "embed" a piece of text means to transform it into a vector representation within a high-dimensional space. 

"_vectors_, _high-dimensional spaces_, _what!?_"

- A **vector** is a mathematical object that has both magnitude and direction, often represented as an ordered list of numbers corresponding to coordinates in a space. In simpler terms, it's an array of numbers, e.g [3, 4, 5]
- A **high-dimensional** space refers to a mathematical space with many dimensions, allowing it to capture complex relationships and semantic similarities between it's data points. Whereas we work in the three dimensions, these spaces can have thousands, making it impossible to intuitively comprehend. In said spaces, ***semantically*** similar data points are positioned closer to each other. For example, the numerical values for "Dog" and "Cat" might be positioned fairly close to each other, and likewise the numerical representation for the word "Bank" might be equally positioned between regions representing Finance and Geography. 

But which numbers do we pick...? And how do we "encode" the semantic meaning behind a given piece of text into these numbers?

# Embedding Models

Embedding models are machine learning models designed to convert data (like words or images) into vectors. They are trained using large datasets where the model adjusts the vector representations to minimize the distance between similar items and maximize the distance between dissimilar ones. So to start off, the vector representations are completely random. But as training progresses, the model starts to learn the similarities and differences between various inputs, and generates the appropriate vector in response. Let's take a look:

In [24]:
# Sentence Transformers is an open-source library for interfacing with embedding models
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu")
sentences = ["This is for a tech talk", "Hello world!"]

embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")

Sentence: This is for a tech talk
Embedding: [-1.85906217e-02 -7.89896250e-02 -2.90067401e-02 -1.14555974e-02
 -2.81989183e-02  1.18537704e-02  5.35872839e-02 -5.07863499e-02
 -2.27182433e-02  2.66975537e-03  6.44041747e-02 -3.70712988e-02
  1.86049100e-02  6.85815886e-02  3.68726323e-03 -5.77761009e-02
  3.97582017e-02  1.33974403e-02 -2.54959092e-02  5.75061962e-02
 -1.32698547e-02  1.59527957e-02 -3.02311834e-02 -1.65105832e-03
 -5.11745829e-03 -1.42291067e-02 -1.74166542e-02  8.79486650e-03
 -7.65029900e-03  2.66950345e-04  5.61381807e-04 -5.49411364e-02
 -6.56222051e-04 -5.15956096e-02  1.56339365e-06  1.01803159e-02
 -2.44737752e-02  9.87703353e-03 -7.02290237e-02  9.24384445e-02
  4.40662820e-03  6.48789927e-02  1.49574815e-04  7.06449598e-02
  5.08026183e-02 -3.47991362e-02  6.47578314e-02  3.67464758e-02
  6.07231911e-03  6.80860737e-03 -6.64794957e-03  1.38000473e-02
 -5.59679233e-03 -2.69164629e-02 -6.31440580e-02 -4.03364561e-03
 -3.17131057e-02  2.97694094e-02  2.82389633e

In [25]:
# Let's take a look at the first embedding
embeddings[0].shape

(768,)

So, we are now representing each sentence with 768 numbers. Or another way of putting this is this vector now represents a point 768-dimensional space, where each dimension encapsulates a certain feature or characteristic of the data.

# Efficiency of Vector Operations: CPU vs GPU

In the previous example, we performed the operation on the CPU. CPUs are designed for sequential operations (one after the next). Let's see how long it takes to embed all of our FAQs using this approach:

In [26]:
%%time

embedding_model.to("cpu")

# Embed each FAQ one by one
for faq in faqs:
    # Encode the FAQ text
    embedding = embedding_model.encode(faq['faq'])

CPU times: user 5.29 s, sys: 65.1 ms, total: 5.35 s
Wall time: 671 ms


Now let's try on the GPU. I have an NVIDIA RTX 3070, which is by no means necessary the fastest consumer card out there, but let's see how it fares in comparison...

In [27]:
%%time

embedding_model.to("cuda")

# Embed each FAQ one by one
for faq in faqs:
    # Encode the FAQ text
    embedding = embedding_model.encode(faq['faq'])

CPU times: user 1.49 s, sys: 26.3 ms, total: 1.52 s
Wall time: 271 ms


Wow, ~35x faster. But wait, we're still doing this sequentially due to nature of the code... Can we get some more performance out of my GPU?

In [28]:
%%time

faq_texts = [faq['faq'] for faq in faqs]
batch_embeddings = embedding_model.encode(faq_texts, batch_size=32)

CPU times: user 35.5 ms, sys: 1.18 ms, total: 36.7 ms
Wall time: 25.2 ms


Another ~8x improvement, leading to a total ~301x improvement overall in comparison to using the CPU for our operations.

But why?

GPUs excel at tasks such as this because they are designed for parallel computing, allowing them to process thousands of vector operations simultaneously, whereas CPUs are designed for sequential processing. For example, think of a video game and the amount of polygons the GPU has to render each frame. 

In [29]:
# Typically to store these embeddings, we would use a Vector Database, such as Pinecone, Elasticsearch, Qdrant, etc. For now though, let's just store them in a NumPy array
import numpy as np

# Add the embedding to each FAQ object
for faq, embedding in zip(faqs, batch_embeddings):
    faq['embedding'] = embedding

# Extract the embeddings from each FAQ and store them in a NumPy array
embeddings_array = np.array([faq['embedding'] for faq in faqs])

# ***Retrieval*** Augmented Generation

The first phase of RAG is Retrieval. This includes gathering relevant data based on some input query. For this, we have a few options:

- We can do this by Similarity Search, otherwise known as Vector Search or Semantic Search. These are techniques that use embeddings to understand the meaning and context of queries and data, enabling more accurate and relevant search results by comparing vector representations rather than relying on exact keyword matches.
- Another option is Keyword Search, where based on some input query, we find relevant data that contains that input. For example if I were to search for "Apple", it would return passages that also contain "Apple".

In our instance, we want to ask some question and then retrieve relevant FAQs so our LLM has context to answer accurately. Let's go for Semantic Search, as the user might ask their question in a way that Keyword Search wouldn't pick up the relevant passages we need.

In [30]:
# Let's take our NumPy array and convert it into a Tensor using PyTorch
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Convert the NumPy array of embeddings to a PyTorch tensor
embeddings_tensor = torch.tensor(embeddings_array).to(device)

# Print the shape of the resulting tensor to verify
embeddings_tensor.shape

torch.Size([20, 768])

This means my Tensor has 20 vectors, each with 768 elements.

But what is PyTorch, and what is a Tensor? 

PyTorch is an open-source deep learning framework that provides tools for building and training neural networks.

A Tensor is a multidimensional array similar to a NumPy array but with additional capabilities, such as running on GPUs for accelerated computations. While a vector is a 1-dimensional tensor (or array), PyTorch tensors can represent data in higher dimensions (like matrices or multidimensional arrays), which is essential for deep learning tasks.

Now we have our embeddings stored, let's create a small Semantic Search pipeline by:
1. Defining a query string.
2. Transforming the query string into an embedding.
3. Performing a dot product or cosine similarity operation (more on this soon) between the FAQ embeddings and the query embedding
4. Getting the top 3 matching results, with the most similar being first

In [31]:
from sentence_transformers import SentenceTransformer, util

# It's important to embed our query using the same model we used to embed our FAQs
query_embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device=device)

# 1. Define the query
query = "item is damaged"

# 2. Embed the query using the same model we used to embed FAQs
query_embedding = query_embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (use cosine similarity if outputs of model aren't normalized, they are in our case).
dot_scores = util.dot_score(a=query_embedding, b=embeddings_tensor)

# 4. Get top-k results (3 for now)
top_faqs = torch.topk(dot_scores, k=3)

# Extract the index of the top scores
top_indices = top_faqs.indices[0].tolist()

# Retrieve and print the most relevant FAQ by looking at the index at the first index in top_indices
relevant_faq = faqs[top_indices[0]]
print(f"FAQ: {relevant_faq['faq']}")

FAQ: What should I do if I receive a damaged item? Please contact our customer service immediately with your order number and a photo of the damaged item.


Note: To use dot product for comparison, ensure vector sizes are of same shape (e.g. 768) and tensors/vectors are in the same datatype (e.g. both are torch.float32).

## Similarity Measures: Dot Product and Cosine Similarity

Two of the most common similarity measures between vectors are dot product and cosine similarity.
In essence, closer vectors will have higher scores, further away vectors will have lower scores.
The "scores" however are calculated and are represented different across both approaches.

In [39]:
def normalize(vector):
    """
    To normalize a vector means for it to have magnitude (length) of 1. 
    The magnitude of a vector is determined by calculating the distance between the origin of the space the vector is in (0, 0,..., 0) to the space the vector occupies.
    This is done by dividing each component of the vector by its Euclidean (L2) norm.
    After normalization, the vector points in the same direction but has a magnitude of 1.
    """
    norm = torch.sqrt(torch.sum(vector**2))
    return vector / norm

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

# If the embeddings are already normalized, then don't use this, as it's just performing an unnecessary computation to try and normalize them again
def cosine_similarity(vector1, vector2):
    # Normalize the input vectors
    normalized_vector1 = normalize(vector1)
    normalized_vector2 = normalize(vector2)
    # Compute the dot product of the normalized vectors
    return torch.dot(normalized_vector1, normalized_vector2)

# Example vectors/tensors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate dot product. Note: these vectors aren't normalized, so we're also taking into account the magnitude of each.
print("Dot product between vector1 and vector2 (normalized):", dot_product(vector1, vector2))
print("Dot product between vector1 and vector3 (normalized):", dot_product(vector1, vector3))
print("Dot product between vector1 and vector4 (normalized):", dot_product(vector1, vector4))

# Cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

Dot product between vector1 and vector2 (normalized): tensor(14.)
Dot product between vector1 and vector3 (normalized): tensor(32.)
Dot product between vector1 and vector4 (normalized): tensor(-14.)
Cosine similarity between vector1 and vector2: tensor(1.0000)
Cosine similarity between vector1 and vector3: tensor(0.9746)
Cosine similarity between vector1 and vector4: tensor(-1.0000)


## Key Takeaways:
Think of a vector as an arrow. Its Direction tells you where the arrow is pointing. Magnitude tells you how long the arrow is. 

### How Dot Product and Cosine Similarity Differ:
The two metrics differ only if the vectors are not normalized.
Dot Product takes into account both magnitude and direction of the vectors. Larger vectors (higher magnitudes) will yield a larger dot product, even if their directions are identical.
Cosine Similarity normalizes the vectors first, effectively removing the influence of magnitude and focuses purely on the directional alignment between the vectors.

In [34]:
# Get the embedding of the first FAQ
first_faq_embedding = embeddings_tensor[0]

# Calculate the L2 norm of the embedding
embedding_norm = torch.linalg.vector_norm(first_faq_embedding, ord=2)

# Check if the embedding is normalized (L2 norm should be approximately 1)
is_normalized = torch.isclose(embedding_norm, torch.tensor(1.0), atol=1e-6)

# Print the results
print(f"L2 norm of the first FAQ's embedding: {embedding_norm.item()}")
print(f"Is the embedding normalized? {'Yes' if is_normalized else 'No'}")

L2 norm of the first FAQ's embedding: 0.9999999403953552
Is the embedding normalized? Yes


# What next?

Now we have our relevant documents, we can now pass them alongside the user's query to the LLM

![Generation](generation.png)