# RAG Overview

Simple overview of RAG (Retrieval Augmented Generation) and how it works.

By the end we will have created a simple RAG system that can answer questions about methane emissions.

## Outline

- What is RAG and why?
- Building an Expert System
- Vectors
- Vector Databases
- Vector Search
- RAG System



## What is RAG?

Rag is Retrival Augmented Generation. The idea is you give an LLM references to answer a question, rather then using the LLMs internal knowledge.

To explain why and how lets walk through a simple example. Lets build an LLM to answser questiosn about methane.

We will use the openai api to create a methane expert.

First, here is one with no context:

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
from IPython.display import display, Markdown

load_dotenv()

# The API key is added to the environment variable in .env
client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are methane expert, for answering methane related questions. Keep answers short. Make answer markdown.",
        },
        {
            "role": "user",
            "content": "What is AVO?",
        },
    ],
)
display(Markdown(completion.choices[0].message.content))

The answer does not give 'audio, visual, and olfactory', instead it makes something else up.

This is the main thing RAG tries to prevent.

Now we will add the highwood glossary to the RAG prompt, and see what happens.

In [None]:
# Open the file glossary.md and set as a variable called glossary

glossary = open("../glossary.md", "r").read()

system_prompt = f"""
You are a methane expert, for answering methane related questions. Keep answers short. Make answer markdown.

Use only the provided references to answer the questions.

<References>
<Glossary>
{glossary}
</Glossary>
</References>
"""

# Create a client object
client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": "What is AVO?",
        },
    ],
)

display(Markdown(completion.choices[0].message.content))

Now the LLM gives an accurate answer. This is pretty much what RAG does. Find references to answer the question, give the LLM the references and tell the LLM to answer the question based on the references.

## The Catch
So the RAG once you have references is easy! The hard part is getting the right references.

LMMs have some limitations which make this task difficult.

### Tokens
Lets briefly talk about tokens, as understading them is key to understanding the problem.

Tokens are how an LLM will break up text to process. A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text.

In [None]:
import tiktoken # tiktoken is a python libary for generating tokens from text

enc = tiktoken.encoding_for_model("gpt-4o")
print("Hello,", enc.encode("Hello,"))
print("how", enc.encode("how"))
print("are", enc.encode("are"))
print("you?", enc.encode("you?"))
print("Hello, how are you?", enc.encode("Hello, how are you?"))

'Hello, how are you?' translates into 6 tokens. You'll notice that the tokens are different depending if its the words, or the full sentence. This has to do with the semantic meaning of the words in the context of the sentence rather then the words themselves. We won't get into this, but its important to know that the tokens are numberical representations of the semantic meaning of the words.

The main thing we need to overcome is the limited number of tokens you can send to an LLM prompt. With GPT-4o the limit is 128K tokens. Thats a lot, but for example, subpart-w documents are > 300k tokens. So unless you have a small set of documents you can use, you need to make the documents that get fed to the LLM dynamic.

Methane expert has more then 10 million tokens, so you can't just load them into the prompt.

## Search Engine
The R in RAG is 'retrival'. This is a search engine.

This can be described as the following function.
`search_engine(question) -> documents[]`

This search engine can really be anything. A websearch, a text search, a database query. All it needs to do is find a set of documents that are relevant to the question.

Modern LLM technology have given us a new and powerful way to do these searches. We can use a vector database to store the documents and then use a vector similarity algorithm to find the most relevant documents. What this does is search for things that are **semantically similar** to the question, not just looking for text that matches. Most RAG systems will use a vector sematic search (including methane expert) so lets look into that.

A vector is a, well, vector reprenation of text. This is the same way that LLM store the text to do their fancy auto complete. The good news is that openAI has an API to create vector for you from text!

In [None]:
response = client.embeddings.create(
    input="Here is some text!",
    model="text-embedding-3-small",
    dimensions=10
)

print(response.data[0].embedding)

Here the vector has 10 demensions for to show some clarity. In practice, we use ~1000 demensions.

Once text is vectorized, its simply a matter of finding the closest vectors to each other. There are many ways to do this, but if there were only 2 demensions, we could use a simple euclidean distance d = sqrt((x1-x2)^2 + (y1-y2)^2).

Here we will use a more sophisticated method called the cosine similarity. The cosine similarity is a measure of how similar two vectors are. It is defined as:

cosine_similarity = dot_product / (magnitude1 * magnitude2)

The dot product is the sum of the products of the corresponding elements of the two vectors. The magnitudes are the square roots of the sum of the squares of the elements of the vectors.

The cosine similarity is a value between -1 and 1, where 1 means the vectors are identical and -1 means they are completely dissimilar.

In [None]:
import numpy as np

def cosine_similarity(a_emb, b_emb):
    a = np.array(a_emb.embedding)
    b = np.array(b_emb.embedding)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

text_1 = "The dark feline purrs softly."
text_2 = "The black cat hums gently."
text_3 = "The golden cat purrs loudly."

response = client.embeddings.create(
    input=[text_1, text_2, text_3],
    model="text-embedding-3-small",
)

[v_1, v_2, v_3] = response.data

s_1_2 = cosine_similarity(v_1, v_2)
s_1_3 = cosine_similarity(v_1, v_3)
s_2_3 = cosine_similarity(v_2, v_3)

print(f"d_1_2 = {s_1_2}")
print(f"d_1_3 = {s_1_3}")
print(f"d_2_3 = {s_2_3}")

1 is the highest similarity score, so you can see that that the first 2 sentences are the most similar although they don't share the same words but the meanings are the same.

This is the advantage of the LLM vectors. Similar vectors mean the meanings are similar!

So this is all fine for an example, but how do you accomplish this at scale?

### Vector Database
A specialized database for storing and querying vectors.

The idea is you:
1. Convert documents into vectors.
2. Store the vectors in a database.
3. Convert queries into vectors.
4. Query the database for the closest vectors to the query vector.

In practice, this is implemented in the pgvector extension in PostgreSQL. But for this example we will use something called LanceDB, which is just a simple file-system based vector database.

#### Documents to Vectors
The first step to building the vector database is converting the documents into vectors. As we have done in the previous example. But now lets try it with some real documents.

In [None]:
voluntary = open("../voluntary.md", "r").read()

embeddings = client.embeddings.create(
    input=[glossary, voluntary],
    model="text-embedding-3-small",
)

Okay, there was an error here. This model's maximum context length is 8192 tokens, however you requested 38750 tokens. Basically the max number of tokens we can vectorize is 8192 tokens, but we have a bigger document!

How do we deal with this? Well here is the second catch. We need to break the documents into smaller chunks.

Aside from just getting API to work, this is also import because this way we can have small chunks of the documetns to answer users questions.

I kinda soft-sold this part. Its actually probably the hardest part of the RAG system. You need to split the document into smaller chunks, but you probably want to keep similar parts together. So where, and how do you split?

There are several strategies to split the document. You could do a recusive split, which is split in the middle, then split again, and again, until everythign fits. But that way you might split on words or sentences. You could try and split on periods, or line breaks, or other characters. Really the options here are endless.

For the sake of this, I already split the text into sections and put them in the volutary_sections.json file.

In [None]:
# load the sections
import json

with open("../voluntary_sections.json", "r") as f:
    voluntary_sections = json.load(f)

with open("../glossary_sections.json", "r") as f:
    glossary_sections = json.load(f)


all_sections = [section['content'] for section in voluntary_sections] + [section['content'] for section in glossary_sections]

all_sections[0:5]

In [None]:
# now try and vectorize it again
embeddings_req = client.embeddings.create(
    input=all_sections,
    model="text-embedding-3-small",
)

embeddings = [e.embedding for e in embeddings_req.data]

print(embeddings[0])

In [None]:
import lancedb
from pydantic import BaseModel
from lancedb.pydantic import pydantic_to_schema


uri = "../data/sample-lancedb"
db = lancedb.connect(uri)

# create the data, there is a vector and an item for each
data = []
for i in range(len(all_sections)):
    data.append({"vector": embeddings[i], "content": all_sections[i]})

tbl = db.create_table("me", data=data, mode="overwrite")

Now that we have the database created, query is as simple as converting the question to a vector and then doing a similarity search.

In [None]:
question = 'explain MIQ'

question_embedding = client.embeddings.create(
    input=[question],
    model="text-embedding-3-small",
)

question_vector = question_embedding.data[0].embedding

tbl.search(question_vector).limit(1).to_list()[0]['content']

Sweet. Looks like we got the search engine working. Now lets make a function to handle the whole search process.

In [None]:
def search(query: str, results: int = 1) -> list[str]:

    question_embedding = client.embeddings.create(
        input=[query],
        model="text-embedding-3-small",
    )

    question_vector = question_embedding.data[0].embedding

    rows = tbl.search(question_vector).limit(results).to_list()

    return [row['content'] for row in rows]

search('explain MIQ')

Now we put the whole thing together. A simple function that takes a question, searchs for matching documents, then sends those to the LLM to generate a response.

In [None]:
def rag(text):

    search_results = search(text)

    search_results_str = '\n'.join(search_results)

    system_prompt = f"""
        You are a methane expert, for answering methane related questions. Keep answers short. Make answer markdown.

        Use only the provided references to answer the questions.

        <References>
        {search_results_str}
        </References>
    """

    # Create a client object
    client = OpenAI()

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": text,
            },
        ],
    )

    display(Markdown(completion.choices[0].message.content))


In [None]:
rag("What is MIQ?")
rag("Compare Veritas and MIQ")