__RAG__

This notebook shows a simple example of Retrieval-Augmented Generation (RAG). The goal is to answer questions with specific user information. In this way, a traditional LLM is enhanced by having access to proprietary data. 

Retrieve 
- When a user asks a question, a search for additional information to a private database is made 

Augment
- The information gathered is then used to "augment" the original questions by providing more context


Generate 
- The context + question is sent to an LLM, which in turn can generate a more accurate answer



The next cell imports packages needed for this example.

In [60]:
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline
import numpy as np

Private data can be stored in different ways. For this simple example we will store our private database as text strings. 

In [61]:
knowledge_base = [
    "Pablo once hiked Angel's Landing in Zion National Park. It is one of the deadliest hikes in the US.",
    "Pablo studied abroad in Rome, Italy where he took classical art history courses.",
    "Pablo enjoys playing tennis, pickleball, soccer and hiking. He plays tennis at the local courts every Thursday.",
    "Pablo has a pet dog, a beagle that he adopted from a shelter in 2018.",
    "Pablo enjoys reggaeton music and has seen Pitbull in concert twice.",
]

print(f"Knowledge base created with {len(knowledge_base)} documents.")

Knowledge base created with 5 documents.


In order to get the word meanings, we need to represent the words as embeddings as seen in the previous project. For this example, we will use a pertained model to create these embeddings. 

In [62]:
# Load a pre-trained model for creating embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for our knowledge base
knowledge_base_embeddings = embedding_model.encode(knowledge_base, convert_to_tensor=True)

print("Embeddings created for the knowledge base.")
print("Shape of the embeddings tensor:", knowledge_base_embeddings.shape)

Embeddings created for the knowledge base.
Shape of the embeddings tensor: torch.Size([5, 384])


Below are the embeddings created.

In [63]:
knowledge_base_embeddings

tensor([[ 0.0815,  0.0336, -0.0262,  ..., -0.0824, -0.0435, -0.0410],
        [ 0.0509, -0.0035,  0.0167,  ...,  0.0033, -0.0289, -0.0178],
        [ 0.1076,  0.0102,  0.0461,  ..., -0.0350, -0.0333,  0.0059],
        [-0.0010, -0.0361,  0.0553,  ...,  0.0012,  0.0890,  0.0519],
        [ 0.0871, -0.0851,  0.0171,  ..., -0.0459,  0.1151, -0.0166]])

The cell below poses a question. "What type of activities does Pablo enjoy?". 

Then we compare the questions embedding to that of the documents in our private database. The goal is to return the most relevant document based on the question. 



In [64]:
# what would be a better example question to ask for this knowledge base?
user_question = "What type of activities does Pablo enjoy?"

# 1. Create an embedding for the user's question
question_embedding = embedding_model.encode(user_question, convert_to_tensor=True)

# 2. Calculate cosine similarity between the question and all knowledge base documents
cos_scores = util.cos_sim(question_embedding, knowledge_base_embeddings)[0]

# 3. Find the document with the highest score
top_result = torch.argmax(cos_scores)
retrieved_context = knowledge_base[top_result]

print(f"User Question: {user_question}")
print(f"Most relevant document found (Score: {cos_scores[top_result]:.4f}):")
print("---")
print(retrieved_context)

User Question: What type of activities does Pablo enjoy?
Most relevant document found (Score: 0.7002):
---
Pablo enjoys playing tennis, pickleball, soccer and hiking. He plays tennis at the local courts every Thursday.


This simple example shows how documents are selected based on questions in RAG.

Next we will ask the LLM the same question without providing any additional information from our private database. 

In [65]:
# The original question without any context
plain_prompt = f"Question: {user_question}\n\nAnswer:"

print("--- PLAIN PROMPT (NO RAG) ---")
print(plain_prompt)

# Generate the answer
result_no_rag = generator(plain_prompt, max_new_tokens=50, num_return_sequences=1, do_sample=False)

print("\n--- MODEL'S ANSWER (NO RAG) ---")
print(result_no_rag[0]['generated_text'])

--- PLAIN PROMPT (NO RAG) ---
Question: What type of activities does Pablo enjoy?

Answer:

--- MODEL'S ANSWER (NO RAG) ---
scuba diving

--- MODEL'S ANSWER (NO RAG) ---
scuba diving


Since the model does not inherently have any information on what Pablo's favorite activities are, the model responds with hallucinations. In this case the model guessed that I enjoy scuba diving which is false. 


Next we will ask the LLM the same prompt and we will provide it with the most relevant document from our private database. 

In [66]:
generator = pipeline('text2text-generation', model='google/flan-t5-small', torch_dtype=torch.bfloat16)

# Augment the prompt with the retrieved context - simplified format for T5
augmented_prompt = f"""Answer the question based on the context provided.

Context: {retrieved_context}

Question: {user_question}

Answer:"""

print("--- AUGMENTED PROMPT ---")
print(augmented_prompt)

# Generate the answer
# We set max_new_tokens to control only the generated portion
result = generator(augmented_prompt, max_new_tokens=50, num_return_sequences=1, do_sample=False)

print("\n--- MODEL'S ANSWER (WITH RAG) ---")
print(result[0]['generated_text'])

Device set to use cpu


--- AUGMENTED PROMPT ---
Answer the question based on the context provided.

Context: Pablo enjoys playing tennis, pickleball, soccer and hiking. He plays tennis at the local courts every Thursday.

Question: What type of activities does Pablo enjoy?

Answer:

--- MODEL'S ANSWER (WITH RAG) ---
playing tennis, pickleball, soccer and hiking

--- MODEL'S ANSWER (WITH RAG) ---
playing tennis, pickleball, soccer and hiking


As shown above, the model with RAG responds correctly, suggesting that I like playing tennis, pickleball, soccer and hiking. 

## Additional Considerations

This simple example demonstrates the core concept of RAG. In applications such as ChatGPT, RAG may include millions of files from websites, databases, pdf's and more.

In our example we used a simple similarity function to find the most relevant document. However, there are many other approaches such as. 
  - Hybrid search (keyword + semantic)
  - Multiple retrieval steps
  - Re-ranking retrieved documents
  - Filtering by date, source, or relevance


In our example the documents contain one sentence, but in real world application they often include many paragraphs. RAG systems employ the following techniques to help in correct retrieval of information.
  - Break large documents into smaller "chunks" (paragraphs or sections)
  - Overlap chunks to maintain context
  - Store metadata (source, date, author) with each chunk

In our example, our private database is static, but in the real world RAG systems typically include the following to strengthen responses generated. 
  - Continuously updated databases
  - Real-time web searches
  - Fresh information retrieval

In our example, we asked one question, retrieved on document and provided one answer. Real application employ the following measures to solve complex questions and utilize multiple sources.
  - Break complex questions into sub-questions
  - Multiple retrieval rounds
  - Synthesize information from multiple sources