# RAG Example

To start, let's get GPT and its tokenizer set up

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

gpt_model = GPT2LMHeadModel.from_pretrained('gpt2')
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')


Now, set up an embedding model that will encode our content for the vector database. We will use FAISS to start.

In [2]:
from sentence_transformers import SentenceTransformer
from tqdm.autonotebook import tqdm, trange # overcome notebook issue
# sample documents
documents = [
    "RAG combines retrieval-based methods with generation-based models for improved text generation.",
    "It retrieves relevant information from a large corpus to enhance the generation process.",
    "By using FAISS, we efficiently search over the document embeddings.",
    "GPT models are commonly used for generating natural language responses.",
    "Sentence-Transformers generate high-quality document embeddings."
]

# convert documents into embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents)


Now we will add the documents to FAISS' index

In [3]:
import faiss
import numpy as np

d = embeddings.shape[1]  # dimension of embeddings
index = faiss.IndexFlatL2(d)  # L2 distance metric
index.add(np.array(embeddings))  # add embeddings to the index


We are now ready to ask the LLM questions.
This happens in the following steps:
1. Ask our question (_'query'_)
2. Create an embedding for our query. This will allow the vector database to search for similar documents.
3. Run the search in the vector database.
4. The search results will map to our document array, and will point at the documents the database found to be most relevant. 
5. We now know what documents to pass as context to the LLM.
6. We combine our original query with the documents we identified to be relevant, and send them to the LLM.
7. The LLM will read our documents (_'context'_) and compose a response to our query.  

In [4]:
# embed our query
query = "What is Retrieval-Augmented Generation?"
query_embedding = model.encode(query)

# Run similarity search with FAISS
k=3 # number of top results to retrieve
distances, indices = index.search(np.array([query_embedding]), k)  

# Print indices of the retrieved documents
print("Indices of retrieved documents:", indices)

Indices of retrieved documents: [[1 0 3]]


This result means that the second, first and fourth documents were most relevant for our query.

We will now grab those documents from our list.

In [5]:
# Retrieve actual document text based on FAISS indices
retrieved_documents = [documents[i] for i in indices[0]]
print("Retrieved documents:", retrieved_documents)


Retrieved documents: ['It retrieves relevant information from a large corpus to enhance the generation process.', 'RAG combines retrieval-based methods with generation-based models for improved text generation.', 'GPT models are commonly used for generating natural language responses.']


We now need to combine these documents with our query so we send them as a block to the LLM.

In [6]:
# Combine the query and retrieved documents into a single input for GPT-2
input_text = query + "\n" + "\n".join(retrieved_documents)
print("Input for GPT-2:", input_text)


Input for GPT-2: What is Retrieval-Augmented Generation?
It retrieves relevant information from a large corpus to enhance the generation process.
RAG combines retrieval-based methods with generation-based models for improved text generation.
GPT models are commonly used for generating natural language responses.


With the input in hand, we need to tokenize it again for use by the LLM

In [7]:
# Set padding token to the eos_token_id
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

# Tokenize the input with padding
inputs = gpt_tokenizer(input_text, return_tensors='pt', padding=True, truncation=True)

# This will provide both input_ids and attention_mask
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Generate response with the attention mask and padded input
output = gpt_model.generate(
    input_ids,
    attention_mask=attention_mask,  # Add the attention mask here
    max_length=200,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    pad_token_id=gpt_tokenizer.eos_token_id  # Ensure pad token is set to eos_token
)

# Decode the generated output
generated_response = gpt_tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Response:", generated_response)

Generated Response: What is Retrieval-Augmented Generation?
It retrieves relevant information from a large corpus to enhance the generation process.
RAG combines retrieval-based methods with generation-based models for improved text generation.
GPT models are commonly used for generating natural language responses. This provides a unique opportunity to use Retrieves, or Retrieve-Based Generative Learning, and to gain a better understanding of the underlying mechanisms behind the response.
