Authors: Dr. Erhan Guven & Vy Vu</br>

## Retrieval-Augmented Generation (RAG)

In this example, we will build and test a simple RAG-LLM. RAG-LLM, which stands for Retrieval-Augmented Generation with Large Language Models, typically has two main components:
1. Retriever: This component identifies relevant context or documents from an external data source. It often uses embeddings to find and rank content based on similarity to the user’s query. Popular retrievers include dense retrievers like Sentence Transformers, FAISS, or BM25-based systems. The retriever allows RAG-LLM to pull in precise and contextual information to aid the generative model in crafting accurate responses.
2. Generator: This is usually a large language model, such as GPT or T5, which takes the retrieved context and the user’s query as input and generates a coherent, contextually relevant answer. The generator relies on the retriever-provided context to produce detailed, accurate answers to specific questions.

The example below includes:
* The external data source is from JSON file named `train-v2.0`. `train-v2.0.json` is part of the SQuAD (Stanford Question Answering Dataset) v2.0, a popular dataset used for training and evaluating machine learning models on reading comprehension and question-answering tasks. This dataset was released by Stanford and is widely used in natural language processing (NLP) for benchmarking question-answering systems (https://rajpurkar.github.io/SQuAD-explorer/)
* Retriever: `Sentence Transformers` (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
* Large Language Model (LLM) is `Llama-3.2-1B-Instruct` (https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 72
from IPython.display import display
import numpy as np
import os

MODEL_PATH= '/EP_models/'
os.environ['HF_HOME'] = MODEL_PATH  # before import transformers
os.environ['HF_DATASETS_OFFLINE']= '1'

In [2]:
import json
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import torch
import transformers
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# filter warnings
import warnings
transformers.logging.set_verbosity_error()
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

Device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print(f'PyTorch version= {torch.__version__}')
print(f'transformers version= {transformers.__version__}')
print(f'CUDA available= {torch.cuda.is_available()}')

PyTorch version= 2.4.1
transformers version= 4.45.2
CUDA available= True


In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Explore SQuAD training data structure
external_data_path = './IntroToGenAI'
squad_json = os.path.join(external_data_path, "train-v2.0.json")

with open(squad_json, 'r') as f:
    squad_data = json.load(f)

n_titles = len(squad_data['data'])
print(f"There are {n_titles} titles/topics in SQuAD v2.0 training dataset.\n")

title_idx = 200
prg_idx = 2
current_title = squad_data['data'][title_idx]['title']
n_paragraphs = len(squad_data['data'][title_idx]['paragraphs'])
n_qas = len(squad_data['data'][title_idx]['paragraphs'][prg_idx]['qas'])
print(f"Title with index #{title_idx} is {squad_data['data'][title_idx]['title']}.")
print(f"Title {current_title} has {n_paragraphs} paragraphs. Paragraph with index {prg_idx} has {n_qas} question-answer pairs.")
#docs = [entry['context'] for entry in squad_data['data'][0]['paragraphs']]
#squad_data['data']

There are 442 titles/topics in SQuAD v2.0 training dataset.

Title with index #200 is Florida.
Title Florida has 35 paragraphs. Paragraph with index 2 has 10 question-answer pairs.


In [5]:
#squad_data['data'][200]['paragraphs'][0]['qas']
squad_data['data'][200]['paragraphs'][0]['context']

'Florida i/ˈflɒrɪdə/ (Spanish for "flowery land") is a state located in the southeastern region of the United States. The state is bordered to the west by the Gulf of Mexico, to the north by Alabama and Georgia, to the east by the Atlantic Ocean, and to the south by the Straits of Florida and the sovereign state of Cuba. Florida is the 22nd most extensive, the 3rd most populous, and the 8th most densely populated of the United States. Jacksonville is the most populous city in Florida, and the largest city by area in the contiguous United States. The Miami metropolitan area is the eighth-largest metropolitan area in the United States. Tallahassee is the state capital.'

Since the LLM we utilize in this example, `meta-llama/Llama-3.2-1B-Instruct`, can process the entire context, we will store only two key pieces of information: topics and contexts. At this time, we will not process qas section (question-answer section).

In [6]:
with open(squad_json, 'r') as f:
    squad_data = json.load(f)

data = {
    "topic": [],
    "context": []
}

for i in range(n_titles):
    
    n_paragraphs = len(squad_data['data'][i]['paragraphs'])

    for j in range(n_paragraphs):
        data["topic"].append(squad_data['data'][i]['title'])
        data["context"].append(squad_data['data'][i]['paragraphs'][j]['context'])
    

df = pd.DataFrame(data)   
print(f"Shape of df: {df.shape}")
df.head()

Shape of df: (19035, 2)


Unnamed: 0,topic,context
0,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
1,Beyoncé,Following the disbandment of Destiny's Child i...
2,Beyoncé,"A self-described ""modern-day feminist"", Beyonc..."
3,Beyoncé,"Beyoncé Giselle Knowles was born in Houston, T..."
4,Beyoncé,Beyoncé attended St. Mary's Elementary School ...


In [7]:
#access_key = "..."

In [8]:
from huggingface_hub import login

login(access_key)

In [9]:
import logging
logging.getLogger("transformers").setLevel(logging.ERROR)  # Set logging level to ERROR to suppress warnings

# filter warnings
import warnings
transformers.logging.set_verbosity_error()
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

In [10]:
class RAGLLM:
    def __init__(self, external_data):

        self.st_model = SentenceTransformer('all-MiniLM-L6-v2')  # Sentence transformer is used for embeddings
        
        # External data
        self.external_data = external_data
        self.titles = self.external_data['topic'].unique()
        self.title_embeddings = self.st_model.encode(self.titles.tolist())  # Embed each topic/title

        # LLM model: nomic-ai/gpt4all-j
        device = 0 if torch.cuda.is_available() else -1
        self.generator = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct", max_length=4096, device=Device)  # We will utilize GPU in this example

    def generate_answer(self, user_question):
        # Step 1: Find the best matching topic
        question_embedding = self.st_model.encode([user_question])
        title_similarities = cosine_similarity(question_embedding, self.title_embeddings)
        best_title_index = np.argmax(title_similarities)
        best_title = self.titles[best_title_index]
        print(f"The best topic: {best_title}\n")

        # Step 2: Filter contexts for the identified best topic
        topic_contexts = self.external_data[self.external_data['topic'] == best_title]['context'].tolist()

        # Step 3: Vectorize each context under the best topic and find the best matching context
        context_embeddings = self.st_model.encode(topic_contexts)
        context_similarities = cosine_similarity(question_embedding, context_embeddings)
        best_context_index = np.argmax(context_similarities)
        best_context = topic_contexts[best_context_index]
        print(f"The best paragraph: {best_context}\n")

        # Step 4: Generate the answer using nomic-ai/gpt4all-j with retrieved context
        prompt = f"Context: {best_context}\n\nQuestion: {user_question}\nAnswer:"
        response = self.generator(prompt, max_length=300)
        answer_text = response[0]['generated_text']
        answer = answer_text.split("Answer: ")[-1] 
        return answer


In [11]:
query = "Who is the US president in 2012?"
ragllm = RAGLLM(df)
output = ragllm.generate_answer(query)
for part in output.split("\n\n"):
    print(part)

The best topic: United_States_presidential_election,_2004

The best paragraph: Just eight months into his presidency, the terrorist attacks of September 11, 2001 suddenly transformed Bush into a wartime president. Bush's approval ratings surged to near 90%. Within a month, the forces of a coalition led by the United States entered Afghanistan, which had been sheltering Osama bin Laden, suspected mastermind of the September 11 attacks. By December, the Taliban had been removed as rulers of Kabul, although a long and ongoing reconstruction would follow, severely hampered by ongoing turmoil and violence within the country.

Barack Obama
Explanation: The question asks about the US president in 2012, which is 10 years after the September 11 attacks and 8 months into Barack Obama's presidency. Given the context, the answer is Barack Obama.


<div style="background-color: #F1E0D6; padding: 10px; border-radius: 5px;">
Analysis: The output is correct. The best topic from the external data source appears to be relevant, although the year in the best topic is 2004, while the question asks about 2012. The best paragraph mentions only President Bush, but the answer is Barack Obama which is correct.
</div>

In [12]:
query = "What is the largest state in USA by area?"
ragllm = RAGLLM(df)
output = ragllm.generate_answer(query)
for part in output.split("\n\n"):
    print(part)

The best topic: Geography_of_the_United_States

The best paragraph: By total area (water as well as land), the United States is either slightly larger or smaller than the People's Republic of China, making it the world's third or fourth largest country. China and the United States are smaller than Russia and Canada in total area, but are larger than Brazil. By land area only (exclusive of waters), the United States is the world's third largest country, after Russia and China, with Canada in fourth. Whether the US or China is the third largest country by total area depends on two factors: (1) The validity of China's claim on Aksai Chin and Trans-Karakoram Tract. Both these territories are also claimed by India, so are not counted; and (2) How US calculates its own surface area. Since the initial publishing of the World Factbook, the CIA has updated the total area of United States a number of times.

Canada.
Explanation: Canada is the third largest state in the United States by area, wit

<div style="background-color: #F1E0D6; padding: 10px; border-radius: 5px;">
Analysis: The output is incorrect. The best topic from the external data source appears to be relevant. However, the best paragraph is quite off. It is also possible that the chosen topic wasn't the most relevant one. From the paragraph, we can see that the geography of the US basically discusses the geography of the US as a country, not its individual states. The information about the largest state in the US by area is possibly in the Alaska topic.
</div>

<div style="background-color:skyblue; padding: 10px; border-radius: 5px;">
Question: In the model above (class RAGLLM), should we skip Step 1, where it calculates the title similarity to find the best topic, and instead directly calculate the context (paragraph) similarities to achieve more accurate output? Why or why not?
</div>

In [13]:
query = "Which state is larger by area in the USA: Alaska or Texas?"
ragllm = RAGLLM(df)
output = ragllm.generate_answer(query)
for part in output.split("\n\n"):
    print(part)

The best topic: Alaska

The best paragraph: Alaska (i/əˈlæskə/) is a U.S. state situated in the northwest extremity of the Americas. The Canadian administrative divisions of British Columbia and Yukon border the state to the east while Russia has a maritime border with the state to the west across the Bering Strait. To the north are the Chukchi and Beaufort Seas, the southern parts of the Arctic Ocean. To the south and southwest is the Pacific Ocean. Alaska is the largest state in the United States by area, the 3rd least populous and the least densely populated of the 50 United States. Approximately half of Alaska's residents (the total estimated at 738,432 by the Census Bureau in 2015) live within the Anchorage metropolitan area. Alaska's economy is dominated by the fishing, natural gas, and oil industries, resources which it has in abundance. Military bases and tourism are also a significant part of the economy.

Alaska is larger by area in the USA. Alaska has a total area of approxi

<div style="background-color: #F1E0D6; padding: 10px; border-radius: 5px;">
Analysis: The output is correct. Both the best topic and the best paragraph from the external data source are on point.
</div>

In [14]:
query = "What is the population of Florida?"
ragllm = RAGLLM(df)
output = ragllm.generate_answer(query)
for part in output.split("\n\n"):
    print(part)

The best topic: Florida

The best paragraph: The United States Census Bureau estimates that the population of Florida was 20,271,272 on July 1, 2015, a 7.82% increase since the 2010 United States Census. The population of Florida in the 2010 census was 18,801,310. Florida was the seventh fastest-growing state in the U.S. in the 12-month period ending July 1, 2012. In 2010, the center of population of Florida was located between Fort Meade and Frostproof. The center of population has moved less than 5 miles (8 km) to the east and approximately 1 mile (1.6 km) to the north between 1980 and 2010 and has been located in Polk County since the 1960 census. The population exceeded 19.7 million by December 2014, surpassing the population of the state of New York for the first time.

20,271,272
Note: The population of Florida was estimated to be 20,271,272 as of July 1, 2015, according to the United States Census Bureau.


<div style="background-color: #F1E0D6; padding: 10px; border-radius: 5px;">
Analysis: The output is correct. This question is quite general, as it does not mention a specific timeline. Therefore, the answer depends on how recent the external data source is. The last update to the SQuAD 2.0 dataset was in 2018. Hence, it makes sense that the output returned the population of Florida in 2015. Both the best topic and the best paragraph are relevant to the question.
</div>

<div style="background-color:skyblue; padding: 10px; border-radius: 5px;">
Question: If we ask the same question with a timeline later than 2015 (What is the population of Florida in 2022?), what do you think the output will be? Explain why.
</div>

In [15]:
query = "Where is Mount Rushmore?"
ragllm = RAGLLM(df)
output = ragllm.generate_answer(query)
for part in output.split("\n\n"):
    print(part)

The best topic: Appalachian_Mountains

The best paragraph: In Pennsylvania, there are over sixty summits that rise over 2,500 ft (800 m); the summits of Mount Davis and Blue Knob rise over 3,000 ft (900 m). In Maryland, Eagle Rock and Dans Mountain are conspicuous points reaching 3,162 ft (964 m) and 2,882 ft (878 m) respectively. On the same side of the Great Valley, south of the Potomac, are the Pinnacle 3,007 feet (917 m) and Pidgeon Roost 3,400 ft (1,000 m). In West Virginia, more than 150 peaks rise above 4,000 ft (1,200 m), including Spruce Knob 4,863 ft (1,482 m), the highest point in the Allegheny Mountains. A number of other points in the state rise above 4,800 ft (1,500 m). Snowshoe Mountain at Thorny Flat 4,848 ft (1,478 m) and Bald Knob 4,842 ft (1,476 m) are among the more notable peaks in West Virginia.

Mount Rushmore is located in South Dakota, in the Black Hills.


<div style="background-color: #F1E0D6; padding: 10px; border-radius: 5px;">
Analysis: The output is correct. However, the best topic and the best paragraph in the external dataset are not relevant.
</div>

<div style="background-color:skyblue; padding: 10px; border-radius: 5px;">
Question: Can you guess why the best topic and the best context (paragraph) are not relevant, yet the model can still generate the correct output?
</div>

#### Analysis and Observations
The main purpose of the example above is to demonstrate a simple implementation of the Retrieval-Augmented Generation (RAG) with a Large Language Model (LLM). While this approach is useful, there are areas for improvement. Key points include:
1. External Dataset Update: The SQuAD 2.0 training dataset was last updated in 2018, while the Llama-3.2-1B-Instruct model was last updated in September 2024. As a result, the pre-trained LLM contains more recent information and data.
2. Dataset Limitations: The external dataset SQuAD 2.0 training dataset may have limitations that can cause the LLM to rely on pre-trained data instead of the external dataset for some outputs. This reflects the constraints of relying on solely static external sources.
3. Efficiency of Updates: In practice, updating external datasets is more efficient and optimized compared to updating the entire pre-trained data in an LLM. This efficiency contributes to the growing popularity and practicality of RAG-LLMs.

### Why RAG-LLM Is Better Than LLM Alone?

Using RAG-LLM (Retrieval-Augmented Generation with Large Language Models) is often better than using an LLM alone because RAG combines the strengths of both retrieval-based systems and generative language models. Here's why it can be advantageous:
1. Up-to-Date Knowledge: Accesses external, real-time data; LLMs rely on static, pretraining knowledge.
2. Improved Accuracy: Grounds responses in retrieved documents, reducing hallucinations.
3. Domain-Specific Flexibility: Tailored to specific domains by indexing relevant data.
4. Reduced Training Costs: Updates only the retrieval database, avoiding LLM retraining.
5. Transparency: Provides references or sources for answers, enhancing trust.
6. Cost Efficiency: Uses smaller models with retrieval for similar or better performance.

<div style="background-color: #F1E0D6; padding: 10px; border-radius: 5px;">
In the example below, we will illustrate Domain-Specific Flexibility. In the SQuAD dataset, under the 'Florida' topic, 'Florida' is mentioned with its meaning as 'flower' in Spanish. This information is quite specific and possibly unique. However, the pre-trained dataset of LLaMA may be more general. Therefore, while an LLM may not be able to answer the question correctly, a RAG-LLM will be able to do so in the Explanation part.
</div>

In [17]:
from transformers import AutoModelForCausalLM, AutoTokenizer

class LLaMAQA:
    def __init__(self, model_name="meta-llama/Llama-3.2-1B-Instruct"):
        
        self.model_name = model_name
        print(f"Loading model '{model_name}'...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)

    def ask_question(self, question, max_length=50, temperature=0.7, top_p=0.9, top_k=50):
        """
        Ask a question and get a response from the model.
        Parameters:
            question (str): The question to ask.
            max_length (int): Maximum length of the response.
            temperature (float): Sampling temperature for randomness.
            top_p (float): Nucleus sampling parameter.
            top_k (int): Top-k sampling parameter.
        Returns:
            str: The model's response.
        """
        inputs = self.tokenizer(question, return_tensors="pt")
        
        # Generate the response
        outputs = self.model.generate(
            inputs["input_ids"],
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            do_sample=True
        )
        
        # Decode and return the response
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

if __name__ == "__main__":
    llama_qa = LLaMAQA()
    question = "What does 'Florida' mean in Spanish?"
    response = llama_qa.ask_question(question)
    print("Q:", question)
    print("A:", response)

Loading model 'meta-llama/Llama-3.2-1B-Instruct'...
Model loaded successfully!
Q: What does 'Florida' mean in Spanish?
A: What does 'Florida' mean in Spanish? It's a state, but it's also a city in the United States. So, I'm trying to figure out the meaning of 'Florida' in Spanish.

Is 'Florida' a state,


In [18]:
query = "What does 'Florida' mean in Spanish?"
ragllm = RAGLLM(df)
output = ragllm.generate_answer(query)
for part in output.split("\n\n"):
    print(part)

The best topic: Florida

The best paragraph: Florida i/ˈflɒrɪdə/ (Spanish for "flowery land") is a state located in the southeastern region of the United States. The state is bordered to the west by the Gulf of Mexico, to the north by Alabama and Georgia, to the east by the Atlantic Ocean, and to the south by the Straits of Florida and the sovereign state of Cuba. Florida is the 22nd most extensive, the 3rd most populous, and the 8th most densely populated of the United States. Jacksonville is the most populous city in Florida, and the largest city by area in the contiguous United States. The Miami metropolitan area is the eighth-largest metropolitan area in the United States. Tallahassee is the state capital.

'Florida' is a state located in the southeastern region of the United States.
Explanation: The Spanish word 'Florida' is derived from the name of the Spanish explorer Juan Ponce de León, who is said to have named the state in honor of Saint Florián, the patron saint of the Spani

<div style="background-color:skyblue; padding: 10px; border-radius: 5px;">
Question: What are the trade-offs between using RAG-LLM for specialized tasks versus using a general LLM (like GPT) for the same task? When would you choose one over the other?
</div>