# Retrieval Augmented Generation (RAG) with a Large Language Model (LLM)

This notebook shows a RAG implementation for a TinyLLaMA model. Some different RAG and query prompting techniques are used and compared, such as Multi-Query document retrieval, Hypothetical Documents (HyDE), adding contextual prompts to the query, and varying the amount of retrieved documents.

At the end, a fine tuned LLM is used to do sentiment analysis on a few sentences, to highlight a different type of task LLMs can perform.

## Setting up the libraries and the environment

In [1]:
!pip install datasets transformers sentence-transformers langchain langchain_community faiss-cpu torch

Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12=

In [29]:
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

## Data Preprocessing and Model Selection

In [None]:
# Computer Science Question/Answer dataset: https://huggingface.co/datasets/August4293/CS_QA
# Combine the question/answer pairs into documents
cs_qa_raw = load_dataset("August4293/CS_QA", split="train")
documents = [f"Q: {pair['question']} A: {pair['answer']}" for pair in cs_qa_raw]
print(len(documents))

# Generate tokens for the documents
model_name = "TinyLLaMA/TinyLLaMA-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer(documents, truncation=True, padding=True, return_tensors="pt")
print(tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=True))
print(tokens['input_ids'].shape)

# Chunks the data before vectorizing and storing
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.create_documents(documents)
chunks = [doc.page_content for doc in chunked_docs]
print("Number of chunks:", len(chunks))

# Making a vector store with faiss and also making the embeddings
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedding_model.encode(chunks)
print(embeddings.shape)

# Create and save index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings.astype(np.float32))
index_doc_map = {i: doc for i, doc in enumerate(chunked_docs)}
faiss.write_index(index, "csqa_index.faiss")

798
Q: What is supervised learning? A: Supervised learning is a machine learning paradigm where the algorithm learns from labeled training data, making predictions or decisions based on input-output pairs.
torch.Size([798, 78])
Number of chunks: 798
(798, 384)


## Implementing RAG using LangChain for different queries

The main components of RAG with LangChain are as follows:
- Query Translation:
  - Augment the query to be more useful in retrieval systems.
  - Rephrase, breakdown, abstract, generate hypothetical documents
- Indexing:
  - Make use of the vector stores in their efficient storage of vector embeddings.
  - Offline step to store documents for retrieval.
  - Important due to the efficient retrieval of relevant documents that it provides through vector embeddings.
- Retrieval:
  - Retrieve data that is similar to the translated query based on it's embedding.
  - Includes ranking relevance to grab the most relevant information to respond to the query.
- Generation:
  - Generate a response given the retrieved information and the query.
  - Can also use the generated response to inform more response generation in a positive feedback loop to improve the response.

Other components exist such as:
- Routing:
  - Involves deciding on which data stores to query for information given the translated prompt.
- Query construction:
  - Involves constructing queries for the chosen data stores involved with RAG, based on the translated prompt.

In [None]:
# Load model
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map="auto")

In [None]:
# RAG pipeline
def rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
    """
    Generate a response using the RAG pattern.

    Args:
        query: The user's question
        index: FAISS index
        embedding_model: Model to create embeddings
        llm_model: Language model for generation
        llm_tokenizer: Tokenizer for the language model
        index_to_doc_map: Mapping from index positions to document chunks
        top_k: Number of documents to retrieve

    Returns:
        response: The generated response
        sources: The source documents used
    """
    # Create query embedding
    query_embedding = embedding_model.encode([query]).astype(np.float32)

    # Get top k docs
    distances, indices = index.search(query_embedding, top_k)
    retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]

    # Generate context from retrieved documents
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Create prompt for the model
    prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."

Context:
{context}
<|user|>
{query}
<|assistant|>"""

    # Generate response
    input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)

    generation_config = {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.95,
        "do_sample": True
    }

    # Generate the output
    with torch.no_grad():
        output = llm_model.generate(
            input_ids=input_ids,
            **generation_config
        )

    # Decode the output
    generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract the model's response
    response = generated_text.split("<|assistant|>")[-1].strip()

    return response

In [None]:
# Relevant queries to the CompSci dataset
queries = [
    "What is reinforcement learning in machine learning?",
    "What are some differences between reinforcement learning and supervised learning?",
    "What are some benefits of reinforcement learning over deep reinforcement learning?"
]

for question in queries:
    print(f"Question: {question}\n")

    response = rag_response(
        query=question,
        index=index,
        embedding_model=embedding_model,
        llm_model=model,
        llm_tokenizer=tokenizer,
        index_to_doc_map=index_doc_map,
        top_k=5
    )

    print(f"Response: {response}\n")
    print("=" * 50)

Question: What is reinforcement learning in machine learning?



  return forward_call(*args, **kwargs)


Response: Reinforcement learning is a type of machine learning that is used to train a decision-making agent to learn from its experiences and make decisions based on the rewards it receives. The agent receives feedback in the form of rewards or penalties based on its actions. Reinforcement learning is commonly used in applications such as robotics, finance, and games.

Question: What are some differences between reinforcement learning and supervised learning?

Response: Here are some differences between reinforcement learning and supervised learning:

1. Learning Goals: Reinforcement learning aims to learn the optimal policy for a given task, which is different from supervised learning that aims to learn a specific label for each input-output pair.

2. Input-Output Pair: Reinforcement learning learns based on the current state, whereas supervised learning learns based on the current input-output pair.

3. Strategy or Policy: Reinforcement learning focuses on the policy, while supervis

The LLM was prompted with 3 queries. Under each query, the strengths and weaknesses of the response are analyzed.
- What is reinforcement learning in machine learning?
  - This is a question that closely matches similar questions and information in the CS/QA dataset that I am using to augment the LLM. The definition that the LLM provides has no clear flaws and gives a general description of reinforcement learning. This is likely because of the similarity between the query and the retrieved documents
- What are some differences between reinforcement learning and supervised learning.
  - The first point is that reinforcement learning learns an optimal policy for a given task, while supervised learning learns labels for input output pairs. This is true, but is quite vague. That being said, it correctly identified the difference in learning goals.
  - It also identifies that reinforcement learning learns based off of state and supervised learning learns off of input output pairs. This is quite vague, and I would also say that reinforcement learning learns primarily off of the reward it receives based on actions it takes in a state. Given the shortness of the responses in the CS/QA dataset, the LLM is likely reaching as it simply does not have the information to appropriately answer the question. That being said, it is not completely incorrect.
  - It then says some somewhat truths like that reinforcement learning focuses on policy whereas supervised learning focuses on strategy.
  - It struggles and hallucinates a bit when comparing the 2 ML methods. It says that reinforcement learning uses labelled data and supervised learning uses unlabelled data, which is clearly wrong. This is the distinction between supervised learning and unsupervised learning. I believe it must have thought the query was similar to a document that explains the difference between supervised and unsupervised learning, and got confused there.
- What are some benefits of reinforcement learning over deep reinforcement learning?
  - To this query, the LLM had some valid points and some hallucinations. For example, it states that reinforcement learning is more interpretable, which is definitely true. It also states that reinforcement learning is better at complex tasks, which is quite a general statement, but it is historically true that RL methods that are specific to a task tend to perform better than general Deep RL methods. However, it also stated that reinforcement learning is more general than Deep RL, which is simply not true, as RL often needs task specific feature engineering to work properly.
  - Overall, the LLM may have found some vague information on RL and Deep RL to compare, but it seems to be likely using information about Deep Learning in general and comparing that to reinforcement learning or to other types of machine learning mistakenly.

## Modify and evaluate the different components of RAG

In [None]:
# Multi-Query document retrieval:
def multi_query_rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
    query_embedding = embedding_model.encode([query]).astype(np.float32)

    # Multi-query augmentation
    # Generate some additional similar queries
    queries = [query, f"What does {query} mean?", f"Reword {query} in a computer science context"]
    all_embeddings = embedding_model.encode(queries).astype(np.float32)

    # Retrieve documents based on the new queries
    retrieved_indices = set()
    for emb in all_embeddings:
        _, idxs = index.search(emb.reshape(1, -1), top_k)
        retrieved_indices.update(idxs[0])
    retrieved_docs = [index_to_doc_map[idx] for idx in retrieved_indices]

    # SAME AS BASELINE RAG vvvvvvvvvvvvvvvvvvvv
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])
    prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."

Context:
{context}
<|user|>
{query}
<|assistant|>"""

    # Generate response
    input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)

    generation_config = {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.95,
        "do_sample": True
    }

    # Generate the output
    with torch.no_grad():
        output = llm_model.generate(
            input_ids=input_ids,
            **generation_config
        )

    # Decode the output
    generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract the model's response
    response = generated_text.split("<|assistant|>")[-1].strip()

    return response

# HyDE RAG:
def hypthetical_doc_rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
    # Generates a hypothetical document using baseline RAG and then use that doc to search for other docs
    hypo_doc_query = f"Write a short explanation or summary for the following question: {query}"
    hypo_doc = rag_response(hypo_doc_query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k)
    query_embedding = embedding_model.encode([hypo_doc]).astype(np.float32)

    # SAME AS BASELINE RAG vvvvvvvvvvvvvvvvvvvv
    # Get top k docs
    distances, indices = index.search(query_embedding, top_k)
    retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]

    # Generate context from retrieved documents
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Create prompt for the model
    prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."

Context:
{context}
<|user|>
{query}
<|assistant|>"""

    # Generate response
    input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)

    generation_config = {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.95,
        "do_sample": True
    }

    # Generate the output
    with torch.no_grad():
        output = llm_model.generate(
            input_ids=input_ids,
            **generation_config
        )

    # Decode the output
    generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract the model's response
    response = generated_text.split("<|assistant|>")[-1].strip()

    return response

In [23]:
for question in queries:
    print(f"Question: {question}\n")

    mq_response = multi_query_rag_response(
        query=question,
        index=index,
        embedding_model=embedding_model,
        llm_model=model,
        llm_tokenizer=tokenizer,
        index_to_doc_map=index_doc_map,
        top_k=5
    )

    print(f"Multi-Query Response: {mq_response}\n")

    hyde_response = hypthetical_doc_rag_response(
        query=question,
        index=index,
        embedding_model=embedding_model,
        llm_model=model,
        llm_tokenizer=tokenizer,
        index_to_doc_map=index_doc_map,
        top_k=5
    )

    print(f"HyDE Response: {hyde_response}\n")
    print("=" * 50)

Question: What is reinforcement learning in machine learning?



  return forward_call(*args, **kwargs)


Multi-Query Response: Reinforcement learning is a type of machine learning that uses a feedback loop to learn and improve an agent's behavior in an environment. The agent receives feedback in the form of rewards or penalties based on its actions, which can help the agent make better decisions. Reinforcement learning is a subset of machine learning, and it is often used in domains where there are a lot of decision-making or control tasks that require a high level of automation or precision.

HyDE Response: Reinforcement learning is a type of machine learning (ML) methodology that enables an ML model to learn from experience and make decisions based on the rewards or punishments associated with those decisions. In reinforcement learning, an agent takes actions in an environment to maximize a reward signal over time, which is then used to update its decision-making process. Reinforcement learning can be applied in various domains, such as navigation, control, and decision-making.

Questio

The first thing to note while using Multi-Query document retrieval and Hypothetical Document Embeddings is that both seemed to perform at least as good as the baseline, and often were more specific and correct in their responses. For example, both MQ and HyDE go more in-depth in their description of RL, with MQ describing the feedback loop in RL and HyDE describing the agent's interactions quite well. Unfortunately, the models still had some hallucinations. I believe that this is more a limit of the dataset combined with the specificity of the questions than the methods themselves. The MQ method is able to gather 3x as many documents as the baseline or HyDE, so it typically is able to include more information than HyDE. That said, HyDE seems to have the highest quality information, which is likely due to the  hypothetical document to document similarity improving the retrieved documents for the LLM to contextualize a query. Finally, both MQ and HyDE hallucinated less in the more difficult final 2 questions than the baseline.

In [None]:
# Modified baseline RAG, prompt template giving more guidance
def prompt_rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
    # Create query embedding
    query_embedding = embedding_model.encode([query]).astype(np.float32)

    # Get top k docs
    distances, indices = index.search(query_embedding, top_k)
    retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]

    # Generate context from retrieved documents
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Create prompt for the model
    prompt = f"""<|system|>
You are a Computer Science information retrieval assistant. Answer the question based on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."
If you have enough information to give some context to the user, but not enough to make an exhaustive list, then simply give a short list that you are confident in.
The questions will be about machine learning. Make sure you do not mix up concepts such as supervised learning, unsupervised learning and reinforcement learning, which are all different.

Context:
{context}
<|user|>
{query}
<|assistant|>"""

    # Generate response
    input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)

    generation_config = {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.95,
        "do_sample": True
    }

    # Generate the output
    with torch.no_grad():
        output = llm_model.generate(
            input_ids=input_ids,
            **generation_config
        )

    # Decode the output
    generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract the model's response
    response = generated_text.split("<|assistant|>")[-1].strip()

    return response

for question in queries:
    print(f"Question: {question}\n")

    response = prompt_rag_response(
        query=question,
        index=index,
        embedding_model=embedding_model,
        llm_model=model,
        llm_tokenizer=tokenizer,
        index_to_doc_map=index_doc_map,
        top_k=5
    )

    print(f"Response: {response}\n")
    print("=" * 50)

Question: What is reinforcement learning in machine learning?



  return forward_call(*args, **kwargs)


Response: Reinforcement learning (RL) in machine learning is a type of machine learning algorithm that allows an agent to learn how to take actions in an environment to maximize a reward signal over time. In RL, an agent receives feedback in the form of rewards or penalties based on its actions. The agent's goal is to learn to take actions that lead to the highest possible reward.

Question: What are some differences between reinforcement learning and supervised learning?

Response: 1. Type of feedback: Reinforcement learning uses feedback in the form of rewards or penalties, while supervised learning uses feedback in the form of labels.

2. Decision-making process: Reinforcement learning determines the optimal actions for a given state based on the current state and rewards, while supervised learning selects the optimal actions for a given state based on the training data.

3. Learning algorithm: Reinforcement learning uses a policy gradient algorithm, while supervised learning uses a

The additional prompt information improved the resulting responses from the reinforcement learning questions considerably versus other methods so far. The description of RL is similar to other methods responses, and all of the methods so far has answered it correctly. The next 2 questions are typically more difficult for the methods explored so far. The second question contained some hallucinations, with the LLM stating that supervised learning selects actions, but overall did quite well. It even brought up gradient descent vs. Monte Carlo methods, and model interpretability, which are both valid points. Finally, the last question was answered quite well. The question itself is supposed to trap the LLM as Deep RL and RL each have their advantages and disadvantages. Despite this, the LLM was able to recognize this and list both the advantages and disadvantages of RL and Deep RL, without any major hallucinations.

In [None]:
# Test different amounts of documents retrieved.
for question in queries:
    print(f"Question: {question}\n")
    for k in [1, 5, 10]:
        print(f"k = {k}")

        response = rag_response(
            query=question,
            index=index,
            embedding_model=embedding_model,
            llm_model=model,
            llm_tokenizer=tokenizer,
            index_to_doc_map=index_doc_map,
            top_k=k
        )

        print(f"Response: {response}\n")
        print("=" * 50)

Question: What is reinforcement learning in machine learning?

k = 1


  return forward_call(*args, **kwargs)


Response: Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions. In reinforcement learning, the agent's goal is to maximize the total reward it receives by taking actions that lead to the highest possible reward. In other words, the agent's goal is to maximize its expected cumulative reward.

k = 5
Response: Reinforcement learning (RL) is a machine learning technique that enables machines to learn and improve their behavior through interactions with the environment. In RL, an agent or machine receives feedback in the form of rewards or punishments, which it uses to update its policy or actions based on its current state and actions. This process is repeated continuously until the agent learns to perform optimally in the environment.

k = 10
Response: Reinforcement learning is a type of machine learning that is based on the

When increasing the amount of documents retrieved, the model seems to improve to a point, and then becomes relatively nonsensical. When retrieving just 1 document, the LLM can answer simple questions like defining RL. But in the questions that ask for a deeper comparison, it can fall short. For example, it wasn't able to recognize that supervised learning was being referred to in the general context, instead of some sort of offline RL as supervised learning or some other hallucinations. As the document count increased, the comparative questions contained more information, but at 10 documents, the LLM seemed to be adding more information that was less relevant to the question. This was unhelpful and made for the response to be worse than the 5 document response.

In all, the baseline performance of the LLM, with 5 documents retrieved, gives a decent description for RL, and degrading performance in more complex comparative questions, and often includes hallucinations.

The Multi Query technique performed better than the baseline, in that it seemed able to add more information to it's responses. This decreased hallucinations and added more context for the reader. The HyDE technique performed the best versus the MQ technique or the baseline. It seemed to gather more quality context to base it's response off of, and would hallucinate less than the baseline. It was also able to include valid points slightly more often than MQ.

The prompt engineering technique, where more guidance was added to the LLM through the prompt, to make it more task specific, provided the largest jump in quality from the baseline. It was able to answer the RL description of course, but it was also able to answer the second question with only a few pain points. Finally, it was able to navigate the final query without falling into the trap that the query implies that RL or Deep RL are better than each other, when they are generally used in different cases or are involved with completely different methods.

Finally, the effect of the number of documents retrieved was the least pronounced of the methods. This is likely because while more context is being provided, the documents will get less relevant as more are retrieved. I do believe that combining more documents being retrieved with a measure to ensure that the documents meet a similarity threshold would be useful, to avoid telling the LLM that some document is relevant when in reality, it is not.

## Using a pretrained LLM for Sentiment Analysis

The model will take in a text string and output a label and a score. The label indicates whether the LLM believes it is a positive or negative opinion. The score indicates how confident the LLM is in it's prediction.

I am using the distilbert-base-uncased-finetuned-sst-2-english model which is a fine-tuned model from distilbert-base-uncased. This uses Supervised Fine Tuning (SFT) to tune the model to sentiment analysis. It is different from the TinyLlama LLM used in the previous code.

Reference: https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

In [None]:
# Hugging face LLM called distilbert, fined tuned using SFT to do sentiment analysis
# This also grabs a sentiment analysis pipeline from HF that handles preprocessing and translating logits to label and confidence score.
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

sentences = [
    "This course was incredibly insightful and well-structured.",
    "I don't like the way this algorithm was explained.",
    "The model performance was excellent even on difficult examples.",
    "This is the worst textbook I’ve ever used.",
    "It’s okay, but could use more examples and clarity.",
    "Reinforcement Learning is my least favourite ML category.",
    "Reinforcement Learning is my favourite ML category."
]

# Sentiment analysis
for text in sentences:
    result = classifier(text)[0]
    label = result['label']
    score = result['score']
    print(f"Text: {text}\nPredicted Sentiment: {label} (Confidence: {score:.4f})\n{'-'*60}")


Device set to use cuda:0


Text: This course was incredibly insightful and well-structured.
Predicted Sentiment: POSITIVE (Confidence: 0.9999)
------------------------------------------------------------
Text: I don't like the way this algorithm was explained.
Predicted Sentiment: NEGATIVE (Confidence: 0.9983)
------------------------------------------------------------
Text: The model performance was excellent even on difficult examples.
Predicted Sentiment: POSITIVE (Confidence: 0.9997)
------------------------------------------------------------
Text: This is the worst textbook I’ve ever used.
Predicted Sentiment: NEGATIVE (Confidence: 0.9998)
------------------------------------------------------------
Text: It’s okay, but could use more examples and clarity.
Predicted Sentiment: POSITIVE (Confidence: 0.8375)
------------------------------------------------------------
Text: Reinforcement Learning is my least favourite ML category.
Predicted Sentiment: NEGATIVE (Confidence: 0.9996)
--------------------------