# Retrieval-Augmented Generation (RAG)

## Overview
This notebook presents a complete workflow for building a RAG system that enhances Large Language Models with external knowledge. You'll implement the core components that allow LLMs to access and use information beyond their training data!

**What You'll Build:**
- A text chunking function for document processing
- An embedding generation system using OpenAI's API
- A similarity-based retrieval mechanism
- A prompt augmentation pipeline for context-aware responses


## Install Packages and Setup Variables


In [None]:
!pip install -q openai==1.59.8 cohere==5.13.8 tiktoken==0.8.0

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m455.6/455.6 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.7/251.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os

# Set the "OPENAI_API_KEY" and "GOOGLE_API_KEY" in the Python environment. Will be used by OpenAI client later.

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"
os.environ["GOOGLE_API_KEY"] =  "<YOUR_GOOGLE_API_KEY>"



# from google.colab import userdata

# os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')
# os.environ["GOOGLE_API_KEY"] =  userdata.get('Google_api_key')

In [None]:
# False: Generate the embedding for the dataset. (Associated cost with using OpenAI endpoint)
# True: Load the dataset that already has the embedding vectors.
load_embedding = False

## Download Dataset (JSON)


The dataset includes several articles from the TowardsAI blog, which provide an in-depth explanation of the LLaMA2 model.


In [None]:
!wget https://raw.githubusercontent.com/AlaFalaki/tutorial_notebooks/main/data/mini-llama-articles.csv
!wget https://raw.githubusercontent.com/AlaFalaki/tutorial_notebooks/main/data/mini-llama-articles-with_embeddings.csv

## Read File and Chunk Text

### Exercise 1: Implement the text chunking function

Text chunking is crucial in RAG systems because:
- Language models have token limits
- Smaller chunks allow for more precise retrieval
- Overlapping chunks can preserve context at boundaries

Your task: Implement a function that splits text into chunks of a specified size.
Consider:
- How to handle the last chunk if it's smaller than chunk_size
- Whether to add overlap between chunks (optional enhancement)



In [None]:
# Split the input text into chunks of specified size.
def split_into_chunks(text, chunk_size=1024):
    """
    Split the input text into chunks of specified size.

    Parameters:
    - text (str): The input text to be split
    - chunk_size (int): The maximum size of each chunk in characters

    Returns:
    - chunks (list): A list of text chunks

    TODO: Implement this function
    Hint: Use a loop to iterate through the text in steps of chunk_size
    Think about: How would you extract a substring from position i to i+chunk_size?
    """
    chunks = []

    # TODO: Your implementation here
    # Step 1: Create a loop that goes from 0 to len(text) in steps of chunk_size
    # Step 2: In each iteration, extract a substring from current position to current position + chunk_size
    # Step 3: Append each chunk to the chunks list
    # Step 4: Return the chunks list

    pass  # Remove this line when implementing

In [None]:
import csv

chunks = []

# Load the file as a CSV
with open("./mini-llama-articles.csv", mode="r", encoding="utf-8") as file:
    csv_reader = csv.reader(file)

    for idx, row in enumerate(csv_reader):
        if idx == 0:
            continue
            # Skip header row
        chunks.extend(split_into_chunks(row[1]))

In [None]:
print("number of articles:", idx)
print("number of chunks:", len(chunks))

In [None]:
import pandas as pd

# Convert the JSON list to a Pandas Dataframe
df = pd.DataFrame(chunks, columns=["chunk"])

df.keys()

## Generate Embedding
### Exercise 2: Implement the embedding generation function

Embeddings are numerical representations of text that capture semantic meaning:
* OpenAI's models convert text into high-dimensional vectors (1536 dimensions for text-embedding-3-small)
* Similar texts have similar vector representations
* The distance between vectors indicates semantic similarity

Your task: Implement a function that converts text to embeddings using OpenAI's API.

Important considerations:
* Clean the text by removing newlines (they can affect embedding quality)
* Use try-except to handle potential API errors gracefully
* The OpenAI API might fail due to rate limits or network issues

API details:
* Use client.embeddings.create() method
* Parameters: input=[your_text], model="text-embedding-3-small"
* Extract embedding from: response.data[0].embedding

Reference: https://platform.openai.com/docs/guides/embeddings




In [None]:
from openai import OpenAI

client = OpenAI()


# Defining a function that converts a text to embedding vector using OpenAI's Ada model.
def get_embedding(text):
    """
    Convert a text to embedding vector using OpenAI's embedding model.

    Parameters:
    - text (str): The input text to convert to embedding

    Returns:
    - embedding (list): A list of floats representing the embedding vector

    TODO: Implement this function
    Important considerations:
    1. Clean the text by removing newlines (they can affect embedding quality)
    2. Use try-except to handle potential API errors gracefully
    3. The OpenAI embeddings.create() method requires:
       - input: a list containing your text
       - model: use "text-embedding-3-small" for this exercise
    4. The response contains the embedding in res.data[0].embedding

    API Documentation: https://platform.openai.com/docs/guides/embeddings
    """

    # TODO: Your implementation here
    # Step 1: Add try-except block for error handling
    # Step 2: Inside try block, remove newlines from text using .replace()
    # Step 3: Call client.embeddings.create() with appropriate parameters
    # Step 4: Extract and return the embedding from the response
    # Step 5: In except block, return None if an error occurs

    pass  # Remove this line when implementing

In [None]:
from tqdm.notebook import tqdm
import numpy as np

# Generate embedding
if not load_embedding:
    print("Generating embeddings...")
    embeddings = []
    for index, row in tqdm(df.iterrows()):
        # df.at[index, 'embedding'] = get_embedding( row['chunk'] )
        embeddings.append(get_embedding(row["chunk"]))

    embeddings_values = pd.Series(embeddings)
    df.insert(loc=1, column="embedding", value=embeddings_values)

# Or, load the embedding from the file.
else:
    print("Loaded the embedding file.")
    # Load the file as a CSV
    df = pd.read_csv("mini-llama-articles-with_embeddings.csv")
    # Convert embedding column to an array
    df["embedding"] = df["embedding"].apply(lambda x: np.array(eval(x)), 0)

In [None]:
# df.to_csv('mini-llama-articles-with_embeddings.csv')

## User Question


In [None]:
# Define the user question, and convert it to embedding.
QUESTION = "How many parameters LLaMA2 model has?"
QUESTION_emb = get_embedding(QUESTION)

len(QUESTION_emb)

## Test Cosine Similarity


Calculating the similarity of embedding representations can help us to find pieces of text that are close to each other. In the following sample you see how the Cosine Similarity metric can identify which sentence could be a possible answer for the given user question. Obviously, the unrelated answer will score lower.


In [None]:
BAD_SOURCE_emb = get_embedding("The sky is blue.")
GOOD_SOURCE_emb = get_embedding("LLaMA2 model has a total of 2B parameters.")

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# A sample that how a good piece of text can achieve high similarity score compared
# to a completely unrelated text.
print("> Bad Response Score:", cosine_similarity([QUESTION_emb], [BAD_SOURCE_emb]))
print("> Good Response Score:", cosine_similarity([QUESTION_emb], [GOOD_SOURCE_emb]))

## Calculate Cosine Similarities

### Exercise 3: Implement the retrieval logic

This is the core of the RAG system - finding the most relevant chunks to answer the user's question.

Your task: Find the indices of the top N most similar chunks.

Key concepts:
* Cosine similarity returns a 2D array of similarity scores
* Higher scores indicate more relevant chunks
* You need to sort and select the best matches

Implementation hints:
* numpy's argsort() returns indices that would sort an array
* To get descending order, you can use negative indexing: [::-1]
* Select only the first N indices after sorting

Example:
* If scores = [0.3, 0.8, 0.5, 0.9], then np.argsort(scores) returns [0, 2, 1, 3]
* To get descending order: np.argsort(scores)[::-1] returns [3, 1, 2, 0]



In [None]:
# Calculate similarity between the question and each chunk
cosine_similarities = cosine_similarity([QUESTION_emb], df["embedding"].tolist())

print(f"Shape of similarities: {cosine_similarities.shape}")

number_of_chunks_to_retrieve = 3

# TODO: Your implementation here
# Step 1: Use np.argsort() on cosine_similarities[0] (the first and only row)
# Step 2: Reverse the order to get descending sort (highest scores first)
# Step 3: Select only the first 'number_of_chunks_to_retrieve' indices
# Store the result in a variable called 'indices'

indices = None  # Replace with your implementation

# Display the retrieved chunks
print("Retrieved chunks:")
for idx, item in enumerate(df.chunk[indices]):
    print(f"\n> Chunk {idx+1}")
    print(item)
    print("----")

## Augment the Prompt

This is where RAG differs from standard LLM usage. Instead of relying only on the model's training data, we augment the prompt with retrieved context.

Your task: Create an effective prompt that incorporates the retrieved chunks.

Key components of a good RAG prompt:
* Clear system instructions about using provided context
* Explicit context boundaries (START/END tags)
* The retrieved information placed before the question
* Instructions to be accurate and base answers on the context

Prompt engineering tips:
* Be explicit about what the model should do with the context
* Use clear delimiters to separate context from the question
* Instruct the model to decline if information isn't in the context
* Keep instructions concise but clear

Template structure:
```
System: [Role and constraints]
User: [Context] + [Question] + [Instructions]
```



In [None]:
import google.generativeai as genai

try:
    # TODO: Define the system_prompt
    # Should instruct the model to:
    # - Answer questions based on provided chunks
    # - Only answer AI-related questions
    # - Decline non-AI questions politely

    system_prompt = ""  # Your implementation here

    # TODO: Create the prompt template
    # Should include:
    # - Instructions about using the context between <START_OF_CONTEXT> and <END_OF_CONTEXT>
    # - Placeholders for context and question (use {} for .format())
    # - Instructions to be accurate and concise
    # Example structure:
    # "Read the following...<START_OF_CONTEXT>\n{}\n<END_OF_CONTEXT>\n\nQuestion: {}\nAnswer:"

    prompt = ""  # Your implementation here

    # TODO: Format the prompt with retrieved chunks and question
    # Hint: Use "".join(df.chunk[indices]) to combine chunks
    # formatted_prompt = prompt.format(???, ???)

    model = genai.GenerativeModel(model_name="gemini-1.5-flash", system_instruction=system_prompt)

    # Uncomment the following lines after implementing the prompt
    # result = model.generate_content(formatted_prompt, request_options={"timeout": 1000})
    # res = result.text
    # print(res)

except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
print(res)

## How Augmenting the Prompt can address knowledge cutoff limitations and hallucinations

In [None]:
# Consider this as a retrieved chunk
# https://ai.meta.com/blog/meta-llama-3-1/
Example_chunk = """
Introducing Llama 3.1 Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. With the release of the 405B model, we’re poised to supercharge innovation—with unprecedented opportunities for growth and exploration. We believe the latest generation of Llama will ignite new applications and modeling paradigms, including synthetic data generation to enable the improvement and training of smaller models, as well as model distillation—a capability that has never been achieved at this scale in open source.
As part of this latest release, we’re introducing upgraded versions of the 8B and 70B models. These are multilingual and have a significantly longer context length of 128K, state-of-the-art tool use, and overall stronger reasoning capabilities. This enables our latest models to support advanced use cases, such as long-form text summarization, multilingual conversational agents, and coding assistants. We’ve also made changes to our license, allowing developers to use the outputs from Llama models—including the 405B—to improve other models. True to our commitment to open source, starting today, we’re making these models available to the community for download on llama.meta.com and Hugging Face and available for immediate development on our broad ecosystem of partner platforms. Model evaluations
For this release, we evaluated performance on over 150 benchmark datasets that span a wide range of languages. In addition, we performed extensive human evaluations that compare Llama 3.1 with competing models in real-world scenarios. Our experimental evaluation suggests that our flagship model is competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet. Additionally, our smaller models are competitive with closed and open models that have a similar number of parameters.
Model Architecture As our largest model yet, training Llama 3.1 405B on over 15 trillion tokens was a major challenge. To enable training runs at this scale and achieve the results we have in a reasonable amount of time, we significantly optimized our full training stack and pushed our model training to over 16 thousand H100 GPUs, making the 405B the first Llama model trained at this scale.
To address this, we made design choices that focus on keeping the model development process scalable and straightforward. We opted for a standard decoder-only transformer model architecture with minor adaptations rather than a mixture-of-experts model to maximize training stability.
We adopted an iterative post-training procedure, where each round uses supervised fine-tuning and direct preference optimization. This enabled us to create the highest quality synthetic data for each round and improve each capability’s performance.
Compared to previous versions of Llama, we improved both the quantity and quality of the data we use for pre- and post-training. These improvements include the development of more careful pre-processing and curation pipelines for pre-training data, the development of more rigorous quality assurance, and filtering approaches for post-training data.
"""

In [None]:
QUESTION = "How many parameters LLaMA 3.1 model has?"

# Formulating the system prompt
system_prompt = (
        "You are an assistant and expert in answering questions from a chunks of content. "
        "Only answer AI-related question, else say that you cannot answer this question."
    )
# Combining the system prompt with the user's question
prompt = (
        "Read the following informations that might contain the context you require to answer the question. You can use the informations starting from the <START_OF_CONTEXT> tag and end with the <END_OF_CONTEXT> tag. Here is the content:\n\n<START_OF_CONTEXT>\n{}\n<END_OF_CONTEXT>\n\n"
        "Please provide an informative and accurate answer to the following question based on the avaiable context. Be concise and take your time. \nQuestion: {}\nAnswer:"
    )
prompt = prompt.format(Example_chunk, QUESTION)

model = genai.GenerativeModel(model_name= "gemini-1.5-flash", system_instruction=system_prompt)

#Gemini API call
result = model.generate_content(prompt,request_options={"timeout": 1000},)
res = result.text
print(res)

# Without Augmentation


Test the Gemini API to answer the same question without the addition of retrieved documents. Basically, the LLM will use its knowledge to answer the question.


In [None]:
QUESTION = "How many parameters LLaMA 3 model has?"

# Formulating the system prompt
system_prompt = "You are an assistant and expert in answering questions."

# Combining the system prompt with the user's question
prompt = "Be concise and take your time to answer the following question. \nQuestion: {}\nAnswer:"
prompt = prompt.format(QUESTION)

model = genai.GenerativeModel(model_name= "gemini-1.5-flash", system_instruction=system_prompt)

#Gemini API call
result = model.generate_content(prompt,request_options={"timeout": 1000},)
res = result.text

In [None]:
print(res)

## Reflection & Next Steps

### Key Questions to Consider

1. **Chunking Strategy**: How might different chunk sizes affect retrieval quality? What about overlapping chunks?

2. **Embedding Quality**: Why do we remove newlines before generating embeddings? How might preprocessing affect results?

3. **Retrieval Precision**: We used top-3 chunks. How would you determine the optimal number of chunks to retrieve?

4. **Prompt Engineering**: How does the structure of your augmented prompt affect the model's response quality?

5. **Limitations**: What are potential drawbacks of this basic RAG approach? How might you improve it?

### Bonus Challenges

If you finish early, try these enhancements:

1. **Overlap Implementation**: Modify the chunking function to include overlap between chunks
   ```python
   def split_into_chunks_with_overlap(text, chunk_size=1024, overlap=128):
       # Implement sliding window approach
   ```

2. **Similarity Threshold**: Instead of top-N, retrieve chunks above a similarity threshold
   ```python
   threshold = 0.7
   relevant_indices = np.where(cosine_similarities[0] > threshold)[0]
   ```

3. **Reranking**: Implement a secondary ranking step after initial retrieval using a different similarity metric

4. **Evaluation**: Create a function to evaluate answer quality with and without RAG using metrics like:
   * Answer relevance
   * Information completeness
   * Factual accuracy

### Future Enhancements

This basic RAG implementation can be enhanced with:
* Vector databases (Pinecone, Weaviate, ChromaDB)
* More sophisticated chunking strategies
* Hybrid search (combining dense and sparse retrieval)
* Query expansion and reformulation
* Multi-step reasoning with chain-of-thought
