In a simple RAG setup, we follow these steps:
1. **Data Ingestion**: Load and preprocess text data
2. **Chunking**: Break data into smaller chunks to improve retrieval performance
3. **Embedding**: Convert the text chunks into numerical representations using an emdedding model
4. **Semantic Search**: Retrieve relevant chunks based on user query
5. **Response Generation**: Use a language model to generate a response based on retrieval text

### 1. Set up environment

In [197]:
import fitz
import os
from dotenv import load_dotenv
import numpy as np
import json
from sklearn.metrics.pairwise import cosine_similarity
from google import genai
from google.genai import types

In [198]:
load_dotenv("../conf.env")

True

### 2. Ingest data(PDF for this example)

In [199]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

### 3. Chunk the extracted text

In [200]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

#### Set up Google GenAI Client
Access [this](https://aistudio.google.com/apikey) to get your free API key

In [201]:
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

now load pdf, extract text and split it into chunks

In [202]:
# Define the path to the PDF file
pdf_path = "../data/AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 42

First text chunk:
Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past few decades, advancements in computing power and data availability 
have significantly accelerated the development and deployment of AI. 
Historical Context 
The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. 
However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop 
in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving 
and 

### 4. Emdedding
Convert text into numerical vectors, which allow for efficient similarity search

In [203]:
def create_embeddings(text, model="gemini-embedding-001"):
    """
    Generates embedding vectors for a list of input texts using the Gemini embedding API.

    Args:
        text (List[str]): A list of input strings to embed.
        model (str): Name of the Gemini embedding model to use.

    Returns:
        List[np.ndarray]: A list of embedding vectors, each corresponding to an input string.
    """

    result = [
        np.array(e.values) for e in client.models.embed_content(
            model=model,
            contents=text, 
            config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")).embeddings
    ]
    return result # list[np.array(x,)]
response = create_embeddings(text_chunks)
response

[array([-0.00805921,  0.00604082,  0.01368964, ..., -0.00378978,
        -0.00951595, -0.00788247], shape=(3072,)),
 array([-0.00541291, -0.00768946,  0.00840041, ..., -0.00555763,
        -0.00453059, -0.00575172], shape=(3072,)),
 array([-0.01510609, -0.00370133, -0.01130207, ..., -0.01065998,
         0.0138528 , -0.00942225], shape=(3072,)),
 array([-0.01703906, -0.00132903, -0.01255116, ..., -0.01783094,
         0.00196428, -0.01987451], shape=(3072,)),
 array([ 0.00496949,  0.00703917,  0.00059291, ..., -0.01075656,
        -0.00148964, -0.00847893], shape=(3072,)),
 array([ 0.00363039, -0.00663219,  0.00391538, ...,  0.00074985,
        -0.00317055, -0.01557343], shape=(3072,)),
 array([-0.00655024, -0.00751401,  0.00925364, ..., -0.01057247,
         0.00491098, -0.00145009], shape=(3072,)),
 array([-0.01434592,  0.00530639,  0.0129325 , ..., -0.01475381,
         0.00640732,  0.00202144], shape=(3072,)),
 array([-0.0254624 , -0.0206279 , -0.0010777 , ..., -0.01374053,
       

In [215]:
embeddings_matrix = response # np.array(x,)
similarity_matrix = cosine_similarity(embeddings_matrix)

# you can pass parameter of a list[NDArray(x,)] or np.stack(list[NDArray(x,)]) => NDArray(n, x), but not pair of NDArray(x,)

for i, text1 in enumerate(text_chunks):
    for j in range(i + 1, 5):
        text2 = text_chunks[j]
        similarity = similarity_matrix[i, j]
        print(f"Similarity between '{i}' and '{j}': {similarity:.4f}")

Similarity between '0' and '1': 0.9061
Similarity between '0' and '2': 0.8539
Similarity between '0' and '3': 0.8081
Similarity between '0' and '4': 0.8296
Similarity between '1' and '2': 0.8759
Similarity between '1' and '3': 0.8110
Similarity between '1' and '4': 0.8435
Similarity between '2' and '3': 0.8729
Similarity between '2' and '4': 0.8194
Similarity between '3' and '4': 0.8738


implement semantic search to find contextual similarity

In [None]:
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    Performs semantic search on the text chunks using the given query and embeddings.

    Args:
    query (str): The query for the semantic search.
    text_chunks (List[str]): A list of text chunks to search through.
    embeddings (List[dict]): A list of embeddings for the text chunks.
    k (int): The number of top relevant text chunks to return. Default is 5.

    Returns:
    List[str]: A list of the top k most relevant text chunks based on the query.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query) # :list[NDArray(x,)] # one NDArray(x,) element
    similarity_scores = []  # Initialize a list to store similarity scores

    # Calculate similarity scores between the query embedding and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        # np.array(query_embedding) # :(x,)
        # np.array(chunk_embedding) # :(x,)
        similarity_score = cosine_similarity(query_embedding[0].reshape(1, -1), np.array(chunk_embedding).reshape(1, -1)) # cos(list[NDArray(x,)][0].reshape(1, -1) => (1, x), (x,1).reshape(1, -1) => (1,x))
        similarity_scores.append((i, similarity_score))  # Append the index and similarity score

    # Sort the similarity scores in descending order
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # Get the indices of the top k most similar text chunks
    top_indices = [index for index, _ in similarity_scores[:k]]
    # Return the top k most relevant text chunks
    return [text_chunks[index] for index in top_indices]


running a query on extracted chunks

In [228]:
# Load the validation data from a JSON file
with open('../data/val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[3]['question']
# print(create_embeddings(query)[0]) # :list[one NDArray(x,)]
# Perform semantic search to find the top 2 most relevant text chunks for the query
top_chunks = semantic_search(query, text_chunks, response, k=2)

# Print the query
print("Query:", query)

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

[-0.00176254 -0.00547624  0.010617   ... -0.02954735  0.00507351
 -0.00670528]
Query: How does AI contribute to personalized medicine?
Context 1:
ent by analyzing medical images, predicting 
patient outcomes, and assisting in treatment planning. AI-powered tools enhance accuracy, 
efficiency, and patient care. 
Drug Discovery and Development 
AI accelerates drug discovery and development by analyzing biological data, predicting drug 
efficacy, and identifying potential drug candidates. AI-powered systems reduce the time and cost 
of bringing new treatments to market. 
Personalized Medicine 
AI enables personalized medicine by analyzing individual patient data, predicting treatment 
responses, and tailoring interventions. Personalized medicine enhances treatment effectiveness 
and reduces adverse effects. 
Robotic Surgery 
AI-powered robotic surgery systems assist surgeons in performing complex procedures with 
greater precision and control. These systems enhance dexterity, reduce invas

### 5. Generate response

In [232]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I don't have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="gemini-2.5-flash"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "meta-llama/Llama-2-7B-chat-hf".

    Returns:
    GenerateContentResponse: The response from the AI model.
    """
    response = client.models.generate_content(
        model=model,
        config=types.GenerateContentConfig(
            system_instruction=system_prompt,
            temperature=1
            ),
        contents=user_message
    )

    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)
print(ai_response.text)

AI contributes to personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions. This approach enhances treatment effectiveness and reduces adverse effects.


### Evaluate

In [233]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.text}\nTrue Response: {data[3]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.text)

Score: 1


In [238]:
ai_gen = create_embeddings(ai_response) # list[NDArray(x,)]
ideal_re = create_embeddings(data[3]['ideal_answer']) #list[NDArray(x,)]

# ai_gen, ideal_re
cosine_similarity(ai_gen, ideal_re)

array([[0.99075595]])