### Why?
Choosing the right chunk size is crucial for improving retrieval accuracy in a Retrieval-Augmented Generation (RAG) pipeline. The goal is to balance retrieval performance with response quality.

### Procedure
- Extracting text from a PDF.
- Splitting text into chunks of varying sizes.
- Creating embeddings for each chunk.
- Retrieving relevant chunks for a query.
- Generating a response using retrieved chunks.
- Evaluating faithfulness and relevancy.
- Comparing results for different chunk sizes.


### Set up env

In [118]:
import fitz
import os
from dotenv import load_dotenv
import numpy as np
import json
from openai import OpenAI

import time
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
from sklearn.metrics.pairwise import cosine_similarity

In [119]:
load_dotenv("../conf.env")
client = OpenAI(
    api_key=os.getenv("GEMINI_API_KEY"),
    base_url=os.getenv("GEMINI_BASE_URL")
)

### Extract text

In [120]:
def extract_text_from_pdf(pdf_path):
    mypdf = fitz.open(pdf_path)
    all_text = ""

    for page in mypdf:
        all_text += page.get_text("text") + " "
    
    return all_text

pdf_path = "../data/AI_Information.pdf"

extracted_text = extract_text_from_pdf(pdf_path)

print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


#### Chunking

In [121]:
def chunk_text(text, n, over_lap):
    chunks = []

    for i in range(0, len(text), over_lap):
        chunks.append(text[i:i + n])
    
    return chunks

# define different chunk sizes to evaluate
chunk_sizes = [128, 256, 512]

# create dictionary to store text chunks for each chunk size
text_chunks_dict = {size: chunk_text(extracted_text, size, size // 5) for size in chunk_sizes} # dict[size:chunks]

# print number of chunks created for each chunk sizes
for size, chunks in text_chunks_dict.items():
    print(f"Chunk size: {size}, number of chunks: {len(chunks)}")

Chunk size: 128, number of chunks: 1341
Chunk size: 256, number of chunks: 658
Chunk size: 512, number of chunks: 329


### Create embeddings for each chunks

In [122]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(os.getenv("LOCAL_EMBED_EN")).to(device)
tokenizer = AutoTokenizer.from_pretrained(os.getenv("LOCAL_EMBED_EN"))

def embed(text):
    if isinstance(text, str): text = [text] # if single string => convert into a list of one element
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device) # tokenize input
    with torch.no_grad(): 
        output = model(**inputs) # running model
        embedding = F.normalize(output.last_hidden_state[:, 0, :], p=2, dim=1) # normalize vector to L2
    return embedding.cpu().numpy() # pass to cpu with numpy array

In [123]:
text_chunks_dict.items()

dict_items([(128, ['Understanding Artificial Intelligence \nChapter 1: Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers t', 'Intelligence \nChapter 1: Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers to the ability of a digita', 'Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers to the ability of a digital computer or computer-co', 'l Intelligence \nArtificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \nto perfor', 'l intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \nto perform tasks commonly associat', 's to the ability of a digital computer or computer-controlled robot \nto perform tasks commonly associated with intelligent being', 'ital computer or computer-controlled robot \nto perform tasks commonly associated with intelligent beings. The term is frequently', '-controlled robot \nto perform tasks commonly 

In [124]:
from tqdm import tqdm
chunk_embeddings_dict = {size: embed(chunks) for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")} #dict[int, NDA(1, x)]


Generating Embeddings: 100%|██████████| 3/3 [00:12<00:00,  4.25s/it]


In [125]:
chunk_embeddings_dict # dict[size:array[embeddings]]

{128: array([[ 0.01779863,  0.02884399, -0.01863427, ..., -0.00918702,
          0.02004975,  0.00086159],
        [-0.00407528,  0.0249798 , -0.01202046, ..., -0.02553442,
          0.00013637, -0.00388858],
        [-0.00017993,  0.02555388,  0.00591069, ..., -0.02416707,
          0.00151111, -0.0009307 ],
        ...,
        [ 0.01139831,  0.02028845, -0.01382658, ..., -0.02051813,
          0.00317977,  0.02439051],
        [ 0.0154119 ,  0.0205931 ,  0.00378093, ..., -0.00332296,
         -0.00038543, -0.00156751],
        [ 0.00521575, -0.01485795,  0.01499774, ..., -0.00450879,
          0.02869543,  0.00327276]], shape=(1341, 768), dtype=float32),
 256: array([[ 0.01029931,  0.03366722,  0.001278  , ..., -0.01744429,
         -0.00084792, -0.00511596],
        [ 0.01369525,  0.02772786,  0.006009  , ..., -0.02886634,
          0.00439644,  0.01221239],
        [-0.00860512,  0.03000513,  0.01027754, ..., -0.01381614,
          0.00780758,  0.01017344],
        ...,
        [ 

In [126]:
def retrieve_relevant_chunks(query, text_chunks, chunk_embeddings, k=5):
    """
    Retrieves the top-k most relevant text chunks.
    
    Args:
    query (str): User query.
    text_chunks (List[str]): List of text chunks.
    chunk_embeddings (List[np.ndarray]): Embeddings of text chunks.
    k (int): Number of top chunks to return.
    
    Returns:
    List[str]: Most relevant text chunks.
    """
    # Embed the query
    query_embedding = embed(query)

    # Ensure query_embedding has correct shape (1, dim)
    query_embedding = np.array(query_embedding).reshape(1, -1)

    # Compute cosine similarities
    similarities = np.array([
        cosine_similarity(query_embedding, emb.reshape(1, -1))[0][0]  # scalar similarity
        for emb in chunk_embeddings
    ])

    # Get top-k indices
    top_indices = np.argsort(similarities)[-k:][::-1]

    # Return top-k text chunks
    return [text_chunks[i] for i in top_indices]


In [127]:
# Load the validation data from a JSON file
with open('../data/val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[3]['question']

# Retrieve relevant chunks for each chunk size
retrieved_chunks_dict = {size: retrieve_relevant_chunks(query, text_chunks_dict[size], chunk_embeddings_dict[size]) for size in chunk_sizes}

# Print retrieved chunks for chunk size 256
print(retrieved_chunks_dict[128])

['Medicine \nAI enables personalized medicine by analyzing individual patient data, predicting treatment \nresponses, and tailoring ', ' bringing new treatments to market. \nPersonalized Medicine \nAI enables personalized medicine by analyzing individual patient dat', 'to market. \nPersonalized Medicine \nAI enables personalized medicine by analyzing individual patient data, predicting treatment \n', 'uce the time and cost \nof bringing new treatments to market. \nPersonalized Medicine \nAI enables personalized medicine by analyzi', 's. AI-powered systems reduce the time and cost \nof bringing new treatments to market. \nPersonalized Medicine \nAI enables persona']


In [128]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(query, system_prompt, retrieved_chunks, model=os.getenv("GEMINI_GEN_MODEL")):
    """
    Generates an AI response based on retrieved chunks.

    Args:
    query (str): User query.
    retrieved_chunks (List[str]): List of retrieved text chunks.
    model (str): AI model.

    Returns:
    str: AI-generated response.
    """
    # Combine retrieved chunks into a single context string
    context = "\n".join([f"Context {i+1}:\n{chunk}" for i, chunk in enumerate(retrieved_chunks)])
    
    # Create the user prompt by combining the context and the query
    user_prompt = f"{context}\n\nQuestion: {query}"

    # Generate the AI response using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    # Return the content of the AI response
    return response.choices[0].message.content

# Generate AI responses for each chunk size
ai_responses_dict = {size: generate_response(query, system_prompt, retrieved_chunks_dict[size]) for size in chunk_sizes}

# Print the response for chunk size 256
print(ai_responses_dict[128])
print(ai_responses_dict[256])
print(ai_responses_dict[512])

AI enables personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring.
AI contributes to personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions.
AI contributes to personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions. This enhances treatment effectiveness and reduces adverse effects.


In [129]:
# Define evaluation scoring system constants
SCORE_FULL = 1.0     # Complete match or fully satisfactory
SCORE_PARTIAL = 0.5  # Partial match or somewhat satisfactory
SCORE_NONE = 0.0     # No match or unsatisfactory

In [130]:
# Define strict evaluation prompt templates
FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the true answer.
User Query: {question}
AI Response: {response}
True Answer: {true_answer}

Faithfulness measures how well the AI response aligns with facts in the true answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely faithful, no contradictions with true answer
    * {partial} = Partially faithful, minor contradictions
    * {none} = Not faithful, major contradictions or hallucinations
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [131]:
RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely relevant, directly addresses the query
    * {partial} = Partially relevant, addresses some aspects
    * {none} = Not relevant, fails to address the query
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""


In [None]:
def evaluate_response(question, response, true_answer):
        """
        Evaluates the quality of an AI-generated response based on faithfulness and relevancy.

        Args:
        question (str): The user's original question.
        response (str): The AI-generated response being evaluated.
        true_answer (str): The correct answer used as ground truth.

        Returns:
        Tuple[float, float]: A tuple containing (faithfulness_score, relevancy_score).
                                                Each score is one of: 1.0 (full), 0.5 (partial), or 0.0 (none).
        """
        # Format the evaluation prompts
        faithfulness_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
                question=question, 
                response=response, 
                true_answer=true_answer,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )
        
        relevancy_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
                question=question, 
                response=response,
                full=SCORE_FULL,
                partial=SCORE_PARTIAL,
                none=SCORE_NONE
        )

        # Request faithfulness evaluation from the model
        faithfulness_response = client.chat.completions.create(
               model=os.getenv("GEMINI_GEN_MODEL"),
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": faithfulness_prompt}
                ]
        )
        
        # Request relevancy evaluation from the model
        relevancy_response = client.chat.completions.create(
                model=os.getenv("GEMINI_GEN_MODEL"),
                temperature=0,
                messages=[
                        {"role": "system", "content": "You are an objective evaluator. Return ONLY the numerical score."},
                        {"role": "user", "content": relevancy_prompt}
                ]
        )
        
        # Extract scores and handle potential parsing errors
        try:
                faithfulness_score = float(faithfulness_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse faithfulness score, defaulting to 0")
                faithfulness_score = 0.0
                
        try:
                relevancy_score = float(relevancy_response.choices[0].message.content.strip())
        except ValueError:
                print("Warning: Could not parse relevancy score, defaulting to 0")
                relevancy_score = 0.0

        return faithfulness_score, relevancy_score

# True answer for the first validation data
true_answer = data[3]['ideal_answer']

# Evaluate response for chunk size 256 and 128
faithfulness, relevancy = evaluate_response(query, ai_responses_dict[256], true_answer)
faithfulness2, relevancy2 = evaluate_response(query, ai_responses_dict[128], true_answer)
faithfulness3, relevancy3 = evaluate_response(query, ai_responses_dict[512], true_answer)

# print the evaluation scores
print(f"Faithfulness Score (Chunk Size 256): {faithfulness}")
print(f"Relevancy Score (Chunk Size 256): {relevancy}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness2}")
print(f"Relevancy Score (Chunk Size 128): {relevancy2}")

print(f"\n")

print(f"Faithfulness Score (Chunk Size 128): {faithfulness3}")
print(f"Relevancy Score (Chunk Size 128): {relevancy3}")


Faithfulness Score (Chunk Size 256): 1.0
Relevancy Score (Chunk Size 256): 1.0


Faithfulness Score (Chunk Size 128): 0.5
Relevancy Score (Chunk Size 128): 1.0


In [135]:
print(f"Faithfulness Score (Chunk Size 128): {faithfulness3}")
print(f"Relevancy Score (Chunk Size 128): {relevancy3}")

Faithfulness Score (Chunk Size 128): 1.0
Relevancy Score (Chunk Size 128): 1.0


In [136]:
print(ai_responses_dict[128])
print(ai_responses_dict[256])
print(ai_responses_dict[512])

AI enables personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring.
AI contributes to personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions.
AI contributes to personalized medicine by analyzing individual patient data, predicting treatment responses, and tailoring interventions. This enhances treatment effectiveness and reduces adverse effects.


In [134]:
ideal_res = embed(data[3]['ideal_answer'])
for response in ai_responses_dict.items():
    ai_res = embed(response[1])
    print(cosine_similarity(ai_res, ideal_res))

[[0.97333324]]
[[0.9651215]]
[[0.98304504]]
