### Note
From this repo, OpenAI compatibility only, Google GenAI is stupid

### Why?
Text chunking is an essential step in Retrieval-Augmented Generation (RAG), where large text bodies are divided into meaningful segments to improve retrieval accuracy. Unlike fixed-length chunking, semantic chunking splits text based on the content similarity between sentences.

### Breakpoint Methods
- **Percentile**: Find the Xth percentile of all similarity differences and split chunks where drop is greater than this value
- **Standard Deviation**: Split where similarity drops more than X standard deviations below them
- **Interquartile Range(IQR)**: Use the interquartile distance (Q3 - Q1) to determine split points

### 1. Set up env

In [3]:
import fitz
from dotenv import load_dotenv
import os
import numpy as np
import json
import time
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity

### 2. Extract text

In [4]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page in mypdf:
        all_text += page.get_text("text") + " "

    return all_text.strip()  # Return the extracted text
# Define the path to the PDF file
pdf_path = "../data/AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Print the first 500 characters of the extracted text
print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


### 3. Create sentence-level embedding

We split text into sentences and generate embeddings.

I implement embed function using pretrained model because Gemini limits requests per minute.

This model runs smoothly on my GTX 1650 with 4 VRAM

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("BAAI/bge-base-en").to(device)
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-base-en")

def embed(text):
    if isinstance(text, str): text = [text]
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model(**inputs)
        embedding = F.normalize(output.last_hidden_state[:, 0, :], p=2, dim=1)
    return embedding.cpu().numpy()


In [6]:
# Splitting text into sentences
sentences = extracted_text.split(". ")  # Optional: dùng nltk/sentencepiece cho chính xác hơn

# Get all embeddings
embeddings = embed(sentences)

print(f"Generated {len(embeddings)} sentence embeddings.")

Generated 257 sentence embeddings.


In [7]:
embeddings

array([[ 0.01306901,  0.04015021, -0.00235185, ..., -0.02536607,
         0.00107017, -0.00399907],
       [-0.00877151,  0.0327035 ,  0.00057982, ..., -0.02138391,
        -0.00518402,  0.00479695],
       [-0.01059035,  0.00150866, -0.00066208, ..., -0.01109738,
         0.00259597, -0.00476295],
       ...,
       [ 0.01444942,  0.03102358, -0.00610783, ..., -0.01105188,
         0.00620361,  0.00885018],
       [ 0.02855043, -0.02881697,  0.0228119 , ..., -0.01229273,
         0.02225733,  0.00549865],
       [ 0.01170838, -0.00375156, -0.00367259, ..., -0.01511175,
        -0.00050225, -0.00193483]], shape=(257, 768), dtype=float32)

### 4. Calculate similarities
We compute cosine similarity between consecutive sentences.

In [8]:
similarities = [cosine_similarity(embeddings[i].reshape(1, -1), embeddings[i + 1].reshape(1, -1)) for i in range(len(embeddings) - 1)]
similarities

[array([[0.78351605]], dtype=float32),
 array([[0.79600334]], dtype=float32),
 array([[0.84577465]], dtype=float32),
 array([[0.8563843]], dtype=float32),
 array([[0.86224794]], dtype=float32),
 array([[0.85106635]], dtype=float32),
 array([[0.80903834]], dtype=float32),
 array([[0.8385702]], dtype=float32),
 array([[0.8805547]], dtype=float32),
 array([[0.9036343]], dtype=float32),
 array([[0.8707457]], dtype=float32),
 array([[0.7875351]], dtype=float32),
 array([[0.7268816]], dtype=float32),
 array([[0.86637866]], dtype=float32),
 array([[0.8530915]], dtype=float32),
 array([[0.89408314]], dtype=float32),
 array([[0.8129629]], dtype=float32),
 array([[0.80566347]], dtype=float32),
 array([[0.8363713]], dtype=float32),
 array([[0.76574343]], dtype=float32),
 array([[0.8858226]], dtype=float32),
 array([[0.8135253]], dtype=float32),
 array([[0.77825916]], dtype=float32),
 array([[0.8125247]], dtype=float32),
 array([[0.81830573]], dtype=float32),
 array([[0.87223226]], dtype=float32),

### 5. Implementing Semantic Chunking
implement three different methods for finding breakpoints.

In [9]:
def compute_breakpoints(similarities, method="percentile", threshold=90):
    """
    Computes chunking breakpoints based on similarity drops.

    Args:
    similarities (List[float]): List of similarity scores between sentences.
    method (str): 'percentile', 'standard_deviation', or 'interquartile'.
    threshold (float): Threshold value (percentile for 'percentile', std devs for 'standard_deviation').

    Returns:
    List[int]: Indices where chunk splits should occur.
    """
    # Determine the threshold value based on the selected method
    if method == "percentile":
        # Calculate the Xth percentile of the similarity scores
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        # Calculate the mean and standard deviation of the similarity scores
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        # Set the threshold value to mean minus X standard deviations
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        # Calculate the first and third quartiles (Q1 and Q3)
        q1, q3 = np.percentile(similarities, [25, 75])
        # Set the threshold value using the IQR rule for outliers
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        # Raise an error if an invalid method is provided
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")

    # Identify indices where similarity drops below the threshold value
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

# Compute breakpoints using the percentile method with a threshold of 90
breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)
breakpoints


[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 65,
 66,
 67,
 68,
 69,
 70,
 72,
 73,
 74,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 106,
 107,
 108,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 124,
 125,
 126,
 128,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 148,
 149,
 150,
 152,
 154,
 155,
 156,
 158,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 170,
 171,
 172,
 173,
 175,
 176,
 177,
 178,
 180,
 181,
 182,
 183,
 184,
 185,
 186,
 188,
 189,
 190,
 191,
 192,
 193,
 194,
 195,
 196,
 198,
 199,
 200,
 201,
 202,
 203,
 204,
 206,


### 6. Splitting Text into Semantic Chunks

We split the text based on computed breakpoints.


In [10]:
def split_into_chunks(sentences, breakpoints):
    """
    Splits sentences into semantic chunks.

    Args:
    sentences (List[str]): List of sentences.
    breakpoints (List[int]): Indices where chunking should occur.

    Returns:
    List[str]: List of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    start = 0  # Initialize the start index

    # Iterate through each breakpoint to create chunks
    for bp in breakpoints:
        # Append the chunk of sentences from start to the current breakpoint
        chunks.append(". ".join(sentences[start:bp + 1]) + ".")
        start = bp + 1  # Update the start index to the next sentence after the breakpoint

    # Append the remaining sentences as the last chunk
    chunks.append(". ".join(sentences[start:]))
    return chunks  # Return the list of chunks

# Create chunks using the split_into_chunks function
text_chunks = split_into_chunks(sentences, breakpoints)

# Print the number of chunks created
print(f"Number of semantic chunks: {len(text_chunks)}")

# Print the first chunk to verify the result
print("\nFirst text chunk:")
print(text_chunks)


Number of semantic chunks: 231

First text chunk:
['Understanding Artificial Intelligence \nChapter 1: Introduction to Artificial Intelligence \nArtificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \nto perform tasks commonly associated with intelligent beings.', 'The term is frequently applied to \nthe project of developing systems endowed with the intellectual processes characteristic of \nhumans, such as the ability to reason, discover meaning, generalize, or learn from past \nexperience.', 'Over the past few decades, advancements in computing power and data availability \nhave significantly accelerated the development and deployment of AI.', '\nHistorical Context \nThe idea of artificial intelligence has existed for centuries, often depicted in myths and fiction.', '\nHowever, the formal field of AI research began in the mid-20th century.', 'The Dartmouth Workshop \nin 1956 is widely considered the birthplace of AI.', 'Early AI resea

### Creating Embeddings for Semantic Chunks

We create embeddings for each chunk for later retrieval.

In [11]:
def create_embeddings(text_chunks):
    """
    Creates embeddings for each text chunk.

    Args:
    text_chunks (List[str]): List of text chunks.

    Returns:
    List[np.ndarray]: List of embedding vectors.
    """
    # Generate embeddings for each text chunk using the get_embedding function
    return [embed(chunk) for chunk in text_chunks]

# Create chunk embeddings using the create_embeddings function
chunk_embeddings = create_embeddings(text_chunks)


In [12]:
chunk_embeddings

[array([[ 1.02993045e-02,  3.36671807e-02,  1.27800961e-03,
         -1.37690641e-02,  6.06240891e-02,  4.31134785e-03,
          5.27668819e-02,  3.52631249e-02, -2.00078115e-02,
         -2.63740662e-02, -3.90928835e-02,  2.12220773e-02,
         -5.81960753e-02,  3.55584435e-02, -1.17446305e-02,
          7.46376589e-02,  5.82213029e-02, -1.45262689e-03,
         -9.72324796e-03, -4.43250127e-03, -3.04036010e-02,
         -1.72888152e-02,  2.09294111e-02,  3.06009073e-02,
          3.82794626e-02, -4.76491358e-03,  1.12352055e-02,
          1.85596887e-02, -5.93509264e-02, -4.32772824e-04,
          5.64121380e-02, -3.39476317e-02, -2.21382361e-02,
         -3.59113365e-02, -2.37848293e-02, -2.86272564e-03,
         -1.05305817e-02, -2.96872724e-02, -9.33797657e-03,
         -4.54072747e-03, -2.71239821e-02,  7.87651259e-03,
         -2.79891305e-02,  1.30511094e-02, -5.15639782e-02,
          1.50566213e-02, -7.76670501e-02,  5.41709997e-02,
          1.19998073e-02, -1.70478541e-0

### Performing Semantic Search

We implement cosine similarity to retrieve the most relevant chunks.

In [13]:
def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    """
    Finds the most relevant text chunks for a query.

    Args:
        query (str): Search query.
        text_chunks (List[str]): List of text chunks.
        chunk_embeddings (List[np.ndarray]): List of chunk embeddings.
        k (int): Number of top results to return.

    Returns:
        List[str]: Top-k relevant chunks.
    """
    # Embed the query
    query_embedding = embed(query)

    # Ensure query_embedding has correct shape (1, dim)
    query_embedding = np.array(query_embedding).reshape(1, -1)

    # Compute cosine similarities
    similarities = np.array([
        cosine_similarity(query_embedding, emb.reshape(1, -1))[0][0]  # scalar similarity
        for emb in chunk_embeddings
    ])

    # Get top-k indices
    top_indices = np.argsort(similarities)[-k:][::-1]

    # Return top-k text chunks
    return [text_chunks[i] for i in top_indices]


In [14]:
# Load the validation data from a JSON file
with open('../data/val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[0]['question']

# Get top 2 relevant chunks
top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)

# Print the query
print(f"Query: {query}")

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i+1}:\n{chunk}\n{'='*40}")


Query: What is 'Explainable AI' and why is it considered important?
Context 1:

Explainable AI (XAI) 
Explainable AI (XAI) aims to make AI systems more transparent and understandable.
Context 2:

Transparency and Explainability 
Transparency and explainability are essential for building trust in AI systems.


### Generating a Response Based on Retrieved Chunks

In [15]:
load_dotenv("../conf.env")
client = OpenAI(
    api_key=os.getenv("GEMINI_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [16]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="gemini-2.5-flash"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "meta-llama/Llama-2-7B-chat-hf".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)


In [17]:
ai_response_text = ai_response.choices[0].message.content
ai_response_text

'Explainable AI (XAI) aims to make AI systems more transparent and understandable. It is considered important because transparency and explainability are essential for building trust in AI systems.'

### Evaluation

In [18]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response_text}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

Score: 0.5


In [19]:
print(embed(ai_response_text).shape)

(1, 768)


In [20]:
embed(data[0]['ideal_answer']).shape

(1, 768)

In [21]:
ai_gen = embed(ai_response_text) # (1,x)
ideal_re = embed(data[0]['ideal_answer']) # (1,x)
print(ai_gen.shape, ideal_re.shape)

cosine_similarity(ai_gen, ideal_re)

(1, 768) (1, 768)


array([[0.99459505]], dtype=float32)