# **BONUS CHALLENGE: LangGraph RAG + RAGAS Evaluation**

Complete pipeline to:
1. Build a baseline RAG system using naive chunking
2. Evaluate it with RAGAS
3. Implement semantic chunking
4. Compare evaluation metrics between both systems


## **Step 1: Baseline LangGraph RAG with Naive Retrieval**

In [2]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

In [None]:
from typing_extensions import TypedDict, List

from langchain_openai import ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langgraph.graph import StateGraph, START

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

# Step 1: Load and preprocess documents
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
print(f"# of chunks: {len(chunks)}")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
qdrant = QdrantClient(":memory:")
collection_name = "naive_chunking"

qdrant.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

vectorstore = QdrantVectorStore(
    client=qdrant,
    collection_name=collection_name,
    embedding=embeddings,
)
_ = vectorstore.add_documents(documents=chunks)

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4.1-nano")

# Step 2: LangGraph State Definition
class GraphState(TypedDict):
  question: str
  context: List[Document]
  response: str
  
# Step 3: Retrieval Node
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

# Step 4: Augmentation
RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# Step 5: Generation Node
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

# Step 6: Build LangGraph
graph_builder = StateGraph(GraphState).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


## **Step 2: Baseline Evaluation using RAGAS Metrics**

In [15]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.metrics import (
    faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness
)

# Generate testset from source documents
print("Generating dataset...")
generator_llm = LangchainLLMWrapper(langchain_llm=ChatOpenAI(model="gpt-4.1"),)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings=embeddings)

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)
print("Dataset generated.")


# Run inference and update answers
print("Executing RAG chain...")
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
print("RAG chain executed successfully.")

# RAG evaluation
print("Executing RAGAS evaluation...")
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
baseline_evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
custom_run_config = RunConfig(timeout=360)
baseline_ragas_report = evaluate(
    dataset=baseline_evaluation_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness
    ],
    llm=evaluator_llm,
    run_config=custom_run_config
)

print("RAGAS Results using Naive Retrieval:")
print(baseline_ragas_report)


Generating dataset...


Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

unable to apply transformation: Error code: 500 - {'error': {'message': 'The server had an error while processing your request. Sorry about that!', 'type': 'server_error', 'param': None, 'code': None}}


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

Dataset generated.
Executing RAG chain...
RAG chain executed successfully.
Executing RAGAS evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[19]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 30000, Requested 1476. Please try again in 2.952s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[5]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29831, Requested 2495. Please try again in 4.652s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[18]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29870, Reques

RAGAS Results using Naive Retrieval:
{'faithfulness': 0.8681, 'answer_relevancy': 0.7868, 'context_precision': 0.5333, 'context_recall': 0.8701, 'answer_correctness': 0.7673}


## **Step 3: LangGraph RAG with Semantic Chunking**

In [None]:
from typing_extensions import TypedDict, List
from langchain_openai import ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain_community.document_loaders import DirectoryLoader

from langgraph.graph import StateGraph, START

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import tiktoken
import re
import time

# ================================
# Step 1: Semantic Chunking Logic
# ================================
semantic_embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

def clean_sentences(sentences: list[str], min_words: int = 3) -> list[str]:
    cleaned = []
    for s in sentences:
        # Replace newlines, non-breaking spaces, HTML entities
        s = s.replace("\n\n", " ").replace("\xa0", " ").replace("&nbsp;", " ")
        s = re.sub(r"\s+", " ", s)  # Collapse any repeated whitespace
        s = s.strip()

        # Skip empty or too-short sentences
        if len(s.split()) >= min_words:
            cleaned.append(s)
    return cleaned

def get_embeddings(sentences: list[str]) -> list[np.ndarray]:
    while True:
        try:
            return semantic_embedding_model.embed_documents(sentences)
        except Exception as e:
            print(f"Retrying due to error: {e}")
            time.sleep(2)

def semantic_chunk(sentences, threshold=0.85, max_sentences=15, max_tokens=1024):
    """
    Adjusted semantic chunking logic to reduce the number of granular chunks.
    
    Parameters:
    - sentences: List of sentences to be chunked.
    - threshold: Similarity threshold for including a sentence in the current chunk.
    - max_sentences: Maximum number of sentences in a chunk.
    - max_tokens: Maximum number of tokens in a chunk.
    
    Returns:
    - List of text chunks.
    """
    # Clean the input sentences
    sentences = clean_sentences(sentences)

    # Embed cleaned sentences
    embeddings = get_embeddings(sentences)

    chunks = []
    current_chunk = [sentences[0]]
    current_vectors = [embeddings[0]]

    for i in range(1, len(sentences)):
        proposed_chunk = " ".join(current_chunk + [sentences[i]])
        total_tokens = count_tokens(proposed_chunk)

        current_avg = np.mean(current_vectors, axis=0)
        sim = cosine_similarity([embeddings[i]], [current_avg])[0][0]

        if sim > threshold and (len(current_chunk) < max_sentences or total_tokens <= max_tokens):
            current_chunk.append(sentences[i])
            current_vectors.append(embeddings[i])
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
            current_vectors = [embeddings[i]]

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# ================================
# Step 2: Load and Preprocess HTML
# ================================
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

# Extract text and break into semantic chunks
all_chunks = []
for doc in docs:
    # Break raw content into sentences (basic split, can improve)
    raw_text = doc.page_content
    sentences = re.split(r'(?<=[.!?]) +', raw_text.strip())
    chunks = semantic_chunk(sentences)
    for chunk in chunks:
        all_chunks.append(Document(page_content=chunk, metadata=doc.metadata))

print(f"# of chunks: {len(all_chunks)}")

# ================================
# Step 3: Vector DB Setup (Qdrant)
# ================================
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
qdrant = QdrantClient(":memory:")
collection_name = "semantic_chunking"

qdrant.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

vectorstore = QdrantVectorStore(
    client=qdrant,
    collection_name=collection_name,
    embedding=embeddings,
)
_ = vectorstore.add_documents(documents=all_chunks)

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4.1-nano")

# ================================
# Step 4: LangGraph State Definition
# ================================
class GraphState(TypedDict):
    question: str
    context: List[Document]
    response: str

# ================================
# Step 5: Nodes (Retrieve + Generate)
# ================================
def retrieve(state):
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""
rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(
        question=state["question"],
        context=docs_content
    )
    response = llm.invoke(messages)
    return {"response": response.content}

# ================================
# Step 6: Build LangGraph
# ================================
graph_builder = StateGraph(GraphState).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


# of chunks: 211


## **Step 4: Semantic Chunking Evaluation using RAGAS Metrics**

In [88]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.metrics import (
    faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness
)

# Generate testset from source documents
print("Generating dataset...")
generator_llm = LangchainLLMWrapper(langchain_llm=ChatOpenAI(model="gpt-4.1"),)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings=embeddings)

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)
print("Dataset generated.")


# Run inference and update answers
print("Executing RAG chain...")
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
print("RAG chain executed successfully.")

# RAG evaluation
print("Executing RAGAS evaluation...")
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
semantic_chunk_evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
custom_run_config = RunConfig(timeout=360)
semantic_chunk_ragas_report = evaluate(
    dataset=semantic_chunk_evaluation_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness
    ],
    llm=evaluator_llm,
    run_config=custom_run_config
)

print("RAGAS Results using Semantic Chunking:")
print(semantic_chunk_ragas_report)

Generating dataset...


Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

Dataset generated.
Executing RAG chain...
RAG chain executed successfully.
Executing RAGAS evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[14]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29545, Requested 2085. Please try again in 3.26s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[10]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29927, Requested 2767. Please try again in 5.388s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[24]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29862, Reques

RAGAS Results using Semantic Chunking:
{'faithfulness': 0.5406, 'answer_relevancy': 0.9578, 'context_precision': 0.7441, 'context_recall': 0.5790, 'answer_correctness': 0.6453}


## **Step 5: Comparison between Naive Chunking and Semantic Chunking**

In [89]:
import plotly.graph_objects as go

# Convert RAGAS reports to DataFrames
naive_scores_df = baseline_ragas_report.to_pandas()
semantic_scores_df = semantic_chunk_ragas_report.to_pandas()

# Metrics to include
selected_metrics = [
    "faithfulness",
    "answer_relevancy",
    "context_precision",
    "context_recall",
    "answer_correctness"
]

# Compute average scores for each metric
naive_values = naive_scores_df[selected_metrics].mean().tolist()
semantic_values = semantic_scores_df[selected_metrics].mean().tolist()

# Create grouped bar chart with values displayed
fig = go.Figure(data=[
    go.Bar(
        name='Naive Chunking',
        x=selected_metrics,
        y=naive_values,
        text=[f"{v:.2f}" for v in naive_values],
        textposition="auto"
    ),
    go.Bar(
        name='Semantic Chunking',
        x=selected_metrics,
        y=semantic_values,
        text=[f"{v:.2f}" for v in semantic_values],
        textposition="auto"
    )
])

fig.update_layout(
    title='RAGAS Metric Comparison (Averaged): Naive vs Semantic Chunking',
    yaxis=dict(title='Score', range=[0, 1]),
    barmode='group'
)

fig.show()

| Metric                 | Naive Chunking | Semantic Chunking | Higher Is Better | Insight                                                                 |
| ---------------------- | -------------- | ----------------- | ---------------- | ----------------------------------------------------------------------- |
| **Faithfulness**       | 0.87           | 0.54              | ✅ Naive          | Naive chunking provided more faithful responses with respect to sources |
| **Answer Relevancy**   | 0.79           | 0.96              | ✅ Semantic       | Semantic chunking significantly improved relevance of answers           |
| **Context Precision**  | 0.53           | 0.74              | ✅ Semantic       | Semantic chunking led to tighter, more precise retrieval                |
| **Context Recall**     | 0.87           | 0.58              | ✅ Naive          | Naive chunking covered more relevant context overall                    |
| **Answer Correctness** | 0.77           | 0.65              | ✅ Naive          | Naive chunking slightly outperformed in correctness of final answers    |


**Summary:**

* **Semantic chunking improved**:

  * *Answer relevancy* and *context precision*, likely due to more focused retrieval from semantically similar text groups.

* **Naive chunking was stronger** in:

  * *Faithfulness*, *context recall*, and *answer correctness*, suggesting that longer or overlapping chunks may better preserve full context.