# CE 26 - Evaluating LLM Applications Like a Statistician

The below code will stand up a RAG using 3 documents containing reports on public transit and safe streets. (feel free to add your own documents!). They're minimally processed and may give messy results, but the kinds of answers from these sources should be more nuanced than what we see in benchmark datasets like HotPotQA

The code creates an in-memory vector store. You will need an OpenAI API key with money in the account. Running the whole thing should only cost a few cents. If you prefer another LLM source, Haystack makes this easy--just replace the embedder and generator and the pipeline will still work!

Once the documents are embedded, we can run questions though them. We the generate synthetic questions with an LLM and test them with the metrics we covered in the course.

Hopefully I've taken care of the annoying parts to get the pipeline working. Employing measurement is an exercise for you the learner

_AI Disclosure: I heavily employed Claude-4 in generating these scripts, and have designated AI generated functions as such_

In [134]:
import numpy as np
import pandas as pd
import PyPDF2
import re
import os

from haystack.document_stores.in_memory import InMemoryDocumentStore
from datasets import load_dataset
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack import Document
from openai import OpenAI
from pydantic import BaseModel
from bert_score import score as bertscore
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore
from ragas.metrics import BleuScore
from sklearn.decomposition import PCA
from dotenv import load_dotenv
from typing import List

openai_api_key = os.getenv('OPENAI_API_KEY')


In [3]:
def process_pdfs_to_documents(pdf_paths: List[str]) -> List[Document]:
    """
    AI generated
    Process a list of PDF files, chunk them into 3-sentence chunks, 
    and create Haystack Document objects.
    
    Args:
        pdf_paths: List of paths to PDF files
        
    Returns:
        List of Haystack Document objects
    """
    documents = []
    
    for pdf_path in pdf_paths:
        try:
            # Extract text from PDF
            with open(pdf_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                text = ""
                
                # Extract text from all pages
                for page in pdf_reader.pages:
                    text += page.extract_text() + " "
            
            # Clean up the text
            text = re.sub(r'\s+', ' ', text.strip())
            
            # Split into sentences using regex
            sentences = re.split(r'(?<=[.!?])\s+', text)
            sentences = [s.strip() for s in sentences if s.strip()]
            
            # Create 3-sentence chunks
            chunks = []
            for i in range(0, len(sentences), 3):
                chunk = ' '.join(sentences[i:i+3])
                if chunk:  # Only add non-empty chunks
                    chunks.append(chunk)
            
            # Create Document objects for each chunk
            for i, chunk in enumerate(chunks):
                doc = Document(
                    content=chunk,
                    meta={
                        "source": pdf_path,
                        "chunk_id": i,
                        "total_chunks": len(chunks)
                    }
                )
                documents.append(doc)
                
        except Exception as e:
            print(f"Error processing {pdf_path}: {str(e)}")
            continue
    
    return documents

In [4]:
# Define the paths to the PDF files
pdf_files = [
    "./data/2023 National Transit Summaries and Trends_1.2.pdf",
    "./data/Best-Complete-Streets-Policies-2025.pdf", 
    "./data/Dangerous-By-Design-2024_5.30.pdf"
]

# Process the PDFs and create Document objects
pdf_docs = process_pdfs_to_documents(pdf_files)

print(f"Successfully processed {len(pdf_docs)} document chunks from {len(pdf_files)} PDF files")
print(f"Sample chunk: {pdf_docs[0].content[:200]}..." if pdf_docs else "No documents processed")

Successfully processed 625 document chunks from 3 PDF files
Sample chunk: Office of Budget and Policy National Transit Summaries and Trends 2023 Edition Primary C ontributors Alexus Cook Amanda Walton -Hawthorne Andrew Gogolin Chandrashekar Machiraju Chelsea Champlin Declan...


In [None]:
def make_rag_pipeline(
        top_k=10,
        embedding_model = "text-embedding-3-large",
        generator_model="gpt-4o",
        pdf_docs=pdf_docs,
        ):
    """Create a RAG pipeline with the specified parameters."""

    # Create an in-memory document store
    document_store = InMemoryDocumentStore()

    # Define the embedder we'll be using - Haystack has different functions
    # for document and string embedding, so we'll define both
    doc_embedder = OpenAIDocumentEmbedder(
            model=embedding_model
            # API key will be read from environment variable
        )
    text_embedder = OpenAITextEmbedder(
            model=embedding_model
            # API key will be read from environment variable
        )

    # Embed the documents and write them to the document store
    docs_with_embeddings = doc_embedder.run(pdf_docs)

    document_store.write_documents(docs_with_embeddings["documents"])

    # Define the retriever
    retriever = InMemoryEmbeddingRetriever(document_store, top_k=top_k)

    # Define the generator instructions
    template = [
        ChatMessage.from_user(
            """
    Given the following information, answer the question. 
    If the answer is not in the documents, say "I don't know".

    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}

    Question: {{question}}
    Answer:
    """
        )
    ]

    prompt_builder = ChatPromptBuilder(template=template)

    # Define the generator
    chat_generator = OpenAIChatGenerator(model=generator_model)

    # Build the pipeline
    rag_pipeline = Pipeline()
    # Add components to your pipeline
    rag_pipeline.add_component("text_embedder", text_embedder)
    rag_pipeline.add_component("retriever", retriever)
    rag_pipeline.add_component("prompt_builder", prompt_builder)
    rag_pipeline.add_component("llm", chat_generator)

    # Now, connect the components to each other
    rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
    rag_pipeline.connect("retriever", "prompt_builder")
    rag_pipeline.connect("prompt_builder.prompt", "llm.messages")

    return rag_pipeline, document_store


In [39]:
def rag_answer(question, rag_pipeline, verbose=False):
    """ AI generated
    Run the RAG pipeline with verbose output.
    """
    # Get the response from the RAG pipeline
    response = rag_pipeline.run(
        {"text_embedder": {"text": question}, "prompt_builder": {"question": question}},
        include_outputs_from={"retriever", "prompt_builder", "llm"}
    )

    # Extract the answer from the response
    rag_answer = response["llm"]["replies"][0].text

    # Retrieve the documents used for answering the question
    retrieved_docs = response["retriever"]["documents"]
    document_indexes = [(doc.meta.get('source', 'Unknown'), 
                         doc.meta.get('chunk_id', 'Unknown'))
                        for doc in retrieved_docs]
    if verbose:
        # Display the retrieved chunks
        print("=" * 80)
        print(f"QUESTION: {question}")
        print("=" * 80)
        print("\nRETRIEVED CHUNKS:")
        print("-" * 50)

        # Now we can access the retriever output
        print(f"Retrieved {len(retrieved_docs)} chunks:")
        for i, doc in enumerate(retrieved_docs, 1):
            print(f"\n[CHUNK {i}]")
            print(f"Source: {doc.meta.get('source', 'Unknown')}")
            print(f"Chunk ID: {doc.meta.get('chunk_id', 'Unknown')}")
            print(f"Content: {doc.content}")
            print("-" * 50)

        print(f"\nFINAL ANSWER:")
        print("=" * 50)
        print(rag_answer)
        print("=" * 80)
    else:
        print(rag_answer)
    return rag_answer, document_indexes
    


In [129]:
rag_pipeline, rag_documents = make_rag_pipeline()

rag_answer("What is  Complete Streets?", rag_pipeline, verbose=True)
print("\n" + "=" * 80 + "\n")
rag_answer("What is  Complete Streets?", rag_pipeline, verbose=False)

Calculating embeddings: 20it [00:12,  1.54it/s]


QUESTION: What is  Complete Streets?

RETRIEVED CHUNKS:
--------------------------------------------------
Retrieved 10 chunks:

[CHUNK 1]
Source: ./data/Best-Complete-Streets-Policies-2025.pdf
Chunk ID: 12
Content: Complete Streets is an approach to planning, designing, building, operating, and maintaining streets that enables safe access for all people who need to use them, including pedestrians, bicyclists, motorists, and transit riders. Adopting a Complete Streets policy is an indicator that policymakers, practitioners, and neighbors alike recognize that current roadway designs are not working and that a pivot is needed. Complete Streets provide a number of important benefits, including better health, economic growth, a sense of place, and overall quality of life.
--------------------------------------------------

[CHUNK 2]
Source: ./data/Best-Complete-Streets-Policies-2025.pdf
Chunk ID: 10
Content: The Complete Streets approach involves practical and tangible changes—such as putt

('Complete Streets is an approach to planning, designing, building, operating, and maintaining streets that enables safe access for all people who need to use them, including pedestrians, bicyclists, motorists, and transit riders. It involves both practical changes, such as adding sidewalks, raised crosswalks, and safe bicycle infrastructure, as well as less visible changes like public accountability systems and evaluation processes. The approach aims to improve health, economic growth, quality of life, and safety by making streets safe and accessible for all users.',
 [('./data/Best-Complete-Streets-Policies-2025.pdf', 12),
  ('./data/Best-Complete-Streets-Policies-2025.pdf', 10),
  ('./data/Best-Complete-Streets-Policies-2025.pdf', 13),
  ('./data/Best-Complete-Streets-Policies-2025.pdf', 4),
  ('./data/Best-Complete-Streets-Policies-2025.pdf', 7),
  ('./data/Best-Complete-Streets-Policies-2025.pdf', 11),
  ('./data/Best-Complete-Streets-Policies-2025.pdf', 100),
  ('./data/Best-Comp

In [125]:
def make_synthetic_eval(n_samples=10, rag_documents=rag_documents):
    """Generates synthetice evaluation questions and answers 
    based on the RAG documents."""
    class SyntheticQuestion(BaseModel):
        question: str
        gt_answer: str
        reference: str

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    synthetic_eval = []

    chunks = np.random.choice(list(rag_documents.storage.keys()), size=n_samples, replace=True)
    for chunk in chunks:
        body = {}
        body["content"]=rag_documents.storage[chunk].content
        body["source"]=rag_documents.storage[chunk].meta.get('source', 'Unknown')
        body["chunk_id"]=rag_documents.storage[chunk].meta.get('chunk_id', 'Unknown')

        prompt = f"""Generate 1 synthetic question based on the following text. Generate questions 
            that are relevant to the text and can be answered using the information provided, and would
            be suitable for evaluating the performance of a RAG system.
            Text: {body["content"]}
            respond format:
            {{'question': 'Your question here', 
            'gt_answer': 'The answer to the question here',
            'reference': 'text from the chunk supporting the answer'}}.
            Do not include any additional text or explanations. Only the JSON response.
            If the text does not contain enough information to answer a question, make
            the question and answer both 'NA'
            """
        
        response = client.responses.parse(
            model="gpt-4o",
            input=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that generates synthetic questions and answers based on provided text.",
                },
                {"role": "user", "content": prompt},
            ],
            text_format=SyntheticQuestion,
        )

        if not response.output_parsed.question == "NA":
            body["question"] = response.output_parsed.question
            body["gt_answer"] = response.output_parsed.gt_answer
            body["reference"] = response.output_parsed.reference
            synthetic_eval.append(body)

    return pd.DataFrame(synthetic_eval)

In [126]:
synth_df = make_synthetic_eval(n_samples=100)
print(synth_df.shape)
synth_df.head(5)


(82, 6)


Unnamed: 0,content,source,chunk_id,question,gt_answer,reference
0,Each one of these deaths was a person who left...,./data/Dangerous-By-Design-2024_5.30.pdf,16,What are the three hidden costs of the crisis ...,"The trauma of survivors, the heavy direct and ...",The trauma of survivors is just one of three h...
1,"Despite a clear Complete Streets vision, furth...",./data/Best-Complete-Streets-Policies-2025.pdf,93,What is important for a new Complete Streets p...,It is important that a new policy addresses al...,It is important that a new policy addresses al...
2,................................ ............ ...,./data/2023 National Transit Summaries and Tre...,98,What section of the 2023 National Transit Summ...,Table of Contents,2023 National Transit Summaries & Trends Table...
3,Bus (MB) – A transit mode using rubber -tired ...,./data/2023 National Transit Summaries and Tre...,215,What is a characteristic of a Bus Rapid Transi...,Operates over 50 percent of its route in a sep...,Bus Rapid Transit (RB) –A Fixed -Route Bus sys...
4,"However, it also means that these policies are...",./data/Best-Complete-Streets-Policies-2025.pdf,48,Why is it important for jurisdictions to adopt...,To ensure the policies and actions last beyond...,It is important that Complete Streets policies...


In [None]:
def llm_judge(answer, gt_answer, reference, question):
    prompt = f"""You are an expert evaluator of question-answering
    systems. Compare the following answer from an AI to the reference
    answer provided by an expert human. Determine whether they are
    factually equivalent even if they are worded differently
    Answer only 1 for same or 0 for different. Do not provide any
    other commentary
    Question: {question}
    Reference: {reference}
    Groud Truth Answer: {gt_answer}
    Proposed Answer" {answer}
    """
        
    # Send the prompt to GPT-4o
    response = client.responses.create(
        model="gpt-4o",
        input=prompt
    )

    answer = response.output[0].content[0].text
    return answer 

async def evaluate_generated_answer(answer, gt_answer, reference, question):
    reference = gt_answer.strip()
    candidate = answer.strip()
    print(reference, candidate)
    # RAGAS BLEU and ROUGE
    sample = SingleTurnSample(response=candidate, reference=reference)

    bleu = await BleuScore().single_turn_ascore(sample)
    rouge = await RougeScore(rouge_type="rouge1").single_turn_ascore(sample)

    # BERTScore F1
    # P, R, F1 = bertscore([candidate], [reference], lang="en", rescale_with_baseline=True)
    # bert_f1 = F1[0].item()
    # I am commenting this out because the hotel wifi can't handle the BERTScore
    bert_f1 = -9999

    # LLM judge
    judge_binary = llm_judge(answer, gt_answer, reference, question)


    return {
        "bleu": bleu,
        "rouge": rouge,
        "bert_f1": bert_f1,
        "llm judge": judge_binary,
    }

In [133]:
# This will run for a long time the first run while it gets the BERT weights
# Comment out the BERTScore line if you want to skip it
record = synth_df.iloc[5,:]

candidate, sources = rag_answer(record.question, rag_pipeline, verbose=False)

result = await evaluate_generated_answer(candidate, 
                          record.gt_answer, 
                          record.reference, 
                          record.question)
print(result)

It is important for jurisdictions to adopt strong and binding Complete Streets policies because these policies need to last beyond political timelines. Strong policies ensure that the Complete Streets approach can have lasting impacts without being easily overturned when a new administration takes over. Additionally, strong policies can lead to changes in the way people travel, promote vibrant public spaces, encourage shopping at locally owned businesses, and help neighbors connect with each other.
To ensure that these policies last beyond political timelines. It is important for jurisdictions to adopt strong and binding Complete Streets policies because these policies need to last beyond political timelines. Strong policies ensure that the Complete Streets approach can have lasting impacts without being easily overturned when a new administration takes over. Additionally, strong policies can lead to changes in the way people travel, promote vibrant public spaces, encourage shopping at

# Exercises

Remainder left as an exercise to the reader

Now you have all the components you need to try out different approaches to the RAG. 

Ideas to try:

1. Get the metrics for all of the observations in synth_df and perform some UQ methods (boostrap, bayesian posterior). You could also modify the judge functon to return logits and try calibrating those
2. Below is a weaker version of the pipeline. Run this and see how performance differs
3. Try changing the RAG and LLM hyperparameters and observe the change
4. See if you can get a reranker working. You can use the sentence transformers package to get a model, or even try using an LLM to rerank

In [None]:
# # Weak is a misnomer, ADA is a fine model, but it plus GPT 3.5 should
# # be less performant than the above
# # Once you run this, there may be errors if you try to call the other pipeline
# weak_rag_pipeline, _ = make_rag_pipeline(embedding_model="text-embedding-ada-002", generator_model="gpt-3.5-turbo")

# rag_answer("What is  Complete Streets?", weak_rag_pipeline, verbose=True)