<a href="https://colab.research.google.com/github/salmantec/AI-Agents-Crash-Course/blob/feat%2FDay-5/Day-5/Day_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [79]:
## Evaluation

In [80]:
# Welcome to day five of our AI Agents Crash Course.

# Yesterday we learned about function calling and created our first agent using Pydantic AI.
# But is this agent actually good? Today we will see how to answer this question.

# In particular, we will cover:
# - Build a logging system to track agent interactions
# - Create automated evaluation using AI as a judge
# - Generate test data automatically
# - Measure agent performance with metrics

# At the end of this lesson, you'll have a thoroughly tested agent with performance metrics.

# In this lesson, we'll use the FAQ database with text search, but it's applicable for any other use case.

# This is going to be a long lesson, but an important one. Evaluation is critical for building reliable AI systems. Without proper evaluation, you can't tell if your changes improve or hurt performance. You can't compare different approaches. And you can't build confidence before deploying to users.

# So let's start!


In [81]:
# Logging

# The easiest thing we can do to evaluate an agent is interact with it. We ask something and look at the response. Does it make sense? For most cases, it should.

# "Vibe check" - we interact with it, and if we like the results, we go ahead and deploy it.

# If we don't like something, we go back and change things:
# - Maybe our chunking method is not suitable? Maybe we need to have a bigger window size?
# - Is our system prompt good? Maybe we need more precise instructions?
# - Or we want to change something else
# And we iterate.

# It might be okay for the first MVP, but how can we make sure the result at the end is actually good?

# We need systematic evaluation. Manual testing doesn't scale - you can't manually test every possible input and scenario. With systematic evaluation, we can test hundreds or thousands of cases automatically.

# We also need to base our decisions on data. It will help us to
# - Compare different approaches
# - Track improvements
# - Identify edge cases

# We can start collecting this data ourselves: start with vibe checking, but be smart about it. We don't just test it, but also record the results.


In [82]:
!pip install uv



In [83]:
!uv pip install openai minsearch requests python-frontmatter pydantic-ai

[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m5 packages[0m [2min 112ms[0m[0m


In [84]:
#  find read_repo_data in the first lesson and sliding_window in the second lesson

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
  """
  Download and parse all markdown files from a github repository

  Args:
    repo_owner : Github username or organization
    repo_name: Repository name

  Returns:
    List of dictionaries containing file content and metadata
  """
  prefix = 'https://codeload.github.com'
  url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
  resp = requests.get(url)

  if resp.status_code != 200:
    raise Exception(f"Failed to download repository {repo_owner}/{repo_name}: {resp.status_code}")

  repository_data = []

  # Create a ZipFile object from the downloaded content
  zf = zipfile.ZipFile(io.BytesIO(resp.content))

  for file_info in zf.infolist():
    filename = file_info.filename
    filename_lower = filename.lower()

    if not (filename_lower.endswith('.md') or (filename_lower.endswith('.mdx'))):
      continue

    try:
      with zf.open(file_info) as f_in:
        content = f_in.read().decode('utf-8', errors='ignore')
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)
    except Exception as e:
      print(f"Error processing {filename}: {e}")
      continue

  zf.close()
  return repository_data

In [85]:
# Let's now index this data with minsearch:

from minsearch import Index

# For DataTalksClub FAQ, it's similar, except we don't need to chunk the data. For the data engineering course, it'll look like this:

dtc_faq = read_repo_data('DataTalksClub', 'faq')

de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

query = 'Course: Can I still join the course after the start date?'
results = faq_index.search(query)
print(results)

# This is text search, also known as "lexical search". We look for exact matches between our query and the documents.

[{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}, {'id': '9e508f2212', 'question': 'Course: When does the course start?', 'sort_order': 1, 'content': "The next cohort starts January 13th, 2025. More info at [DTC](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html).\n\n- Register before the course starts using this [link](https://airtable.com/shr6oVXeQvSI5HuWD).\n- Join the [course Telegram channel with announcements](https://t.me/dezoomcamp).\n- Don’t forget to register in DataTalks.Club's Slack and join the channel.", 'file

In [None]:
# Here's the agent we created yesterday:

from typing import List, Any
from pydantic_ai import Agent
from google.colab import userdata
import os


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course.

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""


# Get API keys from Colab secrets
groq_api_key = userdata.get('GROQ_API_KEY')

# Set environment variables (pydantic-ai can also read from these)
os.environ['GROQ_API_KEY'] = groq_api_key

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='groq:gemma2-9b-it'
)

question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)

print(result.output)


# Here's what we want to record:
# - The system prompt that we used
# - The model
# - The user query
# - The tools we use
# - The responses and the back-and-forth interactions between the LLM and our tools
# - The final response

# To make it simpler, we'll implement a simple logging system ourselves: we will just write logs to json files.

# You shouldn't use it in production. In practice, you will want to send these logs to some log collection system, or use specialized LLM evaluation tools like Evidently, LangWatch or Arize Phoenix.


In [88]:
# Let's extract all this information from the agent and from the run results:

from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }


# This code extracts the key information from our agent:
# - the configuration (name, prompt, model)
# - available tools
# - complete message history (user input, tool calls, responses)

# We also use ModelMessagesTypeAdapter.dump_python(messages) to convert internal message format into regular Python dictionaries. This makes it easier to save it to JSON and process later.

# We also add the source parameter. It tracks where the question came from. We start with "user" but later we'll use AI-generated queries. Sometimes it may be important to tell them apart for analysis.0

# This code is generic so it will work with any Pydantic AI agent. If you use a different library, you'll need to adjust this code.

In [89]:
# Let's write these logs to a folder:

import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    ts_str = ts.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

  # This code:
  # - Creates a logs directory (if not created previously)
  # - Generates unique filenames with timestamp and random hex
  # - Saves complete interaction logs as JSON files
  # - Handles datetime serialization (using the serialized function)

In [None]:
# Now we can interact with it and do some vibe checking:
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# This creates a simple interactive loop where:
# User enters a question
# Agent processes it and responds
# Complete interaction is logged to a file

# Try these questions:
# - how do I use docker on windows?
# - can I join late and get a certificate?
# - what do I need to do for the certificate?


In [None]:
# Adding References

# When interacting with the agent, I noticed one thing: it doesn't include the reference to the original documents.

# Let's fix it by adjusting the prompt

system_prompt = """
You are a helpful assistant for a course.

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.
""".strip()

# Create another version of agent, let's call it faq_agent_v2
agent = Agent(
    name="faq_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='groq:gemma2-9b-it'
)

# Now we can interact with it and do some vibe checking:
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# question - can I join late and get a certificate?

In [92]:
# LLM as a Judge

# You can ask your colleagues to also do a "vibe check", but make sure you record the data. Often collecting 10-20 examples and manually inspecting them is enough to understand how your model is doing.

# Don't be afraid of putting manual work into evaluation. Manual evaluation will help you understand edge cases, learn what good responses look like and think of evaluation criteria for automated checks later.

# For example, I manually inspected the output and noticed that references are missing. So we will later add it as one of the checks.

# So, in our case, we can have the following checks:
# - Does the agent follow the instructions?
# - Given the question, does the answer make sense?
# - Does it include references?
# - Did the agent use the available tools?

# We don't have to evaluate this manually. Instead, we can delegate this to AI. This technique is called "LLM as a Judge".

# The idea is simple: we use one LLM to evaluate the outputs of another LLM. This works because LLMs are good at following detailed evaluation criteria.


In [None]:
# Our system prompt for the judge (we'll call it "evaluation agent" because it sounds cooler) can look like that:

evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met.

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do
- answer_relevant: The response directly addresses the user's question
- answer_clear: The answer is clear and correct
- answer_citations: The response includes proper citations or sources when required
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked?

Output true/false for each check and provide a short explanation for your judgment.
""".strip()


# Since we expect a very well defined structure of the response, we can use structured output.

# We can define a Pydantic class with the expected response structure, and the LLM will produce output that matches this schema exactly.

# This is how we do it:

from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str


# This code defines the structure we expect from our evaluation:
# - Each check has a name, justification, and pass/fail result
# - The overall evaluation includes a list of checks and a summary

# Note that justification comes before check_pass. This makes the LLM reason about the answer before giving the final judgment, which typically leads to better evaluation quality.

# With Pydantic AI in order to make the output follow the specified class, we use the parameter output_type:

eval_agent = Agent(
    name='eval_agent',
    model='groq:llama-3.1-8b-instant',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)


# Usually it's a good idea to evaluate the results of one model (in our case, "gpt-4o-mini") with another model (e.g. "gpt-5-nano").
# A different model can catch mistakes, reduce self-bias, and give a second opinion. This makes evaluations more reliable.

# We have the instructions, and we have the agent. In order to run the agent, it needs input. We'll start with a template:

user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

# We use XML markup because it's easier and more clear for LLMs to understand the input. XML tags help the model see the structure and boundaries of different sections in the prompt.

# Let's fill it in. First, define a helper function for loading JSON log files:

def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data

# We also add the filename in the result - it'll help us with tracking later.

# Now let's use it:

log_record = load_log_file('logs/faq_agent_v2_20251004_071817_8d1665.json')

instructions = log_record['system_prompt']
question = log_record['messages'][0]['parts'][0]['content']
answer = log_record['messages'][-1]['parts'][0]['content']
log = json.dumps(log_record['messages'])

user_prompt = user_prompt_format.format(
    instructions=instructions,
    question=question,
    answer=answer,
    log=log
)


# The user input is ready and we can test it!

result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)

checklist = result.output
print(checklist.summary)

for check in checklist.checklist:
    print(check)


# This code:
# - Loads a saved interaction log
# - Extracts the key components (instructions, question, answer, full log)
# - Formats them into the evaluation prompt
# - Runs the evaluation agent
# - Prints the results



In [94]:
# Note that we're putting the entire conversation log into the prompt, which is not really necessary. We can reduce it to make it less verbose.

# For example, like that:

def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []

        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']

            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                # Replace actual search results with placeholder to save tokens
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']

            parts.append(part)

        message = {
            'kind': m['kind'],
            'parts': parts
        }

        log_simplified.append(message)
    return log_simplified

# We make it simpler:
# - remove timestamps and IDs that aren't needed for evaluation
# - replace actual search results with a placeholder
# - keep only the essential structure

# This is helpful because it reduces the number of tokens we send to the evaluation model, which lowers the costs and speeds up evaluation.

In [None]:
# Let's put everything together

async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][0]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result.output


log_record = load_log_file('logs/faq_agent_v2_20251004_071817_8d1665.json')
eval1 = await evaluate_log_record(eval_agent, log_record)

print(eval1)

# We know how to log our data and how to run evals on our logs.
# Great. But how do we get more data to get a better understanding of the performance of our model?


In [97]:
# Data Generation:

# WE can ask AI to help. What If we used it for generating more questions ? Let's do that.

# We can sample some records from our database. Then for each record, ask an LLM to generate a question based on the record.
# We use this question as input to our agent and log the answers.

# Let's start by defining the question generator:

question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about a data engineering course.

Based on the provided FAQ content, generate realistic questions that students might ask.

The questions should:

- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general course questions

Generate one question for each record.
""".strip()

class QuestionsList(BaseModel):
    questions: list[str]

question_generator = Agent(
    name="question_generator",
    instructions=question_generation_prompt,
    model='groq:llama-3.1-8b-instant',
    output_type=QuestionsList
)

# This prompt is designed for our specific use case (data engineering course FAQ). You should adjust it for your project.

# We will send it a bunch of records, and it will generate a question from each of them.

# Note: we use a simple way of generating questions. We can use a more complex approach where we also track the source (filename) of the question. If we do it, we can later check if this file was retrieved and cited in the answer. But we won't do it today to make things simpler.



In [None]:
# Now let's sample 10 records from our dataset using Python's built-in random.sample function:

import random

sample = random.sample(de_dtc_faq, 10)
prompt_docs = [d['content'] for d in sample]
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.output.questions

print(questions)

In [None]:
# Now we simply iterate over each of the question, ask our agent and log the results:

from tqdm.auto import tqdm

for q in tqdm(questions):
    print(q)

    result = await agent.run(user_prompt=q)
    print(result.output)

    log_interaction_to_file(
        agent,
        result.new_messages(),
        source='ai-generated'
    )

    print()

# We can repeat it multiple times until we have enough data. Around 100 should be good for a start, but today we can just continue with the 10 log records we already generated.

# Using AI for generating test data is quite powerful. It can help us get data faster and sometimes cover edge cases we won't think about.

# There are limitations too:
# - AI-generated questions might not reflect real user behavior
# - It may miss important edge cases that only real users encounter
# - They may not capture the full complexity of real user queries

# The logs are ready, so we can run evaluation on them with our evaluation agent.


In [None]:
# First, collect all the AI-generated logs for the v2 agent:


eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if 'faq_agent_v2' not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    if log_record['source'] != 'ai-generated':
        continue

    eval_set.append(log_record)

eval_results = []

for log_record in tqdm(eval_set):
    eval_result = await evaluate_log_record(eval_agent, log_record)
    eval_results.append((log_record, eval_result))

# This code:
# - Loops through each AI-generated log
# - Runs our evaluation agent on it
# - Stores both the original log and evaluation result

# There are ways to speed this up, but we won't cover them in detail here. For example, you can try this:
# - Don't ask for justification - this makes evaluation faster but slightly lower quality
# - Parallelize execution - you can ask ChatGPT how to do this with async/await


In [107]:
# The results are collected, but we need to display them and also calculate some statistics. The best tool for doing this is Pandas. We already should have it because minsearch depends on it.

# But we can make it an explicit dependency:

!uv pip install pandas

[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 250ms[0m[0m


In [None]:
# Our data is not ready to be converted to a Pandas DataFrame. We first need to transform it a little. Let’s do it:

rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']

    row = {
        'file': log_record['log_file'].name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }

    checks = {c.check_name: c.check_pass for c in eval_result.checklist}
    row.update(checks)

    rows.append(row)

# This code:
# - Extracts key information from each log (file, question, answer)
# - Converts the evaluation checks into a dictionary format

# Now each row is a simple key-value dictionary, so we can create a DataFrame:

import pandas as pd

df_evals = pd.DataFrame(rows)

# We can look at individual records and see which checks are False.

# But it's also useful to look at the overall stats:

df_evals.mean(numeric_only=True)

# This calculates the average pass rate for each check:

# instructions_follow    0.3
# instructions_avoid     1.0
# answer_relevant        1.0
# answer_clear           1.0
# answer_citations       0.3
# completeness           0.7
# tool_call_search       1.0

# This tells us:
# - Only 30% of responses follow instructions completely
# - All responses avoid forbidden actions (good!)
# - All responses are relevant and clear (great!)
# - Only 30% include proper citations (needs improvement)
# - 70% of responses are complete
# - All responses use the search tool (as expected)

# For us, the most important check is answer_relevant. This tells us whether the agent actually answers the user's question. If this score was low, it’d mean that our agent is not ready.

# We now know how to evaluate our agent. What can we do with it now?

# Many things:
# - Decide if this quality is good enough for deployment
# - Evaluate different chunking approaches and search
# - See if changing a prompt leads to any improvements.

# The algorithm is simple:
# - Collect data for evaluation and keep this dataset fixed
# - Run different versions of your agent for this dataset
# - Compare key metrics to decide which version is better

# Evaluation is a very powerful tool and we should use it when possible.



In [None]:
# Evaluating functions and tools

# Also, we can (and should) evaluate our tools separately from evaluating the agent.

# If it's code, we need to cover it with unit and integration tests.

# We also have the search function, which we can evaluate using standard information retrieval metrics. For example:
# - Precision and Recall: How many relevant results were retrieved vs. how many relevant results were missed
# - Hit Rate: Percentage of queries that return at least one relevant result
# - MRR (Mean Reciprocal Rank): Reflects the position of the first relevant result in the ranking

# This is how we can implement hitrate and MRR calculation in Python:

def evaluate_search_quality(search_function, test_queries):
    results = []

    for query, expected_docs in test_queries:
        search_results = search_function(query, num_results=5)

        # Calculate hit rate
        relevant_found = any(doc['filename'] in expected_docs for doc in search_results)

        # Calculate MRR
        for i, doc in enumerate(search_results):
            if doc['filename'] in expected_docs:
                mrr = 1 / (i + 1)
                break
        else:
            mrr = 0

        results.append({
            'query': query,
            'hit': relevant_found,
            'mrr': mrr
        })
    return results

# We won't do it today, but these ideas and the code will be useful when you implement a real agent project with search.

# It's useful because it'll helps us make guided decisions about:
# - When to use text vs. vector vs. hybrid search
# - What are the best parameters for our search

# You can ask ChatGPT to learn more about information retrieval evaluation metrics.

# This was a very long lesson, but an important one. We finished it, and evaluated our agent. It’s good for deployment, so tomorrow we’ll create an UI for it and deploy it to the internet.
