# Day 5: Evaluation

https://docs.google.com/document/d/1laKd7zWd9P6vgM06gIQMX7_EHdaPy43o5qIfz_6-zgE/edit?tab=t.0#heading=h.6f40algdx96s

In [3]:
from utils.ingest import read_repo_data
from minsearch import Index

# DataTalksClub FAQ
dtc_faq = read_repo_data('DataTalksClub', 'faq')

de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

<minsearch.minsearch.Index at 0x71224214efc0>

In [1]:
from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='groq:openai/gpt-oss-20b'
)


In [5]:
question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)

In [6]:
print(result.output)

Below is a quick‚Äëstart guide for getting a **Kafka client** up and running in Python.  
If you actually need to run a Kafka **broker** on your machine, the steps are a little different (see the note at the end).

---

## 1. Install the Python Kafka client library

There are two popular client libraries:

| Library | Typical use case | Install command |
|---------|------------------|-----------------|
| `confluent-kafka` | High‚Äëperformance, native C library (recommended for production) | `pip install confluent-kafka` |
| `kafka-python` | Pure‚ÄëPython, easier to install in minimal environments | `pip install kafka-python` |

> **Tip:**  `confluent-kafka` is the library that the course‚Äôs example code uses.  It is fast and is the one shown in the FAQ entry *‚ÄúPython Kafka: Installing dependencies for python3 06‚Äëstreaming/python/avro_example/producer.py‚Äù*.

### Example

```bash
# Create (or activate) a virtual environment first
python -m venv venv
source venv/bin/activate   # Wi

### Logging

We want to capture the following information from each agent run:

- **System prompt** used  
- **Model** chosen  
- **User query**  
- **Tools** invoked  
- **LLM ‚Üî tool interactions** (inputs/outputs)  
- **Final response**  

---

#### Simple Approach
- Implement a minimal logging system.  
- Save logs as **JSON files** for easy review.  

---

#### ‚ö†Ô∏è Important
- This is **not production-ready**.  
- In practice, send logs to a log collection system or use specialized LLM evaluation tools such as:  
  - **Evidently**  
  - **LangWatch**  
  - **Arize Phoenix**  

---

#### Example JSON Log Structure
```json
{
  "agent_name": "support_agent",
  "system_prompt": "You are a helpful assistant.",
  "provider": "groq",
  "model": "openai/gpt-oss-20b",
  "tools": ["text_search"],
  "messages": [
    {
      "role": "user",
      "content": "What is scaffolding in education?"
    },
    {
      "role": "assistant",
      "content": "Scaffolding in education means providing structured support..."
    }
  ],
  "source": "user"
}
```




In [4]:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    # Convert internal message format into regular Python dictionaries
    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }


In [5]:
import json
import secrets
from pathlib import Path
from datetime import datetime

# Create logs directory if it doesn't exist
LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


# Custom serializer for datetime objects
def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    # Generate a unique filename with timestamp and random hex
    ts = entry['messages'][-1]['timestamp']
    ts_str = ts.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    # Save the completed logs to a JSON file
    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

In [9]:
# Create a simple interaction loop
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

**Yes ‚Äì you can still earn a certificate even if you join the cohort late.**  

The key requirements are:

| Requirement | What you need to do |
|-------------|---------------------|
| **Be part of a live cohort** | You must register for a scheduled cohort (self‚Äëpaced tracks do not award certificates). |
| **Submit the peer‚Äëreviewed capstone projects on time** | These are the only assignments that are required for certification. As long as you turn in the capstone projects by their deadlines and receive peer reviews, you are eligible. |
| **No need to do the homeworks if you join late** | The FAQ specifically states: ‚ÄúYou do **not** need to do the homeworks if you join late, for example.‚Äù |

---

### How to get the certificate once you‚Äôre in

1. **Join the cohort** ‚Äì sign up via the course enrollment page for the edition you‚Äôre interested in (e.g., 2025: https://courses.datatalks.club/de-zoomcamp-2025/enrollment).
2. **Complete the capstone(s)** ‚Äì submit your projects

PosixPath('logs/faq_agent_20251002_111315_0a47e6.json')

In [45]:
result.new_messages()

[ModelRequest(parts=[UserPromptPart(content='can I join late and get a certificate?', timestamp=datetime.datetime(2025, 10, 2, 11, 57, 9, 637966, tzinfo=datetime.timezone.utc))], instructions='You are a helpful assistant for a course.  \n\nUse the search tool to find relevant information from the course materials before answering questions.  \n\nIf you can find specific information through search, use it to provide accurate answers.\n\nAlways include references by citing the filename of the source material you used.  \nWhen citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"\nFormat: [LINK TITLE](FULL_GITHUB_LINK)\n\nIf the search doesn\'t return relevant results, let the user know and provide general guidance.'),
 ModelResponse(parts=[ThinkingPart(content='We need to find relevant info from course materials. Search query: "join late certificate".'), ToolCallPart(tool_name='text_search', args='{"query":"j

### Adding References

Add references to the original documents in the agent‚Äôs responses.

In [6]:
system_prompt = """
You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

# Create another version of agent, let's call it faq_agent_v2
agent = Agent(
    name="faq_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='groq:openai/gpt-oss-20b'
)


In [47]:
question = "can I join late and get a certificate?"
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

You can indeed join a cohort late and still earn a certificate‚Äîprovided you finish the **peer‚Äëreviewed capstone projects on time**.  
The course does not require you to complete the homework assignments if you‚Äôre a late‚Äëarriving student. Just make sure you stay on top of the capstone deadlines and participate in the required peer reviews while the cohort is running.

> **Reference**:  
> *[Certificate: Do I need to do the homeworks to get the certificate?](https://github.com/DataTalksClub/faq/blob/main/faq-main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md)*


PosixPath('logs/faq_agent_v2_20251002_115904_1cd9e6.json')

In [48]:
result.new_messages()

[ModelRequest(parts=[UserPromptPart(content='can I join late and get a certificate?', timestamp=datetime.datetime(2025, 10, 2, 11, 59, 3, 812232, tzinfo=datetime.timezone.utc))], instructions='You are a helpful assistant for a course.  \n\nUse the search tool to find relevant information from the course materials before answering questions.  \n\nIf you can find specific information through search, use it to provide accurate answers.\n\nAlways include references by citing the filename of the source material you used.  \nWhen citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"\nFormat: [LINK TITLE](FULL_GITHUB_LINK)\n\nIf the search doesn\'t return relevant results, let the user know and provide general guidance.'),
 ModelResponse(parts=[ThinkingPart(content='The user asks: "can I join late and get a certificate?" They likely refer to a course. We need to search relevant information from the course material

### Evaluation

- **Vibe Check**:  
  - A quick manual review to get a sense of how well the model is performing.  
  - Collect ~10‚Äì20 examples for inspection.  
  - Helps identify edge cases and define future evaluation criteria.  

- **LLM as a Judge**:  
  - Use one LLM to evaluate another‚Äôs outputs.  
  - Checks can include:  
    - Following instructions  
    - Answer quality/relevance  
    - References included  
    - Proper tool usage  
  - Automates evaluation once criteria are defined.  


In [None]:
evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user's question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: The agent invoked search tool (in <LOG>)

Only fill true/false and justification. Do not include any explanation or extra fields.
""".strip()

#### Structured Output

- Use structured output when expecting a well-defined response format.  
- Define a **Pydantic class** with the expected schema.  
- The LLM will generate output that matches the schema exactly, ensuring consistency and easier parsing.  


In [None]:
from pydantic import BaseModel, Field

In [239]:
class EvaluationCheck(BaseModel):
    check_name: str = Field(description="The name of the check from the provided checklist.")
    justification: str = Field(description="A brief explanation for the check_pass result.")
    check_pass: bool = Field(description="True if the condition is met, False otherwise.")

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str = Field(description="A concise final summary of the agent's performance.")


In [None]:
# Create an evaluation agent with structured output
eval_agent = Agent(
    name='eval_agent',
    model='groq:llama-3.3-70b-versatile', # Use the model that can handle structured output
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)


In [247]:
# Input format for the evaluation agent - use XML-like tags for better understanding
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

In [248]:
# Helper function to load log file
def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data

In [249]:
# Loads a saved interaction log
log_record = load_log_file('./logs/faq_agent_v2_20251002_115904_1cd9e6.json')

# Extracts the key components (instructions, question, answer, full log)
instructions = log_record['system_prompt'][0]
question = log_record['messages'][0]['parts'][0]['content']
answer = log_record['messages'][-1]['parts'][-1]['content']
log = json.dumps(log_record['messages'])

print("Instructions:", instructions)
print("Question:", question)
print("Answer:", answer)
print("Log:", log[:500], "...")  # Print only the first 500 characters

Instructions: You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.
Question: can I join late and get a certificate?
Answer: You can indeed join a cohort late and still earn a certificate‚Äîprovided you finish the **peer‚Äëreviewed capstone projects on time**.  
The course does not require you to complete the homework assignments if you‚Äôre a late‚Äëarriving student. Just make sure you stay on top of the capstone deadlines and participate in the requ

In [250]:
# Formats them into the evaluation prompt
user_prompt = user_prompt_format.format(
    instructions=instructions,
    question=question,
    answer=answer,
    log=log
)

print("User Prompt:", user_prompt[:500], "...")  # Print only the first 500 characters

User Prompt: <INSTRUCTIONS>You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
Format: [LINK TITLE](F ...


In [251]:
# Runs the evaluation agent
result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)

checklist = result.output
print(checklist.summary)

for check in checklist.checklist:
    print(check)

The assistant provided a clear and accurate answer with proper citations, followed the instructions, and used the search tool effectively.
check_name='instructions_follow' justification='The assistant followed the instructions by searching for relevant information and providing a reference.' check_pass=True
check_name='instructions_avoid' justification='The assistant avoided doing things it was told not to do.' check_pass=True
check_name='answer_relevant' justification="The response directly addresses the user's question." check_pass=True
check_name='answer_clear' justification='The answer is clear and correct.' check_pass=True
check_name='answer_citations' justification='The response includes proper citations and sources.' check_pass=True
check_name='completeness' justification='The response is complete and covers all key aspects of the request.' check_pass=True
check_name='tool_call_search' justification='The agent invoked the search tool.' check_pass=True


#### Reduce prompt verbosity: 

Keep the prompt concise by including only the necessary context, rather than the entire conversation log.

- remove timestamps and IDs that aren't needed for evaluation
- remove thinking part that are not necessary
- replace actual search results with a placeholder
- keep only the essential structure

This reduces the number of tokens we send to the evaluation model, which lowers the costs and speeds up evaluation.

In [189]:
def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
            
            if kind == 'thinking':
                part['content'] = 'THINKING_REDACTED'
            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                # Replace actual search results with placeholder to save tokens
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified

In [190]:
simple_log = simplify_log_messages(log_record['messages'])
print(json.dumps(simple_log, indent=2))  # Print only

[
  {
    "kind": "request",
    "parts": [
      {
        "content": "can I join late and get a certificate?",
        "part_kind": "user-prompt"
      }
    ]
  },
  {
    "kind": "response",
    "parts": [
      {
        "content": "THINKING_REDACTED",
        "id": null,
        "signature": null,
        "provider_name": null,
        "part_kind": "thinking"
      },
      {
        "tool_name": "text_search",
        "args": "{\"query\":\"join late certificate course\"}",
        "part_kind": "tool-call"
      }
    ]
  },
  {
    "kind": "request",
    "parts": [
      {
        "tool_name": "text_search",
        "content": "RETURN_RESULTS_REDACTED",
        "part_kind": "tool-return"
      }
    ]
  },
  {
    "kind": "response",
    "parts": [
      {
        "content": "THINKING_REDACTED",
        "id": null,
        "signature": null,
        "provider_name": null,
        "part_kind": "thinking"
      },
      {
        "content": "You can indeed join a cohort late and s

In [191]:
async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][-1]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result.output 


log_record = load_log_file('./logs/faq_agent_v2_20251002_115904_1cd9e6.json')
eval1 = await evaluate_log_record(eval_agent, log_record)

In [192]:
print(eval1)

checklist=[EvaluationCheck(check_name='instructions_follow', justification='Used the search tool and gave a reference, but the citation still contains the placeholder "faq-main" instead of the full repository path as instructed.', check_pass=False), EvaluationCheck(check_name='instructions_avoid', justification='No prohibited actions were taken.', check_pass=True), EvaluationCheck(check_name='answer_relevant', justification="The response directly answers the user's question about joining late and obtaining a certificate.", check_pass=True), EvaluationCheck(check_name='answer_clear', justification='The answer is concise, understandable, and accurately reflects the policy.', check_pass=True), EvaluationCheck(check_name='answer_citations', justification='Citation format is incorrect; it still includes "faq-main" instead of the full path replacement required.', check_pass=False), EvaluationCheck(check_name='completeness', justification='All key points (late join, certificate eligibility, c

### Data Generation

- Use AI to generate additional questions from sample records in the database.  
- For each record:  
  1. Ask an LLM to create a question based on the record.  
  2. Use the generated question as input to the agent and log the response.  
- Current approach is simple; more advanced methods could also track the source file for later verification.  
- Adjust prompts according to your specific project or use case.  

In [87]:
question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about a data engineering course.

Based on the provided FAQ content, generate realistic questions that students might ask.

The questions should:

- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general course questions

Generate one question for each record.
""".strip()

class QuestionsList(BaseModel):
    questions: list[str]

question_generator = Agent(
    name="question_generator",
    instructions=question_generation_prompt,
    model='groq:openai/gpt-oss-20b',
    output_type=QuestionsList
)


In [88]:
import random

# Randomly sample 10 records from the FAQ data
sample = random.sample(de_dtc_faq, 10)
prompt_docs = [d['content'] for d in sample]

In [89]:
prompt_docs

['For example, when running `JsonConsumer.java`, you might see:\n\n```\nConsuming form kafka started\n\nRESULTS:::0\n\nRESULTS:::0\n\nRESULTS:::0\n```\n\nOr when running `JsonProducer.java`, you might encounter:\n\n```\nException in thread "main" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.SaslAuthenticationException: Authentication failed\n```\n\n**Solution:**\n\n1. Ensure the `StreamsConfig.BOOTSTRAP_SERVERS_CONFIG` in the scripts located at `src/main/java/org/example/` (e.g., `JsonConsumer.java`, `JsonProducer.java`) is pointing to the correct server URL (e.g., `europe-west3` vs `europe-west2`).\n\n2. Verify that the cluster key and secrets are updated in `src/main/java/org/example/Secrets.java` (`KAFKA_CLUSTER_KEY` and `KAFKA_CLUSTER_SECRET`).',
 'GitHub Codespaces offers you computing Linux resources with many pre-installed tools (Docker, Docker Compose, Python).\n\nYou can also open any GitHub repository in a GitHub Codespace.',
 'Before you can develo

In [90]:
# Generate questions
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.output.questions

In [91]:
questions

["I keep seeing 'RESULTS:::0' when I run JsonConsumer.java, and I get a SaslAuthenticationException with JsonProducer.java‚Äîwhat steps should I take to correctly set up the bootstrap server and cluster credentials?",
 'How do I launch a GitHub Codespace for my data engineering project, and what pre-installed tools (like Docker, Python, etc.) are available in that environment?',
 'Before I can develop a dbt data model, what configuration steps do I need to perform, and how do I set up both a development and a deployment environment to run the jobs?',
 'My GCP VM is filling up while backfilling data. What practical methods can I use to free up disk space, especially regarding Anaconda, Kestra files, and PostgreSQL data?',
 'I‚Äôm behind a network restriction that blocks Google, and my Terraform runs fail. How can I configure my VPN or proxy settings so that the terminal program respects the system proxy?',
 'The NYC taxi data I download is a .csv.gz file, but my script expects a .csv fi

In [92]:
from tqdm.auto import tqdm

# Use the generated questions as input to the agent and log the response
for q in tqdm(questions):
    print(q)

    result = await agent.run(user_prompt=q)
    print(result.output)

    log_interaction_to_file(
        agent,
        result.new_messages(),
        source='ai-generated'
    )

    print()

  0%|          | 0/10 [00:00<?, ?it/s]

I keep seeing 'RESULTS:::0' when I run JsonConsumer.java, and I get a SaslAuthenticationException with JsonProducer.java‚Äîwhat steps should I take to correctly set up the bootstrap server and cluster credentials?
The two symptoms you‚Äôre seeing are both caused by an incorrect Kafka connection configuration.

| Symptom | Likely Cause | Fix |
|---------|--------------|-----|
| `JsonConsumer.java` prints `RESULTS:::0` repeatedly | Consumer is connecting to the wrong broker (or no broker). | ‚Ä¢ Open `src/main/java/org/example/JsonConsumer.java` (and `JsonProducer.java`).<br>‚Ä¢ Verify that `StreamsConfig.BOOTSTRAP_SERVERS_CONFIG` points to the **correct** broker URL (e.g., `europe-west3` if that‚Äôs where your cluster lives). |
| `JsonProducer.java` throws `SaslAuthenticationException: Authentication failed` | The SASL credentials (key/secret) used by the producer don‚Äôt match the cluster‚Äôs credentials. | ‚Ä¢ Open `src/main/java/org/example/Secrets.java`.<br>‚Ä¢ Update `KAFKA_CLUSTER

- Repeat generation until sufficient data is collected (e.g., ~100 examples).  
- For now, we can use the 10 existing log records.  
- Benefits of AI-generated data:  
  - Faster data creation  
  - Can cover edge cases that might be overlooked  
- Limitations:  
  - May not reflect real user behavior  
  - Might miss edge cases only real users encounter  
  - May not capture full complexity of real queries 

In [93]:
# Collect all AI-generated logs for evaluation
eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if 'faq_agent_v2' not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    if log_record['source'] != 'ai-generated':
        continue

    eval_set.append(log_record)

In [94]:
len(eval_set)

10

In [252]:
eval_results = []

for log_record in tqdm(eval_set):
    eval_result = await evaluate_log_record(eval_agent, log_record)
    print(eval_result)
    eval_results.append((log_record, eval_result)) # Store both log and evaluation result

  0%|          | 0/10 [00:00<?, ?it/s]

checklist=[EvaluationCheck(check_name='instructions_follow', justification="The agent followed the user's instructions by providing a detailed answer and citing the source material.", check_pass=True), EvaluationCheck(check_name='instructions_avoid', justification='The agent avoided doing things it was told not to do.', check_pass=True), EvaluationCheck(check_name='answer_relevant', justification="The response directly addresses the user's question about configuring VPN or proxy settings for Terraform.", check_pass=True), EvaluationCheck(check_name='answer_clear', justification='The answer is clear and correct, providing step-by-step instructions and examples.', check_pass=True), EvaluationCheck(check_name='answer_citations', justification='The response includes proper citations and sources, such as the GitHub repository link.', check_pass=True), EvaluationCheck(check_name='completeness', justification='The response is complete and covers all key aspects of the request, including commo

In [253]:
# Transform results into a tabular format for analysis
rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']

    row = {
        'file': log_record['log_file'].name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }

    checks = {c.check_name: c.check_pass for c in eval_result.checklist}
    row.update(checks)

    rows.append(row)

In [256]:
import pandas as pd

df_evals = pd.DataFrame(rows)
df_evals.head()

Unnamed: 0,file,question,answer,instructions_follow,instructions_avoid,answer_relevant,answer_clear,answer_citations,completeness,tool_call_search
0,faq_agent_v2_20251002_170308_aa2d8d.json,I‚Äôm behind a network restriction that blocks G...,We have a relevant answer. Provide guidance ci...,True,True,True,True,True,True,True
1,faq_agent_v2_20251002_170415_dcb37b.json,My dbt profile's port is being read as a strin...,"We found the relevant answer: ""When configurin...",True,True,True,True,True,True,True
2,faq_agent_v2_20251002_170132_d8333c.json,How do I launch a GitHub Codespace for my data...,We have the relevant answer: question 023_5b4f...,True,True,True,True,True,True,True
3,faq_agent_v2_20251002_170249_7f475b.json,My GCP VM is filling up while backfilling data...,We found relevant answer. Need to answer user ...,True,True,True,True,True,True,True
4,faq_agent_v2_20251002_170221_515204.json,"Before I can develop a dbt data model, what co...","We need to answer user: ""Before I can develop ...",True,True,True,True,True,True,True


In [257]:
df_evals.mean(numeric_only=True)

instructions_follow    1.0
instructions_avoid     1.0
answer_relevant        1.0
answer_clear           1.0
answer_citations       0.9
completeness           1.0
tool_call_search       1.0
dtype: float64

### Evaluating Functions and Tools

- Tools should be evaluated **separately from the agent**.  
- For code tools: use **unit and integration tests**.  
- For search functions: evaluate with **information retrieval metrics**, such as:  
  - **Precision & Recall**: fraction of relevant results retrieved vs. missed  
  - **Hit Rate**: % of queries returning at least one relevant result  
  - **MRR (Mean Reciprocal Rank)**: position of the first relevant result in ranking  
- Implement Hit Rate and MRR calculations in Python for automated evaluation. 

In [270]:
def evaluate_search_quality(search_function, test_queries):
    results = []
    
    for query, expected_docs in test_queries:
        search_results = search_function(query)
        
        # Calculate hit rate
        relevant_found = any(doc['filename'] in expected_docs for doc in search_results)
        
        # Calculate MRR
        for i, doc in enumerate(search_results):
            if doc['filename'] in expected_docs:
                mrr = 1 / (i + 1)
                break
        else:
            mrr = 0
            
        results.append({
            'query': query,
            'hit': relevant_found,
            'mrr': mrr
        })
    return results

In [272]:
sample[:2]

[{'id': 'cd8a62fc55',
  'question': 'Java Kafka: When running the producer/consumer/etc java scripts, no results retrieved or no message sent',
  'sort_order': 24,
  'content': 'For example, when running `JsonConsumer.java`, you might see:\n\n```\nConsuming form kafka started\n\nRESULTS:::0\n\nRESULTS:::0\n\nRESULTS:::0\n```\n\nOr when running `JsonProducer.java`, you might encounter:\n\n```\nException in thread "main" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.SaslAuthenticationException: Authentication failed\n```\n\n**Solution:**\n\n1. Ensure the `StreamsConfig.BOOTSTRAP_SERVERS_CONFIG` in the scripts located at `src/main/java/org/example/` (e.g., `JsonConsumer.java`, `JsonProducer.java`) is pointing to the correct server URL (e.g., `europe-west3` vs `europe-west2`).\n\n2. Verify that the cluster key and secrets are updated in `src/main/java/org/example/Secrets.java` (`KAFKA_CLUSTER_KEY` and `KAFKA_CLUSTER_SECRET`).',
  'filename': 'faq-main/_questions/d

In [273]:
tests_queries = [(item['question'], [item['filename']]) for item in sample]
tests_queries[:2]

[('Java Kafka: When running the producer/consumer/etc java scripts, no results retrieved or no message sent',
  ['faq-main/_questions/data-engineering-zoomcamp/module-6/024_cd8a62fc55_java-kafka-when-running-the-producerconsumeretc-ja.md']),
 ('Environment - Is GitHub codespaces an alternative to using cli/git bash to ingest the data and create a docker file?',
  ['faq-main/_questions/data-engineering-zoomcamp/general/023_5b4fb0c0a8_environment-is-github-codespaces-an-alternative-to.md'])]

In [275]:
test_results = evaluate_search_quality(text_search, tests_queries)

In [276]:
hit_rate = sum(r['hit'] for r in test_results) / len(test_results) if test_results else 0
avg_mrr = sum(r['mrr'] for r in test_results) / len(test_results) if test_results else 0

In [277]:
print(f"Hit Rate: {hit_rate:.2f}")
print(f"Average MRR: {avg_mrr:.2f}")

Hit Rate: 1.00
Average MRR: 1.00
