In [3]:
#load env variables and create client
from dotenv import load_dotenv
from anthropic import Anthropic
load_dotenv()

client = Anthropic()
model= "claude-3-5-haiku-latest"


In [9]:
#helper functions

def add_user_message(messages, text):
    user_message = {"role":"user", "content":text}
    messages.append(user_message)

def add_assistant_message(messages, text):
    assistant_message = {"role": "assistant", "content": text}
    messages.append(assistant_message)

def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
        "stop_sequences": stop_sequences,
    }
    if system:
        params["system"] = system
    
    message = client.messages.create(**params)
    return message.content[0].text

In [10]:
import json


def generate_dataset():
    prompt = """
Generate an evaluation dataset for prompt evaluation. The dataset will be used to evaluate prompts
that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an array of JSON objects,
each representing a task that requires Python, JSON, or a Regex to complete.

Example output:
```json
[
    {
        "task": "Description of task"
    }
]
```

- Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a regular expression.
- Focus on tasks that do not require writing much code.

Please generate 3 objects.
""".strip()

    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```json")

    text = chat(messages, stop_sequences=["```"])
    return json.loads(text)

#### Building the Core Functions
The evaluation pipeline consists of three main functions, each with a specific responsibility. Let's start with the simplest one - the function that handles individual prompts

##### The run_prompt Function
This function takes a test case and merges it with our prompt template:

In [None]:
def run_prompt(test_case):
     """Merges the prompt and test case input, then returns the result"""
    prompt = f"""
Please solve the following task:

{test_case["task"]}
"""
messages = []
add_user_message(messages, prompt)
output = chat(messages)
return output


Right now, we're keeping the prompt extremely simple. We're not including any formatting instructions, so Claude will likely return more verbose output than we need. We'll refine this later as we iterate on our prompt design.


##### The run_test_case Function
This function orchestrates running a single test case and grading the result:

In [None]:
def run_test_case(test_case):
    """
    calls  run_prompt and grades the result
    """
    output = run_prompt(test_case)
    #todo - grading

    score = 10

    return {
        "output": output,
        "test_case": test_case,
        "score": score
    }
    

For now, we're using a hardcoded score of 10. The grading logic is where we'll spend significant time in upcoming sections, but this placeholder lets us test the overall pipeline.

##### The run_eval Function
This function coordinates the entire evaluation process:

In [None]:
def run_eval(dataset):
    """
    Loads the dataset and calls run_test_case with each case
    """
    results = []
    for test_case in dataset:
        results.append(run_test_case(test_case))
    return results

#### This function processes every test case in our dataset and collects all the results into a single list.

##### Running the Evaluation
To execute our evaluation pipeline, we load our dataset and run it through our functions:

In [None]:
with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = run_eval(dataset)

The first time you run this, expect it to take some time - even with Claude Haiku, it can take around 30 seconds to process a full dataset. We'll cover optimization techniques later.

##### Examining the Results
The evaluation returns a structured JSON array where each object represents one test case result:

In [None]:
print(json.dumps(results, indent=2))

Each result contains three key pieces of information:

* output: The complete response from Claude
* test_case: The original test case that was processed
* score: The evaluation score (currently hardcoded)
As you can see in the output, Claude generates quite verbose responses since we haven't provided specific formatting instructions yet. This is exactly the kind of issue we'll address as we refine our prompts.


#### What We've Accomplished
At this point, we've successfully built the core evaluation pipeline. We can take our dataset, process it through Claude, and collect structured results. The major missing piece is the grading system - that hardcoded score of 10 needs to be replaced with actual evaluation logic.

This pipeline represents the foundation of most AI evaluation systems. While it may seem simple, you've just built the majority of what an eval pipeline actually does. The complexity comes in the details - better prompts, sophisticated grading, and performance optimizations.