# Lesson 28: Creating Datasets for AI Evals

In this lesson, we'll explore how to create an evaluation dataset for Brown, the writing workflow.

**Learning Objectives:**

- Understand the structure and format of evaluation datasets for article generation
- Learn how to use the `EvalDataset` and `EvalSample` entities to load and manage evaluation data
- Upload evaluation datasets to Opik for tracking and analysis

> [!NOTE]
> ðŸ’¡ Remember that you can also run `brown` as a standalone Python package by going to `lessons/writing_workflow/` and following the instructions from there. We have a script at `lessons/writing_workflow/scripts/brown_create_eval_dataset.py` that you can use to upload datasets to Opik as well.

## 1. Setup

First, we define some standard Magic Python commands to autoreload Python packages whenever they change:

In [1]:
%load_ext autoreload
%autoreload 2

### Set Up Python Environment

To set up your Python virtual environment using `uv` and load it into the Notebook, follow the step-by-step instructions from the `Course Admin` lesson from the beginning of the course.

**TL/DR:** Be sure the correct kernel pointing to your `uv` virtual environment is selected.

### Configure Opik

To configure Opik, follow the step-by-step instructions in the `Course Admin` lesson.

Here is a quick checklist of what you need to run this notebook:

1.  Get your key from [Opik](https://www.comet.com/site/products/opik/).
2.  From the root of your project, run: `cp .env.example .env` 
3.  Within the `.env` file, fill in the `OPIK_API_KEY` variable:

Now, the code below will load the key from the `.env` file:

In [2]:
from utils import env

env.load(required_env_vars=["OPIK_API_KEY"])

Environment variables loaded from `/Users/pauliusztin/Documents/01_projects/TAI/agentic-ai-engineering-course/.env`
Environment variables loaded successfully.


### Import Key Packages

In [3]:
import nest_asyncio
from utils import pretty_print

nest_asyncio.apply()  # Allow nested async usage in notebooks

### Download Required Files

First, let's download the configs folder:

In [4]:
%%capture

!rm -rf configs
!curl -L -o configs.zip https://raw.githubusercontent.com/iusztinpaul/agentic-ai-engineering-course-data/main/data/configs.zip
!unzip configs.zip
!rm -rf configs.zip

Now, we need to download the inputs folder containing the dataset files:

In [5]:
%%capture

!rm -rf inputs
!curl -L -o inputs.zip https://raw.githubusercontent.com/iusztinpaul/agentic-ai-engineering-course-data/main/data/inputs.zip
!unzip inputs.zip
!rm -rf inputs.zip

Let's verify what we downloaded:

In [6]:
%ls

[1m[36mconfigs[m[m/        [1m[36minputs[m[m/         notebook.ipynb


In [7]:
from pathlib import Path

INPUTS_DIR = Path("inputs")

print(f"Inputs directory exists: {INPUTS_DIR.exists()}")

Inputs directory exists: True


In [8]:
EVALS_DATASET_DIR = Path("inputs/evals")

print(f"Examples directory exists: {EVALS_DATASET_DIR.exists()}")


Examples directory exists: True


## 2. Exploring The Evals Dataset Dir

In [9]:
import json

metadata_path = EVALS_DATASET_DIR / "dataset" / "metadata.json"
with open(metadata_path) as f:
    metadata = json.load(f)

pretty_print.wrapped(json.dumps(metadata, indent=4), title="Evals Dataset Metadata")


[93m-------------------------------------- Evals Dataset Metadata --------------------------------------[0m
  [
    {
        "name": "Lesson 2: Workflows vs. Agents",
        "directory": "data/02_workflows_vs_agents",
        "article_guideline_path": "article_guideline.md",
        "research_path": "research.md",
        "ground_truth_article_path": "article_ground_truth.md"
    },
    {
        "name": "Lesson 3: Context Engineering",
        "directory": "data/03_context_engineering"
    },
    {
        "name": "Lesson 4: Structured Outputs",
        "directory": "data/04_structured_outputs",
        "is_few_shot_example": true
    },
    {
        "name": "Lesson 5: Workflow Patterns",
        "directory": "data/05_workflow_patterns"
    },
    {
        "name": "Lesson 6: Tools",
        "directory": "data/06_tools"
    },
    {
        "name": "Lesson 7: Planning and Reasoning",
        "directory": "data/07_reasoning_planning",
        "is_few_shot_example": true
    },
   

Let's take a deeper look at our dataset, starting with it's overall structure:

In [10]:
data_dir = EVALS_DATASET_DIR / "dataset" / "data"

pretty_print.wrapped(
    json.dumps(
        {
            "dataset_directory": str(EVALS_DATASET_DIR),
            "metadata_file": str(metadata_path),
            "data_directory": str(data_dir),
            "article_samples": len(list(data_dir.iterdir())),
        },
        indent=4,
    ),
    title="Evals Dataset Data Directory",
)

[93m----------------------------------- Evals Dataset Data Directory -----------------------------------[0m
  {
    "dataset_directory": "inputs/evals",
    "metadata_file": "inputs/evals/dataset/metadata.json",
    "data_directory": "inputs/evals/dataset/data",
    "article_samples": 10
}
[93m----------------------------------------------------------------------------------------------------[0m


Now, let's look at each sample individually:

In [11]:
pretty_print.wrapped("ARTICLE SAMPLES", indent=38)

for sample_dir in sorted(data_dir.iterdir()):
    if sample_dir.is_dir():
        files = [f.name for f in sample_dir.iterdir() if f.is_file()]
        print(f"{sample_dir.name}/")
        for f in sorted(files):
            print(f"  - {f}")

[93m----------------------------------------------------------------------------------------------------[0m
                                      ARTICLE SAMPLES
[93m----------------------------------------------------------------------------------------------------[0m
02_workflows_vs_agents/
  - article_ground_truth.md
  - article_guideline.md
  - research.md
03_context_engineering/
  - article_ground_truth.md
  - article_guideline.md
  - research.md
04_structured_outputs/
  - article_generated.md
  - article_ground_truth.md
  - article_guideline.md
  - research.md
05_workflow_patterns/
  - article_ground_truth.md
  - article_guideline.md
  - research.md
06_tools/
  - article_ground_truth.md
  - article_guideline.md
  - research.md
07_reasoning_planning/
  - article_generated.md
  - article_ground_truth.md
  - article_guideline.md
  - research.md
08_react_practice/
  - article_ground_truth.md
  - article_guideline.md
  - research.md
09_RAG/
  - article_ground_truth.md
  - article_

## 3. Uploading The Evals Dataset To Opik

We will quickly go over the code used to upload the dataset described above to Opik. The code is pretty minimal. Thus, we will keep it short.

### 3.1 The EvalSample Entity

The `EvalSample` entity is a Pydantic model that represents a single evaluation sample containing all the data needed for article generation evaluation.

Source: `brown.evals.dataset`
```python
class EvalSample(BaseModel):
    name: str
    directory: Path
    article_guideline: str
    research: str
    ground_truth_article: str
    is_few_shot_example: bool = False
```

Each sample contains:
- `name`: A human-readable identifier for the sample
- `directory`: The path where the sample files are located
- `article_guideline`: The writing guidelines in markdown format
- `research`: The research/source material in markdown format
- `ground_truth_article`: The reference article to compare against
- `is_few_shot_example`: Whether this sample is used for few-shot learning instead of evaluation

### 3.2 The EvalDataset Entity

The `EvalDataset` entity is a Pydantic model that represents a collection of evaluation samples along with dataset metadata.

Source: `brown.evals.dataset`
```python
class EvalDataset(BaseModel):
    name: str
    description: str
    samples: list[EvalSample]

    @classmethod
    def load_dataset(cls, directory: Path, name: str, description: str) -> Self:
        metadata_file = directory / "metadata.json"
        if not metadata_file.exists():
            raise FileNotFoundError(f"Metadata file not found: {metadata_file}")

        with metadata_file.open() as f:
            metadata = json.load(f)

        samples = []
        for sample_metadata in metadata:
            sample_dir = directory / sample_metadata["directory"]

            article_guideline = cls._load_markdown_file(
                sample_dir / sample_metadata.get("article_guideline_path", DEFAULT_ARTICLE_GUIDELINE_PATH)
            )
            research = cls._load_markdown_file(sample_dir / sample_metadata.get("research_path", DEFAULT_RESEARCH_PATH))
            ground_truth_article = cls._load_markdown_file(
                sample_dir / sample_metadata.get("ground_truth_article_path", DEFAULT_GROUND_TRUTH_ARTICLE_PATH)
            )

            sample = EvalSample(
                name=sample_metadata["name"],
                directory=sample_metadata["directory"],
                is_few_shot_example=sample_metadata.get("is_few_shot_example", False),
                article_guideline=article_guideline,
                research=research,
                ground_truth_article=ground_truth_article,
            )
            samples.append(sample)

        return cls(name=name, description=description, samples=samples)
```

The `load_dataset` class method:
- Reads the `metadata.json` file from the specified directory
- Iterates through each sample entry and loads the corresponding markdown files
- Creates `EvalSample` instances for each entry
- Returns a fully populated `EvalDataset` ready for use


### 3.3 The upload_dataset Function

The `upload_dataset` function uploads the evaluation dataset to the Opik observability platform for tracking and analysis.

Source: `brown.observability.dataset`
```python
def upload_dataset(evaluation_dataset: "EvalDataset") -> None:
    samples = evaluation_dataset.model_dump(mode="json")["samples"]
    eval_samples = [sample for sample in samples if not sample["is_few_shot_example"]]
    logger.info(f"Uploading `{len(eval_samples)}/{len(samples)}` evaluation samples to Opik.")
    training_samples = [sample for sample in samples if sample["is_few_shot_example"]]
    logger.info(f"The following `{len(training_samples)}/{len(samples)}` samples will be used for training or as few-shot examples:")
    for sample in training_samples:
        logger.info(f"- `{sample['name']}`")

    opik_utils.update_or_create_dataset(
        name=evaluation_dataset.name,
        description=evaluation_dataset.description,
        items=eval_samples,
    )
```

The function:
- Separates samples into evaluation samples and few-shot examples based on the `is_few_shot_example` flag
- Only uploads evaluation samples to Opik (few-shot examples are used by the LLM judge. Thus, to avoid data leakage, we cannot compute metrics on them)

While the `update_or_create_dataset` function handles updating the dataset on Opik.

Source: `brown.observability.opik_utils`
```python
import opik

def update_or_create_dataset(name: str, description: str, items: list[dict]) -> opik.Dataset:
    """
    Update an existing dataset or create a new one if it doesn't exist.

    Args:
        name: The name of the dataset to update or create.
        description: The description of the dataset.
        items: The items to insert into the dataset.

    Returns:
        opik.Dataset: The updated or created dataset.
    """

    client = opik.Opik()
    dataset = client.get_or_create_dataset(name=name, description=description)
    dataset.clear()

    dataset.insert(items)

    return dataset
```

This is a simple function that gets or creates a dataset on Opik based on its name. Then it clears the dataset and reinserts all the items. As our dataset is small, doing this reinsertion every time works fine, making it a good strategy to avoid duplicates.

### 3.4 Loading and Uploading the Dataset

Now let's use the `EvalDataset` entity to load our evaluation dataset and upload it to Opik. First, let's reference our dataset directory:


In [12]:
EVALS_DATASET_DIR

PosixPath('inputs/evals')

In [13]:
INPUT_EVALS_DATASET_DIR = EVALS_DATASET_DIR / "dataset"
DATASET_NAME = "brown-course-lessons"
DATASET_DESCRIPTION = "Brown evaluation dataset on course lessons format."


Now, let's load the dataset:

In [14]:
from brown.evals.dataset import EvalDataset
from brown.observability import upload_dataset
from loguru import logger

dataset = EvalDataset.load_dataset(INPUT_EVALS_DATASET_DIR, name=DATASET_NAME, description=DATASET_DESCRIPTION)

pretty_print.wrapped(
    {
        "dataset_name": dataset.name,
        "dataset_description": dataset.description,
        "dataset_samples": len(dataset.samples),
    },
    title="Dataset Metadata",
)


[93m----------------------------------------- Dataset Metadata -----------------------------------------[0m
  {
  "dataset_name": "brown-course-lessons",
  "dataset_description": "Brown evaluation dataset on course lessons format.",
  "dataset_samples": 10
}
[93m----------------------------------------------------------------------------------------------------[0m


Finally, let's upload the dataset to Opik:


In [15]:
logger.info(f"Uploading dataset to Opik: `{dataset.name}`")
upload_dataset(dataset)
logger.success(f"Successfully uploaded dataset to Opik: `{dataset.name}`")

[32m2025-12-23 14:47:11.863[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading dataset to Opik: `brown-course-lessons`[0m
[32m2025-12-23 14:47:11.864[0m | [1mINFO    [0m | [36mbrown.observability.dataset[0m:[36mupload_dataset[0m:[36m25[0m - [1mUploading `8/10` evaluation samples to Opik.[0m
[32m2025-12-23 14:47:11.864[0m | [1mINFO    [0m | [36mbrown.observability.dataset[0m:[36mupload_dataset[0m:[36m27[0m - [1mThe following `2/10` samples will be used for training or as few-shot examples:[0m
[32m2025-12-23 14:47:11.865[0m | [1mINFO    [0m | [36mbrown.observability.dataset[0m:[36mupload_dataset[0m:[36m29[0m - [1m- `Lesson 4: Structured Outputs`[0m
[32m2025-12-23 14:47:11.865[0m | [1mINFO    [0m | [36mbrown.observability.dataset[0m:[36mupload_dataset[0m:[36m29[0m - [1m- `Lesson 7: Planning and Reasoning`[0m
[32m2025-12-23 14:47:15.640[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36m<module>[0m:

## 4. Conclusion

In this lesson, we learned how to create and manage evaluation datasets for the Brown writing workflow.

In the next lesson, we'll use this dataset to run evaluations and measure the quality of Brown.

### Practicing Ideas

1. Extend the dataset with more diverse article samples.
2. Change the dataset with a new set of articles that follow your format instead of our course lesson format.
3. Do your own AI evals dataset on a different data format, such as social media posts.
4. Update the `update_or_create_dataset` to stop `clearing` the dataset entirely when adding new items by introducing a mechanism to detect dataset item duplicates.
5. Use Opik to version the dataset when changing it in any way, such as adding, removing or changing dataset samples.

> [!NOTE]
> ðŸ’¡ Remember that you can also run `brown` as a standalone Python package by going to `lessons/writing_workflow/` and following the instructions from there. We have a script at `lessons/writing_workflow/scripts/brown_create_eval_dataset.py` that you can use to upload datasets to Opik as well.
