<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Dataset_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Dataset Filtering

This demonstrates how to load a dataset and then filter it through a scorer.  You'll need a Hugging Face token to pull our sample, or you can bring your own.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Setup scoring system

Let's say we're building an AI to generate stories in the style of Aesop's Fables.  In good test-driven development, we need to decide what we're looking for out of our system.  Initialize a Scoring System and score function:

In [None]:
scoring_spec = [{'question': q} for q in [
    "Does the response contain a clear beginning, middle, and end?",
    "Does the story follow a logical progression of events?",
    "Does the story resolve the conflict in a satisfying manner?",
    "Is the life lesson clearly conveyed in the story?",
    "Is the life lesson relevant to the input provided by the user?"
]]

def score(example):
    example["score"] = pi.scoring_system.score(
        llm_input=example["input"],
        llm_output=example["output"],
        scoring_spec=scoring_spec,
    ).total_score
    return example

# Load Sample Data

We have a small dataset published to Hugging Face with some examples.  You'll need a secret named `HF_TOKEN` you can retrieve from https://huggingface.co/settings/tokens


In [None]:
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")
print(aesop_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


# Score

Let's first score all the examples and only keep the high scored examples which can be used as training examples later on.

In [None]:
aesop_dataset = aesop_dataset.map(score)

display(aesop_dataset)

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output', 'score'],
    num_rows: 23
})

# Filter

Now lets compare a good example and a bad one, to see if the score makes sense

In [None]:
from withpi_utils.colab import pretty_print_responses


good_examples = aesop_dataset.filter(lambda example: example["score"] > 0.95)
bad_examples = aesop_dataset.filter(lambda example: example["score"] <= 0.95)

pretty_print_responses(good_examples.take(1)[0]["output"], bad_examples.take(1)[0]["output"])

Filter:   0%|          | 0/23 [00:00<?, ? examples/s]

Filter:   0%|          | 0/23 [00:00<?, ? examples/s]

## Next Steps

Look at the datasets above and see if you agree with the scorer's assesment. Try different questions or different data.

With a cleaner dataset you can use it in training or fine-tuning workflows as you see fit.