<a href="https://colab.research.google.com/github/zach-2pir/docs/blob/main/colabs/Dataset_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Dataset Filtering

This demonstrates how to load a dataset and then filter it through a scorer.  You'll need a Hugging Face token to pull our sample, or you can bring your own.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

# Load Sample Data

Let's say we want to generate "TLDR"s for Reddit posts.  We have a small dataset published to Hugging Face with some examples.  You'll need a secret named `HF_TOKEN` you can retrieve from https://huggingface.co/settings/tokens


In [2]:
from datasets import load_dataset

tldr_dataset = load_dataset("withpi/tldr", split="train").select(range(100))

print(tldr_dataset)

README.md:   0%|          | 0.00/319 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 100
})


# Score

Let's first score all the examples and only keep the high scored examples which can be used as training examples later on.

In [6]:
scoring_spec = [{'question': q} for q in [
  "Is the TLDR between 1 to 3 sentences long?",
  "Is the TLDR concise and to the point?",
  "Does the TLDR state the important points of the post?",
  "Does the TLDR avoid including personal opinions?",
  "Does the TLDR make sense on its own without needing to refer to the original post?",
]]

def score(example):
  example["score"] = pi.scoring_system.score(
    scoring_spec=scoring_spec,
    llm_input=example["prompt"],
    llm_output=example["completion"],
  ).total_score
  return example

tldr_dataset = tldr_dataset.map(score)

display(tldr_dataset)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'completion', 'score'],
    num_rows: 100
})

# Filter

Now lets filter out examples less than 0.75

In [9]:
filtered = tldr_dataset.filter(lambda example: example["score"] > 0.75)

display(filtered)

Dataset({
    features: ['prompt', 'completion', 'score'],
    num_rows: 43
})

## Next Steps

Look at the datasets above and see if you agree with the scorer's asssesment. Try different questions or different data.

With a cleaner dataset you can use it in training or fine-tuning workflows as you see fit.