<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Dataset_Import.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Dataset Filtering

This demonstrates how to load a dataset and then filter it through a scorer.  You'll need a Hugging Face token to pull our sample, or you can bring your own.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm pandas numpy

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()


# Load Sample Data

We'll keep using a pre-built scoring spec with sample inputs, but feel free to bring your own.

You'll need a secret named `HF_TOKEN` you can retrieve from https://huggingface.co/settings/tokens


In [2]:
# @title Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

tldr_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/tldr.json"
)

display_scoring_spec(tldr_scoring_spec)

In [3]:
# @title Dataset
from datasets import load_dataset

tldr_dataset = load_dataset("withpi/tldr", split="train").select(range(100))

print(tldr_dataset)

README.md:   0%|          | 0.00/319 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 100
})


# Score and Filtering

Let's first score all the examples and only keep the high scored examples which can be used as training examples later on.

In [6]:
# @title Let's SCORE
from tqdm.notebook import tqdm
import pandas as pd

scores = []
for row in tqdm(tldr_dataset):
  scores.append(
        pi.scoring_system.score(
            scoring_spec=tldr_scoring_spec,
            llm_input=row["prompt"],
            llm_output=row["completion"],
        )
    )

df = pd.DataFrame({
    "prompt": tldr_dataset["prompt"],
    "completion": tldr_dataset["completion"],
    "score": [score.total_score for score in scores]
})

display(df)

print(df["score"].describe())


  0%|          | 0/100 [00:00<?, ?it/s]

Unnamed: 0,prompt,completion,score
0,SUBREDDIT: r/relationships\n\nTITLE: I (f/22) ...,I still have contact with an old ex's friends...,0.477148
1,SUBREDDIT: r/loseit\n\nTITLE: SV & NSV! Keepin...,"Progress is still happening, even when you th...",0.492656
2,SUBREDDIT: r/relationships\n\nTITLE: Me [19F] ...,My skin is scarred badly; what could I do/say...,0.467617
3,SUBREDDIT: r/personalfinance\n\nTITLE: Priorit...,$14k in student debt (all <5%) and need to sa...,0.687031
4,SUBREDDIT: r/relationships\n\nTITLE: My[25m] g...,"GF is a meanie-bo-beanie when I'm nice, and a...",0.376352
...,...,...,...
95,SUBREDDIT: r/relationships\n\nTITLE: My [30 F]...,Boyfriend and I are considering moving into t...,0.932813
96,SUBREDDIT: r/relationships\n\nTITLE: Me[19M] p...,GF of 5 years hasn't come to visit me in 2 we...,0.992812
97,SUBREDDIT: r/relationships\n\nTITLE: Am I bein...,Boyfriend would rather have family members he...,0.835020
98,SUBREDDIT: r/relationships\n\nTITLE: My boyfri...,Dating a crazy guy who thinks I don't pay eno...,0.519944


count    100.000000
mean       0.641030
std        0.250002
min        0.058519
25%        0.463174
50%        0.661348
75%        0.866289
max        0.992812
Name: score, dtype: float64


In [7]:
# @title Filter out low scores

# Keep examples having score > 0.75
filtered_df = df[(df['score'] > 0.75)]

display(filtered_df)

print("\nNumber of examples with score > 0.75: {}".format(len(filtered_df)))

Unnamed: 0,prompt,completion,score
6,SUBREDDIT: r/relationships\n\nTITLE: Is it wei...,Gf said she almost didn't date me because I w...,0.924883
9,SUBREDDIT: r/relationships\n\nTITLE: Me [20/F]...,how do I deny sex with my boyfriend of 2.5 ye...,0.850234
12,SUBREDDIT: r/relationships\n\nTITLE: Me [ 20/F...,I've found myself attracted to a man who is n...,0.944766
21,SUBREDDIT: r/legaladvice\n\nTITLE: Contacting ...,My mom hid me from my dad by falsifying DNA t...,0.896172
22,SUBREDDIT: r/relationships\n\nTITLE: I [19M] h...,Been seeing/talking to girl for more than 3 m...,0.929453
24,SUBREDDIT: r/relationships\n\nTITLE: I [28 F] ...,"Confronted borderline mother, now feel guilty...",0.975781
28,SUBREDDIT: r/relationships\n\nTITLE: My [23M] ...,GF's comments on IG are a bit too much for my...,0.864609
29,SUBREDDIT: r/relationships\n\nTITLE: Am I [25/...,"LDR boyfriend has been texting less and less,...",0.962344
33,SUBREDDIT: r/tifu\n\nTITLE: TIFU by forgetting...,i left my lube in the shower for a couple day...,0.915
34,SUBREDDIT: r/relationships\n\nTITLE: My mom [5...,I want my mom to stop tricking me and my sist...,0.982344



Number of examples with score > 0.75: 40


## Next Steps

With a cleaner dataset you can use it in training or fine-tuning workflows as you see fit.