<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Dataset_Import.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Dataset Filtering

This demonstrates how to load a dataset and then filter it through a scorer.  You'll need a Hugging Face token to pull our sample, or you can bring your own.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm pandas numpy

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()


# Load Sample Data

We'll keep using a pre-built scoring spec with sample inputs, but feel free to bring your own.

You'll need a secret named `HF_TOKEN` you can retrieve from https://huggingface.co/settings/tokens


In [2]:
# @title Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

tldr_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/tldr.json"
)

display_scoring_spec(tldr_scoring_spec)

In [3]:
# @title Dataset
from datasets import load_dataset

tldr_dataset = load_dataset("withpi/tldr", split="train").select(range(100))

print(tldr_dataset)

README.md:   0%|          | 0.00/319 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 100
})


# Score and Filtering

Let's first score all the examples and only keep the high scored examples which can be used as training examples later on.

In [4]:
# @title Let's SCORE
from tqdm.notebook import tqdm
import pandas as pd

scores = []
for row in tqdm(tldr_dataset):
  scores.append(
        pi.scoring_system.score(
            scoring_spec=tldr_scoring_spec,
            llm_input=row["prompt"],
            llm_output=row["completion"],
        )
    )

df = pd.DataFrame({
    "prompt": tldr_dataset["prompt"],
    "completion": tldr_dataset["completion"],
    "score": [score.total_score for score in scores]
})

print(df)

print(df["score"].describe())


  0%|          | 0/100 [00:00<?, ?it/s]

                                               prompt  \
0   SUBREDDIT: r/relationships\n\nTITLE: I (f/22) ...   
1   SUBREDDIT: r/loseit\n\nTITLE: SV & NSV! Keepin...   
2   SUBREDDIT: r/relationships\n\nTITLE: Me [19F] ...   
3   SUBREDDIT: r/personalfinance\n\nTITLE: Priorit...   
4   SUBREDDIT: r/relationships\n\nTITLE: My[25m] g...   
..                                                ...   
95  SUBREDDIT: r/relationships\n\nTITLE: My [30 F]...   
96  SUBREDDIT: r/relationships\n\nTITLE: Me[19M] p...   
97  SUBREDDIT: r/relationships\n\nTITLE: Am I bein...   
98  SUBREDDIT: r/relationships\n\nTITLE: My boyfri...   
99  SUBREDDIT: r/relationships\n\nTITLE: Me [22 M]...   

                                           completion     score  
0    I still have contact with an old ex's friends...  0.477148  
1    Progress is still happening, even when you th...  0.492656  
2    My skin is scarred badly; what could I do/say...  0.467617  
3    $14k in student debt (all <5%) and need to sa.

In [5]:
# @title Filter out low scores

# Keep examples having score > 0.75
filtered_df = df[(df['score'] > 0.75)]

print(filtered_df)

print("\nNumber of examples with score > 0.75: {}".format(len(filtered_df)))

                                               prompt  \
6   SUBREDDIT: r/relationships\n\nTITLE: Is it wei...   
9   SUBREDDIT: r/relationships\n\nTITLE: Me [20/F]...   
12  SUBREDDIT: r/relationships\n\nTITLE: Me [ 20/F...   
21  SUBREDDIT: r/legaladvice\n\nTITLE: Contacting ...   
22  SUBREDDIT: r/relationships\n\nTITLE: I [19M] h...   
24  SUBREDDIT: r/relationships\n\nTITLE: I [28 F] ...   
28  SUBREDDIT: r/relationships\n\nTITLE: My [23M] ...   
29  SUBREDDIT: r/relationships\n\nTITLE: Am I [25/...   
33  SUBREDDIT: r/tifu\n\nTITLE: TIFU by forgetting...   
34  SUBREDDIT: r/relationships\n\nTITLE: My mom [5...   
35  SUBREDDIT: r/dating_advice\n\nTITLE: Is it a d...   
37  SUBREDDIT: r/relationships\n\nTITLE: Should me...   
39  SUBREDDIT: r/needadvice\n\nTITLE: Much needed ...   
40  SUBREDDIT: r/relationships\n\nTITLE: I [15 F] ...   
44  SUBREDDIT: r/AskReddit\n\nTITLE: What are some...   
45  SUBREDDIT: r/relationships\n\nTITLE: I [21 M] ...   
47  SUBREDDIT: r/relationships\

## Next Steps

With a cleaner dataset you can use it in training or fine-tuning workflows as you see fit.