<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Dataset_Import.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Dataset Filtering

This colab is the companion to the Dataset Filtering playground

It simply shows how to import data and explore it using Hugging Face Datasets.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Load a Scoring spec and a Dataset

We'll keep using a pre-built scoring spec with sample inputs, but feel free to bring your own.


In [None]:
# @title Load Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

tldr_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/contracts/tldr.json"
)

display_scoring_spec(tldr_scoring_spec)

In [None]:
# @title Load dataset
from datasets import load_dataset

tldr_dataset = load_dataset("withpi/tldr", split="train").select(range(100))

print(tldr_dataset)

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 100
})


# Score and Filtering

Let's first score all the examples and only keep the high scored examples which can be used as training examples later on.

In [None]:
# @title Let's SCORE
from tqdm import tqdm
import pandas as pd

scores = []
for row in tqdm(tldr_dataset):
  scores.append(
        client.scoring_system.score(
            scoring_spec=tldr_scoring_spec,
            llm_input=row["prompt"],
            llm_output=row["completion"],
        )
    )

df = pd.DataFrame({
    "prompt": tldr_dataset["prompt"],
    "completion": tldr_dataset["completion"],
    "score": [score.total_score for score in scores]
})

display(df)

print(df["score"].describe())


100%|██████████| 100/100 [00:24<00:00,  4.16it/s]


Unnamed: 0,prompt,completion,score
0,SUBREDDIT: r/relationships\n\nTITLE: I (f/22) ...,I still have contact with an old ex's friends...,0.142178
1,SUBREDDIT: r/loseit\n\nTITLE: SV & NSV! Keepin...,"Progress is still happening, even when you th...",0.486253
2,SUBREDDIT: r/relationships\n\nTITLE: Me [19F] ...,My skin is scarred badly; what could I do/say...,0.463388
3,SUBREDDIT: r/personalfinance\n\nTITLE: Priorit...,$14k in student debt (all <5%) and need to sa...,0.634564
4,SUBREDDIT: r/relationships\n\nTITLE: My[25m] g...,"GF is a meanie-bo-beanie when I'm nice, and a...",0.408148
...,...,...,...
95,SUBREDDIT: r/relationships\n\nTITLE: My [30 F]...,Boyfriend and I are considering moving into t...,0.696544
96,SUBREDDIT: r/relationships\n\nTITLE: Me[19M] p...,GF of 5 years hasn't come to visit me in 2 we...,0.797807
97,SUBREDDIT: r/relationships\n\nTITLE: Am I bein...,Boyfriend would rather have family members he...,0.828730
98,SUBREDDIT: r/relationships\n\nTITLE: My boyfri...,Dating a crazy guy who thinks I don't pay eno...,0.451902


count    100.000000
mean       0.571067
std        0.169740
min        0.142178
25%        0.469460
50%        0.529180
75%        0.670404
max        0.966709
Name: score, dtype: float64


In [None]:
# @title Filter out low scores

# Keep examples having score > 0.75
filtered_df = df[(df['score'] > 0.75)]

display(filtered_df)

print("\nNumber of examples with score > 0.75: {}".format(len(filtered_df)))

Unnamed: 0,prompt,completion,score
12,SUBREDDIT: r/relationships\n\nTITLE: Me [ 20/F...,I've found myself attracted to a man who is n...,0.774515
17,SUBREDDIT: r/relationships\n\nTITLE: I [16 M] ...,Saw a friend had self-harm scars and want to ...,0.779744
24,SUBREDDIT: r/relationships\n\nTITLE: I [28 F] ...,"Confronted borderline mother, now feel guilty...",0.780141
29,SUBREDDIT: r/relationships\n\nTITLE: Am I [25/...,"LDR boyfriend has been texting less and less,...",0.949005
33,SUBREDDIT: r/tifu\n\nTITLE: TIFU by forgetting...,i left my lube in the shower for a couple day...,0.93256
34,SUBREDDIT: r/relationships\n\nTITLE: My mom [5...,I want my mom to stop tricking me and my sist...,0.95097
39,SUBREDDIT: r/needadvice\n\nTITLE: Much needed ...,I have an adopted 18 year old sister without ...,0.883518
50,SUBREDDIT: r/relationships\n\nTITLE: How do I ...,Boyfriend doesn't think of himself as very at...,0.798727
59,SUBREDDIT: r/AskReddit\n\nTITLE: Girl had nip ...,"Girl had a nip slip at prom, photo was taken ...",0.882157
60,SUBREDDIT: r/relationships\n\nTITLE: Pride onl...,Girlfriend on break sleeps with another guy (...,0.805248



Number of examples with score > 0.75: 17


## Next Steps

Now that you have quality examples with high scores. You can use this to fine-tune your LLM like Llama-3.1-8B.