# Prepare the dataset

We load a dataset with messages from customers that are assigned to different categories (invoices, orders, etc.)

We will use this dataset to tune a prompt that can classify a message into one of these categories, with an accuracy as high as possible.

https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset

In [2]:
pip install 'datasets<3' --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
from datasets import load_dataset

dataset = load_dataset("Yelp/yelp_review_full", split="train")
dataset

Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['label', 'text'],
    num_rows: 650000
})

In [4]:
CLASSES = list(set(dataset['label']))
CLASSES

[0, 1, 2, 3, 4]

### CONSTANTS

In [5]:
N = 200 # Number of samples per class / label
TRAIN_TEST_SPLIT = 0.75 # split dataset in trian and test
TEST_VAL_SPLIT = 0.7 # spit test in test and validation

In [6]:
import pandas as pd
from datasets import Dataset

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(dataset)
# Add an id column
df = df.reset_index(drop=False).rename(columns={'index': 'id'})
# Increment each value in the label column by 1
df['label'] = df['label'] + 1
# Group by the label column
grouped = df.groupby("label")
# Sample 100 records from each label (adjust N as needed)
sampled_df = grouped.apply(lambda x: x.sample(n=N, random_state=42)).reset_index(drop=True)
# Convert the sampled DataFrame back to a Hugging Face dataset
sampled_dataset = Dataset.from_pandas(sampled_df)
# Print the value counts of the label column
print(sampled_df['label'].value_counts())
# Shuffle the dataset
shuffled_dataset = sampled_dataset.shuffle()


label
1    200
2    200
3    200
4    200
5    200
Name: count, dtype: int64


  sampled_df = grouped.apply(lambda x: x.sample(n=N, random_state=42)).reset_index(drop=True)


Select only the relevant columns and rename them according to the class ClassificationDataClass

In [8]:
selected_cols_dataset = shuffled_dataset.select_columns(['label', 'text','id'])
renamed_dataset = selected_cols_dataset.rename_column('label', 'class_name').rename_column('text', 'question')
full_dataset = renamed_dataset
full_dataset.to_csv('full_dataset.csv')
full_dataset

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['class_name', 'question', 'id'],
    num_rows: 1000
})

In [9]:
# Split the dataset into train and test (70% train, 30% test)
train_test_split = full_dataset.train_test_split(test_size=TRAIN_TEST_SPLIT)
# Further split the test set into validation and test sets (50% validation, 50% test)
val_test_split = train_test_split['test'].train_test_split(test_size=TEST_VAL_SPLIT)

train = train_test_split['train']
train.to_csv('train.csv')
val = val_test_split['train']
val.to_csv('val.csv')
test = val_test_split['test']
test.to_csv('test.csv')


Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

397555

In [10]:
train.to_pandas()["class_name"].value_counts()

class_name
1    60
4    55
5    48
3    44
2    43
Name: count, dtype: int64

In [11]:
val.to_pandas()["class_name"].value_counts()

class_name
1    50
5    50
2    48
3    41
4    36
Name: count, dtype: int64

In [12]:
test.to_pandas()["class_name"].value_counts()

class_name
3    115
2    109
4    109
5    102
1     90
Name: count, dtype: int64