# Prepare the classification dataset for finetuning

We take a Amazon review dataset from https://amazon-reviews-2023.github.io/.

However, this dataset is too long, so we use just the beauty reviews.
We want to train our sentiment detection model to find only good and bad
sentiments. Therefore, we select only those with rating `1` and `5`. As
we get many more good than bad rating, we stratify the dataset to have
the same number of good and bad reviews. 5,000 of each part are big
enough, so we end up with 10,000 reviews which are saved in an `.xz`
file.


In [None]:
import pandas as pd

In [None]:
df = pd.read_json("All_Beauty.jsonl.xz", lines=True)

In [None]:
len(df)

## Dataset is too long, select only reviews with very good and very bad rating:

In [None]:
good = df[df["rating"] == 5]

In [None]:
len(good)

In [None]:
bad = df[df["rating"] == 1]

In [None]:
len(bad)

## Select 5000 good and 5000 bad (change if you have a fast GPU)

In [None]:
total = pd.concat([bad.sample(5000, random_state=42), good.sample(5000, random_state=42)])

In [None]:
total["text"] = total["title"] + "\n\n" + total["text"]

## Save in a JSON file

In [None]:
total[["text", "rating"]].to_json("10000_All_Beauty.json.xz", orient="records")