# Preparing datasets

In this notebook, we're going to download the [Amazon US reviews](https://huggingface.co/datasets/amazon_us_reviews) dataset, explore it a bit, process it, and push it back to the Hugging Face Hub.

```datasets```documentation: https://huggingface.co/docs/datasets/index

# 1 - Setup

In order to a push a dataset on the Hugging Face Hub, we need to install Git Large File Support (LFS):

1) Git LFS setup:

In a terminal:

```
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
sudo yum install git-lfs -y
git lfs install
```
Then, add ```*.csv filter=lfs diff=lfs merge=lfs -text``` to ```.gitattributes```, so that CSV files will also be managed with Git LFS.

2) Install the [Hugging Face CLI](https://github.com/huggingface/huggingface_hub)

```pip -q install huggingface_hub```

3) Login to the Hub with your hub credentials

```huggingface-cli login```

In [None]:
%%sh
pip -q install datasets huggingface_hub --upgrade

In [None]:
import datasets

print(datasets.__version__)

# 2 - Loading and exploring

In [None]:
from datasets import load_dataset

dataset = load_dataset('amazon_us_reviews', 'Shoes_v1_00', split='train')
print('{:.2f} GB'.format(dataset.size_in_bytes/1024/1024/1024))
print(dataset.shape)

The dataset is pretty large. We definitely don't need (or want) that much to begin with. Let's work with 10% only.

In [None]:
dataset = load_dataset('amazon_us_reviews', 'Shoes_v1_00', split='train[:10%]')
print(dataset.shape)
print(dataset.column_names)

Let's take a look at the first entry in the dataset.

In [None]:
dataset[0]

There are lot of columns that we don't need right now. Let's just keep ```review_body``` and ```star_rating```.

In [None]:
dataset = dataset.remove_columns(['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_date'])
dataset[0]

Let's check that we don't have any unexpect ```star rating``` values.

In [None]:
dataset.unique('star_rating')

Now, let's see what the distribution of star ratings is. This also shows how we can easily convert a datasets to pandas, in order to use well-known functions.

In [None]:
dataset.to_pandas().value_counts('star_rating')

The dataset is quite unbalanced. We have too many 4 and 5-star ratings. Let's rebalance the dataset, and keep only 20,000 reviews in each class.

In [None]:
import pandas as pd

dataset_pd = dataset.to_pandas()
dataset_pd_balanced = pd.DataFrame(columns=dataset.column_names)
for stars in range(1,6):
    data = dataset_pd[dataset_pd['star_rating']==stars][:20000]
    dataset_pd_balanced = pd.concat([dataset_pd_balanced, data])

In [None]:
dataset_pd_balanced.value_counts('star_rating')

Now, we can switch back to the Hugging Face dataset format.

In [None]:
dataset_balanced = datasets.Dataset.from_pandas(dataset_pd_balanced, preserve_index=False)
print(dataset_balanced)

Transformer models require that labels start at 0, so let's decrement all star ratings using the ```map()``` function in datasets.

In [None]:
# Class labels must start at 0

def decrement_stars(row):
    return {
        'star_rating': row['star_rating']-1
    }

In [None]:
dataset_balanced = dataset_balanced.map(decrement_stars)

Next, let's rename columns to what the model expects.

In [None]:
dataset_balanced = dataset_balanced.rename_column('star_rating', 'labels')
dataset_balanced = dataset_balanced.rename_column('review_body', 'text')

In [None]:
dataset_balanced[0]

As usual, we split the dataset for training and validation. Let's set 10% aside.

In [None]:
dataset_split = dataset_balanced.train_test_split(test_size=0.1, shuffle=True, seed=59)

In [None]:
dataset_split

# 3 - Saving the dataset and pushing it to the to hub

Finally, we save the dataset to disk in Hugging Face (Apache Arrow) format.

In [None]:
dataset_split.save_to_disk('data/')

In [None]:
dataset_split.push_to_hub(repo_id='amazon-shoe-reviews')

The dataset is now visible at https://huggingface.co/datasets/juliensimon/amazon-shoe-reviews.

Of course, we could also use a Git workflow:

```
huggingface-cli repo create -y amazon-shoe-reviews --type dataset

git clone https://huggingface.co/datasets/juliensimon/amazon-shoe-reviews
    
cd amazon-shoe-reviews

cp -r ../data/* .

git add .

git commit -m 'Initial version'

git push
````



It's good practice to describe datasets in a dataset card: language(s), task types, etc. You can easily create one by clicking on "Add a dataset card" on the dataset page. 

Alternatively, you can add a README.md file to the dataset repository. This file should follow a well-defined format described at https://huggingface.co/docs/datasets/upload_dataset#create-a-dataset-card.

Finally, let's also save the dataset in CSV format. That may come in handy later.

In [None]:
import os

os.makedirs('data_csv', exist_ok=True)
dataset_split['train'].to_csv('data_csv/amazon_shoe_reviews_train.csv', header=True, index=False)
dataset_split['test'].to_csv('data_csv/amazon_shoe_reviews_test.csv', header=True, index=False)

In the next notebook, we're going to use the dataset to train a first model.