<a href="https://colab.research.google.com/github/thiagoscerqueira/fiap_techchallenge_03/blob/main/preprocess_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pré-processamento do dataset**

Aqui estamos efetuando a leitura do dataset "Amazon titles 1.3MM", arquivo `trn.json`, efetuando a limpeza dos dados e gerando um arquivo csv para posterior fine-tuning.



# **Instalando as bibliotecas necessárias**

In [None]:
!pip install datasets
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K  

In [None]:
from datasets import load_dataset
import pandas as pd

**Aqui estamos carregando o dataset da amazon que foi armazenado em um dataset no Hugging Faces Hub**

In [None]:
dataset = load_dataset('thiagoscerqueira/amazontitlestrn')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


trn.json:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2248619 [00:00<?, ? examples/s]

# **Tratamento dos dados**

Aqui estamos aplicando uma função de filtro e transformação em cada entrada do dataset.
Na transformação, consideramos apenas os campos "title" and "content".
Caso exista uma entrada sem um título ou descrição, esta entrada é desconsiderada.

In [None]:
def filter_and_transform(example):
    title = example['title'].strip()
    content = example['content'].strip()
    if title and content:  # Check if both fields are non-empty
        return {
            'title': title,
            'content': content
        }
    return None  # Skip rows with empty 'title' or 'content'

In [None]:
transformed_dataset = dataset['train'].filter(lambda example: filter_and_transform(example) is not None)

In [None]:
print(transformed_dataset)

Dataset({
    features: ['title', 'content'],
    num_rows: 1390403
})


In [None]:
transformed_dataset = transformed_dataset.map(filter_and_transform, remove_columns=dataset['train'].column_names)

Map:   0%|          | 0/1390403 [00:00<?, ? examples/s]

In [None]:
print(transformed_dataset)

Dataset({
    features: ['title', 'content'],
    num_rows: 1390403
})


**Aqui estamos carregando o dataset tratado em um dataframe do pandas, facilitando a operação de geração de um arquivo csv a partir do dataset**

In [None]:
df = pd.DataFrame(transformed_dataset)

**Neste trecho, estamos gerando o csv pelo dataframe do pandas e posteriormente salvando o arquivo gerado no Google Drive e fazendo upload do mesmo para o Hugging Face Hub.**

In [None]:
from google.colab import drive
from huggingface_hub import HfApi, Repository
import shutil


drive.mount('/content/drive')


csv_file_path = '/content/drive/MyDrive/datasets/transformed_dataset.csv'

df.to_csv(csv_file_path, index=False, columns=['title', 'content'])

print(f"Dataset salvo na pasta {csv_file_path}.")

api = HfApi()
api.login()

repo_id = 'thiagoscerqueira/amazontitlestrn'
repo = Repository(local_dir='local-dataset', clone_from=repo_id)

shutil.copy(csv_file_path, 'local-dataset/transformed_dataset.csv')

repo.git_add('transformed_dataset.csv')
repo.git_commit('Add transformed_dataset.csv')
repo.git_push()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset successfully transformed and saved to /content/drive/MyDrive/datasets/transformed_dataset.csv.
