<a href="https://colab.research.google.com/github/renato-penna/fiap-tech-challenge-fase03/blob/main/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setup and Imports

In [33]:
import pandas as pd
import numpy as np
from google.colab import drive
import os, json

### Mount Google Drive and Define File Paths

In [34]:
drive.mount('/content/drive')
json_path = '/content/drive/MyDrive/Fiap/trn.json'
filtered_jsonl = "/content/drive/MyDrive/Fiap/trn_filtered.jsonl"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Install ijson Library

In [35]:
!pip install ijson



### Filter and Save Data in Chunks

In [32]:
df_iter = pd.read_json(json_path, lines=True, chunksize=100000)
for i, chunk in enumerate(df_iter):
    # seleciona apenas as colunas necessárias
    filtered = chunk[["title", "content"]]

    # salva em Json (modo append para juntar tudo em um único arquivo)
    mode = "w" if i == 0 else "a"
    filtered.to_json(
        filtered_jsonl,
        orient="records",
        lines=True,
        force_ascii=False,
        mode=mode
    )

### Load Filtered Data and Clean

In [36]:
df = pd.read_json(filtered_jsonl, lines=True)

# remove linhas onde title ou content está vazio ou nulo
df = df[df["title"].notna()]          # remove NaN em title
df = df[df["title"].str.strip() != ""]  # remove strings vazias ou só espaços em title
df = df[df["content"].notna()]          # remove NaN em content
df = df[df["content"].str.strip() != ""]  # remove strings vazias ou só espaços em content

# salva de volta
df.to_json("/content/drive/MyDrive/Fiap/trnTreaded.json", index=False)

### Load Processed Data and Display Head

In [37]:
df = pd.read_json("/content/drive/MyDrive/Fiap/trnTreaded.json")
df.head()

Unnamed: 0,title,content
0,Girls Ballet Tutu Neon Pink,High quality 3 layer ballet tutu. 12 inches in...
3,Mog's Kittens,Judith Kerr&#8217;s best&#8211;selling adventu...
7,Girls Ballet Tutu Neon Blue,Dance tutu for girls ages 2-8 years. Perfect f...
12,The Prophet,"In a distant, timeless place, a mysterious pro..."
13,Rightly Dividing the Word,--This text refers to thePaperbackedition.


### Display Data Shape and Null Counts

In [38]:
df.shape
total = len(df)
null_title = df["title"].isna().sum() + (df["title"].astype(str).str.strip() == "").sum()
null_content = df["content"].isna().sum() + (df["content"].astype(str).str.strip() == "").sum()

print("Total de linhas:", total)
print("Linhas com title nulo/vazio:", null_title)
print("Linhas com content nulo/vazio:", null_content)

Total de linhas: 1390403
Linhas com title nulo/vazio: 0
Linhas com content nulo/vazio: 0


### Create Prompt/Completion and Save for Fine-tuning

In [39]:
# cria prompt/completion em inglês
df["prompt"] = "Question: " + df["title"] + "\nAnswer:"
df["completion"] = df["content"]

fine_tune_jsonl = "/content/drive/MyDrive/Fiap/trn_finetune.jsonl"
df[["prompt", "completion"]].to_json(
    fine_tune_jsonl,
    orient="records",
    lines=True,
    force_ascii=False
)

print("✅ Dataset para fine-tuning salvo em:", fine_tune_jsonl)
print("Número de exemplos de treino:", len(df))

✅ Dataset para fine-tuning salvo em: /content/drive/MyDrive/Fiap/trn_finetune.jsonl
Número de exemplos de treino: 1390403
