# TECH CHALLENGE
Tech Challenge é o projeto que engloba os conhecimentos obtidos em todas as disciplinas presentes na fase. Esta é uma atividade que, em princípio, deve ser desenvolvida em grupo. É importante atentar-se ao prazo de entrega, uma vez que essa atividade é obrigatória, valendo 90% da nota de todas as disciplinas da fase.

**O PROBLEMA**

No Tech Challenge desta fase, você precisa executar o fine-tuning de um foundation model (Llama, BERT, MISTRAL etc.), utilizando o dataset "The AmazonTitles-1.3MM". O modelo treinado deverá:


*   Receber perguntas com um contexto obtido por meio do arquivo json “trn.json”
*   A partir do prompt formado pela pergunta do usuário sobre o título do produto, o modelo deverá gerar uma resposta baseada na pergunta do usuário trazendo como resultado do aprendizado do fine-tuning os dados da sua descrição que está contido dentro do dataset.

### Fluxo de trabalho atualizado:
1. Escolha do Dataset:
Descrição: o The AmazonTitles-1.3MM consiste em consultas textuais reais de usuários e títulos associados de produtos relevantes encontrados na Amazon e suas descrições, medidos por ações implícitas ou explícitas dos usuários.

2. Preparação do Dataset:
Faça o download do dataset AmazonTitles-1.3MM e utilize o arquivo “trn.json”. Nele, você utilizará as colunas “title” e “content”, que contêm título e descrição respectivamente. Prepare os prompts para o fine-tuning garantindo que estejam organizados de maneira adequada para o treinamento do modelo escolhido. Limpe e pré-processe os dados conforme necessário para o modelo escolhido.

3. Chamada do Foundation Model
Importe o foundation model que será utilizado e faça um teste apresentando o resultado atual do modelo antes do treinamento (para que se obtenha uma base de análise após o fine-tuning), e então será possível avaliar a diferença do resultado gerado.

4. Execução do Fine-Tuning:
Execute o fine-tuning do foundation model selecionado (por exemplo,
BERT, GPT, Llama) utilizando o dataset preparado. Documente o processo de fine-tuning, incluindo os parâmetros utilizados e qualquer ajuste específico realizado no modelo.

5. Geração de Respostas:
Configure o modelo treinado para receber perguntas dos usuários. O modelo deverá gerar uma resposta baseada na pergunta do usuário e nos dados provenientes do fine-tuning, incluindo as fontes fornecidas. O que esperamos para o entregável?



*   Documento detalhando o processo de seleção e preparação do dataset.
*   Descrição do processo de fine-tuning do modelo, com detalhes dos parâmetros e ajustes utilizados. Código-fonte do processo de fine- tuning.
*   Um vídeo demonstrando o modelo treinado gerando respostas a partir de perguntas do usuário e utilizando o contexto obtido por meio treinamento com o fine-tuning.

In [1]:

#rodar esse a parte
# Montando volume compartilhado no google drive para salvar os arquivos localmente
# Isso nos permite reuso em diversos pontos desse notebook
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


### Instalação de Pacotes

In [2]:
#rodar esse a parte
# Instalando pacotes necessários.
# Decidimos utilizar o unsloth para acelerar o processo de fine-tuning, já que o dataset é muito grande
!pip install bert-tensorflow
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install xformers==0.0.27 "trl<0.9.0" peft accelerate bitsandbytes
!pip install transformers datasets

Collecting bert-tensorflow
  Downloading bert_tensorflow-1.0.4-py2.py3-none-any.whl.metadata (619 bytes)
Downloading bert_tensorflow-1.0.4-py2.py3-none-any.whl (64 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.4/64.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-tensorflow
Successfully installed bert-tensorflow-1.0.4
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-2tzbko79/unsloth_1546d97fabd64bf197e91155cda10907
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-2tzbko79/unsloth_1546d97fabd64bf197e91155cda10907
  Resolved https://github.com/unslothai/unsloth.git to commit f1951c0f6d3e1f184af93

# Preparação do Dataset

Nos códigos abaixos:



*   Fazemos o download do dataset a partir do google drive
*   Extraímos o arquivo trn.json, que contém todos os dados que nós queremos
*   Processamos e limpamos o dataset, preparando para o fine-tuning



### Download do Dataset

In [3]:
!gdown 12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK

Downloading...
From (original): https://drive.google.com/uc?id=12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK
From (redirected): https://drive.google.com/uc?id=12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK&confirm=t&uuid=cfec92b9-0c6b-424d-bea1-c9b874af97f2
To: /content/LF-Amazon-1.3M.raw.zip
100% 890M/890M [00:09<00:00, 93.3MB/s]


### Extração do Dataset

In [4]:
import zipfile
zip_path = '/content/LF-Amazon-1.3M.raw.zip'
extract_to = '/content/drive/MyDrive/FIAP/content/datasets'
zipfile.ZipFile(zip_path, 'r').extractall(extract_to)

#REALIZAR O CHUNK DO DATASET EM 50K CADA

In [5]:
import pandas as pd
import os
drive = '/content/drive/MyDrive/FIAP'
# Lê o arquivo JSON compactado em pedaços de 100.000 linhas, armazena os dados em um DataFrame.
datasets_path = f'{drive}/content/datasets/LF-Amazon-1.3M'
trn_filepath = f'{datasets_path}/trn.json.gz'

chunks_dir = f'{drive}/content/datasets/chunks/'

# Cria o diretório de chunks se não existir
os.makedirs(chunks_dir, exist_ok=True)

# Define o tamanho do chunk
chunk_size = 100000

chunks = pd.read_json(trn_filepath, lines=True, chunksize=chunk_size)

# Itera sobre cada pedaço e salva os chunks
for i, chunk_df in enumerate(chunks):
    # Define o caminho para o novo arquivo
    chunk_filepath = f'{chunks_dir}chunk_{i}.json.gz'

    # Salva o chunk em um arquivo JSON comprimido
    chunk_df.to_json(chunk_filepath, orient='records', lines=True, compression='gzip')

    print(f'Salvo: {chunk_filepath}')



Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_0.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_1.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_2.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_3.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_4.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_5.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_6.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_7.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_8.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_9.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_10.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_11.json.gz
Salvo: /content/drive/MyDrive/FIAP/content/datasets/chunks/chunk_12.json.gz
Salvo: /content/drive/

#CRIAR FUNÇÃO DE LIMPEZA

In [6]:
import re
import nltk
nltk.download('wordnet')
from nltk.corpus import stopwords
from wordcloud import STOPWORDS
from nltk.stem.wordnet import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

def limpeza_texto(text):
    # regex para limpar o texto
    whitespace = re.compile(r"\s+")                            # encontrando espaços em branco
    user = re.compile(r"(?i)@[a-z0-9_]+")                      # encontrar menções de usuários, exemplo @usuario
    text = whitespace.sub(' ', text)                           # substitui espaços em branco por ' '
    text = user.sub('', text)                                  # remove todas as menções de usuário encontradas no texto
    text = re.sub(r"\[[^()]*\]","", text)                      # remove o conteúdo dentro de colchetes, incluindo os colchetes
    text = re.sub("\d+", "", text)                             # remove todos os dígitos numéricos do texto
    text = re.sub(r'[^\w\s]','',text)                          # remove todos os caracteres que não são palavras (letras e números) ou espaços em branco.
    text = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", text) # remove menções de usuário, hashtags e URLs.
    text = text.lower()                                        # texto para minusculo

    # removendo as stop words
    text = [word for word in text.split() if word not in list(STOPWORDS)]

    # word lemmatization
    sentence = []
    for word in text:
        lemmatizer = WordNetLemmatizer()
        sentence.append(lemmatizer.lemmatize(word,'v'))

    return ' '.join(sentence)

[nltk_data] Downloading package wordnet to /root/nltk_data...


#REALIZAR A LIMPEZA DOS CHUNKS

In [7]:
import json
chunks_dir_clear = f'{drive}/content/datasets/chunks_clear/'

# Cria o diretório de chunks_clear se não existir
os.makedirs(chunks_dir_clear, exist_ok=True)


#pipeline limpeza e chunck json limpo

for filename in os.listdir(chunks_dir):
    if filename.endswith('.json.gz'):
        # Caminho completo do arquivo de entrada
        file_path = os.path.join(chunks_dir, filename)

        # Lê o arquivo JSON em pedaços (chunks)
        json_reader = pd.read_json(file_path, lines=True, chunksize=50000)
        df_trn = next(json_reader)

        # Aplicar a limpeza
        df_trn = df_trn[['title', 'content']]
        df_trn['title'] = df_trn['title'].str.strip()
        df_trn['content'] = df_trn['content'].str.strip()
        df_trn = df_trn[df_trn['title'] != '']
        df_trn = df_trn[df_trn['content'] != '']
        df_trn = df_trn.dropna(subset=['title', 'content'])
        df_trn['title_clear'] = df_trn['title'].apply(limpeza_texto)
        df_trn['content_clear'] = df_trn['content'].apply(limpeza_texto)

        # Caminho para o arquivo limpo
        output_filename = f'cleaned_{filename}'
        output_file_path = os.path.join(chunks_dir_clear, output_filename)

        # Salva o dataset processado e limpo em um arquivo JSON
        df_trn.to_json(output_file_path, orient='records', lines=True, compression='gzip')

        print(f'Arquivo limpo salvo em: {output_file_path}')


Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_0.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_1.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_2.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_3.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_4.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_5.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_6.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_7.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive/FIAP/content/datasets/chunks_clear/cleaned_chunk_8.json.gz
Arquivo limpo salvo em: /content/drive/MyDrive

#CRIAR/CARREGAR A BASE DO MODELO

In [8]:
#rodar esse a parte
#base modelo treino
import torch

#from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer

model_dir = f'{drive}/content/datasets/model/'

# Cria o diretório de model se não existir
os.makedirs(model_dir, exist_ok=True)

#df_trn  pegar cada arquivo cleaned e jogar aqui no df


max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Decidimos utilizar o modelo Llama
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.for_inference(model)


model = FastLanguageModel.get_peft_model(
    model,
    r = 8, #16,
    target_modules = ["q_proj", "k proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 8,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Unsloth: You added custom modules, but Unsloth hasn't optimized for this.
Beware - your finetuning might be noticeably slower!


Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.9 patched 32 layers with 0 QKV layers, 32 O layers and 32 MLP layers.


#CARREGAR CHUNKS LIMPOS E SEPARAR EM TEST AND TRAIN NO FORMATO DO PROMPT

In [10]:
# pipeline para treino
from datasets import load_dataset, Dataset

for filename in os.listdir(chunks_dir_clear):
    if filename.endswith('.json.gz'):
        # Caminho completo do arquivo de entrada
        file_path = os.path.join(chunks_dir_clear, filename)

        # Lê o arquivo JSON em pedaços (chunks)
        json_reader = pd.read_json(file_path, lines=True, chunksize=100000)
        df_trn = next(json_reader)

        df_trn['input'] = df_trn.apply(
            lambda row: f"DESCRIBE THIS PRODUCT:\n[|Title|]{row['title_clear']}[|Title|]\n[|Description|]{row['content_clear']}[|Description|]",
            axis=1
        )


        instruction_list = ["DESCRIBE THIS PRODUCT."] * len(df_trn)

        input_list = df_trn['title_clear'].tolist()
        output_list = df_trn['content_clear'].tolist()

        data = {
            "instruction": instruction_list,
            "input": input_list,
            "output": output_list
        }

        prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

        ### Instruction:
        {}

        ### Input:
        {}

        ### Response:
        {}"""

        EOS_TOKEN = tokenizer.eos_token

        def formatting_prompts_func(data):
            instructions = data["instruction"]
            inputs = data["input"]
            outputs = data["output"]
            texts = []

            for instruction, input, output in zip(instructions, inputs, outputs):
                text = prompt.format(instruction, input, output) + EOS_TOKEN
                texts.append(text)
            return { "text": texts, }


        output_filename = f'model_{filename}'
        output_file_path = os.path.join(model_dir, output_filename)

        # Salva o dataset processado e limpo em um arquivo JSON
        #df = pd.DataFrame(data)
        #print(f'Arquivo limpo salvo em: {output_file_path}')
        #dataset = load_dataset("json", data_files=f"{drive}/datasets/model/model_{i}_formatted.json", split = "train")
        dataset = formatting_prompts_func(data)
        df = pd.DataFrame(dataset)


        df.to_json(output_file_path, orient='records', lines=True, compression='gzip')
        dataset = Dataset.from_pandas(df)

        # Divida o Dataset em conjuntos de treinamento e teste
        split_dataset = dataset.train_test_split(test_size=0.1, seed=42)

        # Defina os caminhos de saída
        output_filename_train = f'model_train_{filename}'
        output_file_path_train = os.path.join(model_dir, output_filename_train)

        output_filename_test = f'model_test_{filename}'
        output_file_path_test = os.path.join(model_dir, output_filename_test)

        # Salve o Dataset de treinamento em JSON comprimido
        train_dataset = split_dataset['train']
        train_df = train_dataset.to_pandas()
        train_df.to_json(output_file_path_train, orient='records', lines=True, compression='gzip')

        # Salve o Dataset de teste em JSON comprimido
        test_dataset = split_dataset['test']
        test_df = test_dataset.to_pandas()
        test_df.to_json(output_file_path_test, orient='records', lines=True, compression='gzip')



        print(f'Arquivo model salvo em: {output_file_path}')
        # with open(f'{model_dir}{output_file_path}.json', 'w') as f:
        #     json.dump(dataset.to_dict(), f, indent=4)




Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_0.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_1.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_2.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_3.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_4.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_5.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_6.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_7.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/content/datasets/model/model_cleaned_chunk_8.json.gz
Arquivo model salvo em: /content/drive/MyDrive/FIAP/con

#CRIAR OS CHUNKS DE TREINO TOKENIZADO UTILIZANDO OS CHUNKS SPLITADOS EM TRAIN E TEST

In [11]:

import os
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import TrainingArguments
from trl import SFTTrainer

drive = '/content/drive/MyDrive/FIAP'
model_dir = f'{drive}/content/datasets/model/'

treino_dir = f'{drive}/content/datasets/treino/'

# Cria o diretório de treino se não existir
os.makedirs(treino_dir, exist_ok=True)

# Configurações do Trainer (ajuste conforme necessário)
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=2e-4,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs"
)



all_files = os.listdir(model_dir)

# Filtra os arquivos que começam com 'model_' e terminam com '.json.gz'
filtered_files = [f for f in all_files if f.startswith('model_') and f.endswith('.json.gz')]

# Conta o número total de arquivos
total_files = len(filtered_files)

# Calcula o número de chunks (cada chunk tem 3 arquivos)
num_chunks = total_files // 3  # Divida por 3 porque cada chunk tem 3 arquivos
for i in range(num_chunks):
#for i in range(num_chunks):  # Ajuste `num_chunks` conforme o número de chunks disponíveis
    # Define paths
    train_file = f'{model_dir}model_train_cleaned_chunk_{i}.json.gz'
    test_file = f'{model_dir}model_test_cleaned_chunk_{i}.json.gz'



    # Carrega datasets
    train_df = pd.read_json(train_file, compression='gzip', lines=True)
    test_df = pd.read_json(test_file, compression='gzip', lines=True)

    train_dataset = Dataset.from_pandas(train_df)
    test_dataset = Dataset.from_pandas(test_df)

    print(train_dataset)

    # Instancia o Trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        packing=False,
        dataset_num_proc=1,
        args=training_args
    )

    # Treine o modelo
    trainer.train()

    # Salve o modelo e o tokenizer
    model_save_path = f'{treino_dir}model_chunk_{i}/'
    os.makedirs(model_save_path, exist_ok=True)
    model.save_pretrained(model_save_path)
    tokenizer.save_pretrained(model_save_path)

    print(f'Modelo e tokenizer salvos em: {model_save_path}')

    # Se desejar, você pode adicionar uma etapa de limpeza aqui para liberar memória ou outros procedimentos necessários

print("Treinamento concluído para todos os chunks.")


Dataset({
    features: ['text'],
    num_rows: 34582
})


Map:   0%|          | 0/34582 [00:00<?, ? examples/s]

Map:   0%|          | 0/3843 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 34,582 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,4.8742
2,4.9014
3,4.9759
4,4.8916
5,4.8231
6,4.8304
7,4.3367
8,4.2208
9,4.051
10,4.0906


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_0/
Dataset({
    features: ['text'],
    num_rows: 29860
})


Map:   0%|          | 0/29860 [00:00<?, ? examples/s]

Map:   0%|          | 0/3318 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 29,860 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,3.5097
2,3.3429
3,3.4902
4,3.5616
5,3.6068
6,2.6842
7,3.4168
8,3.4011
9,3.27
10,3.3273


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_1/
Dataset({
    features: ['text'],
    num_rows: 27369
})


Map:   0%|          | 0/27369 [00:00<?, ? examples/s]

Map:   0%|          | 0/3042 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 27,369 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,3.2556
2,3.7247
3,2.7289
4,3.3701
5,2.796
6,2.7099
7,2.9185
8,3.1975
9,2.7623
10,3.844


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_2/
Dataset({
    features: ['text'],
    num_rows: 21759
})


Map:   0%|          | 0/21759 [00:00<?, ? examples/s]

Map:   0%|          | 0/2418 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,759 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,3.3523
2,3.1859
3,2.9221
4,2.9783
5,2.8827
6,2.9318
7,2.5078
8,2.9151
9,2.6338
10,2.6269


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_3/
Dataset({
    features: ['text'],
    num_rows: 29053
})


Map:   0%|          | 0/29053 [00:00<?, ? examples/s]

Map:   0%|          | 0/3229 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 29,053 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,3.5446
2,2.7679
3,3.0778
4,3.009
5,2.9806
6,3.2286
7,2.8723
8,3.6082
9,2.7282
10,2.324


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_4/
Dataset({
    features: ['text'],
    num_rows: 27512
})


Map:   0%|          | 0/27512 [00:00<?, ? examples/s]

Map:   0%|          | 0/3057 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 27,512 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.9779
2,3.0802
3,3.1467
4,2.8699
5,3.3717
6,2.8351
7,2.8909
8,3.0852
9,2.4698
10,3.2507


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_5/
Dataset({
    features: ['text'],
    num_rows: 23394
})


Map:   0%|          | 0/23394 [00:00<?, ? examples/s]

Map:   0%|          | 0/2600 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 23,394 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.8625
2,3.1917
3,2.9356
4,2.769
5,2.4329
6,2.3483
7,2.9977
8,2.4834
9,3.3691
10,2.8719


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_6/
Dataset({
    features: ['text'],
    num_rows: 36643
})


Map:   0%|          | 0/36643 [00:00<?, ? examples/s]

Map:   0%|          | 0/4072 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 36,643 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.4045
2,2.5721
3,2.3038
4,2.5695
5,2.6372
6,2.5045
7,2.7609
8,2.6526
9,2.5544
10,2.3432


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_7/
Dataset({
    features: ['text'],
    num_rows: 36108
})


Map:   0%|          | 0/36108 [00:00<?, ? examples/s]

Map:   0%|          | 0/4013 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 36,108 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.3965
2,2.265
3,2.4028
4,1.8473
5,1.9735
6,2.0916
7,2.3444
8,2.0685
9,2.5869
10,2.6552


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_8/
Dataset({
    features: ['text'],
    num_rows: 31283
})


Map:   0%|          | 0/31283 [00:00<?, ? examples/s]

Map:   0%|          | 0/3476 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 31,283 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.4143
2,2.735
3,2.7995
4,2.299
5,2.4069
6,1.8426
7,2.0911
8,2.363
9,2.7564
10,2.2055


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_9/
Dataset({
    features: ['text'],
    num_rows: 31056
})


Map:   0%|          | 0/31056 [00:00<?, ? examples/s]

Map:   0%|          | 0/3451 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 31,056 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.2925
2,2.0968
3,1.6698
4,2.1917
5,2.012
6,2.0844
7,2.4223
8,2.1271
9,1.9584
10,1.8208


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_10/
Dataset({
    features: ['text'],
    num_rows: 30922
})


Map:   0%|          | 0/30922 [00:00<?, ? examples/s]

Map:   0%|          | 0/3436 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 30,922 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.4064
2,2.2527
3,2.3425
4,2.0236
5,2.0666
6,2.2093
7,1.7061
8,2.1314
9,2.9044
10,2.5506


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_11/
Dataset({
    features: ['text'],
    num_rows: 29841
})


Map:   0%|          | 0/29841 [00:00<?, ? examples/s]

Map:   0%|          | 0/3316 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 29,841 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.3542
2,2.563
3,2.7705
4,2.4398
5,2.3128
6,2.2679
7,2.1136
8,2.095
9,2.0994
10,2.2602


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_12/
Dataset({
    features: ['text'],
    num_rows: 29649
})


Map:   0%|          | 0/29649 [00:00<?, ? examples/s]

Map:   0%|          | 0/3295 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 29,649 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,1.9769
2,2.2421
3,2.232
4,2.3725
5,2.5932
6,1.9108
7,2.1826
8,2.0859
9,2.0341
10,2.8786


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_13/
Dataset({
    features: ['text'],
    num_rows: 25618
})


Map:   0%|          | 0/25618 [00:00<?, ? examples/s]

Map:   0%|          | 0/2847 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 25,618 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.4729
2,2.3655
3,2.0925
4,1.7554
5,2.3518
6,1.9276
7,1.734
8,2.323
9,2.7636
10,2.8405


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_14/
Dataset({
    features: ['text'],
    num_rows: 27952
})


Map:   0%|          | 0/27952 [00:00<?, ? examples/s]

Map:   0%|          | 0/3106 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 27,952 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,1.8618
2,2.47
3,2.1301
4,2.3995
5,2.036
6,2.7969
7,2.4455
8,2.1191
9,1.9542
10,2.2491


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_15/
Dataset({
    features: ['text'],
    num_rows: 25848
})


Map:   0%|          | 0/25848 [00:00<?, ? examples/s]

Map:   0%|          | 0/2872 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 25,848 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.6333
2,2.4403
3,2.0944
4,2.0041
5,2.1681
6,1.9919
7,2.5247
8,2.5217
9,2.3893
10,2.5711


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_16/
Dataset({
    features: ['text'],
    num_rows: 22227
})


Map:   0%|          | 0/22227 [00:00<?, ? examples/s]

Map:   0%|          | 0/2470 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 22,227 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.1393
2,2.146
3,2.2802
4,2.729
5,2.7097
6,2.786
7,2.3815
8,2.2005
9,2.1079
10,2.3755


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_17/
Dataset({
    features: ['text'],
    num_rows: 25641
})


Map:   0%|          | 0/25641 [00:00<?, ? examples/s]

Map:   0%|          | 0/2850 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 25,641 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.8473
2,2.0652
3,2.1876
4,2.0054
5,2.0105
6,2.0678
7,2.0223
8,2.5328
9,2.3026
10,2.6423


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_18/
Dataset({
    features: ['text'],
    num_rows: 21690
})


Map:   0%|          | 0/21690 [00:00<?, ? examples/s]

Map:   0%|          | 0/2411 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,690 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.2484
2,2.2574
3,2.0809
4,2.5976
5,2.7066
6,2.3637
7,2.3187
8,2.4644
9,2.0272
10,2.3446


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_19/
Dataset({
    features: ['text'],
    num_rows: 21864
})


Map:   0%|          | 0/21864 [00:00<?, ? examples/s]

Map:   0%|          | 0/2430 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,864 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.1245
2,2.083
3,2.612
4,2.3898
5,2.2986
6,2.5578
7,1.761
8,2.5256
9,2.7926
10,2.2574


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_20/
Dataset({
    features: ['text'],
    num_rows: 21205
})


Map:   0%|          | 0/21205 [00:00<?, ? examples/s]

Map:   0%|          | 0/2357 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,205 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.003
2,2.1676
3,2.4185
4,2.1665
5,2.2642
6,2.2689
7,2.6881
8,2.9117
9,2.5733
10,2.2441


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_21/
Dataset({
    features: ['text'],
    num_rows: 22496
})


Map:   0%|          | 0/22496 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 22,496 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 19,660,800


Step,Training Loss
1,2.5628
2,2.1111
3,2.0559
4,1.915
5,2.8613
6,3.2853
7,2.0053
8,2.7443
9,2.3442
10,2.9605


Modelo e tokenizer salvos em: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_22/
Treinamento concluído para todos os chunks.


#CONCATENAR OS CHUNKS TREINO E DEIXAR 1 MODEL APENAS PARA A INFERÊNCIA

In [14]:
import os
import json

drive = '/content/drive/MyDrive/FIAP'
treino_dir = f'{drive}/content/datasets/treino/'
full_dir = f'{drive}/content/datasets/treino_full/'


# Cria o diretório de treino se não existir
os.makedirs(full_dir, exist_ok=True)
# Defina os nomes dos arquivos JSON a serem mesclados
json_files = ['special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json']

def merge_json_files(json_paths):
    merged_data = {}
    for json_path in json_paths:
        with open(json_path, 'r') as file:
            data = json.load(file)
            merged_data = dict(merged_data, **data)  # Mesclar os dados, substituindo chaves se necessário
    return merged_data

def process_chunks(chunk_dirs):
    for chunk_dir in chunk_dirs:
        chunk_path = os.path.join(treino_dir, chunk_dir)
        if os.path.isdir(chunk_path):
            json_paths = [os.path.join(chunk_path, file) for file in json_files]
            if all(os.path.exists(path) for path in json_paths):
                # Mesclar arquivos JSON do chunk
                merged_data = merge_json_files(json_paths)

                # Crie o diretório de saída se não existir
                os.makedirs(full_dir, exist_ok=True)

                # Salvar arquivos mesclados com sufixo _complete no novo diretório
                for json_file in json_files:
                    complete_path = os.path.join(full_dir, f'{json_file.replace(".json", "_complete.json")}')
                    with open(complete_path, 'w') as file:
                        json.dump(merged_data, file, indent=4)

                print(f'Merged and saved: {chunk_path}')
            else:
                print(f'Missing one or more JSON files in {chunk_path}')

# Obtenha a lista de diretórios de chunks
chunk_dirs = [d for d in os.listdir(treino_dir) if os.path.isdir(os.path.join(treino_dir, d))]

# Processar todos os chunks
process_chunks(chunk_dirs)

print("Mesclagem concluída. Arquivos salvos em:", full_dir)


Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_0
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_1
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_2
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_3
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_4
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_5
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_6
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_7
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_8
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_9
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model_chunk_10
Merged and saved: /content/drive/MyDrive/FIAP/content/datasets/treino/model

In [None]:
# eval_results = trainer.evaluate()
# print(f"Acurácia do modelo: {eval_results.get('eval_accuracy', 'Não disponível'):.2f}")

## Inferência com o Modelo Treinado

Realiza a inferência com o modelo treinado, gerando uma descrição do produto de entrada fornecido e exibindo a saída gerada.

In [None]:
from transformers import TextStreamer

drive = '/content/drive/MyDrive/FIAP'
full_dir = f'{drive}/content/datasets/treino_full/'
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = full_dir,
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model)

inputs = tokenizer([
    prompt.format(
        "DESCRIBE THIS PRODUCT.",
        "Apple A Day: The Myths, Misconceptions and Truths About the Foods We Eat",
        "",
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
outputs = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
# tokenizer.batch_decode(outputs)