# üß† Treinamento de LLM - Tradu√ß√£o Guarani ‚Üí Portugu√™s
Este notebook realiza o treinamento de uma LLM pequena (GPT-2) para tradu√ß√£o de palavras Guarani para o Portugu√™s.
As etapas incluem: carregamento do dataset, an√°lise estat√≠stica, aumento de dados, tokeniza√ß√£o, treinamento e testes.

In [11]:
# ‚úÖ Importa√ß√µes
import pandas as pd
import random
import json
import torch
from transformers import (
    GPT2LMHeadModel, GPT2Tokenizer,
    Trainer, TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
from google.colab import drive
# ‚úÖ Carregar dados do Google Drive
def carregar_dados():
    drive.mount('/content/drive')
    df = pd.read_json("/content/drive/MyDrive/Doutorado Unesp/assistente-guarani/data/dados_treinamento_guarani.json")
    print(f"üìä Total de exemplos no dataset: {len(df)}")
    display(df.head())
    return df

df = carregar_dados()
# ‚úÖ Estat√≠sticas
inputs = df['input'].tolist()
outputs = df['output'].tolist()
print(f"üìà Comprimento m√©dio dos inputs: {sum(len(i) for i in inputs)/len(inputs):.2f} caracteres")
print(f"üìà Comprimento m√©dio dos outputs: {sum(len(o) for o in outputs)/len(outputs):.2f} caracteres")
# ‚úÖ Aumento de Dados
instructions_pt = [
    "Traduza para o portugu√™s:", "Tradu√ß√£o em portugu√™s:", "O que significa em portugu√™s:",
    "Traduza do Guarani para o Portugu√™s:", "Como se diz em portugu√™s:","Converta para o portugu√™s:",
    "Passe para o portugu√™s:","A tradu√ß√£o portuguesa √©:"
]
instructions_gua = [
    "Traduza para o guarani:", "Vers√£o em Guarani:", "O que significa em Guarani:",
    "Tradu√ß√£o em Guarani:", "Como se diz em Guarani:","Com respeito √† cultura Guarani",
    "Em guarani, significa:",
]

augmented_data = []

for _, row in df.iterrows():
    for _ in range(5):
        inst = random.choice(instructions_pt)
        augmented_data.append({
            "instruction": inst,
            "input": row["input"],
            "output": row["output"]
        })
        inst_inv = random.choice(instructions_gua)
        augmented_data.append({
            "instruction": inst_inv,
            "input": row["output"],
            "output": row["input"]
        })

print(f"‚úÖ Dataset aumentado: {len(augmented_data)} exemplos")
# ‚úÖ Tokeniza√ß√£o
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
üìä Total de exemplos no dataset: 648


Unnamed: 0,instruction,input,output
0,Como se diz 'filho (do pai)' em Guarani?,filho (do pai),txera‚Äôy
1,Como se diz 'filha (da m√£e)' em Guarani?,filha (da m√£e),txememby
2,Como se diz 'filha (do pai)' em Guarani?,filha (do pai),txeradjy
3,Como se diz 'filho de cria√ß√£o' em Guarani?,filho de cria√ß√£o,txera‚Äôy rami
4,Como se diz 'filha de cria√ß√£o' em Guarani?,filha de cria√ß√£o,txeradjy rami


üìà Comprimento m√©dio dos inputs: 13.13 caracteres
üìà Comprimento m√©dio dos outputs: 16.94 caracteres
‚úÖ Dataset aumentado: 6480 exemplos


In [12]:

def format_and_tokenize(example):
    prompt = example["instruction"] + " " + example["input"]
    target = example["output"]
    full_text = prompt + " " + target
    tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

dataset = Dataset.from_list(augmented_data)
dataset_tokenized = dataset.map(format_and_tokenize)
# ‚úÖ Treinamento
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir="./guarani_model",
    per_device_train_batch_size=8,
    num_train_epochs=5,
    logging_steps=10,
    save_steps=100,
    save_total_limit=1,
    logging_dir="./logs",
    run_name="guarani_training_run",  # Add unique run_name to avoid the warning
    report_to="none"  # Disable W&B and other logging integrations
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()
trainer.save_model("/content/drive/MyDrive/Doutorado Unesp/assistente-guarani/data/guarani_gpt2")



Map:   0%|          | 0/6480 [00:00<?, ? examples/s]

  trainer = Trainer(


Step,Training Loss
10,5.151
20,3.9999
30,3.8361
40,3.3511
50,3.1123
60,2.9546
70,2.8558
80,2.8563
90,2.6515
100,2.6717


In [13]:
# ‚úÖ Testes do modelo
def testar_modelo(prompt_input):
    input_text = "Traduza para o portugu√™s: " + prompt_input
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
    output = model.generate(input_ids, max_new_tokens=30, do_sample=True)
    print(f"> Input: {prompt_input}")
    print("> Tradu√ß√£o:", tokenizer.decode(output[0], skip_special_tokens=True))

# Exemplos de teste
testar_modelo("Djety-mbow√©")
testar_modelo("Kamby")
testar_modelo("E√≠")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> Input: Djety-mbow√©
> Tradu√ß√£o: Traduza para o portugu√™s: Djety-mbow√© banha! Banha ky-re'·∫Ω! Vai com Deus Petape√≥ larva do che! Vai com Deus


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> Input: Kamby
> Tradu√ß√£o: Traduza para o portugu√™s: Kamby leite Kamby leite Como viviam Nhaningarekhe'·∫Ω. Como viviam Nhe'
> Input: E√≠
> Tradu√ß√£o: Traduza para o portugu√™s: E√≠ mel de abelhas at√©. Vicho animal aldeia Nossa flor √© muito bela. Upegwi adj


In [14]:
#!pip install evaluate sacrebleu
import evaluate
import numpy as np

# Carregar m√©tricas
bleu = evaluate.load("sacrebleu")

# Gerar previs√µes para o conjunto de teste
samples = df.sample(50, random_state=42)  # ou usar parte de `augmented_data`

referencias = []
predicoes = []

for _, row in samples.iterrows():
    entrada = row["input"]
    referencia = row["output"]

    input_text = "Traduza para o portugu√™s: " + entrada
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
    output = model.generate(input_ids, max_new_tokens=30, do_sample=False)
    saida = tokenizer.decode(output[0], skip_special_tokens=True)

    # P√≥s-processamento para pegar s√≥ a tradu√ß√£o
    traducao = saida.replace(input_text, "").strip()

    referencias.append([referencia])
    predicoes.append(traducao)

# Calcular BLEU
bleu_result = bleu.compute(predictions=predicoes, references=referencias)
print(f"üìò BLEU score: {bleu_result['score']:.2f}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

üìò BLEU score: 7.86


In [15]:
from torch.nn import CrossEntropyLoss

def calcular_perplexity(model, dataset_tokenized, n=100):
    model.eval()
    losses = []
    # converte para dicion√°rio com listas para cada chave
    batches = dataset_tokenized.select(range(n)).to_dict()
    for i in range(n):
        input_ids = batches['input_ids'][i]
        labels = batches['labels'][i]
        inputs = torch.tensor([input_ids]).to(model.device)
        labels = torch.tensor([labels]).to(model.device)
        with torch.no_grad():
            outputs = model(inputs, labels=labels)
        losses.append(outputs.loss.item())
    mean_loss = np.mean(losses)
    perplexity = np.exp(mean_loss)
    print(f"üî¢ Perplexity: {perplexity:.2f}")
    return perplexity

# Chamada:
calcular_perplexity(model, dataset_tokenized)


üî¢ Perplexity: 77259878116.51


np.float64(77259878116.50635)

In [18]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# ‚úÖ Caminho do modelo salvo no seu Google Drive
caminho_modelo = "/content/drive/MyDrive/Doutorado Unesp/assistente-guarani/data/guarani_gpt2"

# ‚úÖ Carregar modelo e tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(caminho_modelo)
model = GPT2LMHeadModel.from_pretrained(caminho_modelo)
model.eval()

# ‚úÖ Enviar modelo para GPU se dispon√≠vel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# ‚úÖ Fun√ß√£o de tradu√ß√£o
attention_mask = torch.ones_like(input_ids)
# Configurar pad_token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

def traduzir_guarani(texto_guarani):
    prompt = "Traduza para o portugu√™s: " + texto_guarani
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    output = model.generate(input_ids,
                            attention_mask=attention_mask,
                            max_length=50,
                            num_beams=4,
                            max_new_tokens=30,
                            do_sample=True,
                            early_stopping=True,
                            no_repeat_ngram_size=2,
                            temperature=0.5)
    traducao = tokenizer.decode(output[0], skip_special_tokens=True)
    return traducao

# ‚úÖ Loop interativo
print("üî° Digite uma palavra/frase em Guarani (ou 'sair' para encerrar):")
while True:
    entrada = input("üìù Guarani: ")
    if entrada.lower() in ["sair", "exit", "quit"]:
        print("üëã Encerrando.")
        break
    resposta = traduzir_guarani(entrada)
    print("üìò Tradu√ß√£o:", resposta)


üî° Digite uma palavra/frase em Guarani (ou 'sair' para encerrar):
üìù Guarani: menino


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


üìò Tradu√ß√£o: Traduza para o portugu√™s: menino nandadadurururanuraniruraniruiruirirunir uniriru uniru uiru Uiru Uniru Unioniru Universal
üìù Guarani: Bom dia


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


üìò Tradu√ß√£o: Traduza para o portugu√™s: Bom dia j jjjajajjawajawawadawadaawdawbyawandawwereandandwereadandareandad


KeyboardInterrupt: Interrupted by user

In [19]:
def traduzir_corrigido(entrada):
    input_text = f"Traduza para o portugu√™s: {entrada}"
    inputs = tokenizer(input_text, return_tensors="pt")

    output = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=15,
        temperature=1.0,
        repetition_penalty=1.5,
        no_repeat_ngram_size=2,
        do_sample=True,
        top_p=0.8
    )

    result = tokenizer.decode(output[0], skip_special_tokens=True)
    return result.replace(input_text, "").strip()

# Teste
print(traduzir_corrigido("Kamby"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)