# Notebook de referência

Nome:

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante:
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

# Fixando a seed

In [1]:
import random
import torch
import torch.nn.functional as F
import numpy as np

# model
from transformers import DistilBertTokenizer, DistilBertModel

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7fb174151b50>

## Preparando Dados

Primeiro, fazemos download do dataset:

In [3]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [4]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False It really doesn't matter that Superman comic books are unbelievably naive and their target is ten ye
False I don't care how many nominations this junk got for best this and that, this movie stunk. I didn't k
True This is not my favorite WIP ("Women in Prison"), but it is one of the most famous films in the sub-g
3 últimas amostras treino:
False I recently watched this film at the 30'Th Gothenburg Film Festival, and to be honest it was on of th
True "The Gingerbread Man is the first thriller I've ever done!"  Robert Altman <br /><br />In 1955 Char
True I will begin by saying I am very pleased with this climax of the Bourne trilogy. Please, oh please d
3 primeiras amostras validação:
True In the trivia section for Pet Sematary, it mentions that George Romero (director of two Stephen King
True Watching Cliffhanger makes me nostalgic for the early '90s, a time when virtu

# Overview no dataset

In [5]:
print(y_test[1])
print(x_test[1])

True
A Must See!<br /><br />Excellent positive African-American Love Story. This movie had reminded me of watching the old black and white movies with my dad. More true to life characters looking for love, being in love, and loosing it. Old story fresh view. Larenz Tate was so Cary Grant in style as the character may have been in a clumsey situation, but the actor kept him from looking silly and like a cardboard cut out. Nia Long has always been a favorite of mine she is sweet even when she is tough, almost like a Kathrine Hepburn. This is one of his best work and showing that he is better than always playing an angry black man<br /><br />This movie is a classic, superb acting, well written, a real love story set in Chicago, what more can you ask for?<br /><br />SuperB Black Love Story<br /><br />Amsterdam, Holland


In [6]:
print(len(x_test))
print(len(x_train))

25000
20000


# Model

In [7]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
import torch

# Tokenizer do DistilBERT
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def encode_reviews(tokenizer, reviews, max_length):
    input_ids = []
    attention_masks = []

    for review in reviews:
        encoded_review = tokenizer.encode_plus(
            review,
            max_length=max_length,
            add_special_tokens=True,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )

        input_ids.append(encoded_review['input_ids'])
        attention_masks.append(encoded_review['attention_mask'])

    return torch.cat(input_ids, dim=0), torch.cat(attention_masks, dim=0)

# Definindo um comprimento máximo
max_length = 512

# Codificando os dados
x_train_ids, x_train_masks = encode_reviews(tokenizer, x_train, max_length)
x_valid_ids, x_valid_masks = encode_reviews(tokenizer, x_valid, max_length)

# Correção: Converte rótulos booleanos para inteiros e depois para tensores de longos
y_train = torch.tensor([int(label) for label in y_train], dtype=torch.long)
y_valid = torch.tensor([int(label) for label in y_valid], dtype=torch.long)

# Criação do TensorDataset
train_dataset = TensorDataset(x_train_ids, x_train_masks, y_train)
valid_dataset = TensorDataset(x_valid_ids, x_valid_masks, y_valid)




# Dataset em Tensor

In [8]:
# Criação do TensorDataset
train_dataset = TensorDataset(x_train_ids, x_train_masks, y_train)
valid_dataset = TensorDataset(x_valid_ids, x_valid_masks, y_valid)

# Criação do DataLoader
batch_size = 16  # Ajuste de acordo com a capacidade da sua GPU

train_dataloader = DataLoader(
    train_dataset,
    sampler=RandomSampler(train_dataset),
    batch_size=batch_size
)

validation_dataloader = DataLoader(
    valid_dataset,
    sampler=SequentialSampler(valid_dataset),
    batch_size=batch_size
)


# Chamando o modelo

In [9]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2,  # Para classificação binária
    output_attentions=False,
    output_hidden_states=False,
)

# Enviando modelo para GPU, se disponível
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

# Treinamento

In [11]:
from transformers import AdamW
from tqdm import tqdm

# Otimizador
optimizer = AdamW(model.parameters(), lr=2e-5)

# Número de épocas e passos de acumulação de gradiente
epochs = 4
grad_accumulation_steps = 4  # Ajuste conforme necessário

# Função de treinamento
def train(model, train_dataloader, optimizer, epochs, grad_accumulation_steps):
    model.train()
    for epoch in range(epochs):
        total_loss = 0

        # Adiciona a barra de progresso do tqdm aqui
        progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}")
        
        for step, batch in enumerate(progress_bar):
            b_input_ids, b_input_mask, b_labels = batch
            b_input_ids = b_input_ids.to(device)
            b_input_mask = b_input_mask.to(device)
            b_labels = b_labels.to(device)

            outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
            loss = outputs[0]
            total_loss += loss.item()

            loss.backward()

            if (step + 1) % grad_accumulation_steps == 0:
                optimizer.step()
                model.zero_grad()

            # Atualiza a barra de progresso com informações da perda atual
            progress_bar.set_postfix({'loss': total_loss / (step + 1)})

        print(f'Epoch {epoch + 1} | Average Loss: {total_loss / len(train_dataloader)}')

# Chamada da função de treinamento
train(model, train_dataloader, optimizer, epochs, grad_accumulation_steps)


Epoch 1: 100%|██████████| 1250/1250 [05:30<00:00,  3.78it/s, loss=0.236]


Epoch 1 | Average Loss: 0.23572263590097428


Epoch 2: 100%|██████████| 1250/1250 [05:31<00:00,  3.77it/s, loss=0.157]


Epoch 2 | Average Loss: 0.15693349307477475


Epoch 3: 100%|██████████| 1250/1250 [05:31<00:00,  3.77it/s, loss=0.0955]


Epoch 3 | Average Loss: 0.09549005347788334


Epoch 4: 100%|██████████| 1250/1250 [05:31<00:00,  3.77it/s, loss=0.0583]

Epoch 4 | Average Loss: 0.058291692516207694





# Avaliação

In [12]:
def evaluate(model, validation_dataloader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in validation_dataloader:
            b_input_ids, b_input_mask, b_labels = batch
            b_input_ids = b_input_ids.to(device)
            b_input_mask = b_input_mask.to(device)
            b_labels = b_labels.to(device)

            outputs = model(b_input_ids, attention_mask=b_input_mask)
            predictions = torch.argmax(outputs[0], dim=1)
            correct += torch.sum(predictions == b_labels).item()
            total += b_labels.size(0)

    return correct / total

accuracy = evaluate(model, validation_dataloader)
print(f'Accuracy on validation set: {accuracy}')


Accuracy on validation set: 0.922
