## SETUP

In [None]:
!pip install --upgrade pip
!pip install -r requirements.txt
!pip install -r other-requirements.txt

Looking in indexes: https://download.pytorch.org/whl/cu129


In [None]:
import kagglehub 
import pandas as pd
import torch
from transformers import pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np

### ENV VARIABLES

In [None]:
AMOUNT_OF_REVIEWS_TO_CLASSIFY = 300
NUM_TRAIN_EPOCHS = 3 # 2 ou 3 épocas
WARMUP_STEPS = 500 # Passos de aquecimento para o otimizador
WEIGHT_DECAY = 0.01 # Decaimento de peso para o otimizador
TRAIN_SAMPLE_SIZE = 4000 # Tamanho da amostra de treino
MODEL_CHECKPOINT = "distilbert-base-uncased" # modelo leve para o fine-tuning - DistilBERT - rápido de treinar e tem ótima performance.
KAGGLE_DATASET = "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews" # Dataset do Kaggle

### DATASET - IMDB MOVIE REVIEWS TO CLASSIFY

In [None]:
# Download dataset IMDB reviews
path = kagglehub.dataset_download(KAGGLE_DATASET)

print("Path to dataset files:", path)
df = pd.read_csv(path + "/IMDB Dataset.csv")    
df.head(20)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\rafas\.cache\kagglehub\datasets\lakshmi25npathi\imdb-dataset-of-50k-movie-reviews\versions\1


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


### SET GPU CARD

In [None]:
!nvidia-smi
print(f"GPU disponível? {torch.cuda.is_available()}")
print(f"Nome da GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'Nenhuma'}")

# 1. Verifica se a GPU (CUDA) está disponível
#    Se estiver, usa "cuda:0" (a primeira GPU)
#    Se não, usa "cpu"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

print(f"Dispositivo selecionado: {device}")

Fri Nov 14 01:39:39 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 577.00                 Driver Version: 577.00         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4050 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   45C    P3             14W /   60W |       0MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device=device)  # Use GPU if available




Device set to use cuda:0


#### USE BART LARGE MNLI MODEL TO CLASSIFY MOVIE REVIEWS IN POSITIVE/NEGATIVE

In [6]:
model_classification_results = []

for index, row in df.head(AMOUNT_OF_REVIEWS_TO_CLASSIFY).iterrows():
    sequence_to_classify = row['review']
    candidate_labels = ['positive', 'negative']
    result = classifier(sequence_to_classify, candidate_labels)
    print(f"Review {index+1} classification:")
    print(result['labels'][0], "with score", result['scores'][0])
    
    model_classification_results.append({
        'review': sequence_to_classify,
        'predicted_label': result['labels'][0],
        'score': result['scores'][0]
    })


Review 1 classification:
positive with score 0.5315369367599487
Review 2 classification:
positive with score 0.9874507784843445
Review 3 classification:
positive with score 0.9743620753288269
Review 4 classification:
negative with score 0.9753220081329346
Review 5 classification:
positive with score 0.8379409313201904
Review 6 classification:
positive with score 0.9739865660667419


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Review 7 classification:
positive with score 0.7787741422653198
Review 8 classification:
negative with score 0.9267164468765259
Review 9 classification:
negative with score 0.8084461092948914
Review 10 classification:
positive with score 0.9694986343383789
Review 11 classification:
negative with score 0.8875367045402527
Review 12 classification:
negative with score 0.5397162437438965
Review 13 classification:
negative with score 0.8973174691200256
Review 14 classification:
negative with score 0.9752565622329712
Review 15 classification:
positive with score 0.9951791167259216
Review 16 classification:
negative with score 0.9860522150993347
Review 17 classification:
negative with score 0.8709798455238342
Review 18 classification:
negative with score 0.9806430339813232
Review 19 classification:
positive with score 0.5455807447433472
Review 20 classification:
negative with score 0.9686130881309509
Review 21 classification:
positive with score 0.8618015050888062
Review 22 classification:
ne

#### LIST WRONG CLASSIFICATIONS

In [None]:
counter_incorrect = 0
for index, row in df.head(AMOUNT_OF_REVIEWS_TO_CLASSIFY).iterrows():
    predicted = model_classification_results[index]['predicted_label']
    actual_label = row['sentiment']
    if predicted != actual_label:
        # print(f"Review {index+1} - Predicted: {predicted}, Actual: {actual_label}")
        counter_incorrect += 1
        
error_percentage = (counter_incorrect / AMOUNT_OF_REVIEWS_TO_CLASSIFY) * 100
print(f"Total incorrect predictions: {counter_incorrect} - Error percentage: {error_percentage:.2f}%")


Total incorrect predictions: 30 - Error percentage: 10.00%


#### MÉTRICAS - BASELINE

In [None]:
# 1. Obter os rótulos verdadeiros (ground truth)
true_labels = list(df.head(AMOUNT_OF_REVIEWS_TO_CLASSIFY)['sentiment'])

# 2. Obter os rótulos previstos
predicted_labels = [item['predicted_label'] for item in model_classification_results]

# 3. Relatório de classificação
print("--- MÉTRICAS DO MODELO ZERO-SHOT (Baseline) ---")
print(classification_report(true_labels, predicted_labels, target_names=['negative', 'positive']))

--- MÉTRICAS DO MODELO ZERO-SHOT (Baseline) ---
              precision    recall  f1-score   support

    negative       0.88      0.94      0.91       161
    positive       0.93      0.85      0.89       139

    accuracy                           0.90       300
   macro avg       0.90      0.90      0.90       300
weighted avg       0.90      0.90      0.90       300



### FINE-TUNNING
#### --- Preparação dos Dados ---

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

sample_df = df.sample(n=TRAIN_SAMPLE_SIZE, random_state=42) 

# Converter 'positive'/'negative' para 0 e 1
sample_df['label_num'] = sample_df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Dividir os dados em Treino e Teste
X = list(sample_df['review'])
y = list(sample_df['label_num'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenizar (converter o texto em números que o modelo entende)
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=256)
test_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=256)

In [None]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Pega todos os 'input_ids', 'attention_mask', etc.
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Adiciona o rótulo
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Criar os datasets de treino e teste
train_dataset = IMDbDataset(train_encodings, y_train)
test_dataset = IMDbDataset(test_encodings, y_test)

#### TREINAMENTO

In [None]:
# Carregar o modelo pré-treinado (com uma "cabeça" de classificação em cima)
# O num_labels=2 diz para ele se preparar para classificar entre 2 coisas (pos/neg)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=2)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_TRAIN_EPOCHS,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs',
    logging_steps=10
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,0.6899
20,0.6914
30,0.6871
40,0.69
50,0.6837
60,0.6753
70,0.6585
80,0.6267
90,0.5521
100,0.481


TrainOutput(global_step=600, training_loss=0.3375835294524829, metrics={'train_runtime': 153.8754, 'train_samples_per_second': 62.388, 'train_steps_per_second': 3.899, 'total_flos': 635843513548800.0, 'train_loss': 0.3375835294524829, 'epoch': 3.0})

### MÉTRICAS - PÓS FINE-TUNNING

In [None]:
# 1. Fazer previsões no conjunto de teste
predictions = trainer.predict(test_dataset)

# 2. As previsões saem como "logits", precisamos do rótulo final (0 ou 1)
predicted_labels_tuned = np.argmax(predictions.predictions, axis=1)

# 3. Pegar os rótulos verdadeiros do conjunto de teste
true_labels_tuned = y_test

# 4. Gerar o relatório de classificação
print("--- MÉTRICAS DO MODELO FINE-TUNED ---")
print(classification_report(true_labels_tuned, predicted_labels_tuned, target_names=['negative', 'positive']))

--- MÉTRICAS DO MODELO FINE-TUNED ---
              precision    recall  f1-score   support

    negative       0.89      0.90      0.90       392
    positive       0.90      0.90      0.90       408

    accuracy                           0.90       800
   macro avg       0.90      0.90      0.90       800
weighted avg       0.90      0.90      0.90       800



## CONCLUSÃO

O objetivo deste projeto foi comparar a performance de um modelo de classificação Zero-Shot (Baseline) contra um modelo Fine-Tuned (Especialista) na tarefa de análise de sentimento.

### Resultados:

| MODELO | ESTRATÉGIA | MODELO BASE | ACURÁCIA | F1-SCORE (WEIGHTED) |
| -------- | -------- | ----------- | -------- | ------------------- |
| Baseline | Zero-Shot | ***facebook/bart-large-mnli*** | 90% | 0.9 |
| Desafiante | Fine-Tuning | ***distilbert-base-uncased*** | 90% | 0.9 |

### Análise:

Surpreendentemente, o modelo Fine-Tuned (DistilBERT com 90% de acurácia) não superou o modelo Zero-Shot (BART-Large, também com 90%).
Isso sugere que a tarefa de classificação de sentimento binário (positivo/negativo) é uma tarefa onde os modelos NLI de grande escala, como o BART-Large (406M de parâmetros), já possuem uma capacidade generalista extremamente alta.
Mesmo treinando um modelo especialista (DistilBERT, 66M de parâmetros) com 4000 amostras, ele apenas conseguiu igualar a performance do modelo maior, que não recebeu nenhum treinamento específico para esta tarefa. Isso demonstra o poder dos modernos modelos de fundação.

### Observação CPU x GPU

O uso de GPU para execução do modelo **BART-Large** e treino do modelo **DistilBERT** apresentou uma redução no tempo de execução de 15 vezes frente ao uso de CPU para as mesmas tarefas. Isto deixa clara a importância do uso deste tipo de hardware no âmbito da NLI.