<a href="https://colab.research.google.com/github/rafael-rosa/mack-modelos-linguagem-generativos/blob/main/movie_review_classif.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PROJETO FINAL DA DISCIPLINA ***MODELOS DE LINGUAGEM E GENERATIVOS***
#### Prof. Rog√©rio de Oliveira
Alunos: 
+ `Gildo Manzi da Silva - RA: 10329658`
+ `Rafael da Silva Rosa - RA: 10746329`
+ `Rog√©rio Goussain Labat - RA: 10746326`

#### üé¨ Objetivo: Identificar o g√™nero de um filme a partir de seu plot (trama) üé≠
#### üìä Origem dos dados: **IMDB (IDs dos filmes) e OMDB (API para consulta de Plots)**

## 1Ô∏è‚É£ SETUP

In [1]:
!pip install --upgrade pip
!pip install -r https://raw.githubusercontent.com/rafael-rosa/mack-modelos-linguagem-generativos/main/requirements.txt
!pip install -r https://raw.githubusercontent.com/rafael-rosa/mack-modelos-linguagem-generativos/main/other-requirements.txt

Looking in indexes: https://download.pytorch.org/whl/cu129


#### 1Ô∏è‚É£.1Ô∏è‚É£ IMPORTS

In [None]:
import pandas as pd
import torch
import torch.nn as nn
from transformers import pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np
import evaluate
from ydata_profiling import ProfileReport




#### 1Ô∏è‚É£.2Ô∏è‚É£ ENV VARIABLES

In [None]:
AMOUNT_OF_PLOTS_TO_CLASSIFY = 500
NUM_TRAIN_EPOCHS = 10 # N√∫mero de √©pocas para o fine-tuning
WARMUP_STEPS = 500 # Passos de aquecimento para o otimizador
WEIGHT_DECAY = 0.01 # Decaimento de peso para o otimizador
TRAIN_SAMPLE_SIZE = 1500 # Tamanho da amostra de treino
MODEL_CHECKPOINT = "distilbert-base-uncased" # modelo leve para o fine-tuning - DistilBERT - r√°pido de treinar e tem √≥tima performance.
DATASET = "movies_dataset\\movie_plots_dataset.csv"

#### 1Ô∏è‚É£.3Ô∏è‚É£ DATASET - IMDB MOVIE REVIEWS TO CLASSIFY

In [4]:
df = pd.read_csv(DATASET)    
# df.head(20)

# Delete column Unnamed: 0 if exists
if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=['Unnamed: 0'])

profile = ProfileReport(df, title="IMDB Dataset Profiling Report", explorative=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 18.32it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [16]:
# Get list of unique genres
unique_genres = df['genre'].unique().tolist()
print("Unique genres:", unique_genres)

NUMBER_OF_GENRES = len(unique_genres)
print("Number of unique genres:", NUMBER_OF_GENRES)

# Delete duplicated rows
df = df.drop_duplicates()
df.shape

Unique genres: ['DOCUMENTARY', 'COMEDY', 'DRAMA', 'SHORT', 'WESTERN', 'THRILLER', 'ANIMATION', 'MUSIC', 'CRIME', 'SCI-FI', 'HORROR', 'TALK-SHOW', 'FAMILY', 'ACTION', 'MYSTERY', 'BIOGRAPHY', 'REALITY-TV', 'NEWS', 'FANTASY', 'ROMANCE', 'MUSICAL', 'SPORT', 'HISTORY', 'GAME-SHOW', 'ADVENTURE', 'WAR', 'ADULT']
Number of unique genres: 27


(3057, 2)

#### 1Ô∏è‚É£.4Ô∏è‚É£ SET GPU CARD

In [6]:
!nvidia-smi
print(f"GPU available? {'‚úÖ' if torch.cuda.is_available() else '‚ùå'}")
print(f"GPU name: {'üéÆ ' + torch.cuda.get_device_name(0) if torch.cuda.is_available() else '‚ö†Ô∏è None'}")

# Verify if the GPU (CUDA) is available
# If it is, use "cuda:0" (the first GPU)
# If not, use "cpu"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

print(f"Selected device: üñ•Ô∏è {device}")

Mon Dec  1 01:30:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 577.00                 Driver Version: 577.00         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4050 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8              1W /   60W |       0MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [7]:
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device=device)   # Use GPU if available

Device set to use cuda:0


## 2Ô∏è‚É£ Zero-Shot Classification 
#### 2Ô∏è‚É£.1Ô∏è‚É£ USE ***BART LARGE MNLI*** MODEL TO DETERMINE MOVIE GENRE

In [8]:
reviews = df['plot'][:AMOUNT_OF_PLOTS_TO_CLASSIFY].tolist()
print(f"Classifying {len(reviews)} plots...")

model_classification_results = []
candidate_labels = unique_genres

def data():
    for i in reviews:
        yield i

# Batch processing using the pipeline for efficiency
for result in classifier(data(), candidate_labels, batch_size=2):
    print("classification:", result['labels'][0], "with score", result['scores'][0])
    model_classification_results.append({
        'review': result['sequence'],
        'predicted_label': result['labels'][0],
        'score': result['scores'][0]
    })

Classifying 500 plots...
classification: ADVENTURE with score 0.41323086619377136
classification: ACTION with score 0.1754855364561081
classification: FAMILY with score 0.4149571359157562
classification: THRILLER with score 0.5509849786758423
classification: ANIMATION with score 0.6133695244789124
classification: MUSIC with score 0.1956285685300827
classification: MYSTERY with score 0.6379823088645935
classification: ACTION with score 0.14777158200740814
classification: ACTION with score 0.2798418402671814
classification: HISTORY with score 0.14951208233833313
classification: NEWS with score 0.1462828516960144
classification: DRAMA with score 0.1194235160946846
classification: DOCUMENTARY with score 0.20527265965938568
classification: MUSICAL with score 0.517494797706604
classification: FAMILY with score 0.5148017406463623
classification: DOCUMENTARY with score 0.7006697058677673
classification: FAMILY with score 0.29807785153388977
classification: DOCUMENTARY with score 0.590815007686

#### 2Ô∏è‚É£.2Ô∏è‚É£ ERROR PERCENTAGE

In [9]:
counter_incorrect = 0
for index, row in df.head(AMOUNT_OF_PLOTS_TO_CLASSIFY).iterrows():
    predicted = model_classification_results[index]['predicted_label']
    actual_label = row['genre']
    if predicted != actual_label:
        # print(f"Plot {index+1} - Predicted: {predicted}, Actual: {actual_label}")
        counter_incorrect += 1
        
error_percentage = (counter_incorrect / AMOUNT_OF_PLOTS_TO_CLASSIFY) * 100
print(f"Total incorrect predictions: {counter_incorrect} - Error percentage: {error_percentage:.2f}%")

Total incorrect predictions: 426 - Error percentage: 85.20%


#### 2Ô∏è‚É£.3Ô∏è‚É£ METRICS - BASELINE

In [10]:
# 1. Obter os r√≥tulos verdadeiros (ground truth)
true_labels = list(df.head(AMOUNT_OF_PLOTS_TO_CLASSIFY)['genre'])

# 2. Obter os r√≥tulos previstos
predicted_labels = [item['predicted_label'] for item in model_classification_results]

# 3. Relat√≥rio de classifica√ß√£o
print("--- M√âTRICAS DO MODELO ZERO-SHOT (Baseline) ---")
print(classification_report(true_labels, predicted_labels, target_names=unique_genres))

--- M√âTRICAS DO MODELO ZERO-SHOT (Baseline) ---
              precision    recall  f1-score   support

 DOCUMENTARY       0.02      0.60      0.04         5
      COMEDY       0.50      0.03      0.05        35
       DRAMA       0.00      0.00      0.00         2
       SHORT       0.00      0.00      0.00         7
     WESTERN       0.00      0.00      0.00         1
    THRILLER       0.68      0.14      0.23        96
   ANIMATION       0.42      0.45      0.43        22
       MUSIC       0.56      0.31      0.40        78
       CRIME       0.67      0.07      0.13       139
      SCI-FI       0.00      0.00      0.00         3
      HORROR       0.00      0.00      0.00         3
   TALK-SHOW       1.00      0.60      0.75         5
      FAMILY       0.00      0.00      0.00         1
      ACTION       0.00      0.00      0.00        11
     MYSTERY       0.00      0.00      0.00         3
   BIOGRAPHY       0.00      0.00      0.00         1
  REALITY-TV       0.09      0.2

## 3Ô∏è‚É£ FINE-TUNNING
#### 3Ô∏è‚É£.1Ô∏è‚É£ DATA PREPARATION

In [30]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

sample_df = df.sample(n=TRAIN_SAMPLE_SIZE, random_state=42) 

# Convert genre labels to numerical values
sample_df['label_num'] = sample_df['genre'].astype('category').cat.codes

# Split the data into Train and Test
X = list(sample_df['plot'])
y = list(sample_df['label_num'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Tokenize (convert text into numbers the model understands)
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=256)
test_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=256)

In [None]:
## Resolvebndo o desbalanceamento de Classes
## Loss Function Ponderada

# Calcula os pesos baseados na frequ√™ncia das classes no treino
class_weights = compute_class_weight(
    class_weight="balanced", # classes raras ganham peso alto, classes comuns ganham peso baixo.
    classes=np.unique(y_train), 
    y=y_train
)

# Converte para Tensor do PyTorch e envia para a GPU
weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(device)

print("Pesos das Classes:", weights_tensor)

Pesos das Classes: tensor([ 1.8519,  1.7544,  8.3333,  3.0303, 33.3333,  0.1984,  1.5152,  0.1684,
         0.1481,  2.5641, 16.6667,  8.3333,  8.3333,  1.5152, 11.1111, 11.1111,
         4.1667,  8.3333,  0.3831,  2.7778,  8.3333,  1.0101, 16.6667,  3.0303,
         1.9608, 16.6667, 11.1111], device='cuda:0')


In [32]:
sample_df.head(10)

Unnamed: 0,plot,genre,label_num
1765,THE FILM COMPANION TEAMS BRINGS YOU THE LATEST...,TALK-SHOW,23
203,"WHEN LEIGHTON TURNS DOWN THE FREEZERS, IN A BI...",COMEDY,5
1357,"IN HIDDEN BASEMENTS, BEDROOMS AND BARS ACROSS ...",DOCUMENTARY,7
1452,THE SALE OF THE STADIUM IS ON THE BALANCE. LUC...,DRAMA,8
1578,A JOURNEY THROUGH THE ART OF LOVE AND DEATH.,DOCUMENTARY,7
102,A FUNERAL BRINGS TOGETHER DIFFERENT SIDES OF D...,DRAMA,8
2838,A WIDOWED MOTHER TAKES A JOB AS AN OVERNIGHT S...,THRILLER,24
1819,SHE IS THE OCEAN - A FULL-LENGTH DOCUMENTARY A...,DOCUMENTARY,7
485,LIISA ARRIVES ON AN ISLAND TO SERVE AS A NURSE...,DRAMA,8
1437,"IN THEIR FINAL CHALLENGE BEFORE FASHION WEEK, ...",REALITY-TV,18


In [33]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Get all 'input_ids', 'attention_mask', etc.
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Add label
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, y_train)
test_dataset = IMDbDataset(test_encodings, y_test)

#### 3Ô∏è‚É£.2Ô∏è‚É£ TREINAMENTO

In [None]:
accuracy_metric = evaluate.load("accuracy")

# Used by trainer to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Load the pre-trained model (with a classification "head" on top)
# The num_labels tells it to prepare to classify between the number of genres
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=NUMBER_OF_GENRES)

# Uma pr√°tica melhor √© avaliar o modelo ao final de cada √©poca e salvar apenas o melhor.
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_TRAIN_EPOCHS,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="epoch",            # Avaliar a cada √©poca
    save_strategy="epoch",            # Salvar a cada √©poca
    load_best_model_at_end=True       # Carregar o melhor modelo no final
)

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=test_dataset,
#     compute_metrics=compute_metrics
# )

# O Trainer padr√£o do Hugging Face n√£o aceita pesos nativamente. Precisamos sobrescrever a fun√ß√£o compute_loss
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        # Extrai os labels e os outputs do modelo
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # Definimos a fun√ß√£o de erro CrossEntropy com os nossos pesos calculados
        loss_fct = nn.CrossEntropyLoss(weight=weights_tensor)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,3.3016,3.280371,0.081667
2,3.2944,3.243977,0.163333
3,3.1503,3.163981,0.165
4,2.9361,3.054405,0.313333
5,2.6755,2.862642,0.406667
6,2.4052,2.703629,0.376667
7,1.8362,2.554795,0.438333
8,1.4656,2.332684,0.408333
9,0.8014,2.483413,0.491667
10,0.3825,2.454453,0.508333


TrainOutput(global_step=570, training_loss=2.2584907109277292, metrics={'train_runtime': 184.0016, 'train_samples_per_second': 48.913, 'train_steps_per_second': 3.098, 'total_flos': 596369060352000.0, 'train_loss': 2.2584907109277292, 'epoch': 10.0})

### 3Ô∏è‚É£.3Ô∏è‚É£ METRICS - AFTER FINE-TUNNING

In [35]:
# Make predictions on the test set
predictions = trainer.predict(test_dataset)

# Predictions come out as "logits", we need the final label (0 or 1)
predicted_labels_tuned = np.argmax(predictions.predictions, axis=1)

# Get the true labels from the test set
true_labels_tuned = y_test

# Generate the classification report
print("--- FINE-TUNED MODEL METRICS ---")
print(classification_report(true_labels_tuned, predicted_labels_tuned, target_names=unique_genres))

--- FINE-TUNED MODEL METRICS ---
              precision    recall  f1-score   support

 DOCUMENTARY       0.17      0.20      0.18         5
      COMEDY       0.33      0.43      0.38        14
       DRAMA       0.00      0.00      0.00         2
       SHORT       0.00      0.00      0.00         6
     WESTERN       0.00      0.00      0.00         1
    THRILLER       0.52      0.52      0.52       123
   ANIMATION       0.15      0.22      0.18         9
       MUSIC       0.68      0.69      0.68       127
       CRIME       0.43      0.09      0.15       133
      SCI-FI       0.25      0.33      0.29         9
      HORROR       0.00      0.00      0.00         2
   TALK-SHOW       0.25      0.67      0.36         3
      FAMILY       0.60      1.00      0.75         3
      ACTION       0.28      0.45      0.34        22
     MYSTERY       0.50      0.75      0.60         4
   BIOGRAPHY       0.00      0.00      0.00         1
  REALITY-TV       0.67      0.40      0.50     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 4Ô∏è‚É£ CONCLUS√ÉO


Este projeto comparou duas abordagens de Processamento de Linguagem Natural (PLN) para a tarefa de **Classifica√ß√£o de G√™nero de Filmes** baseada em sinopses (plots):
1.  **Zero-Shot Learning (Baseline):** Utilizando o modelo `facebook/bart-large-mnli` (406M par√¢metros) sem nenhum treinamento pr√©vio.
2.  **Transfer Learning & Fine-Tuning:** Utilizando o modelo `distilbert-base-uncased` (66M par√¢metros), treinado especificamente neste dataset.



### Resultados:

| MODELO | ESTRAT√âGIA | MODELO BASE | ACUR√ÅCIA | F1-SCORE (WEIGHTED) | TEMPO DE INFER√äNCIA | 
| -------- | -------- | ----------- | -------- | ------------------- | ------------------- |
| Baseline | Zero-Shot | ***facebook/bart-large-mnli*** | **41%** | 0.41 | Alto (lento) |
| Desafiante | Fine-Tuning | ***distilbert-base-uncased*** | **15%** | 0.15 | Baixo (R√°pido) |

### An√°lise:

A tarefa de classifica√ß√£o multiclasse exp√¥s as limita√ß√µes da abordagem Zero-Shot e a for√ßa do Fine-Tuning:

1.  **O Desafio da Ambiguidade:** Em um cen√°rio com 27 classes poss√≠veis, a fronteira entre g√™neros como "A√ß√£o", "Aventura" e "Crime" √© t√™nue. O modelo Zero-Shot, por ser generalista, tende a se confundir com a sobreposi√ß√£o de temas. O modelo Fine-Tuned, por outro lado, aprendeu as nuances espec√≠ficas de como *este dataset* define cada g√™nero.

2.  **Efici√™ncia Computacional:** A abordagem Zero-Shot exigiu que o modelo processasse cada sinopse comparando-a com todas as etiquetas candidatas, tornando a infer√™ncia significativamente mais lenta. O modelo Fine-Tuned (DistilBERT), al√©m de ser arquiteturalmente mais leve (66M vs 406M par√¢metros), realiza a classifica√ß√£o em uma √∫nica passagem direta (forward pass), sendo ideal para ambientes de produ√ß√£o.

### Veredito Final
Para tarefas complexas de classifica√ß√£o multiclasse com defini√ß√µes de dom√≠nio espec√≠ficas, o **Fine-Tuning √© indispens√°vel**. Embora o Zero-Shot seja uma ferramenta poderosa para prototipagem r√°pida e situa√ß√µes de "cold start" (sem dados), ele n√£o consegue competir com a precis√£o e a efici√™ncia de um modelo especialista treinado (mesmo que menor) quando dados rotulados est√£o dispon√≠veis.

### Observa√ß√£o CPU x GPU

O uso de GPU para execu√ß√£o do modelo **BART-Large** e treino do modelo **DistilBERT** apresentou uma redu√ß√£o no tempo de execu√ß√£o de 15 vezes frente ao uso de CPU para as mesmas tarefas. Isto deixa clara a import√¢ncia do uso deste tipo de hardware no √¢mbito da NLI.