---
# Install the required packages

If needed install the following packages:

In [1]:
!pip install datasets transformers imbalanced-learn evaluate



---
# Imports

In [1]:
import numpy as np
import pandas as pd

from datasets import Dataset, load_dataset

from transformers import DistilBertForSequenceClassification
from transformers import DistilBertTokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB


import evaluate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report


# Write your code here. Add as many boxes as you need.

2025-01-14 17:17:49.921725: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736871469.971963   28461 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736871469.984387   28461 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-14 17:17:50.026977: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


---
# Laboratory Exercise - Run Mode (8 points)

## Introduction

This laboratory assignment's primary objective is to fine-tune a pre-trained language model for detection of toxic sentences (binary classification). 

The dataset contains two attributes: 
- `text`: The sentence which needs to be classified in to toxic/non-toxic
- `label`: 0/1 indicator if the given sentence is toxic

**Note: You are required to perform this laboratory assignment on your local machine.**

# Read the data

The dataset reading is given. Just run the following 2 cells.

**DO NOT MODIFY IT! Just analyse how the data reading was performed, as in the future this part won't be given.**

In [2]:
dataset = load_dataset(
    'csv', 
    data_files={'train': '../datasets/train.tsv', 'val': '../datasets/val.tsv','test': '../datasets/test.tsv'},
    delimiter='\t'
)

**The prediction target column MUST be named 'label' in the dataset !**

See the dataset structure:

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3132
    })
})

---
# Natural Language Processing

## Generate the Tokenizer and Data Collator

For the purposes of this lab you will be using `DistilBertTokenizer` and `DataCollatorWithPadding`.

In [4]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Tokenize the dataset

For the purposes of lowering the amount of computing set the `max_length` parameter to 15.

In [5]:
# Write your code here. Add as many boxes as you need.
dataset = dataset.map(lambda x: tokenizer(x['text'], truncation=True, padding=True, max_length=15),batched=True)

## Define the model

The required model for this lab is the `DistilBertForSequenceClassification`.

In [6]:
# Write your code here. Add as many boxes as you need.
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Define the training arguments

For lowering the compute time I recommend using the following parameters:
- per_device_train_batch_size=128
- per_device_eval_batch_size=128
- **num_train_epochs=1**

In [7]:
# Write your code here. Add as many boxes as you need.
training_args = TrainingArguments(
    output_dir="trainer",
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    metric_for_best_model="f1",
    num_train_epochs=1
)


## Load the metrics

Load the best metric for the this specific problem.

In [8]:
# Write your code here. Add as many boxes as you need.
metric = evaluate.load('f1')

### Define the function to compute the metrics

In [9]:
# Write your code here. Add as many boxes as you need.
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

## Generate the Trainer object

In [10]:
# Write your code here. Add as many boxes as you need.
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['val'],
    compute_metrics=compute_metrics,
    data_collator=collator,
)


## Train the model

Use the trainer to train the model.

In [11]:
# Write your code here. Add as many boxes as you need.
trainer.train()

Step,Training Loss


TrainOutput(global_step=8, training_loss=0.6829401254653931, metrics={'train_runtime': 12.7901, 'train_samples_per_second': 78.186, 'train_steps_per_second': 0.625, 'total_flos': 3880880820000.0, 'train_loss': 0.6829401254653931, 'epoch': 1.0})

---
# Evaluate the model

## Generate predictions for the test set

In [12]:
# Write your code here. Add as many boxes as you need.
predictions = trainer.predict(dataset['test'])

## Extract the predictions (class 0 or 1) from the logits

In [13]:
# Write your code here. Add as many boxes as you need.
preds = np.argmax(predictions.predictions, axis=-1)

## Analyze the performance of the model

In [14]:
# Write your code here. Add as many boxes as you need.
print(classification_report(dataset['test']['label'], preds))


              precision    recall  f1-score   support

           0       0.83      0.66      0.73      1566
           1       0.72      0.86      0.78      1566

    accuracy                           0.76      3132
   macro avg       0.77      0.76      0.76      3132
weighted avg       0.77      0.76      0.76      3132



# Laboratory Exercise - Bonus Task (+ 2 points)

Implement a simple machine learning pipeline to classify if a given text is **toxic** or not. Use TF-IDF vectorization to convert text into numerical features and train a `MultinomialNB` model. If needed use `RandomUnderSampler()`. Compare the results with the transformer model.

In [15]:
# Write your code here. Add as many boxes as you need.
pipeline = make_pipeline(
    TfidfVectorizer(),
    RandomUnderSampler(),
    MultinomialNB()
)

---

In [16]:
pipeline.fit(dataset['train']['text'], dataset['train']['label'])

In [17]:
preds_nb = pipeline.predict(dataset['test']['text'])

In [18]:
print(classification_report(dataset['test']['label'], preds_nb))

              precision    recall  f1-score   support

           0       0.84      0.71      0.77      1566
           1       0.75      0.87      0.81      1566

    accuracy                           0.79      3132
   macro avg       0.80      0.79      0.79      3132
weighted avg       0.80      0.79      0.79      3132

