---
# Install the required packages

If needed install the following packages:

In [3]:
# !pip install transformers[torch]

In [4]:
# !pip install torch

In [5]:
# !pip install datasets transformers imbalanced-learn evaluate

---
# Imports

In [7]:
from datasets import load_dataset

# Write your code here. Add as many boxes as you need.

---
# Laboratory Exercise - Run Mode (8 points)

## Introduction

This laboratory assignment's primary objective is to fine-tune a pre-trained language model for detection of toxic sentences (binary classification). 

The dataset contains two attributes: 
- `text`: The sentence which needs to be classified in to toxic/non-toxic
- `label`: 0/1 indicator if the given sentence is toxic

**Note: You are required to perform this laboratory assignment on your local machine.**

# Read the data

The dataset reading is given. Just run the following 2 cells.

**DO NOT MODIFY IT! Just analyse how the data reading was performed, as in the future this part won't be given.**

In [11]:
dataset = load_dataset(
    'csv', 
    data_files={'train': 'data/train.tsv', 'val': 'data/val.tsv','test': 'data/test.tsv'},
    delimiter='\t'
)

**The prediction target column MUST be named 'label' in the dataset !**

See the dataset structure:

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3132
    })
})

---
# Natural Language Processing

## Generate the Tokenizer and Data Collator

For the purposes of this lab you will be using `DistilBertTokenizer` and `DataCollatorWithPadding`.

In [16]:
# Write your code here. Add as many boxes as you need.
from transformers import DistilBertTokenizer, DataCollatorWithPadding

## Tokenize the dataset

For the purposes of lowering the amount of computing set the `max_length` parameter to 15.

In [18]:
# Write your code here. Add as many boxes as you need.
def tokenize(sample):
    return tokenizer(sample["text"], truncation=True, padding="max_length", max_length=15)

In [19]:
checkpoint = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, max_length=15)
tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3130 [00:00<?, ? examples/s]

Map:   0%|          | 0/3132 [00:00<?, ? examples/s]

In [20]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    val: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3132
    })
})

## Define the model

The required model for this lab is the `DistilBertForSequenceClassification`.

In [22]:
# Write your code here. Add as many boxes as you need.
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Define the training arguments

For lowering the compute time I recommend using the following parameters:
- per_device_train_batch_size=128
- per_device_eval_batch_size=128
- **num_train_epochs=1**

In [24]:
# Write your code here. Add as many boxes as you need.
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./trainer",
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=1,
    eval_strategy="epoch",
    metric_for_best_model="f1"
)

## Load the metrics

Load the best metric for the this specific problem.

In [26]:
# Write your code here. Add as many boxes as you need.
import evaluate
import numpy as np

metric = evaluate.load("f1")

### Define the function to compute the metrics

In [28]:
# Write your code here. Add as many boxes as you need.
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

## Generate the Trainer object

In [30]:
# Write your code here. Add as many boxes as you need.
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["val"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

## Train the model

Use the trainer to train the model.

In [32]:
# Write your code here. Add as many boxes as you need.
trainer.train()



Epoch,Training Loss,Validation Loss,F1
1,No log,0.580849,0.747515




TrainOutput(global_step=8, training_loss=0.6514556407928467, metrics={'train_runtime': 28.3374, 'train_samples_per_second': 35.289, 'train_steps_per_second': 0.282, 'total_flos': 3880880820000.0, 'train_loss': 0.6514556407928467, 'epoch': 1.0})

In [64]:
trainer.evaluate()



{'eval_loss': 0.580848753452301,
 'eval_f1': 0.7475153401285056,
 'eval_runtime': 9.1295,
 'eval_samples_per_second': 342.844,
 'eval_steps_per_second': 2.738,
 'epoch': 1.0}

---
# Evaluate the model

## Generate predictions for the test set

In [66]:
# Write your code here. Add as many boxes as you need.
predictions = trainer.predict(tokenized_dataset["test"])



In [68]:
predictions

PredictionOutput(predictions=array([[ 0.07671159, -0.11954911],
       [-0.31016994,  0.21741846],
       [-0.06474358, -0.04352585],
       ...,
       [-0.02573019, -0.12475275],
       [ 0.03070219, -0.11817786],
       [-0.18656023, -0.1088573 ]], dtype=float32), label_ids=array([1, 1, 1, ..., 0, 0, 0], dtype=int64), metrics={'test_loss': 0.5808983445167542, 'test_f1': 0.7456725540500201, 'test_runtime': 10.1502, 'test_samples_per_second': 308.566, 'test_steps_per_second': 2.463})

## Extract the predictions (class 0 or 1) from the logits

In [70]:
# Write your code here. Add as many boxes as you need.
logits, labels = predictions.predictions, predictions.label_ids
preds = np.argmax(logits, axis=-1)

## Analyze the performance of the model

In [74]:
# Write your code here. Add as many boxes as you need.
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

print(classification_report(labels, preds))

              precision    recall  f1-score   support

           0       0.86      0.60      0.71      1566
           1       0.69      0.90      0.78      1566

    accuracy                           0.75      3132
   macro avg       0.78      0.75      0.75      3132
weighted avg       0.78      0.75      0.75      3132



# Laboratory Exercise - Bonus Task (+ 2 points)

Implement a simple machine learning pipeline to classify if a given text is **toxic** or not. Use TF-IDF vectorization to convert text into numerical features and train a `MultinomialNB` model. If needed use `RandomUnderSampler()`. Compare the results with the transformer model.

In [107]:
# # Write your code here. Add as many boxes as you need.
# from sklearn.feature_extraction.text import TfidfVectorizer
# from imblearn.under_sampling import RandomUnderSampler

# tv = TfidfVectorizer(binary=False, norm='l2', use_idf=False, smooth_idf=False, lowercase=True, stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', min_df=1, max_df=1.0, max_features=None, ngram_range=(1, 1))
# train_tfidf = tv.fit_transform(dataset["train"]["text"])

In [119]:
# undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)
# train_undersample_features, train_undersample_label = undersampler.fit_resample(train_tfidf, dataset["train"]["label"])

In [129]:
from sklearn.naive_bayes import MultinomialNB
from imblearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(binary=False, norm='l2', use_idf=False, smooth_idf=False, lowercase=True, stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', min_df=1, max_df=1.0, max_features=None, ngram_range=(1, 1)), RandomUnderSampler(sampling_strategy='auto', random_state=42), MultinomialNB())
model.fit(dataset["train"]["text"], dataset["train"]["label"])

In [131]:
y_pred_bonus = model.predict(dataset["test"]["text"])

In [139]:
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(dataset["test"]["label"], y_pred_bonus))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.88      0.76      0.90      0.82      0.83      0.67      1566
          1       0.79      0.90      0.76      0.84      0.83      0.69      1566

avg / total       0.84      0.83      0.83      0.83      0.83      0.68      3132



---