### EPFL CS-433 - Machine Learning Project 2
#### CERN - Zenodo: Adaptable Spam Filter Modelling for Digital Scientific Research Repository 
Training a DistilBERT model for the task of english spam detection.

Authors: Luka Secilmis, Yanis De Busschere, Thomas Ecabert

In [None]:
# Install required packages
!pip install transformers
!pip install pandas
!pip install numpy
!pip install datasets
!pip install sklearn
!pip install torch

In [3]:
# Import required packages
import transformers
import pandas as pd
import numpy as np
import datasets
from sklearn.model_selection import train_test_split
import torch

In [None]:
# Set Hardware accelerator to GPU in Edit: Notebook Settings (in Google Colab)
# Check if GPU is available, this will significantly speed up fine-tuning
if torch.cuda.is_available(): 
  device = torch.device("cuda")    
  print('GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, do not train')

GPU: NVIDIA RTX A5000


## Import pre-processed data
Run script *feat-eng-esc.py* to generate the pre-processed data.

In [6]:
# Load the processed dataset
df = pd.read_csv('dataset-esc.csv')
df = df.dropna()

In [7]:
print(df.shape)

(83778, 2)


In [8]:
df.head()

Unnamed: 0,description,label
0,FonePaw iPhone Data Recovery features in recov...,1
1,FonePaw iOS Transfer is mainly designed to tra...,1
2,This is my first upload,1
3,Lost photos from iPhone can be recovered with ...,1
4,"I can’t play WLMP file directly, what player d...",1


In [9]:
df['label'].value_counts()

0    55847
1    27931
Name: label, dtype: int64

## Training Set-up

In [10]:
# Split data into train and test sets
# Note: stratify on 'label' to preserve the same proportions of labels in each set as observed in the original dataset
train, test = train_test_split(df, test_size=0.2, stratify=df[['label']], random_state=42)
test_en = test.copy()

In [None]:
# Load pre-trained tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

def tokenize_(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
# Convert dataframses into DatasetDict for tokenizer
train = pd.DataFrame({
     "label" : [int(x) for x in train['label'].tolist()],
     "text" : [str(x) for x in train['description'].tolist()]
})

test = pd.DataFrame({
     "label" : [int(x) for x in test['label'].tolist()],
     "text" : [str(x) for x in test['description'].tolist()]
})

In [None]:
dataset = datasets.DatasetDict({"train":datasets.Dataset.from_dict(train),"test":datasets.Dataset.from_dict(test)})
dataset_tokenized = dataset.map(tokenize_, batched=True)

  0%|          | 0/68 [00:00<?, ?ba/s]

  0%|          | 0/17 [00:00<?, ?ba/s]

In [None]:
# Import pre-trained BERT model for Binary Classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier

In [None]:
# contains all the hyperparameters you can tune as well as flags for activating different training options
training_args = transformers.TrainingArguments(output_dir="test_trainer", save_total_limit=3)

In [None]:
# function to compute and report metrics
from datasets import load_metric
metric = load_metric("accuracy")

  This is separate from the ipykernel package so we can avoid doing imports until


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [None]:
# Before passing your predictions to compute, you need to convert the predictions to logits 
# Transformers models return logits
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Trainer object with model, training arguments, training and test datasets, and evaluation function
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    compute_metrics=compute_metrics,
)

In [None]:
# Finally, we fine-tune our model
trainer.train() # Train

***** Running training *****
  Num examples = 67022
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 25134
  Number of trainable parameters = 65783042
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.


Step,Training Loss
500,0.1407
1000,0.0958
1500,0.0989
2000,0.0738
2500,0.0541
3000,0.0761
3500,0.0813
4000,0.0815
4500,0.0745
5000,0.0757


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2000
Configuration saved in test_trainer/checkpoint-2000/config.json
Model weights saved in test_trainer/checkpoint-2000/pytorch_model.bin
Deleting older checkpoint [test_trainer/checkpoint-500] due to args.save_total_limit
Saving model checkpoint to test_trainer/checkpoint-2500
Configuration saved in test_trainer/checkpoint-2500/config.json
Model weights saved in test_trainer/ch

TrainOutput(global_step=25134, training_loss=0.052604342072317574, metrics={'train_runtime': 3743.4451, 'train_samples_per_second': 53.711, 'train_steps_per_second': 6.714, 'total_flos': 2.6634689978167296e+16, 'train_loss': 0.052604342072317574, 'epoch': 3.0})

In [None]:
# Evaluate model on test set
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 16756
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.


{'eval_loss': 0.05530316382646561,
 'eval_accuracy': 0.9875865361661494,
 'eval_runtime': 98.4031,
 'eval_samples_per_second': 170.279,
 'eval_steps_per_second': 21.29,
 'epoch': 3.0}

In [None]:
# Save model
trainer.save_model("model-esc")

Saving model checkpoint to model-esc
Configuration saved in model-esc/config.json
Model weights saved in model-esc/pytorch_model.bin


## Performance on Test Set: English Spam Classifier

In [12]:
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load both models
model = AutoModelForSequenceClassification.from_pretrained('model-esc')
model_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

# Create a pipeline to facilitate the use of the model for classification
classifier = pipeline("text-classification", model=model, tokenizer=model_tokenizer)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [13]:
y_true = test_en['label'].tolist()
y_predict = classifier(test_en['description'].map(lambda x: str(x)).tolist(), padding=True, truncation=True)
y_predict = [1 if pred['label'] == 'LABEL_1' else 0 for pred in y_predict]

In [14]:
# compute accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_predict)
print(f'Accuracy: {100 * accuracy_score(y_true, y_predict)} %')
print()
# compute f1 score
from sklearn.metrics import f1_score
f1_score(y_true, y_predict, average='macro')
print(f'F1 Score: {100 * f1_score(y_true, y_predict, average="macro")} %')
print()
# compute confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_true, y_predict)
print('True positives: ', 100 * confusion_matrix[1][1]/(confusion_matrix[1][1] + confusion_matrix[1][0]), '%')
print('True negatives: ', 100 * confusion_matrix[0][0]/(confusion_matrix[0][0] + confusion_matrix[0][1]), '%')

Accuracy: 98.75865361661495 %

F1 Score: 98.59991612467391 %

True positives:  97.61904761904762 %
True negatives:  99.32855863921218 %
