### EPFL CS-433 - Machine Learning Project 2
#### CERN - Zenodo: Adaptable Spam Filter Modelling for Digital Scientific Research Repository 
Training a DistilBERT model for the task of multi-lingual spam detection.

Authors: Luka Secilmis, Yanis De Busschere, Thomas Ecabert

In [None]:
# Install required packages
!pip install transformers
!pip install pandas
!pip install numpy
!pip install datasets
!pip install sklearn
!pip install torch

In [1]:
# Import required packages
import transformers
import pandas as pd
import numpy as np
import datasets
from sklearn.model_selection import train_test_split
import torch

In [2]:
# Set Hardware accelerator to GPU in Edit: Notebook Settings (in Google Colab)
# Check if GPU is available, this will significantly speed up fine-tuning
if torch.cuda.is_available(): 
  device = torch.device("cuda")    
  print('GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, do not train')

GPU: NVIDIA RTX A5000


## Import pre-processed data
Run script *feat-eng-msc.py* to generate the pre-processed data.

In [3]:
# Load the processed dataset
df = pd.read_csv('dataset-msc.csv')
df = df.dropna()

In [4]:
print(df.shape)

(90618, 2)


In [5]:
df.head()

Unnamed: 0,description,label
0,Masih bingung dalam hal makanan penambah berat...,1
1,FonePawÂ iPhoneÂ DataÂ Recovery features in recov...,1
2,FonePaw iOS Transfer is mainly designed to tra...,1
3,This is my first upload,1
4,Doyantoto merupakan sebuah website Agen Togel ...,1


In [6]:
df['label'].value_counts()

0    52834
1    37784
Name: label, dtype: int64

## Training Set-up

In [7]:
# Split data into train and test sets
# Note: stratify on 'label' to preserve the same proportions of labels in each set as observed in the original dataset
train, test = train_test_split(df, test_size=0.2, stratify=df[['label']], random_state=42)
test_multi = test.copy()
test_no_en = test.copy()

In [8]:
# Load pre-trained tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")

def tokenize_(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [9]:
# Convert dataframses into DatasetDict for tokenizer
train = pd.DataFrame({
     "label" : [int(x) for x in train['label'].tolist()],
     "text" : [str(x) for x in train['description'].tolist()]
})

test = pd.DataFrame({
     "label" : [int(x) for x in test['label'].tolist()],
     "text" : [str(x) for x in test['description'].tolist()]
})

In [10]:
dataset = datasets.DatasetDict({"train":datasets.Dataset.from_dict(train),"test":datasets.Dataset.from_dict(test)})
dataset_tokenized = dataset.map(tokenize_, batched=True)

  0%|          | 0/73 [00:00<?, ?ba/s]

  0%|          | 0/19 [00:00<?, ?ba/s]

In [11]:
# Import pre-trained BERT model for Binary Classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=2)

Downloading:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.weight', 'classif

In [12]:
# contains all the hyperparameters you can tune as well as flags for activating different training options
training_args = transformers.TrainingArguments(output_dir="test_trainer", save_total_limit=3)

In [13]:
# function to compute and report metrics
from datasets import load_metric
metric = load_metric("accuracy")

  This is separate from the ipykernel package so we can avoid doing imports until


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [14]:
# Before passing your predictions to compute, you need to convert the predictions to logits 
# Transformers models return logits
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [15]:
# Trainer object with model, training arguments, training and test datasets, and evaluation function
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    compute_metrics=compute_metrics,
)

In [16]:
# Finally, we fine-tune our model
trainer.train() # Training took about 1h15 with GPU: NVIDIA RTX A5000

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 72494
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 27186
  Number of trainable parameters = 135326210


Step,Training Loss
500,0.1752
1000,0.0969
1500,0.0896
2000,0.0874
2500,0.0853
3000,0.084
3500,0.0892
4000,0.0883
4500,0.0944
5000,0.0658


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-2000
Configuration saved in test_trainer/checkpoint-2000/config.json
Model weights saved in test_trainer/checkpoint-2000/pytorch_model.bin
Deleting older checkpoint [test_trainer/checkpoint-500] due to args.save_total_limit
Saving model checkpoint to test_trainer/checkpoint-2500
Configuration saved in test_trainer/checkpoint-2500/config.json
Model weights saved in test_trainer/ch

TrainOutput(global_step=27186, training_loss=0.05157783799180919, metrics={'train_runtime': 4167.7462, 'train_samples_per_second': 52.182, 'train_steps_per_second': 6.523, 'total_flos': 2.880927479450419e+16, 'train_loss': 0.05157783799180919, 'epoch': 3.0})

In [17]:
# Evaluate model on test set
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 18124
  Batch size = 8


{'eval_loss': 0.05935591459274292,
 'eval_accuracy': 0.9881372765393953,
 'eval_runtime': 98.1932,
 'eval_samples_per_second': 184.575,
 'eval_steps_per_second': 23.077,
 'epoch': 3.0}

In [18]:
# Save model
trainer.save_model("msc-model")

Saving model checkpoint to msc-model
Configuration saved in msc-model/config.json
Model weights saved in msc-model/pytorch_model.bin


## Performance on Test Set: Multilingual Spam Classifier

In [20]:
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load both models
model = AutoModelForSequenceClassification.from_pretrained('msc-model')
model_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")

# Create a pipeline to facilitate the use of the model for classification
classifier = pipeline("text-classification", model=model, tokenizer=model_tokenizer)

loading configuration file msc-model/config.json
Model config DistilBertConfig {
  "_name_or_path": "msc-model",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 119547
}

loading weights file msc-model/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at msc-model.
If your task is similar to the task the

In [21]:
y_true = test_multi['label'].tolist()
y_predict = classifier(test_multi['description'].map(lambda x: str(x)).tolist(), padding=True, truncation=True)
y_predict = [1 if pred['label'] == 'LABEL_1' else 0 for pred in y_predict]

Disabling tokenizer parallelism, we're using DataLoader multithreading already


In [22]:
# compute accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_predict)
print(f'Accuracy: {100 * accuracy_score(y_true, y_predict)} %')
print()
# compute f1 score
from sklearn.metrics import f1_score
f1_score(y_true, y_predict, average='macro')
print(f'F1 Score: {100 * f1_score(y_true, y_predict, average="macro")} %')
print()
# compute confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_true, y_predict)

print('True positives: ', 100 * confusion_matrix[1][1]/(confusion_matrix[1][1] + confusion_matrix[1][0]), '%')
print('True negatives: ', 100 * confusion_matrix[0][0]/(confusion_matrix[0][0] + confusion_matrix[0][1]), '%')

Accuracy: 98.81372765393954 %

F1 Score: 98.77898987594016 %

True positives:  98.26650787349477 %
True negatives:  99.20507239519259 %


## Performance on Test Set: all non-english languages Spam Classifier

In [23]:
# Filter out all the english texts
import re
from ftlangdetect import detect
CLEANING_REGEX = re.compile(r'[^a-zA-Z0-9\s]', re.MULTILINE)
def detect_lang(descr):
    d = CLEANING_REGEX.sub('', str(descr))
    d = d.replace('\r', ' ').replace('\n', ' ')
    lang = detect(d)['lang']
    return lang

test_no_en['lang'] = test_no_en['description'].map(lambda x: detect_lang(x))
test_no_en = test_no_en[test_no_en['lang'] != 'en']
test_no_en = test_no_en.drop(columns=['lang'])



In [24]:
y_true = test_no_en['label'].tolist()
y_predict = classifier(test_no_en['description'].map(lambda x: str(x)).tolist(), padding=True, truncation=True)
y_predict = [1 if pred['label'] == 'LABEL_1' else 0 for pred in y_predict]

In [25]:
# compute accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_predict)
print(f'Accuracy: {100 * accuracy_score(y_true, y_predict)} %')
print()
# compute f1 score
from sklearn.metrics import f1_score
f1_score(y_true, y_predict, average='macro')
print(f'F1 Score: {100 * f1_score(y_true, y_predict, average="macro")} %')
print()
# compute confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_true, y_predict)
print('True positives: ', 100 * confusion_matrix[1][1]/(confusion_matrix[1][1] + confusion_matrix[1][0]), '%')
print('True negatives: ', 100 * confusion_matrix[0][0]/(confusion_matrix[0][0] + confusion_matrix[0][1]), '%')

Accuracy: 98.31041257367387 %

F1 Score: 97.48498193971817 %

True positives:  98.50746268656717 %
True negatives:  97.57009345794393 %
