##**Spam Classification with BERT**

The aim of this project is a spam detection to get a model to predict if the messages is spam or not. This approach we have used a BERT (Bidirectional Encoder Representations from Transformers) model to gain our model. In our analysis we have used a Huggingface Transformers library as well.

**Dataset**

The dataset comes from SMS Spam Collection and that can be found at https://www.kaggle.com/uciml/sms-spam-collection-dataset.

This SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It comprises one set of SMS messages in English of 5,574 messages, which is tagged acording being ham (legitimate) or spam.

In [None]:
#!pip install transformers

: 

**Importing the required libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras 
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from keras.models import Model

import transformers
from transformers import BertTokenizer, TFBertModel

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import warnings
warnings.filterwarnings("ignore")

: 

In [None]:
nltk.download('stopwords')

: 

In [None]:
cd '/content/drive/My Drive/moje pliki/data'

: 

**First observations:**

In [None]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df.head()

: 

In [None]:
df.shape

: 

In [None]:
df.info()

: 


### **Data preparation**

Remove unnecessary variables:


In [None]:
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

: 

Rename columns:

In [None]:
df.rename(columns={'v1': 'Class', 'v2': 'Text'}, inplace=True)

: 

In [None]:
df.head()

: 

We change column "Class" to 0 and 1:

- spam = 1
- ham = 0



In [None]:
df['Class'] = df['Class'].map({'ham':0, 'spam':1})
df.head()

: 

In [None]:
df['Text'][2]

: 

In [None]:
# Checking for any missing values
df.isna().sum()

: 

### **EDA and Data Visualization**

We check proportion Class variable:

In [None]:
df['Class'].value_counts()

: 

In [None]:
sns.set(style = "darkgrid" , font_scale = 1.2)
sns.countplot(df.Class).set_title("Number of ham and spam messages")
plt.show()

: 

In [None]:
df.describe()

: 

The target class variable is imbalanced, where "ham" values are more dominating than "spam" ones.

**SMS Distribution**

Now we check the percentage of spam SMS and ham SMS messages:

In [None]:
sms = pd.value_counts(df["Class"], sort=True)
sms.plot(kind="pie", labels=["ham", "spam"], autopct="%1.0f%%")

plt.title("SMS messages Distribution")
plt.ylabel("")
plt.show()

: 

Above 87% of these SMS  messages are ham (legitimate) and 13% of them are spam.


Length of text messages:

In [None]:
df['length'] = df.Text.apply(len)
df.head()

: 

In [None]:
plt.figure(figsize=(8, 5))
df[df.Class == 0].length.plot(bins=35, kind='hist', color='blue', label='Ham', alpha=0.6)
df[df.Class == 1].length.plot(kind='hist', color='red', label='Spam', alpha=0.6)
plt.legend()
plt.xlabel("Messages Length");

: 

Now let's see if the length has an influence on messages spam or ham:

In [None]:
_, ax = plt.subplots(figsize=(10, 4))
sns.kdeplot(df.loc[df.Class == 0, "length"], shade=True, label="Ham", clip=(-50, 250),)
sns.kdeplot(df.loc[df.Class == 1, "length"], shade=True, label="Spam")
ax.set(
    xlabel="Length",
    ylabel="Density",
    title="Length of messages.",
)
ax.legend(loc="upper right")
plt.show()

: 

As we can notice the spam messages are longer than ham ones (that is normal due to the number of words) and  have around 150 characters.

### **Text Pre-processing**

In the next step we clean text, remove stop words and apply stemming operation for each line of text:


In [None]:
stop_words = stopwords.words('english')
print(stop_words[::10])

porter = PorterStemmer()

: 

In [None]:
def clean_text(words):
    """The function to clean text"""
    words = re.sub("[^a-zA-Z]"," ", words)
    text = words.lower().split()                   
    return " ".join(text)

def remove_stopwords(text):
    """The function to removing stopwords"""
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    return " ".join(text)

def stemmer(stem_text):
    """The function to apply stemming"""
    stem_text = [porter.stem(word) for word in stem_text.split()]
    return " ".join(stem_text)

: 

In [None]:
df['Text'] = df['Text'].apply(clean_text)
df['Text'] = df['Text'].apply(remove_stopwords)
df['Text'] = df['Text'].apply(stemmer)

: 

In [None]:
df.head()

: 

In [None]:
print(df['Text'].apply(lambda x: len(x.split(' '))).sum())

: 

### **BERT model**

BERT (Bidirectional Encoder Representations from Transformers) is a pretrained NLP algorithm devolped by google AI. BERT is a bidirectionally trained of Transformer a popular attention model, use to language modelling. So it can have a deeper sense of language context and flow compared to the single-direction language models. BERT model instead of predicting the next word in a sequence makes use of a novel technique called Masked LM (MLM). It relies on randomly masks words in the sentence and then it tries to predict them. Unlike the previous language models, it takes both the previous and next tokens into account at the same time. This is a contrast to previous LSTM based models combined a text sequence left-to-right and right-to-left.

In the first step we have to make tokenization on our dataset. Tokenization will allow us to feed batches of sequences into the model at the same time. 

To do the tokenization of our datasets we have to choose  a pre-trained model. We load the basic model (`bert-large-uncased`) from the Huggingface Transformers library.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
tokenizer

: 

Now we have to load BERT model. In the Transformers library is avaliable many different BERT models. We use the „TFBertModel”  model (bert-base-uncased) from the library.

In [None]:
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

: 

Spliting the data into train and test sets:

In [None]:
X = df['Text']
y = df['Class']

: 

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

: 

The function which allows to encode our dataset with BERT tokenizer. We have decided on a maximum sentence length is 64 (maxlen).


In [None]:
def encode(text, maxlen):
  input_ids=[]
  attention_masks=[]

  for row in text:
    encoded = tokenizer.encode_plus(
        row,
        add_special_tokens=True,
        max_length=maxlen,
        pad_to_max_length=True,
        return_attention_mask=True,
    )
    input_ids.append(encoded['input_ids'])
    attention_masks.append(encoded['attention_mask'])

  return np.array(input_ids),np.array(attention_masks)


: 

Based on this  encodings for our training and testing datasets are generated as follows:

In [None]:
X_train_input_ids, X_train_attention_masks = encode(X_train.values, maxlen=64)
X_test_input_ids, X_test_attention_masks = encode(X_test.values, maxlen=64)

: 

#### **Build the model**

We create a model using BERT model and then add two Dense layers with Dropout layer.

In [None]:
def build_model(bert_model):
   input_word_ids = tf.keras.Input(shape=(64,),dtype='int32')
   attention_masks = tf.keras.Input(shape=(64,),dtype='int32')

   sequence_output = bert_model([input_word_ids,attention_masks])
   output = sequence_output[1]
   output = tf.keras.layers.Dense(32,activation='relu')(output)
   output = tf.keras.layers.Dropout(0.2)(output)
   output = tf.keras.layers.Dense(1,activation='sigmoid')(output)

   model = tf.keras.models.Model(inputs = [input_word_ids,attention_masks], outputs = output)
   model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])

   return model

: 

In [None]:
model = build_model(bert_model)
model.summary()

: 

We set class weights for the loss function to adjust for class imbalance. 'Spam' variable is set to weight 8x more.

We train the model for 5 epoch:

In [None]:
class_weight = {0: 1, 1: 8}

: 

In [None]:
history = model.fit(
    [X_train_input_ids, X_train_attention_masks],
    y_train,
    batch_size=32,
    epochs=5,
    validation_data=([X_test_input_ids, X_test_attention_masks], y_test),
    class_weight=class_weight)

: 

Visualization of training:

In [None]:
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

: 

Predictions on a test set:

In [None]:
loss, accuracy = model.evaluate([X_test_input_ids, X_test_attention_masks], y_test)
print('Test accuracy :', accuracy)

: 

In [None]:
#save model
model.save_weights('bert_model')

: 

### **Summary**

For our analysis we used a pretrained BERT model to resolve our classification problem.  After trained model we achieved an accuracy on the test set equal to 98 % and it is a very good result in comparison to previous Machine Learning models that we have used (e.g. Logistic Regression).


: 