# Email Classification Project Using BERT

Using BERT to generate number vectors based on sentences, pass the number vectors through a simple neural network to overall classify emails as 'spam' or 'not spam' emails. BERT contains the steps of 'preprocess' and 'encode' to aid in generating word vectors in 768 length.

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

In [2]:
import pandas as pd

df = pd.read_csv("spam.csv", encoding='ISO-8859-1') #fixed unicode error thrown while reading
# get rid of empty trailing columns 
df.drop(df.columns[[2, 3, 4]], axis = 1, inplace = True)
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# overview of data
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


In [4]:
# there is some imbalance in the dataset between 'spam' & 'ham' count
df['Category'].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [5]:
747/4825

0.15481865284974095

Currently as shown, only 15% of the total dataset are spam emails. To handle this imbalance,can use a downsampling method that will look at the 4825 not spam emails, randomly sample 747 of them so that the dataset will become balanced. Unfortunately, it does mean that the remaining of not spam email data is discarded.

In [6]:
# spliting the dataset into their two new df based on their Category
df_spam = df[df['Category']=='spam']
df_spam.shape # spam df

(747, 2)

In [7]:
df_ham = df[df['Category']=='ham']
df_ham.shape # not spam df

(4825, 2)

In [8]:
df_ham_downsampled = df_ham.sample(df_spam.shape[0])
df_ham_downsampled.shape # scaled down ham df random samples equal to spam df size

(747, 2)

In [9]:
df_balanced = pd.concat([df_spam, df_ham_downsampled])
df_balanced.shape # adding two df together

(1494, 2)

In [10]:
df_balanced['Category'].value_counts() # now have balanced data

spam    747
ham     747
Name: Category, dtype: int64

In [11]:
df_balanced.sample(5) # double checking balanced data with 5 random samples

Unnamed: 0,Category,Message
3124,spam,1st wk FREE! Gr8 tones str8 2 u each wk. Txt N...
992,ham,Up to Ì_... ÌÏ wan come then come lor... But i...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3997,spam,We tried to call you re your reply to our sms ...
1029,ham,"Its good, we'll find a way"


In [12]:
# Create new column that identies spam as 1, not spam 0
df_balanced['spam'] = df_balanced['Category'].apply(lambda x: 1 if x=='spam'else 0)
df_balanced.sample(10)

Unnamed: 0,Category,Message,spam
2490,ham,Dun b sad.. It's over.. Dun thk abt it already...,0
849,spam,Today's Offer! Claim ur å£150 worth of discoun...,1
1690,spam,Sunshine Quiz Wkly Q! Win a top Sony DVD playe...,1
2297,ham,Draw va?i dont think so:),0
2022,spam,U can WIN å£100 of Music Gift Vouchers every w...,1
3338,ham,Babe !!!! I LOVE YOU !!!! *covers your face in...,0
2871,ham,See you there!,0
4947,spam,"Hi this is Amy, we will be sending you a free ...",1
3272,ham,Just finished eating. Got u a plate. NOT lefto...,0
4965,spam,URGENT! We are trying to contact U. Todays dra...,1


In [13]:
# split dataset
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df_balanced['Message'], df_balanced['spam'], stratify=df_balanced['spam'])
# straify used to make distribution of train and test sets equal

In [14]:
x_train.head(5)

5076    Guy, no flash me now. If you go call me, call ...
2263    Not heard from U4 a while. Call 4 rude chat pr...
4434    Don't b floppy... b snappy & happy! Only gay c...
1429    For sale - arsenal dartboard. Good condition b...
2129           Mine here like all fr china then so noisy.
Name: Message, dtype: object

In [15]:
# Downloading BERT base pretrained model from Tensorflow library
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/4")

Now that the BERT pretrained model has been downloaded from the Tensorflow library, it can be used to convert the sentences under the Message column into scaled numeric vectors to be used for classification in neural network.

In [16]:
def get_sentence_embedding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output'] 
# returns vectors 768 size for each sentence passed through function

get_sentence_embedding([
    "Is that seriously how you spell his name?",
    "Oops, I'll let you know when my roommate's done"
])

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.688842  ,  0.48915437,  0.99993664, ...,  0.9999838 ,
        -0.8203347 ,  0.99689585],
       [-0.6138372 ,  0.38423482,  0.99970055, ...,  0.9998746 ,
        -0.29772863,  0.96822155]], dtype=float32)>

In [17]:
# more testing for get_sentence_embedding function
e = get_sentence_embedding([
    "banana",
    "grapes",
    "mango",
    "elon musk",
    "tim cook",
    "ariana grande"
])

In [18]:
e

<tf.Tensor: shape=(6, 768), dtype=float32, numpy=
array([[-0.83494806,  0.6283766 ,  0.999989  , ...,  0.9999964 ,
         0.47890285,  0.9889488 ],
       [-0.6465627 ,  0.2901052 ,  0.99960643, ...,  0.99989045,
        -0.7451261 ,  0.96996635],
       [-0.3773091 ,  0.23452356,  0.99860287, ...,  0.99963367,
        -0.82702005,  0.8739308 ],
       [-0.7615175 ,  0.46662855,  0.9999174 , ...,  0.9999793 ,
        -0.3928189 ,  0.9895317 ],
       [-0.66849387,  0.47295344,  0.99984246, ...,  0.99993926,
        -0.34909454,  0.97035396],
       [-0.73349375,  0.49436417,  0.9998204 , ...,  0.9999513 ,
        -0.68891346,  0.98477036]], dtype=float32)>

In [19]:
# using cosine similarity to measure/compare how similar vectors are related
# if two vectors point in the same direction, cosine_sim will be closer to 1
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([e[0]], [e[1]]) #calculates similarity e index 0 and 1 (two fruits)

array([[0.840076]], dtype=float32)

In [20]:
cosine_similarity([e[1]], [e[4]])

array([[0.94274414]], dtype=float32)

# Building the Model

In [21]:
# BERT layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name="text")
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural Network layers - Functional model
l= tf.keras.layers.Dropout(0.1, name='dropout')(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(l)

# Construct final model
model = tf.keras.Model(inputs=[text_input], outputs=[l])

In [22]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_word_ids':                                                
                                (None, 128)}                                                  

In [23]:
METRICS = [
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall')
] # prints accuracy, precision, recall with each epoch

model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=METRICS)

In [24]:
model.fit(x_train, y_train, epochs=10)

Epoch 1/10


InvalidArgumentError: Graph execution error:

indices[2979] = 29646 is not in [0, 28996)
	 [[{{node word_embeddings/Gather}}]] [Op:__inference_train_function_70815]

In [None]:
x_train.shape

In [None]:
y_train.shape

# Predictions on the Model

Building prediction test on the model to compare to actual results. The model has a threhold of 0.5, once the model predicts an example above this threshold it will be classified as spam. Anyting under will be labeled as not spam.

In [None]:
y_predicted = model.predict(x_test)
y_predicted = y_predicted.flatten()

In [None]:
import numpy as np

y_predicted = np.where(y_predicted > 0.5, 1, 0)
# set predicted values > 0.5 to 1, otherwise 0
y_predicted

# Visualizing the Model

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
# building confusion matrix array
cm = confusion_matrix(y_test, y_predicted)
cm

In [None]:
# better visual of confusion matrix(cm)
from matplotlib import pyplot as plt
import seaborn as sn

sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('True')

# Classification report

In [None]:
print(classification_report(y_test, y_predicted))

In [None]:
# Inference on model
reviews =[
    'Enter for a chance to win $5000, hurry, offer valid until may 31, 2022',
    'Save up to $25 on Uber Eats, get your free meal before it is gone',
    'Enjoy $1 medium chili cheese tots or fries with our new app, hurry this deal ends april 17',
    'Hello Sam, Are you coming for a cricket game tomorrow',
    'I look forward to seeing you again next meeting'
]
model.predict(reviews)

As shown in the array above, the model predicted accurately with the spam email examples provdied as the first three examples scored above 0.5 the model will categorize them correctly as spam and the last two examples accurately scored below 0.5, so they will be not spam. 