# Email Fraud Detector: BERT Model Build

#### Ross Willett

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Email-Fraud-Detector:-BERT-Model-Build" data-toc-modified-id="Email-Fraud-Detector:-BERT-Model-Build-1">Email Fraud Detector: BERT Model Build</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Ross-Willett" data-toc-modified-id="Ross-Willett-1.0.0.1">Ross Willett</a></span></li></ul></li></ul></li><li><span><a href="#File-Introduction" data-toc-modified-id="File-Introduction-1.1">File Introduction</a></span></li><li><span><a href="#Preparing-the-Model" data-toc-modified-id="Preparing-the-Model-1.2">Preparing the Model</a></span></li><li><span><a href="#Building-the-Model" data-toc-modified-id="Building-the-Model-1.3">Building the Model</a></span></li><li><span><a href="#Other-Builds-with-the-BERT-Model" data-toc-modified-id="Other-Builds-with-the-BERT-Model-1.4">Other Builds with the BERT Model</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-1.5">Conclusions</a></span></li></ul></li></ul></div>

## File Introduction

In this file, a BERT transformer loaded from [TensorflowHub](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4) with additional layers added on top of it for email classification will be trained. A BERT transformer will be used as the base for a transfer learning model for several reasons. One reason a BERT model will be used is due to the fact it uses word embeddings which will place words with similar meanings in a relatively close vector space. This means that the model will not rely on specific words in order to positively identify a fraudulent email, thus making the model more generalizable. In addition to this, the BERT model takes into account the context of words in relation to other words in the input. (Up to 128 words for the small model) This will allow the model to better numerically express the context of the text data which should allow for a better ability to classify an email as fraudulent or not.

## Preparing the Model

Before the model can be built and trained, the data needs to be prepared.

In [4]:
# Import data manipulation libraries
import pandas as pd
import numpy as np

# Model selection libraries
from sklearn.model_selection import train_test_split

# Model Evaluation Libraries
from sklearn.metrics import accuracy_score

# Import Tensor Flow and keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Import Tensor Flow Hub and Tensor Flow Text (Required libraries for the pre-trained BERT Model)
import tensorflow_hub as hub
import tensorflow_text

# Import library for saving files
import joblib

2023-04-08 12:29:43.229867: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# Import warnings and supress them
import warnings
warnings.filterwarnings('ignore')

In [6]:
# Configure Pandas to show all columns / rows
pd.options.display.max_columns = 2000
pd.options.display.max_rows = 2000
# Set column max width larger
pd.set_option('display.max_colwidth', 200)

In [7]:
# Load X remainder
X_remainder = pd.read_csv('./data/X_remainder.csv')
# Load X test
X_test = pd.read_csv('./data/X_test.csv')
# Load y remainder
y_remainder = pd.read_csv('./data/y_remainder.csv')
# Load y test
y_test = pd.read_csv('./data/y_test.csv')

## Building the Model

Now that the data has been appropriately separated, the model can be built and trained. First the BERT encoder and pre-trained BERT model needs to be loaded from Tensor Flow Hub.(Encoder located [here](https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3), transformer located [here](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4)) The BERT model that will be used is a compact versions of the BERT model so as to allow for faster training and testing. This model consists of 12 hidden layers (i.e. Transformer blocks), a layer node size of 768, and 12 attention heads. For the purposes of this project, the BERT encoding layers will be frozen such that only the layers added to the BERT transformer output will be trained. This will be done since the BERT model has already been trained upon a large data set of words and will output appropriate relationships between these words. Given that the data used to train this model is likely better suited to establish these relationships, these parametric associations will not be adjusted on the basis of the email data. On top of the pooled output of the BERT model, a 128 node layer using RELU activation will be added, and on top of that a 1 node layer with Sigmoid activation. The pooled output of the BERT model will be used since the sequence of words is not important for the purposes of this identification task, thus making the pooled output the most applicable values to pass into other layers. The trainable RELU node layer will be added so as to allow the outputs from the BERT model to be adjusted before passing the results to the Sigmoid layer. The one node Sigmoid layer will be used as it will output a value between 0 and 1 which will be used to classify the email as fraudulent or not.

In [8]:
# Load the BERT encoder from Tensor Flow Hub
preprocessor = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
# Load the BERT model from Tensor Flow Hub
encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=False)



In [9]:
# Instantiate the input layer of the model
text_input = layers.Input(shape=(), dtype=tf.string)
# Pass the input layer to the BERT tokenizing layer
encoder_inputs = preprocessor(text_input)
# Pass the tokenized input to the BERT encoder
outputs = encoder(encoder_inputs)
# Get the pooled output of the BERT encoder
pooled_output = outputs["pooled_output"]
# Pass the pooled output of the BERT encoder to a 128 node relu layer
relu_layer = layers.Dense(128, activation='relu')(pooled_output)
# Pass the relu layer to a 1 node sigmoid layer for classification
output = layers.Dense(1, activation='sigmoid')(relu_layer)

In [10]:
# Instantiate the model
bert_model = tf.keras.Model(inputs=text_input, outputs=output)

Now that the model has been built, the various layers associated within this neural network can be examined.

In [13]:
# Examine the layers of the model
bert_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_mask': (Non  0           ['input_1[0][0]']                
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

From the output of the model summary we can see the large number of parameters associated with this model (109,580,802) and the details surrounding each layer of both the BERT encoder and additional layers added on top of it.

Next the model compiler will be set up to establish how the model will be optimized and how its performance will be evaluated. The model will use the Adam optimizer as this is a fairly standard optimization method used for neural networks. The loss function that will be used is the BinaryCrossentropy loss function since this is the standard loss function used for binary categorization problems in neural networks. Finally the metrics which will be reported for the model are accuracy and recall to establish a baseline of model performance.

In [17]:
# Compile the BERT model using the adam optimizer, binary cross entropy loss and record binary accuracy and recall
bert_model.compile(
    # Optimizer
    optimizer=keras.optimizers.Adam(),
    # Loss function to minimize
    loss=keras.losses.BinaryCrossentropy(),
    # Metric used to evaluate model
    metrics=[keras.metrics.BinaryAccuracy(), keras.metrics.Recall()]
)

Now the model is ready to be trained, it will use a 20% validation split to establish performance and 5 epochs. The reason only 5 epochs will be used in this file is due to the processing power and time required to train this model. Models that are trained with more epochs will be trained on other more powerful machines, but the process is the same.

In [20]:
# Fit the model
history = bert_model.fit(X_remainder['content'], y_remainder, epochs=5, verbose=1, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The fitted model can now be saved for further use and the history of the fitted model saved for further examination.

In [None]:
# Save the model to a tensor flow
bert_model.save('bert_model_5_relu_sig', save_format='hd5')

In [None]:
# Save the model history to a pickle file
joblib.dump(history.history, 'bert_model_5_relu_sig_hist.pkl')

## Other Builds with the BERT Model

Using the BERT model as a base, several builds were attempted using Amazon's Sagemaker service. The variations and results for these models will be displayed in the table below. Note that all models built used the BERT uncased preprocessor located [here](https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3) and used a single output node with a Sigmoid activation.

| Model Number | BERT Transformer Used            | RELU Activation Layer Nodes | Epochs | Validation Accuracy |
|--------------|----------------------------------|-----------------------------|--------|---------------------|
| 1            | bert_en_uncased_L-12_H-768_A-12  | 128                         | 5      | 95.5%               |
| 2            | bert_en_uncased_L-12_H-768_A-12  | 128                         | 20     | 97.5%               |
| 3            | bert_en_uncased_L-12_H-768_A-12  | 256                         | 20     | 95.7%               |
| 4            | bert_en_uncased_L-12_H-768_A-12  | 0                           | 20     | 95.2%               |
| 5            | bert_en_uncased_L-24_H-1024_A-16 | 128                         | 20     | 89.4%               |

As demonstrated by the table above, the model which resulted in the highest validation accuracy utilized the "bert_en_uncased_L-12_H-768_A-12" tensor flow hub module with a 128 Relu activation layer and ran for 20 epochs. This will be the model used for final evaluation and testing.

## Conclusions

Using transfer learning with a BERT transformer model should improve general applicability and performance of fraud email identification. This model will utilize a small BERT transformer from tensor flow hub as a base with a Relu activation layer using 128 nodes on the output from it and a Sigmoid activation layer with 1 node for output. This model will undergo further testing and evaluation in subsequent files.