# Phishing Detection Using BERT

## Authors
- Eli Fajardo (NVIDIA)
- Gorkem Batmaz (NVIDIA)
- Bartley Richardson, PhD (NVIDIA)

## Table of Contents 
* Introduction
* List of datasets used
* Reading in the datasets
* Initialize CLX Phishing Detection and BERT model
* Training - CLAIR FRAUDULENT EMAILS dataset
* Evaluation of CLAIR Test Set
* Training with the the SPAM_ASSASSIN dataset
* Evaluation of the SPAM_ASSASSIN Test Set
* Training with all three datasets CLAIR+SPAM_ASSASSIN+ENRON
* Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN+ENRON Datasets
* References

## Introduction
Phishing is a method used by fraudsters/hackers to obtain sensitive information from email users by pretending to be from legitimate institutions/people.
Various machine learning methods are in use to detect and filter phishing/spam emails. 
In this notebook, we show how to train a *BERT language model and analyse the performance on various datasets. We have fine-tuned a pre-trained BERT model with a classification layer using HuggingFace library. 
*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found [here.](https://arxiv.org/pdf/1810.04805.pdf)
This notebook will be updated with a much faster GPU tokenizer

## Datasets used
* [CLAIR-Fraudulent E-mail Corpus](https://www.kaggle.com/rtatman/fraudulent-email-corpus)
* [SPAM_ASSASSIN Dataset](https://spamassassin.apache.org/old/publiccorpus/)
* [Enron Emails](https://www.cs.cmu.edu/~enron/)

### Required Libraries

In [1]:
import cudf
from cuml.preprocessing.model_selection import train_test_split
from clx.analytics.phishing_detector import PhishingDetector

## Reading the files

In [2]:
dfclair = cudf.read_csv("Phishing_Dataset_Clair-Collection.tsv", delimiter='\t', header=None, names=['label', 'email'])# Clair dataset



In [3]:
dfspam = cudf.read_csv("200_20021010_spam_.tsv", delimiter='\t', header=None, names=['label', 'email'])#Phishing emails of the SPAM ASSASIN dataset

In [4]:
dfeasyham = cudf.read_csv("200_1010_easy_ham_.tsv", delimiter='\t', header=None, names=['label', 'email'])#Benign emails of the SPAM ASSASIN dataset

In [5]:
dfhardham = cudf.read_csv("200_1010_hard_ham_.tsv", delimiter='\t', header=None, names=['label', 'email'])#Benign emails of the SPAM ASSASIN dataset that are easy to be confused with phishing emails

In [6]:
dfenron=cudf.read_csv("enron10000.tsv", delimiter='\t', header=None, names=['label', 'email'])#Benign Enron emails

In [7]:
# The files contain the first 200 words of each email. The model uses only the first 128 words.

## Initialize/Load BERT model

In [8]:
phish_detect = PhishingDetector()
phish_detect.init_model()

# init_model can also load pre-trained model by passing it model directory path

## Training - CLAIR FRAUDULENT EMAILS DATASET

Split the dataset into training and test sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(dfclair, 'label', train_size=0.8)

In [10]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.09543371412349733


Epoch: 100%|██████████| 1/1 [01:07<00:00, 67.54s/it]

Validation Accuracy: 0.996875





## Evaluation of CLAIR Test Set

## Training with SPAM_ASSASSIN dataset

Merging the spam assasin dataset

In [11]:
df_assassin = cudf.concat([dfhardham,dfeasyham,dfspam], ignore_index=True)

Split the dataset into train and test

In [12]:
X_train, X_test, y_train, y_test = train_test_split(df_assassin, 'label', train_size=0.8)

In [13]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.512865782235608


Epoch: 100%|██████████| 1/1 [00:18<00:00, 18.64s/it]

Validation Accuracy: 0.8462370242214533





## Evaluation of the SPAM_ASSASSIN Test Set

In [14]:
phish_detect.evaluate_model(X_test, y_test)

0.8547655068078669

## Training with CLAIR+SPAM_ASSASSIN datasets

Merge the two datasets and split as train and test sets

In [15]:
df_total = cudf.concat([dfhardham,dfeasyham,dfspam,dfclair],ignore_index=True)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(df_total, 'label', train_size=0.8)

In [17]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.06557311948014759


Epoch: 100%|██████████| 1/1 [01:24<00:00, 84.91s/it]

Validation Accuracy: 0.9939123376623377





## Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN Datasets

In [18]:
phish_detect.evaluate_model(X_test, y_test)

0.9957321076822062

## Training with all three datasets (CLAIR+SPAM_ASSASSIN+ENRON)

Merge all the datasets, split into training and test set and then tokenize the emails

In [19]:
df_total = cudf.concat([dfhardham,dfeasyham,dfspam,dfclair,dfenron],ignore_index=True)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(df_total, 'label', train_size=0.8)

In [21]:
phish_detect.train_model(X_train, y_train, epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Train loss: 0.020746011122955137


Epoch: 100%|██████████| 1/1 [02:22<00:00, 142.62s/it]

Validation Accuracy: 0.9980314960629921





## Evaluation of the Test Set of CLAIR+SPAM_ASSASSIN+ENRON Datasets

In [22]:
phish_detect.evaluate_model(X_test, y_test)

0.9982164090368609

# References
* https://github.com/huggingface/transformers/tree/master/examples#
* https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/
* https://github.com/ThilinaRajapakse/pytorch-transformers-classification
* https://mccormickml.com/2019/07/22/BERT-fine-tuning/