A project in three parts:
- A tool that uses ML to detect emails that may contain phishing scams.
- A webservice that provides an endpoint for the ML tool.
- A proof-of-concept service that monitors a Gmail inbox, sending new messages to the webservice and recording the results.
See /phish_detector
. This is a sequential neural net written in
Python using the Tensorflow library. Trained on the dataset
https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset.
The raw emails are first converted to matrices of Term Frequency-Inverse Document Frequency (TF-IDF) features, which measure how significant individual words are within an individual email and across the corpus. The term frequency measures how often a term appears in an email. The inverse document frequency measures how rare or common a term is across the entire corpus.
The array of TF-IDF features are inputted to a NN with this structure:
Model: "sequential"
+---------------------+-----------------+-----------------+
| Layer (type) | Output Shape | Param # |
+---------------------+-----------------+-----------------+
| dense (Dense) | (None, 128) | 640,128 |
+---------------------+-----------------+-----------------+
| dropout (Dropout) | (None, 128) | 0 |
+---------------------+-----------------+-----------------+
| dense_1 (Dense) | (None, 64) | 8,256 |
+---------------------+-----------------+-----------------+
| dropout_1 (Dropout) | (None, 64) | 0 |
+---------------------+-----------------+-----------------+
| dense_2 (Dense) | (None, 1) | 65 |
+---------------------+-----------------+-----------------+
Test Loss: 0.0241, Test Accuracy: 0.9964
- Evaluate for overfitting on other corpora.
- Add subject and URL features.
See /webservice
. This is a Flask app that provides a single endpoint,
/check
, which accepts POST
data with a field named email
and
returns 1 (phishing scam) or 0 (ham).
Run the app:
$ flask --app webservice run --debug
Sending a request:
$ curl -X POST http://127.0.0.1:5000/check -d "email=value"
$ curl -X POST http://127.0.0.1:5000/check -d "$(cat data/emails/good0.txt)"
See /gmail_app/app/gmail.py
. The script runs as a daemon using the GMail API to
periodically poll a Gmail account. The first time the script runs it will
launch a browser asking the user to allow access to the account. As the
app is in testing only listed users can run it.
New messages are sent to the webservice and the results are logged.
Start the service:
$ python gmail_app/app/gmail.py &
Watch the log:
$ tail -f gmail_app/logs/access.log
By default access and error logs are in gmail_app/logs/
. Control this
location, the frequency of the calls to the GMail API, the location of
the webservice and one or two other things by editing the file gmail_app.ini
.