## Motivation {.nonincremental}

![](https://learn.microsoft.com/answers/storage/attachments/8587-tomsn.png){height=70%, width=70%, .center}
 
- As someone who uses the internet on a daily basis, I've gotten my fair share of phishing emails.
- I wanted to see if there was a way to detect "phishy" websites using machine learning.
- Having a tool that tells you if a website is a phishing website would be a huge benefit to both individuals and organizations.



::: {.notes}
My motivation for this project was that because I'm online on a daily basis I tend to get a good amount of phishing emails from time to time, and also when I was a kid I would "help" my parents with phishing training for their jobs as a way to build digital awareness at a young age. The skill of knowing if something is a phishing attack or not is a very important skill to have especially during your professional career, which is why large companies devote resources to training their employees on how to look for it. The idea of eliminating phishing training because a tool exists to classify those websites for users would be a huge benefit to both individuals and large corporations
:::

## ML Problem and technical approach
The goal of this project is to find the optimal machine learning model that can detect phishing websites with the highest accuracy possible based on the features of the website and its URL.

**Technical approach**  

1. Gather and preprocess dataset
2. Train each model on training set and test them on the testing set
3. Gather accuracies and store for later comparison

:::{.notes}
The goal for this project is to find the most accurate machine learning model that can classify a website as "phishy" or legitimate based on its features. In order to do this I gathered and preprocessed the dataset and split it into a training/testing set. Fit each model with the training set and obtain accuracy ratings from the testing set, then store those accuracies to be compared later
:::

## Website Data{.smaller}

- The dataset comes from Kaggle
- It contains the domain of, and information about the features of 10000 websites and classifies them as either phishy (1) or legitimate (0)

. . .

In [3]:
#| code-fold: true
import pandas as pd
df = pd.read_csv('data/test.csv')

# shuffle the data
df = df.sample(frac=1).reset_index(drop=True)
df.set_index("Domain", inplace=True)
df.head()

Unnamed: 0_level_0,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Web_Traffic,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
Domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
tobogo.net,0,0,1,2,0,0,0,0,0,1,0,0,0,0,1,0,0
teat09.com,0,0,0,3,0,0,0,0,0,1,0,0,0,0,1,0,1
depositphotos.com,0,0,1,1,0,0,0,0,0,1,0,1,0,0,1,0,0
superuser.com,0,0,1,3,0,0,0,0,0,1,0,1,0,0,1,0,0
web.de,0,0,1,6,0,0,0,0,0,1,1,1,0,0,1,0,0


- The dataset is broken up into 3 main chunks
    - Address bar based features
    - Domain based features
    - HTML/JavaScript based features

- I randomly sampled the dataset into a training and testing set along an 80/20 split
    - Training set made up of 8000 entries
    - Test set made up of 2000 entries

:::{.notes}
The dataset I used I came from Kaggle, it contains the domain name and information about the features of 10000 websites, 5000 of them are phishing websites and the other 5000 are legitimate. The dataset is broken up into 3 main chunks, address bar based, domain based, and html/js based features. Using this dataset I shuffled the entries around and then split the data into an 80/20 split, 8000 entries used for training and 2000 for testing.
:::

## Models Used
::: {.nonincremental}
- Decision Tree
    - Used multiple depths and found that `maxdepth = 5` was the most optimal
- Random Forest
    - Used multiple depths and found that `maxdepth = 5` gave the best results
- Binary Logistic Regression
    - Used 1000 iterations
- SVM
    - Used Linear kernel with regularization parameter `C = 1`
:::

:::{.notes}
Due to the dataset using boolean values of 0 and 1 to tell if a website is phishy or not, this is a classification task. The four models that i decided to use are...
:::

## Model Evaluation & Results{.smaller}
- Each model is fit on the training set and then evaluated on the testing set, where its accuracy score on both the training and testing set are stored to be compared after all evaluation is complete

:::{.panel-tabset}

### Graph
![](figures/output.svg)


### Table

| ML Model | Training Accuracy | Testing Accuracy |
| :------: | :------: | :------: |
| Random Forest |	81.90 |	82.35 |
| Decision Tree |	81.21 |	81.80 |
| SVM | 79.98 |	81.15 |
| Logistic Regression | 79.81 |	80.95 |

:::

Out of all the models the Random Forest had the highest training and testing accuracy  

- Training Accuracy of 81.9%
- Testing Accuracy of 82.35%

Some models like the SVM and Logistic Regression show some slight underfitting with the testing accuracy being around 2% higher than the training accuracy


:::{.notes}
Once the data had been split into training and testing sets, each model is then fit on the training set and evaluated on the testing set. Each model's training and testing score is then stored to be compared, resulting in the graph below. As you can see from the table, the random forest classifier had the highest overall training and testing accuracy with {vals}. I also noticed that models like the Linear SVM and the Binary Logistic Regression didn't have particulary balanced training and testing accuracies, which could mean an issue with the fit
:::

## What's Next?
- Build a new dataset using entries from open-source datacenters like PhishTank
- Try deep learning models 
    - Neural Networks
    - Multilayer perceptrons
- Build extention based on these models that can classify the website that the user is currently on
    - Similar to how Norton labels websites as dangerous with its extention