# AI Phishing URL Detector in Python
source: https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/

## 1 - Identifying Fradulent URLs
A **fraudulent domain** or phishing domain is a URL scheme that looks suspicious. The URL:
* is misspelled
* points to the wrong top-level domain
* a combo of a valid and fraudulent URL
* is incredibly long
* is just an IP address
* has a low [pagerank](https://en.wikipedia.org/wiki/PageRank)
* has a young domain page, and/or
* ranks poorly on the [Alexa Top 1 Million Sites](https://www.alexa.com/topsites)

### Dataset and Criteria
For this project, I will use a [dataset from The University of California, Irvine](https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#) identifying fraudulent vs. valid URLs to train my model. This is a old dataset, but the features are still valid today. Feature sets are divided into 4 main categories:
1. *Address Bar-Based* - features extracted from the URL itself. This may include:
    * URL length > 54 characters
    * contains an IP address
    * uses an URL shortening service like TinyURL or Bitly
    * employs redirection
    * adding a prefix or suffix separated by "-" to the domain
    * having sub-domain and multi-sub-domains
    * existence of HTTPS
    * domain registration age
    * favicon loading from a different domain
    * using a non-standard port
2. *Abnormal Features* may include:
    * loading images in the body of the page from a different URL
    * minimal use of meta tags
    * the use of a Server Form Handler (SFH), which processes data sent to an HTML form
    * submitting information to email
    * an abnormal URL
3. *HTML and JavaScript-Based Features* may include:
    * website forwarding 
    * status bar customization normally using JavaScript to display a fake URL 
    * disabling the ability to right-click so users can’t view page source code
    * using pop-up windows
    * iFrame redirection
4. *Domain-Based Features* may include:
    * unusually young domains
    * suspicious DNS record
    * low volume of website traffic
    * PageRank (95% of phishing webpages have no PageRank)
    * whether the site has been indexed by Google

## 2 - Building A Decision Tree
For my machine learning algorithm, I will need a decision tree classifier to help me determine whether a URL is valid or not. I downloaded the UC Irvine dataset and explored its contents. The feature list contains:
* having_IP_Address  { -1,1 }
* URL_Length   { 1,0,-1 }
* Shortining_Service { 1,-1 }
* having_At_Symbol   { 1,-1 }
* double_slash_redirecting { -1,1 }
* Prefix_Suffix  { -1,1 }
* having_Sub_Domain  { -1,0,1 }
* SSLfinal_State  { -1,1,0 }
* Domain_registeration_length { -1,1 }
* Favicon { 1,-1 }
* port { 1,-1 }
* HTTPS_token { -1,1 }
* Request_URL  { 1,-1 }
* URL_of_Anchor { -1,0,1 }
* Links_in_tags { 1,-1,0 }
* SFH  { -1,1,0 }
* Submitting_to_email { -1,1 }
* Abnormal_URL { -1,1 }
* Redirect  { 0,1 }
* on_mouseover  { 1,-1 }
* RightClick  { 1,-1 }
* popUpWidnow  { 1,-1 }
* Iframe { 1,-1 }
* age_of_domain  { -1,1 }
* DNSRecord   { -1,1 }
* web_traffic  { -1,0,1 }
* Page_Rank { -1,1 }
* Google_Index { 1,-1 }
* Links_pointing_to_page { 1,0,-1 }
* Statistical_report { -1,1 }

The Result designates whether the URL is valid or not.
* Result  { -1,1 }

where -1 denotes an invalid URL and 1 is a valid URL.

Now is the code. First, I'll load the required modules:

In [None]:
# to perform operations on dataset
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff
# ML model
## dev note: install scikit-learn instead of sklearn, as it's deprecated
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# visualization
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as pyplt
import seaborn as sns
from sklearn.tree import export_graphviz

I took a moment to download any libaries that weren't installed on my machine, which was all of them, as this is my first time working with ML on my own environment. Next is reading and spliting the dataset. I had to make some edits to the original, as the dataset was an `.arff` file:

In [None]:
raw_data = loadarff('Training Dataset.arff')
df = pd.DataFrame(raw_data[0])
dot_file = '.../tree.dot'
confusion_matrix_file = '.../confusion_matrix.png'

And then printing the results:

In [None]:
print(df.head())

This dataset contains 5 rows and 31 columns, where each column contains a value for each of the attributes discussed above.

## 3 - Train the Model
The first step in training a machine learning model is to split the dataset into testing and training data:

In [None]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

Since the dataset contains boolean data, it’s always best to use a Decision Tree, RandomForest Classifier or Logistic Regression algorithm since these models work best for classification. In this case, the OP (see source) chose to work with a Decision Tree, because it’s straightforward and generally gives the best results when trying to classify data.

In [None]:
model = DecisionTreeClassifier()
model.fit(Xtrain, ytrain)