# Phishing URL Detection


From Wikipedia, https://en.wikipedia.org/wiki/Phishing

> Phishing is the fraudulent attempt to obtain sensitive information such as usernames, passwords and credit card details, often for malicious reasons, by disguising as a trustworthy entity in an electronic communication. The word is a neologism created as a homophone of fishing due to the similarity of using a bait in an attempt to catch a victim. The annual worldwide impact of phishing could be as high as US$5 billion.
>
> Phishing is typically carried out by email spoofing or instant messaging, and it often directs users to enter personal information at a fake website, the look and feel of which are identical to the legitimate site, the only difference being the URL of the website in concern. Communications purporting to be from social web sites, auction sites, banks, online payment processors or IT administrators are often used to lure victims. Phishing emails may contain links to websites that distribute malware.
>
> Phishing is an example of social engineering techniques used to deceive users, and it exploits weaknesses in current web security. Attempts to deal with the growing number of reported phishing incidents include legislation, user training, public awareness, and technical security measures.

Here's an example of a real phishing email sent in 2011 by attackers looking to get login credentials for Facebook users:

<pre>
LAST WARNING : Your account is reported to have violated the policies that are considered annoying or insulting Facebook users.

Until we system will disable your account within 24 hours if you do not do the reconfirmation.

Please confirm your account below:

[ Link Removed ]

Thanks.
The Facebook Team
Copyright facebook © 2011 Inc. All rights reserved.
</pre>

A victim clicking on the Phishing link would be taken to a site that looked like a pretty good copy of the Facebook login screen.

<img src="images/Not_Facebook.png">

Here are some examples of the links used in emails sent by the attackers running this phishing campaign:

**Note**: These links may be dangerous to your computer. Our practice will be to "neuter" links by wrapping certain characters with square brackets so that you cannot click on these links, or accidentally copy/paste them into your browser.

**CAUTION: DO NOT CLICK ON OR VISIT THESE LINKS!!**
<pre>
http[:]//team-welcome[.]at[.]ua/facebook-support[.]html
http[:]//reportedpages[.]at[.]ua/facebook-support-account[.]html
http[:]//www[.]facebooks[.]cloud/PayPlls[.]CEanada[.]tNZnZZlR3ZdyZZ-5RkZZDRTZZBy
http[:]//www[.]greenaura[.]net/appz[.]westpac/westpac[.]appz/login[.]php
http[:]//www[.]irastrum[.]com/wp-admin/mail[.]yahoo[.]com/
http[:]//appleid[.]apple[.]com-subscriptions[.]manager508158125[.]kevinfoley[.]com
</pre>
**CAUTION: DO NOT CLICK ON OR VISIT THESE LINKS!!**


Something smells a little phishy about these links. Given a close look by a human, you'd probably be able to decide pretty quickly if the link was really sent by Facebook or not. But billions of people get hundreds or thousands of emails each every day! How can defenders keep up with the onslaught by the phishers?

## The Problem

We want to use methods from Machine Learning to build a computer program that will automatically flag links it thinks are phishing attempts. We can do this by studying the problem, looking at data, and learning a decision rule.

The dataset we will be using is named "Phishing_Mitre_Dataset_Summer_of_AI.csv". 

#### Analyze the data, build features, or use the existing features in the data to build a model, and report your findings. We will use the F1 score to evaluate the final models using a test set that we have set aside. 

## Explore Some Data

Let's take a look at the provided features on our set of URLs. 

* Create Age (in Months): The age of the domain. If the value is -1, that information is not available or the domain has been deleted. 
* Expiry Age (in Months): The amount of months until the domain expires. If the value is negative, that information is not available or the domain has been deleted.
* Update Age (in Days): The last time the domain was updated. If the value is -1, that information is not available or the domain was deleted.
* URL: The URL of the website. Three periods have been added to the end of each URL to prevent the URL from being clicked for security purposes. 
* Label: A label to determine whether a website is a phishing link or not. 0 denotes a website that is not a phishing link, 1 denotes a website is a phishing link. 

In [50]:
import pandas as pd
import sklearn
import numpy as np


import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import pandas as pd
from sklearn import svm
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import csv
from PIL import Image
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn.svm import LinearSVR

In [2]:
df = pd.read_csv('Training Data\Phishing_Mitre_Dataset_Summer_of_AI.csv')
df

Unnamed: 0,create_age(months),expiry_age(months),update_age(days),URL,Label
0,-1,-1,-1,http://account-google-com.ngate.my/c44cca40176...,1
1,212,16,663,http://www.coffeespecialties.com/...,0
2,-1,-1,-1,http://black.pk/wp-content/2013/04/bp.postale/...,1
3,198,6,186,http://atomicsoda.com/manutd...,0
4,240,24,1684,http://bostoncoffeecake.com/...,0
...,...,...,...,...,...
4794,52,8,118,http://aridfoods.com/V4/MGen/F97a8a294cf7c5e90...,1
4795,-1,-1,-1,http://www.mazda.co.jp/...,0
4796,-1,-1,-1,http://www.fotografaemsaopaulo.com.br/wp-admin...,1
4797,-1,-1,-1,http://agenda.wehrensarl.ch/libraries/joomla/h...,1


In [66]:
def get_labels(df):
    """
    Returns a list of labels of length n.
    If is_baseline, the labels are classification labels. Else, the labels are real valued.
    
    df: Pandas dataframe of data (either baseline or creative)
    is_baseline: True if the df is baseline. False if the df is creative.
    """
    labels = np.array(list(df['Label']))        
    return labels


def standardize_features(X, means, stdevs):
    """
    Returns X that is standardized so that the mean and standard deviations of all features become equal.
    
    X: nxd numpy array with n feature vectors, each of d features
    """
    
    X = np.subtract(X, means)
    X = np.divide(X, stdevs)
    
    return X


def add_column(df, column_to_add, column_label):
    """
    Returns: df with extra columns
    column_to_add must be a list
    """
    
    count = 0
    df_copy = df.copy(deep=True)
    df_copy[column_label] = column_to_add
    return df_copy
    
def grid_search_svm(x_array, labels):
    
    # this function inputs the training feature vectors and labels and outputs a trained SVM model
    
    # split into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(x_array, labels, test_size = 0.2)
    
    # grid search over parameters
    # param_grid = {'kernel' : ['rbf'], 'C' : [0.5, 1, 2, 4, 32],  'gamma' : [0.0001, 0.001, 0.01, 0.1, 0.9]}
    param_grid = {'kernel' : ['rbf'], 'C' : [0.5, 1],  'gamma' : [0.1, 0.9]}
    svm1_grid = GridSearchCV(svm.SVC(probability=True), param_grid, cv=10, n_jobs=-1)
    
    # fit the model to the training data with the best parameters from the grid search
    svm1_grid.fit(X_train,y_train)
    
    # predict the labels of the validation set
    pred=svm1_grid.predict(X_val)
    
    # print the best parameters and accuracy on the validation set
    print(svm1_grid.best_params_)
    print('Accuracy = ', accuracy_score(pred, y_val))
    
    # return the trained model
    return svm1_grid

def grid_search_randfor(x_array, labels):
    
    # this function inputs the training feature vectors and labels and outputs a trained random forest model
    
    # split into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(x_array, labels, test_size = 0.2)
    
    # grid search over parameters    
    # param_grid = {'max_features' : ['auto', 'sqrt', 'log2'], 'max_depth' : [13, 14, 15], 'criterion' : ['gini', 'entropy'], 'n_estimators' : [100, 200, 300] }
    param_grid = {'max_features' : ['auto', 'sqrt'] } 
    randfor_grid = GridSearchCV(RandomForestClassifier(max_features = 'auto'), param_grid)
    
    # fit the model to the training data with the best parameters from the grid search
    randfor_grid.fit(X_train,y_train)
    
    # predict the labels of the validation set
    pred=randfor_grid.predict(X_val)
    
    # print the best parameters and accuracy on the validation set
    print(randfor_grid.best_params_)
    print('Accuracy = ', accuracy_score(pred, y_val))
    
    # return the trained model
    return randfor_grid

def create_submission_csv(pred, name):
    
    # used code from here: https://www.geeksforgeeks.org/python-save-list-to-csv/
    
    df_kaggle_test = pd.read_csv('test_baseline_no_label.csv', sep=',',header=None, encoding='unicode_escape')
    new_header = df_kaggle_test.iloc[0] #grab the first row for the header
    df_kaggle_test = df_kaggle_test[1:] #take the data less the header row
    df_kaggle_test.columns = new_header #set the header row as the df header
    
    country_list = list(df_kaggle_test['country'])
    date_list = list(df_kaggle_test['date'])
    country_id_list = []
    
    for i in range(0,len(country_list)):
        country_id_list.append(country_list[i] + ' ' + date_list[i])
        
    pred = list(pred)
    submission_list = []
    
    for i in range(0,len(country_list)):
        submission_list.append([country_id_list[i], pred[i]])
        
    fields = ['country_id', 'next_week_increase_decrease'] 
    with open(name, 'w', newline='') as f:
        # using csv.writer method from CSV package
        write = csv.writer(f)
      
        write.writerow(fields)
        write.writerows(submission_list)

In [58]:
# Pre-process the data - drop the labels column and keep it separate 
df_features = df.copy(deep=True)
df_features = df_features.drop(columns = ['Label', 'URL'])
labels = get_labels(df)

print(labels)
df_features

[1 0 1 ... 1 1 0]


Unnamed: 0,create_age(months),expiry_age(months),update_age(days)
0,-1,-1,-1
1,212,16,663
2,-1,-1,-1
3,198,6,186
4,240,24,1684
...,...,...,...
4794,52,8,118
4795,-1,-1,-1
4796,-1,-1,-1
4797,-1,-1,-1


In [59]:
# add more features to the dataset

df_more_features = add_column(df_features, labels, 'added_col')
df_more_features

Unnamed: 0,create_age(months),expiry_age(months),update_age(days),added_col
0,-1,-1,-1,1
1,212,16,663,0
2,-1,-1,-1,1
3,198,6,186,0
4,240,24,1684,0
...,...,...,...,...
4794,52,8,118,1
4795,-1,-1,-1,0
4796,-1,-1,-1,1
4797,-1,-1,-1,1


In [63]:
x_array = df_features.to_numpy(dtype=float)

means = np.mean(x_array,axis=0)
stdevs = np.std(x_array, axis=0) 
x_array = standardize_features(x_array, means, stdevs)
x_array

array([[-0.993857  , -0.66217274, -0.44202413],
       [ 1.29993724,  0.18648314,  0.63204101],
       [-0.993857  , -0.66217274, -0.44202413],
       ...,
       [-0.993857  , -0.66217274, -0.44202413],
       [-0.993857  , -0.66217274, -0.44202413],
       [ 0.86917776,  2.7823717 ,  0.76468159]])

In [64]:
svm_grid = grid_search_svm(x_array, labels)
for_grid = grid_search_randfor(x_array, labels)

{'C': 0.5, 'gamma': 0.9, 'kernel': 'rbf'}
Accuracy =  0.8166666666666667
{'max_features': 'sqrt'}
Accuracy =  0.859375




In [67]:
svm_predict = svm_grid.predict(x_array)
for_predict = for_grid.predict(x_array)

create_submission_csv(svm_predict, 'trial_svm')
create_submission_csv(for_predict, 'trial_forest')



FileNotFoundError: [Errno 2] File b'test_baseline_no_label.csv' does not exist: b'test_baseline_no_label.csv'