# Phishing URL's Prediction

## Table of Contents

1. [Problem Statement](#section1)</br>
    - 1.1 [Introduction](#section101)<br/>
    - 1.2 [Data source and data set](#section102)<br/>
2. [Load the packages and data](#section2)</br>
    - 2.1 [Description about each column in the dataset](#section201)<br/>
3. [Data profiling](#section3)</br>
    - 3.1 [Initial Profiling Observations](#section3.1)<br/>
    - 3.2 [Final Profiling observations](#section3.2)<br/>
4. [Model evaluation](#section4)</br>
    - 4.1 [Loading the train and test datasets](#section4.1)</br>
    - 4.2 [Logistic regression](#section4.2)</br>
    - 4.3 [Decision Tree Regressor with Default values](#section4.3)</br>
    - 4.4 [Decision tree with Grid Search CV](#section4.4)</br>
    - 4.5 [RF with best hyper parameters](#section4.5)</br>
    - 4.6 [Navie Bayes](#section4.6)</br>
    - 4.7 [Stochastic Gradient Descent](#section4.7)</br>
    - 4.8 [K-Nearest Neighbours](#section4.8)</br>
    - 4.9 [SVM](#section4.9)</br>
    - 4.10 [Ensemble Bagging - voting classifier](#section4.10)</br>
    - 4.11 [AdaBoost](#section4.11)</br>
    - 4.12 [Gradient Boosting](#section4.12)</br>
    - 4.13 [XGBoost](#section4.13)</br>
    - 4.14 [Conclusion and model selection](#section4.14)</br>
5. [Model Deployment](#section5)</br>

<a id=section1></a> 
## 1. Problem Statement 

"Detecting the phishing URL using the machine learning algorithms - The openness of the Web exposes opportunities for criminals to upload malicious content. In fact, despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing host URLs."

<a id=section101></a> 
### 1.1. Introduction
Phishing is popular among attackers, since it is easier to trick someone into clicking a malicious link which seems legitimate than trying to break through a computer’s defense systems. The malicious links within the body of the message are designed to make it appear that they go to the spoofed organization using that organization’s logos and other legitimate contents.

<a id=section102></a> 
### 1.2. Data source and dataset

__a__. How was it collected? 

- This data set if collected from UCI - machine learning respository

- __Description__: "This data set is prepared by UCI by combining 31 different attribuites which were useful in classification a website URL as phishing or not"

__b__. Is it a sample? If yes, was it properly sampled?
- No

<a id=section2></a> 
## 2. Load the packages and data 

#### Run this line in case you dont have sklearn installed.
```python
!pip install sklearn
```      

In [None]:
from sklearn import tree
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pickle

import numpy as np
import pandas as pd
import sys
import warnings # Ignore warning related to pandas_profiling
warnings.filterwarnings('ignore')

In [None]:
#Load the actual data into the dataframe
ds = pd.read_csv("dataset.csv")
ds.head()

<a id=section201></a> 
### 2.1. Description about each column in the data set

#### Parameters in dataset

Each value in the dataset contains below items which are seperated by a comma.

- having_IP_Address { -1,1 }
- URL_Length { 1,0,-1 }
- Shortining_Service { 1,-1 }
- having_At_Symbol { 1,-1 }
- double_slash_redirecting { -1,1 }
- Prefix_Suffix { -1,1 }
- having_Sub_Domain { -1,0,1 }
- SSLfinal_State { -1,1,0 }
- Domain_registeration_length { -1,1 }
- Favicon { 1,-1 }
- port { 1,-1 }
- HTTPS_token { -1,1 }
- Request_URL { 1,-1 }
- URL_of_Anchor { -1,0,1 }
- Links_in_tags { 1,-1,0 }
- SFH { -1,1,0 }
- Submitting_to_email { -1,1 }
- Abnormal_URL { -1,1 }
- Redirect { 0,1 }
- on_mouseover { 1,-1 }
- RightClick { 1,-1 }
- popUpWidnow { 1,-1 }
- Iframe { 1,-1 }
- age_of_domain { -1,1 }
- DNSRecord { -1,1 }
- web_traffic { -1,0,1 }
- Page_Rank { -1,1 }
- Google_Index { 1,-1 }
- Links_pointing_to_page { 1,0,-1 }
- Statistical_report { -1,1 }
- Result { -1,1 }

<a id=section3></a> 
## 3. Data Profiling

Review the data types and sample data to understand what variables we are dealing with?<br>
Which variables need to be transformed in some way before they can be analyzed?

In [None]:
#Mapping columns to the name in the dataset so that we can easy related name to value
ds_columns=["having_IP_Address","URL_Length","Shortining_Service","having_At_Symbol","double_slash_redirecting","Prefix_Suffix",
            "having_Sub_Domain","SSLfinal_State","Domain_registeration_length","Favico","port","HTTPS_token","Request_URL",
            "URL_of_Anchor","Links_in_tags","SFH","Submitting_to_email","Abnormal_URL","Redirect","on_mouseover","RightClick",
            "popUpWindow","Iframe", "age_of_domain","DNSRecord","web_traffic","Page_Rank","Google_Index",
            "Links_pointing_to_page","Statistical_report","Result"]

indexed_ds=pd.DataFrame(ds.values,columns=ds_columns)
indexed_ds.head()

In [None]:
# we are not using Favicon,port,Request_URL,URL_of_Anchor,Links_in_tags,SFH,web_traffic and Statistical_report
#dataset size
indexed_ds.shape

In [None]:
indexed_ds.dtypes

In [None]:
#check for NAN/missing values in the dataset
indexed_ds.isnull().values.any()

In [None]:
#find the correlation between different variables
indexed_ds.corr()['Result']

In [None]:
#Correlation map to see how features are correlated with Results
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt 

color = sns.color_palette()
sns.set_style('darkgrid')
corrmat = indexed_ds.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

<a id=section3.1></a> 
### 3.1. Observations
 - We have 31 columns which are type interger
 - Correlation between "Result" and Iframe(-0.003362),Favico(-0.000231),popUpWindow(0.000136) is almost zero,we can drop them.


In [None]:
del indexed_ds["Favico"]
del indexed_ds["Iframe"]
del indexed_ds['popUpWindow']

In [None]:
#del indexed_ds["Favico"]
#del indexed_ds["port"]
#del indexed_ds["Request_URL"]
#del indexed_ds["URL_of_Anchor"]
#del indexed_ds["Links_in_tags"]
#del indexed_ds["SFH"]
indexed_ds.shape

<a id=section4></a>
## 4. Model evaluation
- We will use __accuracy score __  for evaluation using different ML classification algorithms

<a id=section4.1></a>
### 4.1 Loading the train and test datasets

In [None]:
y_col = 'Result'
y = indexed_ds[y_col]

#Configure X as other columns except Result
X = indexed_ds.iloc[:,:-1]

#Split the test and traning data as 80:20 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print(X_train.shape)
print(y_train.shape)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

def calculate_error(test, pred):
    """
    Calculate the accuracy, precision, recall and F-score for test and predict data
    """
    print('Accuracy score for test data is:', accuracy_score(test,pred))
    print(pd.DataFrame(confusion_matrix(test,pred,labels=[-1, 1]),index=['Actual:-1','Actual:1'],columns=['Pred:-1','Pred:1']))
    print(classification_report(test, pred))
    #print(np.count_nonzero(test==1), np.count_nonzero(test==-1))
    #print(np.count_nonzero(pred==1), np.count_nonzero(pred==-1))

<a id=section4.2></a>
### 4.2 Logistic regression

In [None]:
#Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

lr =  LogisticRegression()
lr.fit(X_train, y_train)
y_pred=lr.predict(X_test)

calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.3></a>
### 4.3 Decision tree with default values

In [None]:
#DT with default values
from sklearn import tree
from sklearn.model_selection import GridSearchCV

dt = tree.DecisionTreeClassifier(random_state = 0)
dt.fit(X_train, y_train)  

y_pred = dt.predict(X_test) 

calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.4></a>
### 4.4 Decision tree with Grid Search CV

In [None]:
#DT with GridSearch CV
grid_model = tree.DecisionTreeClassifier(random_state = 0)  

param_grid = {'min_samples_split':range(2,10,1), 'max_features':[0.5,0.8,0.9,'auto']}
grid_search = GridSearchCV(grid_model,param_grid,cv=10, refit='AUC')
grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)

calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.5></a>
### 4.5 RF with best hyper parameters

In [None]:
#RF with best hyper parameters
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf_model = RandomForestClassifier()

param_grid = {
    'n_estimators':[10,50,100,200], 'min_samples_split':range(2,10,1), 'max_features':[0.5,0.8,0.9,'auto']}

n_iter_search = 20
search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_grid, return_train_score=True,n_iter=n_iter_search, cv=5, n_jobs=-1)

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
            

from time import time
start = time()
search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(search.cv_results_)

print("model Test score: %.3f" % search.score(X_test, y_test))
search.best_params_

In [None]:
clf = RandomForestClassifier(**search.best_params_)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
#print("model score: %.3f" % clf.score(X_test, y_test))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.6></a>
### 4.6 Navie Bayes

In [None]:
#Navie Bayes
from sklearn.naive_bayes import GaussianNB
nb =  GaussianNB()
nb.fit(X_train, y_train)
y_pred=nb.predict(X_test)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can see that accuracy > F-score, which means model is not working as expected
- Precision is largely varing for different predictions(-1 and 1) when compared to other models.
- Recall is 1 when predicting outcome as "-1" which means model is baised towards "-1" due to which False negatives has gradually increased.
- Incase of predicting "1" recall is < 0.5


<a id=section4.7></a>
### 4.7 Stochastic Gradient Descent

In [None]:
#Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier
sgd =  SGDClassifier(loss='modified_huber', shuffle=True,random_state=101)
sgd.fit(X_train, y_train)
y_pred=sgd.predict(X_test)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.8></a>
### 4.8 K-Nearest Neighbours

In [None]:
#K-Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier
error_rate = []
for i in range(1,51):    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(8,4))
plt.plot(range(1,51),error_rate,color='darkred', marker='o',markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
#We can choose 7 as we have elbow there
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.9></a>
### 4.9 SVM

In [None]:
#SVM
from sklearn.svm import SVC

svm =  SVC(kernel="rbf", C=0.025,random_state=101)
svm.fit(X_train, y_train)
y_pred=svm.predict(X_test)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.10></a>
### 4.10 Ensemble Bagging - voting classifier

In [None]:
#Ensemble Bagging - voting classifier
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = tree.DecisionTreeClassifier(random_state=1)
model3 = RandomForestClassifier(**search.best_params_)

en_voting = VotingClassifier(estimators=[('lr', model1),('dt', model2), ('rf',model3)], voting='hard')
en_voting.fit(X_train,y_train)

y_pred=en_voting.predict(X_test)
calculate_error(y_test,y_pred)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.11></a>
### 4.11 AdaBoost

In [None]:
#AdaBoost
from sklearn.ensemble import AdaBoostClassifier
ada_boost = AdaBoostClassifier(random_state=1)
ada_boost.fit(X_train, y_train)

y_pred=ada_boost.predict(X_test)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.12></a>
### 4.12 Gradient Boosting

In [None]:
#Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gra_boost= GradientBoostingClassifier(learning_rate=0.01,random_state=1)
gra_boost.fit(X_train, y_train)

y_pred=gra_boost.predict(X_test)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.12></a>
### 4.13 XGBoost

In [None]:
#XGBoost
import xgboost as xgb
xg_boost=xgb.XGBClassifier(random_state=1,learning_rate=0.01)
xg_boost.fit(X_train, y_train)

y_pred=xg_boost.predict(X_test)
#print('Accuracy score for test data is:', accuracy_score(y_test,y_pred))
calculate_error(y_test,y_pred)

#### Observation
- From the above table we can say that accuracy is same as F-score, which means model is working as expected
- Higher Precision means smaller number of False Positives
- Higher Recall means smaller number of False Negatives

<a id=section4.14></a>
### 4.14 Conclusion and model selection

- Out of all the above classification algorithms RF with hyper parameter tuning and Ensembling voting classifier has acheived higher accuracy score of 97%

<a id=section5></a> 
## 5. Model Deployment
- Deplopying the RF randomized CV model using the python pickle model
- Provided the REST interface to test the model
- Run the deploy.ipynb to access the rest interface for prediction

In [None]:
import pickle
pickle.dump(clf, open("model.pkl","wb"))