# Phishing Site Detection using KNN

###### The data contains atrributes encoded into the following format 

SFH's type is nominal, range is ('1', '-1', '0')
 	popUpWidnow's type is nominal, range is ('-1', '0', '1')
 	SSLfinal_State's type is nominal, range is ('1', '-1', '0')
 	Request_URL's type is nominal, range is ('-1', '0', '1')
 	URL_of_Anchor's type is nominal, range is ('-1', '0', '1')
 	web_traffic's type is nominal, range is ('1', '0', '-1')
 	URL_Length's type is nominal, range is ('1', '-1', '0')
 	age_of_domain's type is nominal, range is ('1', '-1')
 	having_IP_Address's type is nominal, range is ('0', '1')
 	Result's type is nominal, range is ('0', '1', '-1'))
    
    The  data coding is as follows:
    "1"  - Legitimate Website
    "0"  - Suspisious
    "-1" - Malicious Website

###### Step 1: Importing required libraries in advance

In [1]:
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [8]:
phishing_data_raw = loadarff("PhishingData.arff")


In [70]:
phishing_data_array = np.array(phishing_data_raw[0])

In [13]:
phishing_data_frame = pd.DataFrame(phishing_data_array).apply(pd.to_numeric)
phishing_data_frame.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


##### Step 2: Let us perform some data manipulations to extract predictor and target variables respectively

In [20]:
col_names = list(phishing_data_frame)
print("Column_names: " + ", ".join(col_names))

Column_names: having_IP_Address, URL_Length, Shortining_Service, having_At_Symbol, double_slash_redirecting, Prefix_Suffix, having_Sub_Domain, SSLfinal_State, Domain_registeration_length, Favicon, port, HTTPS_token, Request_URL, URL_of_Anchor, Links_in_tags, SFH, Submitting_to_email, Abnormal_URL, Redirect, on_mouseover, RightClick, popUpWidnow, Iframe, age_of_domain, DNSRecord, web_traffic, Page_Rank, Google_Index, Links_pointing_to_page, Statistical_report, Result


In [16]:
labels = phishing_data_frame["Result"].values
print(labels)

[-1 -1 -1 ... -1 -1 -1]


In [21]:
predictor_cols = col_names[0:len(col_names) - 2]
print("Predictor_columns: "  +  ", ".join(predictor_cols))

Predictor_columns: having_IP_Address, URL_Length, Shortining_Service, having_At_Symbol, double_slash_redirecting, Prefix_Suffix, having_Sub_Domain, SSLfinal_State, Domain_registeration_length, Favicon, port, HTTPS_token, Request_URL, URL_of_Anchor, Links_in_tags, SFH, Submitting_to_email, Abnormal_URL, Redirect, on_mouseover, RightClick, popUpWidnow, Iframe, age_of_domain, DNSRecord, web_traffic, Page_Rank, Google_Index, Links_pointing_to_page


In [22]:
predictors = phishing_data_frame[predictor_cols].values
predictors

array([[-1,  1,  1, ..., -1,  1,  1],
       [ 1,  1,  1, ..., -1,  1,  1],
       [ 1,  0,  1, ..., -1,  1,  0],
       ...,
       [ 1, -1,  1, ..., -1,  1,  0],
       [-1, -1,  1, ..., -1,  1,  1],
       [-1, -1,  1, ..., -1, -1,  1]], dtype=int64)

In [30]:
clf_knn = KNeighborsClassifier(n_neighbors = 10, weights = 'distance')

clf_knn = clf_knn.fit(predictors, labels)
print(clf_knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='distance')


In [24]:
score_knn = cross_val_score(clf_knn, predictors, labels, cv = 5)

In [28]:
print("Cross validation score :" + str(score_knn))

Cross validation score :[0.98101266 0.97377939 0.96291271 0.93846154 0.92171946]


In [29]:
print("Cross validation Mean score :" + str(score_knn.mean()))

Cross validation Mean score :0.9555771496112235


##### Step 3: Let us split the data into training and testing

In [34]:
X_train, X_test, y_train, y_test = train_test_split(predictors, labels, test_size = 0.3)

###### Step 4: Let us apply a grid search to obtain the best parameters

In [31]:
from sklearn.model_selection import GridSearchCV

In [37]:
param_grid = {'leaf_size' : [10, 20, 30, 40, 50],
             'n_neighbors' : [5, 7, 9, 11, 13, 15, 17, 19]}

In [38]:
knn_grid = GridSearchCV(clf_knn, param_grid, cv = 10)
knn_grid_model = knn_grid.fit(X_train, y_train)

In [40]:
knn_grid.best_params_

{'leaf_size': 10, 'n_neighbors': 11}

###### As per the gridsearchcv the best hyperparameters are with the value 10 for leaf_size and value 11 for number of neighbors or k

##### Step 5: After finalizing the value of 'k' post cross validation let us apply our ML model as below :

In [49]:
col_names_ml = list(phishing_data_frame)
print("Column_names :" + ", ".join(col_names_ml))

Column_names :having_IP_Address, URL_Length, Shortining_Service, having_At_Symbol, double_slash_redirecting, Prefix_Suffix, having_Sub_Domain, SSLfinal_State, Domain_registeration_length, Favicon, port, HTTPS_token, Request_URL, URL_of_Anchor, Links_in_tags, SFH, Submitting_to_email, Abnormal_URL, Redirect, on_mouseover, RightClick, popUpWidnow, Iframe, age_of_domain, DNSRecord, web_traffic, Page_Rank, Google_Index, Links_pointing_to_page, Statistical_report, Result


In [58]:
labels_ml = phishing_data_frame["Result"].values

In [53]:
predictor_col_ml = col_names_ml[0: len(col_names_ml) - 2]

print("Predictor Columns :" + ", ".join(predictor_col_ml))

Predictor Columns :having_IP_Address, URL_Length, Shortining_Service, having_At_Symbol, double_slash_redirecting, Prefix_Suffix, having_Sub_Domain, SSLfinal_State, Domain_registeration_length, Favicon, port, HTTPS_token, Request_URL, URL_of_Anchor, Links_in_tags, SFH, Submitting_to_email, Abnormal_URL, Redirect, on_mouseover, RightClick, popUpWidnow, Iframe, age_of_domain, DNSRecord, web_traffic, Page_Rank, Google_Index, Links_pointing_to_page


In [56]:
predictors_ml = phishing_data_frame[predictor_col_ml].values

In [59]:
X_train, X_test, y_train, y_test = train_test_split(predictors_ml, labels_ml, test_size = 0.3, random_state = 123)

In [60]:
clf_knn_ml = KNeighborsClassifier(n_neighbors = 11, leaf_size = 10, weights = 'distance')

In [61]:
clf_knn_ml = clf_knn_ml.fit(X_train, y_train)

In [67]:
pred = clf_knn_ml.predict(X_test)

In [69]:
print(classification_report(y_test, pred))

             precision    recall  f1-score   support

         -1       0.96      0.94      0.95      1508
          1       0.95      0.96      0.96      1809

avg / total       0.95      0.95      0.95      3317



In [73]:
print(accuracy_score(y_test, pred) * 100)

95.14621646065721


###### So with an accuracy score of 95% we can conclude that our model is performing relatively well in classifying the sites which are phishing from those which are genuine.