##The Website Phishing data contains 10 columns.
##Each column helps to identify if the website is legitimate(1), suspicious(0), phishy(-1).  
#Applying **Decision Tree** to classify whether a website is phishy or not.



In [None]:
import numpy as np
import pandas as pd


In [None]:
df = pd.read_csv('../input/website-phishing-data-set/Website Phishing.csv')

In [None]:
df.head(10)

Name and Meaning of Each Column

The columns in the dataframe represent different features related to phishing. The values associated with each one of the columns are described below:

1. SFH - Server Form Handler After the information is send the website sends them to a server to process the data. Phishy websites usuallu let the SFH field blank or redirects to another domain.

2. Pop up Window - Legitimate websites don't use pop up windows to validate users' information.

3. SSL final state - Reliable webpages use the HTTP protocol, on the other hand malicious websites may use a fake HTTP procotol or not use it at all.

4. Request URL - Malicious websites usually load the page content from a different URL than the original website URL.

5. URL of anchor - Malicious websites usually have links that point to different webpages.

6. Web traffic - Legitimate websites usually have a lower number of visits than the malicious ones. 

7. URL length - URLs with length bigger than 75 characters are considered features from phishy websites.

8. Age of domain - Websites with less than a year of existece are considered suspicious 

9. Having IP Address - The presence of IP address in the website URL is associated with malicious websites.

In [None]:
a=len(df[df.Result==0])
b=len(df[df.Result==-1])
c=len(df[df.Result==1])

In [None]:
print("Count of Legitimate Websites = ", b)
print("Count of Suspicious Websites = ", a)
print("Count of Phishy Websites = ", c)

Exploring Data

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df.plot.hist(subplots=True, layout=(5,5), figsize=(15, 15), bins=20)

In [None]:
#Correlation Matrix
df.corr()

In [None]:
df.info()

Model Training

In [None]:
x = df.drop('Result',axis=1).values 
y = df['Result'].values

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.40,random_state=10)

print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_test.shape[0]))

In [None]:
from sklearn import tree

In [None]:
model_tree = tree.DecisionTreeClassifier()
model = model_tree.fit(x_train, y_train)

In [None]:
features = ('SFH', 'popUpWidnow', 'SSLfinal_State', 'Request_URL', 'URL_of_Anchor',	'web_traffic',	'URL_Length',	'age_of_domain',	'having_IP_Address')
name = ('Phishing', 'Suspicious', 'Legitimate')

In [None]:
#Decision Tree
import graphviz 
dot_data = tree.export_graphviz(model, out_file=None, 
                      feature_names = features, class_names =  name,
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

Model Testing

In [None]:
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
 
#Test the model using testing data
predictions = model.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test,predictions)

In [None]:
print("f1 score is ",f1_score(y_test,predictions,average='weighted'))
print("The accuracy of Decision Tree Algorithm on testing data is: ",100.0 *accuracy_score(y_test,predictions))