## Detection and Prediction of Phishing Websites using Classification Mining Techniques

**Mofleh Al-diabat**\
Department of Computer Science - Al Albayt University\
International Journal of Computer Applications (0975 – 8887) - Volume 147 – No.5, August 2016

*ABSTRACT*\
Phishing is serious web security problem that involves mimicking legitimate websites to deceive online users in order to steal their sensitive information. Phishing can be seen as a typical classification problem in data mining where the classifier is constructed from large number of website’s features. There are high demands on identifying the best set of features that when mined the predictive accuracy of the classifiers is enhanced. This paper investigates features selection aiming to determine the effective set of features in terms of classification performance. We compare two known features selection method in order to determine the least set of features of phishing detection using data mining. Experimental tests on large number of features data set have been done using Information Gain and Correlation Features set methods. Further, two data mining algorithms namely C4.5 and IREP have been trained on different sets of selected features to show the pros and cons of the feature selection process. We have been able to identify new knowledge in the forms of rules that show vital correlations among significant features.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, log_loss

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier

In [2]:
phishing_data = pd.read_csv('phishing_data.csv')
phishing_data.head()

Unnamed: 0,SFH,popUpWindow,SSLFinal_State,Request_URL,URL_of_Anchor,Web_Traffic,URL_Length,Age_of_Domain,Having_IP_Address,Result
0,1,-1,1,-1,-1,1,1,1,0,0
1,-1,-1,-1,-1,-1,0,1,1,1,1
2,1,-1,0,0,-1,0,-1,1,0,1
3,1,0,1,-1,-1,0,1,1,0,0
4,-1,-1,1,-1,0,0,-1,1,0,1


In [3]:
X = phishing_data.drop('Result', axis=1)
y = phishing_data['Result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [4]:
names = ["Logististic Regression", "Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost", "XGBoost","Naive Bayes", "QDA"]
scores = []

classifiers = [
    LogisticRegression(),
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=0.001, solver='lbfgs', learning_rate='adaptive', max_iter=1000),
    AdaBoostClassifier(),
    XGBClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

for classifier in classifiers:
    pipe = Pipeline(steps=[('classifier', classifier)])
    pipe.fit(X_train, y_train)   
    
    scores.append(pipe.score(X_test, y_test))
    
    print(classifier)    
    print("model score: %.3f" % pipe.score(X_test, y_test))
    print("\n -----------------------------------------------------------------------------------")
    
#end of pipeline
scores_df = pd.DataFrame(zip(names,scores), columns=['Classifier', 'Accuracy Score'])

LogisticRegression()
model score: 0.844

 -----------------------------------------------------------------------------------
KNeighborsClassifier(n_neighbors=3)
model score: 0.882

 -----------------------------------------------------------------------------------
SVC(C=0.025, kernel='linear')
model score: 0.808

 -----------------------------------------------------------------------------------
SVC(C=1, gamma=2)
model score: 0.867

 -----------------------------------------------------------------------------------
GaussianProcessClassifier(kernel=1**2 * RBF(length_scale=1))
model score: 0.844

 -----------------------------------------------------------------------------------
DecisionTreeClassifier(max_depth=5)
model score: 0.838

 -----------------------------------------------------------------------------------
RandomForestClassifier(max_depth=5, max_features=1, n_estimators=10)
model score: 0.832

 ------------------------------------------------------------------------------



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
model score: 0.900

 -----------------------------------------------------------------------------------
GaussianNB()
model score: 0.823

 -----------------------------------------------------------------------------------
QuadraticDiscriminantAnalysis()
model score: 0.847

 -----------------------------------------------------------------------------------


In [5]:
print(scores_df.sort_values(by='Accuracy Score', ascending=False))

                Classifier  Accuracy Score
7               Neural Net        0.908555
9                  XGBoost        0.899705
1        Nearest Neighbors        0.882006
3                  RBF SVM        0.867257
11                     QDA        0.846608
0   Logististic Regression        0.843658
4         Gaussian Process        0.843658
5            Decision Tree        0.837758
8                 AdaBoost        0.837758
6            Random Forest        0.831858
10             Naive Bayes        0.823009
2               Linear SVM        0.808260
