# Importing Required Libraries

- **Pandas** :  For data processing, CSV file I/O (e.g. pd.read_csv)
- **Numpy**  :  For linear algebra
- **Matplotlib** : For Data visualization
- **sklearn.model_selection**  : For spliting data in Train & Test
- **sklearn.linear_mode.LogisticRegression**   : For Logistic Regression 
- **sklearn.metrics**  : Evaluation metrics 

In [None]:
import pandas as pd
import numpy as np
import copy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression  # For Logistic Regression
from sklearn.ensemble import RandomForestClassifier # For RFC
from sklearn.svm import SVC                               #For SVM
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import matthews_corrcoef    
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score,roc_curve,auc
from sklearn.model_selection import GridSearchCV
sns.set(style="ticks", color_codes=True)

## Loding complete data in Panda's Dataframe

In [None]:
df = pd.read_csv("../input/phishing-data/combined_dataset.csv")
df.head()

The description of data are as follows:
- Domain: The URL itself.
- Ranking: Page Ranking
- isIp: Is there an IP address in the weblink
- valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration.
- activeDuration: Also from whois API. Gives the duration of the time since the registration up until now.
- urlLen: It is simply the length of the URL
- is@: If the link has a '@' character then it's value = 1
- isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together.
- haveDash: If there are any dashes in the domain name.
- domainLen: The length of just the domain name.
- noOfSubdomain: The number of subdomains preset in the URL.
- Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link

In [None]:
df.isnull().sum()
df.isna().sum()
#df.info()

- No null Value

In [None]:
df.describe()

In [None]:
sns.countplot(df['label'])

## Prepration Of Data

In [None]:
X= df.drop(['label', 'domain'], axis=1)
Y= df.label
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.40)
print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_test.shape[0]))


# Logistic regression
![title](https://miro.medium.com/max/700/1*dm6ZaX5fuSmuVvM4Ds-vcg.jpeg)
![title](https://miro.medium.com/max/700/1*UgYbimgPXf6XXxMy2yqRLw.png)

In [None]:
def LogReg(x_train, y_train, x_test, y_test):
    LogReg1=LogisticRegression()
    #Train the model using training data 
    LogReg1.fit(x_train,y_train)
    #Test the model using testing data
    y_pred_log1 = LogReg1.predict(x_test)
    cm=confusion_matrix(y_test,y_pred_log1)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_log1,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_log1))
    print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log1))
    print( classification_report(y_test,y_pred_log1))
    print(cm)
    return;

In [None]:
LogReg(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)

In [None]:
# Normalizing continuous variables
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)

In [None]:
LogReg(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)

- By Normalizing continuous variables Accuracy has beed increased

### L1 and L2 Regularization Methods
- Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.
- Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

##### penalty: Used to specify the norm used in the penalization (for regularization).

In [None]:
def LogReg22(x_train, y_train, x_test, y_test):
    LogReg2=LogisticRegression(penalty='l2')
    #Train the model using training data 
    LogReg2.fit(x_train,y_train)
    #Test the model using testing data
    y_pred_log2 = LogReg2.predict(x_test)
    cm=confusion_matrix(y_test,y_pred_log2)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_log2,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_log2))
    print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log2))
    print( classification_report(y_test,y_pred_log2))
    print(cm)
    return;

In [None]:
LogReg22(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)

- Over All accuracy is decreased, but confidence interval is increased for both 0 & 1

In [None]:
def LogReg33(x_train, y_train, x_test, y_test):
    LogReg3=LogisticRegression(penalty='l2')
    #Train the model using training data 
    LogReg3.fit(x_train,y_train)
    #Test the model using testing data
    y_pred_log3 = LogReg3.predict(x_test)
    cm=confusion_matrix(y_test,y_pred_log3)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_log3,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_log3))
    print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log3))
    print( classification_report(y_test,y_pred_log3))
    print(cm)
    return;

In [None]:
LogReg33(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)

In [None]:
LogReg1=LogisticRegression(random_state= 0, multi_class='multinomial' , solver='newton-cg')

### Grid Search

In [None]:
# Create first pipeline for base without reducing features.

#pipe = Pipeline([('classifier' , LogisticRegression())])
# pipe = Pipeline([('classifier', RandomForestClassifier())])

# Create param grid.

param_grid = [
    {'random_state' : (range(0,10,2)),
     'solver' : ['newton-cg', 'liblinear']}]

# Create grid search object
logreg = LogisticRegression()

clf = GridSearchCV(logreg, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data

best_clf = clf.fit(X_train, y_train)

In [None]:
best_clf.best_params_
#grid_result.best_params_

In [None]:
def LogReg44(x_train, y_train, x_test, y_test):
    LogReg4=LogisticRegression(random_state=0, solver='newton-cg')
    #Train the model using training data 
    LogReg4.fit(x_train,y_train)
    #Test the model using testing data
    y_pred_log4 = LogReg4.predict(x_test)
    cm=confusion_matrix(y_test,y_pred_log4)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_log4,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_log4))
    print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log4))
    print( classification_report(y_test,y_pred_log4))
    print(cm)
    return;

In [None]:
LogReg44(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)

In [None]:
#For RFC
param_grid = [
    {'n_estimators' : list(range(10,101,10)),
    'max_features' : list(range(2,10,1))},
     ]

# Create grid search object
rfc = RandomForestClassifier()

clf = GridSearchCV(rfc, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data

best_clf = clf.fit(X_train, y_train)

In [None]:
best_clf.best_params_

In [None]:
def RFC(x_train, y_train, x_test, y_test):
    #create RFC object
    RFClass1 = RandomForestClassifier(max_depth=3, n_estimators=100)
    #Train the model using training data 
    RFClass1.fit(x_train,y_train)

    #Test the model using testing data
    y_pred_rfc1 = RFClass1.predict(x_test)

    cm=confusion_matrix(y_test,y_pred_rfc1)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_rfc1,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_rfc1))
    print("The accuracy Random forest classifier on testing data is: ",100.0 *accuracy_score(y_test,y_pred_rfc1))
    print( classification_report(y_test,y_pred_rfc1))
    print(cm)
    return;

In [None]:
RFC(x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)