# Objective :  Based on the given features, classify the website as legitimate website or phishing website link

# Importing Required Libraries

- **Pandas** :  For data processing, CSV file I/O (e.g. pd.read_csv)
- **Numpy**  :  For linear algebra
- **Matplotlib** : For Data visualization
- **sklearn.model_selection**  : For spliting data in Train & Test
- **sklearn.linear_mode.LogisticRegression**   : For Logistic Regression 
- **sklearn.metrics**  : Evaluation metrics 

In [None]:
import pandas as pd
import numpy as np
import copy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression  # For Logistic Regression
from sklearn.ensemble import RandomForestClassifier # For RFC
from sklearn.svm import SVC                               #For SVM
from sklearn.metrics import matthews_corrcoef    
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score,roc_curve,auc
sns.set(style="ticks", color_codes=True)

## Loding complete data in Panda's Dataframe

In [None]:
df = pd.read_csv("../input/phishing-data/combined_dataset.csv")

In [None]:
df.head()

The description of data are as follows:
- Domain: The URL itself.
- Ranking: Page Ranking
- isIp: Is there an IP address in the weblink
- valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration.
- activeDuration: Also from whois API. Gives the duration of the time since the registration up until now.
- urlLen: It is simply the length of the URL
- is@: If the link has a '@' character then it's value = 1
- isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together.
- haveDash: If there are any dashes in the domain name.
- domainLen: The length of just the domain name.
- noOfSubdomain: The number of subdomains preset in the URL.
- Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link

# EDA

In [None]:
df.isnull().sum()
df.isna().sum()

- No null Value

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
sns.countplot(df['label'])

- Result/target are distribute in aprox 4-6 ratio

### Co-relation matrix

In [None]:
df.corr()
df.corr()['label'].sort_values()

### Heat Map

In [None]:
#plt.figure(figsize = ('8','8'))
sns.heatmap(df.corr(),annot=True)

- activeDuration and valid has very less co-relation with target (can be ignored)

In [None]:
#sns.pairplot(df)

## Prepration Of Data

### Feature Selection 
- Taking all the features in count 

In [None]:
X= df.drop(['label', 'domain'], axis=1)
Y= df.label

- Split the data as training and testing data - 60% train size, 40% test size

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.40)
print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_test.shape[0]))

### Model Training

## 1.1 Logistic Regression
  - As in target feature we have only 2 value (0 or 1)

In [None]:
#create logistic regression object
LogReg=LogisticRegression()

#Train the model using training data 
LogReg.fit(x_train,y_train)


#Test the model using testing data
y_pred = LogReg.predict(x_test)

- ##### Model Testing
   - matthews_corrcoef
   - accuracy_score
   - f1_score
   - confusion matrix

In [None]:
cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True)

In [None]:
print("f1 score is ",f1_score(y_test,y_pred,average='weighted'))
print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred))
print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred))

## 1.2 Logistic Regression
        - With Parameter adjustment

In [None]:
LogReg1=LogisticRegression(random_state= 0, multi_class='multinomial' , solver='newton-cg')
#Train the model using training data 
LogReg1.fit(x_train,y_train)


#Test the model using testing data
y_pred_log = LogReg1.predict(x_test)

cm=confusion_matrix(y_test,y_pred_log)
sns.heatmap(cm,annot=True)
print("f1 score is ",f1_score(y_test,y_pred_log,average='weighted'))
print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_log))
print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log))

In [None]:
fpr,tpr,thresh = roc_curve(y_test,y_pred_log)
roc_auc = accuracy_score(y_test,y_pred_log)

# Plot ROC curve for Logistic Regression
plt.plot(fpr,tpr,'orange',label = 'Logistic Regression')
plt.legend("Logistic Regression", loc='lower right')
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.legend(loc='lower right')

## 2: Random Forest Classifier

In [None]:
#create RFC object
RFClass = RandomForestClassifier()
#Train the model using training data 
RFClass.fit(x_train,y_train)

#Test the model using testing data
y_pred_rfc = RFClass.predict(x_test)

cm=confusion_matrix(y_test,y_pred_rfc)
sns.heatmap(cm,annot=True)
print("f1 score is ",f1_score(y_test,y_pred_rfc,average='weighted'))
print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_rfc))
print("The accuracy Random forest classifier on testing data is: ",100.0 *accuracy_score(y_test,y_pred_rfc))

In [None]:
fpr,tpr,thresh = roc_curve(y_test,y_pred_rfc)
roc_auc = accuracy_score(y_test,y_pred_rfc)

# Plot ROC curve for Logistic Regression
plt.plot(fpr,tpr,'orange',label = 'Random Forest Classification')
plt.legend("Logistic Regression", loc='lower right')
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.legend(loc='lower right')

## 3: SVM

In [None]:
#create SVM object

svc = SVC()

svc.fit(x_train,y_train)
y_pred_svc = svc.predict(x_test)

cm=confusion_matrix(y_test,y_pred_svc)
sns.heatmap(cm,annot=True)
print("f1 score is ",f1_score(y_test,y_pred_svc,average='weighted'))
print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_svc))
print("The accuracy SVC on testing data is: ",100.0 *accuracy_score(y_test,y_pred_svc))

In [None]:
fpr,tpr,thresh = roc_curve(y_test,y_pred_svc)
roc_auc = accuracy_score(y_test,y_pred_svc)

# Plot ROC curve for SVC
plt.plot(fpr,tpr,'orange',label = 'Random Forest Classification')
plt.legend("Logistic Regression", loc='lower right')
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.legend(loc='lower right')

In [None]:
print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log))
print("The accuracy Random forest classifier on testing data is: ",100.0 *accuracy_score(y_test,y_pred_rfc))
print("The accuracy SVC on testing data is: ",100.0 *accuracy_score(y_test,y_pred_svc))