<a href="https://colab.research.google.com/github/nyp-sit/aiup/blob/main/day1-am/Lab02b_Phishing_Prediction_Answer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://www.nyp.edu.sg/content/dam/nyp/logo.png" width='200'/>

Welcome to the lab! Before we get started here are a few pointers on Colab notebooks.

1. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

2. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

3. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.
    

# Phishing Prediction Exercise using K-Nearest Neighbour (Answer)
In this lab, we will be working with a Phishing Dataset to train a K-Nearest Neighbour (KNN). 

There are some parts that requires your input and some blanks indicated with **None** for you to fill in.

This lab is very similiar to the Malware Prediction, except that we are using a different dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Exercise 1: Read the csv file

In [None]:
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/aiup/day1-am/phishing_dataset.csv

In [None]:
# Ex1a: Load the data - answer
file_name = 'phishing_dataset.csv'
df = pd.read_csv(file_name, index_col='index')

### Exercise 2: Preview and process the data

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
# Print the shape (Get the number of rows and cols)
df.shape

In [None]:
# Ex2a: Get the column names -- answer
df.columns

In [None]:
# Ex2b: display the correlation of the dataset -- answer
df.corr()

In [None]:
# Checking for duplicates and removing them
df.drop_duplicates(inplace = True)

In [None]:
# Show the new shape (number of rows & columns)
df.shape

In [None]:
# Show the number of missing (NAN, NaN, na) data for each column
df.isnull().sum()

In [None]:
# list the different result and the number of records with it
df["Result"].value_counts()

In [None]:
# Ex2c: Use a statistical graph to visualise the data above -- answer
sns.countplot(df["Result"])
plt.show()

### Exerise 3: Identify the features and label

In [None]:
# Ex3a: Define x-axis -- answer
x = df.drop(["Result"],axis=1) #axis = 0 (drop by index), axis = 1

In [None]:
x

In [None]:
# Ex3b: Define y-axis -- answer
y = df["Result"]
y

### Exercise 4: Choose and train the model
We will be using KNearestNeighbour for this exercise

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [None]:
# Ex4a: spilt the data -- answer
x_train,x_test,y_train,y_test=train_test_split(x,y, shuffle=True, test_size=0.2, stratify=y)

In [None]:
# Ex4b: train the model -- answer
model=KNeighborsClassifier(n_neighbors=5)
model.fit(x_train,y_train)

In [None]:
pred=model.predict(x_test)
pred

In [None]:
# Ex4c: display the model score -- answer
model.score(x_test,y_test)

In [None]:
result=pd.DataFrame({
    "Actual_Value":y_test,
    "Predict_Value":pred
})
result

### Exercise 5: Evaluate the model and display the reports




In [None]:
# Ex5a: Evaluate the model using the training data -- answer
pred = model.predict(x_train)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [None]:
# Ex5b: Display classification report -- answer
print('Classification Report: \n',classification_report(y_train ,pred ))

In [None]:
# Ex5c: Display Confusion Matrix and accuracy -- answer
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print()
print('Accuracy: ', accuracy_score(y_train,pred))

In [None]:
# Ex5c (optional): Plot the Confusion Matrix for easy visualisation -- answer
from sklearn.metrics import plot_confusion_matrix
titles_options = [("Confusion matrix for training data", None)]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_train, y_train,
                                 display_labels=None,
                                 cmap=plt.cm.Blues,
                                 values_format='')
    disp.ax_.set_title(title)

    print(title)

plt.show()

In [None]:
# Ex5d: repeat Ex5a-5c on testing data -- answer
pred = model.predict(x_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n',confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))
print()
titles_options = [("Confusion matrix for testing data", None)]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_test, y_test,
                                 display_labels=None,
                                 cmap=plt.cm.Blues,
                                 values_format='')
    disp.ax_.set_title(title)

    print(title)

plt.show()