<img src="https://www.nyp.edu.sg/content/dam/nyp/logo.png" width='200'/>

Welcome to the lab! Before we get started here are a few pointers on Collab notebooks.

1. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

2. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

3. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.
    

# Malware Prediction using KNN

In this lab, we will be working with a Malware Dataset to train a K-Nearest-Neighbor model.



For a start, we will need to import the relavant library to help us with data manipulation 


*   **pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. 
*   **matplotlib** is a comprehensive library for creating static, animated, and interactive visualizations in Python.
*   **seaborn** is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

You can import the libraries in the beginning or at later stage just before you need it.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

You will often see \# in the codes which is a comment that can be used to explain the codes or to temporary hide the codes from the execution

\# this is a comment

Let's download the malware_dataset.

In [None]:
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/day1-am/malware_dataset_full.csv

In [None]:
#Load the data
file_name = 'malware_dataset_full.csv' # this is the dataset csv file


After loading the file, let's make use of pandas library to read the file. 

**pd.read_csv** requires the exact path of the file, and returns a dataframe.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types, just like a table in an excel file.

**df.head** will return the first few records depending on the number indicated in the parentheses.

In [None]:
df = pd.read_csv(file_name)
df.head(5)

Guess what df.tail does?


In [None]:
df.tail(5)

**df.shape** returns a tuple that indicates the number of rows and columns in the dataframe.

In [None]:
# Print the shape (Get the number of rows and cols)
df.shape

**df.columns** prints the columns name

As the list is very long, the bulk of the columns will be hidden. However you can still use df.columns.to_list() to display all the columns.


In [None]:
# Get the column names
df.columns
# df.columns.to_list()


There are 1000 columns in the dataset, how do we know what is the correlation between the columns, especially with regards to CLASS?

Let's make use of pandas **df.corr()** to find out. 

Correlation is a statistical technique that shows how two variables are related

In [None]:
df.corr()

Some dataset may have duplicates and we can drop them by using drop_duplicates

By default, inplace = false and it will return a dataframe will remove duplicates and keep the original dataframe intact.

inplace = true will drop the duplicates directly from the dataframe.


In [None]:
#Checking for duplicates and removing them
df.drop_duplicates(inplace = True)

# dfnew = df.drop_duplicaes(inplace = False)

After dropping the duplicates, let's check if there is any changes to the number of rows of data

In [None]:
#Show the new shape (number of rows & columns)
df.shape

We can make use of df.isnull() to check if there are any missing data to do any preprocessing if neccessary

In [None]:
#Show the number of missing (NAN, NaN, na) data for each column
df.isnull().sum()
# df.isnull().any()

The dataset contains both records that are malware(1) and also benign(0). Let's take a look at how many of each type of data. 

In [None]:
# list the CLASS and the number of records with it
df["CLASS"].value_counts()

For ease of visualisation, we can also display it in a statistical graphic.

In [None]:
sns.countplot(df["CLASS"])
plt.show()

df.head(5)

As seen in the earlier part of the lesson, KNN uses X and Y-axis. Using the dataset given, we can see that  "NAME" is not an important feature. 

**Features** are the descriptive attributes, and **label** is what you're attempting to predict or forecast. 

Thus, for x-axis, we will be using all the features columns, which means all columns except "CLASS" and "NAME".
And for y-axis, we will be using just the label, which is "CLASS"

In [None]:
x = df.drop(['CLASS', 'NAME'], axis=1) #axis = 0 (drop by index), axis = 1 (drop by columns)
x.head()

In [None]:
y=df["CLASS"]
y

For this exercise, we will be using KNN to perform malware prediction. Let's import the data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

A dataset consists alot of data that we can spilt into 80-20 such that 80% of the data is used for training while 20% of the data is used to test the accuracy of our model.

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y, shuffle=True, test_size=0.2, stratify=y)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

Remember that in KNN, K is a hyperparameter that is used to tune our model. In this example there are 2 classes, thus typically used K will be 3, 5, 7 etc.

Any idea why is 3, 5, 7 a good number?

Next, let's train the model with the training data: x_train, y_train.

In [None]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train,y_train)

**model.score** returns the mean accuracy on the given test data and labels.

In [None]:
model.score(x_test,y_test)


We can now use the model to do some classify if it is malware or not. Let's send in the 20% test data that we have kept.

In [None]:
pred=model.predict(x_test)
pred

To check if the above test data is correct, let's compare it with y_test.

In [None]:
y_test

However these data above are very difficult to compare. 

We can make use of dataframe to put the data size by size for easy comparison.

In [None]:
result=pd.DataFrame({
    "Actual_Value":y_test,
    "Predict_Value":pred
})
result

There are approximately 180 rows of data in the test data and it is too tedious to check through row by row on its accuracy.

In **sklearn.metrics**, there are classification report, confusion matrix and accuracy score that we can use to check how accuracy is this model when tested on our test data.



## **Classification report** shows the Precision, Recall, F1-Score and Support.

**Precision** is defined as as the ratio of true positives to the sum of true and false positives.
Precision = True Positive / (True Positive + False Positive)

**Recall** is defined as the ratio of true positives to the sum of true positives and false negatives.
Recall = True Positive / (True Positive + False Negative)

The **F1 score** is a weighted harmonic mean of precision and recall 
F1 score = 2 x Precision x Recall / (Precision + Recall)

**Support** is the number of actual occurrences of the class in the specified dataset.

In [None]:
#Evaluate the model on the training data set
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = model.predict(x_train)
print('Classification Report: \n',classification_report(y_train ,pred ))


Next let's look at the confusion matrix and accuracy.

In [None]:
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print()
print('Accuracy: ', accuracy_score(y_train,pred))

To make the confusion matrix easier to understand, we can plot it for easier visualisation.

In [None]:
from sklearn.metrics import plot_confusion_matrix
titles_options = [("Confusion matrix for training data", None)]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_train, y_train,
                                 display_labels = None,
                                 cmap=plt.cm.Blues,
                                 values_format='')
    disp.ax_.set_title(title)

    print(title)
    
plt.show()

Last but not least, let's evaluate the model on the testing data.

In [None]:
#Evaluate the model on the testing data
pred = model.predict(x_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n',confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))

In [None]:
from sklearn.metrics import plot_confusion_matrix
titles_options = [("Confusion matrix for testing data", None)]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_test, y_test,
                                 display_labels = None,
                                 cmap=plt.cm.Blues,
                                 values_format='')
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()