# (4.2) K-Nearest Neighbor implementation

https://towardsdatascience.com/how-to-build-knn-from-scratch-in-python-5e22b8920bd2

In [1]:
import sys
sys.path.append("..")
from Functions.UNSW_DF import *

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import *
from decimal import Decimal

In [2]:
train, test = DF_preprocessed_traintest()

Reading Preprocessed CSV Files..
	 Train Shape:  	 (175341, 54)
	 Test Shape:  	 (82332, 54)
Dataset Loaded!


In [3]:
X = train.drop(["label"], axis=1)
y = train["label"]

In [4]:
cols = X.columns
for df in[X]:
    for col in cols:
        df[col] = df[col].astype(float)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175341 entries, 0 to 175340
Data columns (total 53 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   service_http         175341 non-null  float64
 1   service_others       175341 non-null  float64
 2   dtcpb                175341 non-null  float64
 3   state_others         175341 non-null  float64
 4   sload                175341 non-null  float64
 5   proto_ospf           175341 non-null  float64
 6   ct_dst_sport_ltm     175341 non-null  float64
 7   synack               175341 non-null  float64
 8   sbytes               175341 non-null  float64
 9   service_ftp-data     175341 non-null  float64
 10  is_sm_ips_ports      175341 non-null  float64
 11  service_-            175341 non-null  float64
 12  ct_src_ltm           175341 non-null  float64
 13  ct_src_dport_ltm     175341 non-null  float64
 14  ct_dst_ltm           175341 non-null  float64
 15  service_ssh      

## Algorithm (Pseudo-code)

In [5]:
# TODO: (1) Define a function to calculate the distance between two points
# TODO: (2) Use the distance function to get the distance between a test point and all known data points
# TODO: (3) Sort distance measurements to find the points closest to the test point (i.e., find the nearest neighbors)
# TODO: (4) Use majority class labels of those closest points to predict the label of the test point
# TODO: (5) Repeat steps 1 through 4 until all test data points are classified

## (1) Define a function to calculate distance between two points

First, I define a function called minkowski_distance, that takes an input of two data points (a & b) and a Minkowski power parameter p, and returns the distance between the two points. Note that this function calculates distance exactly like the Minkowski formula I mentioned earlier. By making p an adjustable parameter, I can decide whether I want to calculate Manhattan distance (p=1), Euclidean distance (p=2), or some higher order of the Minkowski distance.

In [None]:
def minkowski_distance(a, b, p=1):
   # Store the number of dimensions
    dim = len(a)
    
    # Set initial distance to 0
    distance = 0
    
    # Calculate minkowski distance using parameter p
    for i in range(dim):
        try:
            distance += abs(float((a[i])) - float(b[i]))**p
            #print(type(distance))
        except:
            for val in[distance]:
                distance[val] = distance[val].astype(float)
            print("Exception made..")
    distance = distance**(1/p)
    return distance

minkowski_distance(a=X.iloc[0], b=X.iloc[1], p=1)

In [6]:
def my_p_root(value, root):
   my_root_value = 1 / float(root)
   return round (Decimal(value) **
   Decimal(my_root_value), 3)

In [7]:
def my_minkowski_distance(x, y, p_value):
   return (my_p_root(sum(pow(abs(a-b), p_value)
      for a, b in zip(x, y)), p_value))

## 2. Use the distance function to get distance between a test point and all known data points

For step 2, I simply repeat the minkowski_distance calculation for all labeled points in X and store them in a dataframe.

In [8]:
# Define an arbitrary test point
test_pt = [4.8, 2.7, 2.5, 0.7]

# Calculate distance between test_pt and all points in X
distances = []

for i in X.index:
    distances.append(my_minkowski_distance(test_pt, X.iloc[i], p_value=1))
    
df_dists = pd.DataFrame(data=distances, index=X.index, columns=['dist'])
df_dists.head()

Unnamed: 0,dist
0,9.789
1,9.143
2,9.227
3,10.642
4,9.955


## 3. Sort distance measurements to find the points closest to the test point

In step 3, I use the pandas .sort_values() method to sort by distance, and return only the top 5 results.

In [9]:
# Find the 5 nearest neighbors
df_nn = df_dists.sort_values(by=['dist'], axis=0)[:5]
df_nn

Unnamed: 0,dist
101086,7.244
128885,7.244
53147,7.244
97314,7.244
70350,7.244


## 4. Use majority class labels of those closest points to predict the label of the test point

For this step, I use collections.Counter to keep track of the labels that coincide with the nearest neighbor points. I then use the .most_common() method to return the most commonly occurring label. Note: if there is a tie between two or more labels for the title of “most common” label, the one that was first encountered by the Counter() object will be the one that gets returned.

In [10]:
from collections import Counter

# Create counter object to track the labels
counter = Counter(y[df_nn.index])

# Get most common label of all the nearest neighbors
counter.most_common()[0][0]

1

## 5. Repeat steps 1 through 4 until all test data points are classified

In this step, I put the code I’ve already written to work and write a function to classify the data using KNN. First, I perform a train_test_split on the data (75% train, 25% test), and then scale the data using StandardScaler(). Since KNN is distance-based, it is important to make sure that the features are scaled properly before feeding them into the algorithm.

Additionally, to avoid data leakage, it is good practice to scale the features after the train_test_split has been performed. First, scale the data from the training set only (scaler.fit_transform(X_train)), and then use that information to scale the test set (scaler.tranform(X_test)). This way, I can ensure that no information outside of the training data is used to create the model

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data - 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# Scale the X data
#scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

In [12]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131505 entries, 72254 to 128037
Data columns (total 53 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   service_http         131505 non-null  float64
 1   service_others       131505 non-null  float64
 2   dtcpb                131505 non-null  float64
 3   state_others         131505 non-null  float64
 4   sload                131505 non-null  float64
 5   proto_ospf           131505 non-null  float64
 6   ct_dst_sport_ltm     131505 non-null  float64
 7   synack               131505 non-null  float64
 8   sbytes               131505 non-null  float64
 9   service_ftp-data     131505 non-null  float64
 10  is_sm_ips_ports      131505 non-null  float64
 11  service_-            131505 non-null  float64
 12  ct_src_ltm           131505 non-null  float64
 13  ct_src_dport_ltm     131505 non-null  float64
 14  ct_dst_ltm           131505 non-null  float64
 15  service_ssh  

In [19]:
def knn_predict(X_train, X_test, y_train, y_test, k, p):
    
    # Counter to help with label voting
    from collections import Counter
    
    # Make predictions on the test data
    # Need output of 1 prediction per test data point
    y_hat_test = []

    for test_point in X_test:
        distances = []

        for train_point in X_train:
            distance = my_minkowski_distance(test_point, train_point, p_value=p)
            distances.append(distance)
        
        # Store distances in a dataframe
        df_dists = pd.DataFrame(data=distances, columns=['dist'], 
                                index=y_train.index)
        
        # Sort distances, and only consider the k closest points
        df_nn = df_dists.sort_values(by=['dist'], axis=0)[:k]

        # Create counter object to track the labels of k closest neighbors
        counter = Counter(y_train[df_nn.index])

        # Get most common label of all the nearest neighbors
        prediction = counter.most_common()[0][0]
        
        # Append prediction to output list
        y_hat_test.append(prediction)
        
    return y_hat_test

In [20]:
# Make predictions on test dataset
y_hat_test = knn_predict(X, X_test, y, y_test, k=3, p=1)

print(y_hat_test)

TypeError: unsupported operand type(s) for -: 'str' and 'str'