<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/JNCLectures_Intro_to_ML/blob/main/Week5/Lec5_KNN_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You can write the whole KNN classification algorithm in the follwing simple steps:
1. calculate distances of the given data point with all the training data points
2. sort in the ascending order
3. take the top k data points. Steps 2 and 3 ensure that we take the k nearest data points to the given data point
4. Perform a majority voting on the class labels of these k data points. The majority vote is your predicted class label.

We will write 3 functions to implement the above steps, and a final function to put all these together.



In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
import scipy.stats as st

from sklearn.model_selection import train_test_split

In [2]:
iris = datasets.load_iris()
list(iris.keys())

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename',
 'data_module']

In [3]:
# get data
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(100, 4) (50, 4) (100,) (50,)


In [4]:
#Step 1: let's create a function that given a test vector calculates distances with all training data points

def calc_euclidean_distances(given_vector, train_data):
  list_distances = [np.linalg.norm((train_data[i] - given_vector), ord = 1) \
                    for i in range(len(train_data))]
  return list_distances

In [5]:
# Step 2 and 3: now we will create a function which, given a list of distances,
# sorts it and returns top k distances and their original indices

def calc_top_k(list_distances, top_k):

  zipped_sorted = sorted(zip(list_distances, range(len(list_distances)))) # by default, sorts in the ascending order
  topk_distances = [element[0] for element in zipped_sorted[0:top_k]]
  topk_dist_indices = [element[1] for element in zipped_sorted[0:top_k]]

  return topk_distances, topk_dist_indices

In [6]:
# Step 4: now we will get the corresponding class labels for these top k neighbours from the training data and get the majority score

def get_class_label(topk_dist_indices, train_labels):
  return st.mode(train_labels[topk_dist_indices])[0]

In [7]:
# wrap it up in a final function
# provide train_data, train_labels, top_k and a test vector
# it should provide class it belongs to

def predict_class_label(given_vector, train_data, train_labels, top_k):

  # step 1: get all the distances of the training data points wrt the given test vector
  list_dist = calc_euclidean_distances(given_vector = given_vector,
                                       train_data = train_data)

  # step 2: sort and get the top k data points closest to the given vector
  topk_list_dist, topk_dist_ind = calc_top_k(list_distances = list_dist, top_k = top_k)

  # step 3: use majority vote to calculate the class from the class of the top k data points closest to the given test vector
  predicted_class = get_class_label(topk_dist_indices = topk_dist_ind,
                                    train_labels = train_labels)

  return predicted_class

In [8]:
# test it out
print("test indices to choose from:", range(len(y_test)))
test_index = int(input("given test vector index:"))
predicted_class = predict_class_label(given_vector = X_test[test_index],
                                      train_data = X_train,
                                      train_labels = y_train,
                                      top_k = 5)
print(f"original class: {y_test[test_index]} \n predicted class: {predicted_class}")

test indices to choose from: range(0, 50)
given test vector index:21
original class: 2 
 predicted class: 2


In [11]:
y_pred = [predict_class_label(given_vector = X_test[i],
                                      train_data = X_train,
                                      train_labels = y_train,
                                      top_k = 5)
           for i in range(len(X_test))]

In [17]:
from sklearn.metrics import precision_score, recall_score, f1_score

print(precision_score(y_test, y_pred, average = "weighted"),
      recall_score(y_test, y_pred, average = "weighted"),
      f1_score(y_test, y_pred, average = "weighted"))

0.98125 0.98 0.98


Q: What does this average = "weighted" mean? Why is this needed?

#### Assignment: Write a function to calculate the confusion matrix, precision, recall, f1-score using the predict_class_label() function on the test dataset