## (Model 4) K-Nearest Neighbours

The k-Nearest Neighbors (KNN) is a versatile machine learning algorithm used for classification and regression. It operates on the idea that similar data points are likely to have similar outputs. In KNN, the output is determined by the labels of the ‘k’ nearest data points. The most common method to calculate the distance between data points is the Euclidean distance.

The choice of ‘k’ is important as a smaller ‘k’ makes the model sensitive to noise, while a larger ‘k’ makes it computationally expensive. Sometimes, the neighbors are weighted according to their distance, giving closer neighbors more influence. For a given data point to be classified or predicted, the distance to every other training data point is computed. The ‘k’ smallest distances and the corresponding data points are identified


Eucledian Distance
$$d(P, Q) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2}$$






In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
#This encoding function  checks the columns data values in the dataset; if they are strings then it converts int unique numerical values for the corresponding alphanumeric values.
def encoding(df1):
  for col in df1.columns:
    if df1[col].dtype=='object':
      convert={}# This is the dictionary I have used to check the unique alphanumeric values in a particular column in the database
      flag=1
      for item in df1[col]:
        if item not in convert:
          convert[item]=flag
          flag+=1
      df1[col]=df1[col].map(convert)
      #Finally in this step we replace the entire column of alphanumeric values with integer values.
  return df1

#This common functions helps us to find the most common class of the k nearest neighbours of our testing data point.
def common(labels):
    lc={} #Here label is a dictionary that stores the key value pair of the two classes and their frequency in the k nearest neighbours.
    for label in labels:
        if label in lc:
          lc[label]+=1
        else:
          lc[label]=1
    #Here we finally calculate the number of times the class 1 and class 2 occurs in the knn and assign its value to the mcom variable and return it.
    mcom=None
    max=0
    for label,count in lc.items():
        if count>max:
          max=count
          mcom=label
    return mcom

#This predict function classifies the test point to get a predicted label with the help of helper function _predict
def predict(X_train,y_train,X_test,k):
    y_pred=[_predict(X_train,y_train,x,k) for x in X_test]
    return np.array(y_pred) #Here we return an array of predicted labels for the test data point.

#This is the helper function to the predict function here we actually calculate the Euclidean distance of the each test data point with the training data point.
def _predict(X_train,y_train,x,k):
    distances=[np.sqrt(np.sum((x_train-x)**2)) for x_train in X_train]#Calculating Euclideaen distance
    k1=np.argsort(distances)[:k]#Here we find the indices of the k nearest neighbours
    klabels=[y_train[i] for i in k1]#Here we get the corresponding class labels to the k nearest points to our testing data point.
    mcom=common(klabels)
    return mcom
    #mcom is a variable that returns the most common class label occuring among the k values

#We use the scaling function so that all the values in the different columns are normalized with respect to one another.
#KNN model works by finding the Euclidean distance of the testing point with the other training points, and wrong scaling can affect the final decision to which class the point belongs to.
def scaling(df1):
    for column in df1.columns:
        df1[column]=(df1[column]-df1[column].min())/(df1[column].max()-df1[column].min())
    return df1
df1 = encoding(df1)
X = df1.drop('Risk', axis=1)
X = scaling(X)
y = df1['Risk']
X_train,X_test,y_train,y_test=train_test_split(X.values,y.values,test_size=0.2,random_state=0)
#we test for the value of k = 20 in th KNN Model. This means the closest 20 neighbours to our testing points will be considered to classify the testing points to the class.
predictions=predict(X_train,y_train,X_test,k=15)

print(f"Model Accuracy:{accuracy_score(y_test,predictions)*100}%")

Model Accuracy:77.5%
