# Introduction to k-Nearest Neighbors

The K-nearest neighbors (KNN) algorithm is a type of supervised machine learning algorithms. KNN is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks. Rather, it uses all of the data for training while classifying a new data point or instance.
KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique we generally look at 3 important aspects:
1. Ease to interpret output 2. Calculation time 3. Predictive Power

We can implement a KNN model by following the below steps:

1. Load the data
2. Initialise the value of k. 
3. For getting the predicted class, iterate from 1 to total number of training data points (1) Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.
(2)Sort the calculated distances in ascending order based on distance values
(3)Get top k rows from the sorted array
(4)Get the most frequent class of these rows
(5)Return the predicted class

In [1]:
import pandas as pd
import numpy as np
import math
import operator

In [17]:
#### Start of STEP 1
# Importing data 
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
data = pd.read_csv(url, names=names) 
data.head() 

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
# Defining a function which calculates euclidean distance between two data points
def euclideanDistance(data1, data2, length):
    distance = 0
    for x in range(length):
        distance += np.square(data1[x] - data2[x])
    return np.sqrt(distance)

# Defining our KNN model
def knn(trainingSet, testInstance, k):
 
    distances = {}
    sort = {}
 
    length = testInstance.shape[1]
    
    #STEP 3
    # Calculating euclidean distance between each row of training data and test data
    for x in range(len(trainingSet)):
        
        #STEP 3.1
        dist = euclideanDistance(testInstance, trainingSet.iloc[x], length)

        distances[x] = dist[0]
        
 
    #STEP 3.2
    #Sorting them on the basis of distance
    sorted_d = sorted(distances.items(), key=operator.itemgetter(1))
    
 
    neighbors = []
    
    #STEP 3.3
    #Extracting top k neighbors
    for x in range(k):
        neighbors.append(sorted_d[x][0])
    
    classVotes = {}
    
    #STEP 3.4
    # Calculating the most freq class in the neighbors
    for x in range(len(neighbors)):
        response = trainingSet.iloc[neighbors[x]][-1]
 
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    

    #STEP 3.5
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return(sortedVotes[0][0], neighbors)
    


# Creating a dummy testset
testSet = [[7.2, 3.6, 5.1, 2.5]]
test = pd.DataFrame(testSet)
#Start of STEP 2
# Setting number of neighbors = 1
k = 1

# Running KNN model
result,neigh = knn(data, test, k)

# Predicted class
print(result)


Iris-virginica


In [8]:
# Nearest neighbor
print(neigh)

[141]


Now we will try to alter the k values, and see how the prediction changes.

In [9]:
# Setting number of neighbors = 3 
k = 3 
# Running KNN model 
result,neigh = knn(data, test, k) 
# Predicted class 
print(result)

Iris-virginica


In [10]:
# 3 nearest neighbors
print(neigh)

[141, 139, 120]


In [11]:
# Setting number of neighbors = 5
k = 5
# Running KNN model 
result,neigh = knn(data, test, k) 
# Predicted class 
print(result) 

Iris-virginica


In [12]:
# 5 nearest neighbors
print(neigh)

[141, 139, 120, 145, 144]


Comparing our model with scikit-learn

In [15]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Class'])

# Predicted class
print(neigh.predict(test))


['Iris-virginica']


In [16]:
# 3 nearest neighbors
print(neigh.kneighbors(test)[1])

[[141 139 120]]


We can see that both the models predicted the same class (‘Iris-virginica’) and the same nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as expected.

KNN algorithm is one of the simplest classification algorithm. Even with such simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from nearest neighbors.