# K-Nearest Neighbors (KNN)

![alt text](knn.png)

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well −

Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.

Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.

In [200]:
import numpy as np
import pandas as pd
from math import sqrt
from collections import Counter

### Standard euclidean distance

In [201]:
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
        return sqrt(distance)

### KNN Location

In [202]:
def knn(train_x, train_y, dis_func, sample, k):
    
    distances = {}
    for i in range(len(train_x)):
        d = euclidean_distance(sample, train_x.iloc[i])
        distances[i] = d
    sorted_dist = sorted(distances.items(), key = lambda x : (x[1], x[0]))
    # take k nearest neighbors
    neighbors = []
    for i in range(k):
        neighbors.append(sorted_dist[i][0])
    
    #convert indices into groups
    groups = [train_y.iloc[c] for c in neighbors]
    
    #count each group in top k
    counts = Counter(groups)
    
    #max number of samples of a class
    list_values = list(counts.values())
    list_keys = list(counts.keys())
    gr = list_keys[list_values.index(max(list_values))]
    
    return gr

In [203]:
# knn on cars data
cars = pd.read_csv('carmpg.csv')
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,"""chevrolet chevelle malibu"""
1,15.0,8,350.0,165,3693,11.5,70,1,"""buick skylark 320"""
2,18.0,8,318.0,150,3436,11.0,70,1,"""plymouth satellite"""
3,16.0,8,304.0,150,3433,12.0,70,1,"""amc rebel sst"""
4,17.0,8,302.0,140,3449,10.5,70,1,"""ford torino"""


In [204]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


The dataset has 9 columns.
- mpg: from 9 to 46.6
- cylinders: 4, 6 or 8
- displacement: from 68 to 455
- horpsepower: from 46 to 230
- weight: 1513 to 5140 lbs
- acceleration: 8 to 24.8
- model year
- origin
- car name

In [205]:
print(cars.shape)
penguin= cars.loc[cars.horsepower != '?', :]
print(cars.shape)

(398, 9)
(398, 9)


In [206]:
cars = cars.drop('origin', axis=1)
cars = cars.drop('model year', axis=1)
cars = cars.drop('car name', axis=1)
cars.isna().sum()
cars

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration
0,18.0,8,307.0,130,3504,12.0
1,15.0,8,350.0,165,3693,11.5
2,18.0,8,318.0,150,3436,11.0
3,16.0,8,304.0,150,3433,12.0
4,17.0,8,302.0,140,3449,10.5
...,...,...,...,...,...,...
393,27.0,4,140.0,86,2790,15.6
394,44.0,4,97.0,52,2130,24.6
395,32.0,4,135.0,84,2295,11.6
396,28.0,4,120.0,79,2625,18.6


### Train/Test Split

In [207]:
cars['is_train'] = np.random.uniform(0, 1, len(cars)) <= .75
train = cars[cars['is_train'] == True]
test = cars[cars['is_train'] == False]

train_x = train[train.columns[:len(train.columns) - 1]] # training samples
train_x = train_x.drop('cylinders', axis=1)             # label drop
train_y = train['cylinders']                            # corresponding labels


test_x = test[test.columns[:len(test.columns) - 1]]
test_x = test_x.drop('cylinders', axis=1)               # label drop
test_y = test['cylinders']

In [208]:
# classifying by number of cylinders
train_x = pd.get_dummies(train_x)
test_x = pd.get_dummies(test_x);
print(train_x.shape)
train_x.head()

(290, 82)


Unnamed: 0,mpg,displacement,weight,acceleration,horsepower_100,horsepower_102,horsepower_103,horsepower_105,horsepower_110,horsepower_112,...,horsepower_88,horsepower_90,horsepower_91,horsepower_92,horsepower_93,horsepower_95,horsepower_96,horsepower_97,horsepower_98,horsepower_?
1,15.0,350.0,3693,11.5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,3433,12.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,3449,10.5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,15.0,429.0,4341,10.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,14.0,454.0,4354,9.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [209]:
model = knn(train_x, train_y, euclidean_distance, test_x.iloc[10], k=5)
print(model)
print(test_y.iloc[11])

8
6


In [210]:
def get_accuracy(test_x, test_y, train_x, train_y, k):
    correct = 0
    for i in range(len(test_x)):
        sample = test_x.iloc[i]
        true_label = test_y.iloc[i]
        predicted_label_euclidean = knn(train_x, train_y, euclidean_distance, sample, k)
        if predicted_label_euclidean == true_label:
            correct += 1
    
    accuracy_euclidean = (correct / len(test_x)) * 100
    
    print("Model accuracy with Euclidean Distance is %.2f" %(accuracy_euclidean), "%")

In [211]:
get_accuracy(test_x, test_y, train_x, train_y, k=5)

Model accuracy with Euclidean Distance is 79.63 %


The accuracy of KNN algorithm in this case is approximately 80% correct. A dataset with fewer features may yield better results.