# K Nearest Neigbors

This notebook implements the k nearest neighbors algorithms, KNN is the simplest classification algorithm. We have used the Iain Murray fruits dataset.

Algorithm<br>
Given a new observation<br>
    1] Find the distance between new items and all other items<br>
    2] Pick K shorter distance<br>
    3] Pick the most common classes in these K distances<br>
    4] Our new observation belong to this class
    

### Import Libraries

In [1]:
import numpy as np
import math
import operator
from collections import Counter

### Read Data

The data is separated by tab, we are using numpy genfromtxt to read the data

In [2]:
data = np.genfromtxt("fruit_data_with_colors.txt", dtype=None, encoding='utf-8')
data[:5]

array([['fruit_label', 'fruit_name', 'fruit_subtype', 'mass', 'width',
        'height', 'color_score'],
       ['1', 'apple', 'granny_smith', '192', '8.4', '7.3', '0.55'],
       ['1', 'apple', 'granny_smith', '180', '8.0', '6.8', '0.59'],
       ['1', 'apple', 'granny_smith', '176', '7.4', '7.2', '0.60'],
       ['2', 'mandarin', 'mandarin', '86', '6.2', '4.7', '0.80']],
      dtype='<U16')

### Data PreProcessing

In the dataset we have two columns in string format. To calculate distance between 2 items these string needs to be converted to integer. we can see the fruit_name and fruit_label are the just the mapping in different data type, we can ignore fruit_name column as fruit_label would suffice for our classification. So we need to convert fruit_subtype to integers. Later we divide the data into train set and test set 

In [3]:
#drop fruit_name columns from dataframe
data = np.delete(data, [1], axis=1)

In [4]:
#replace string to number for fruit sub type
#data[1:,1] will give all the values for fruit column excluding column name
fruit_sub_uniq = np.unique(data[1:,1]) 
#enumerate will give index and value from the list and 
#it is inversed so that string is key and values are number in mapping
mapping = {value : key for key,value in dict(enumerate(fruit_sub_uniq)).items()}
data[1:,1] = [mapping[value] for value in data[1:,1]] 

In [5]:
#removing first row as it not required for model training
data = data[1:,:]
#changing type to float
data = data.astype(float)

In [6]:
#shuffling the data so that the fruit types are distributed across the array and extracting train and test
np.random.shuffle(data)
test_size = 10
split_value = np.negative(test_size)
x_train, y_train, x_test, y_test = data[:split_value,1:], data[:split_value,0], data[split_value:,1:], data[split_value:,0] 

### Implementation

In [7]:
#Distance method (euclidean distance)
def euclidean_distance(instance1, instance2):
    distance = 0
    i = 0
    while i < len(instance1):
        distance += pow((instance2[i] - instance1[i]),2)
        i = i + 1
    return math.sqrt(distance)

In [8]:
# no of neighbors 
k = 5

#define the result array in which we will store the classified values
predicted_class = []

#iterate through test instances and calculate the distances with the training set for each, 
#get k nearest neighbors and select the classes with most frequency 
for test_instance in x_test:
    #distances array to store distance for each traning set
    distances = []
    for x in x_train:
        distances.append(euclidean_distance(x, test_instance))
    
    #zip distances and y _train
    distance_with_classes = list(zip(distances, y_train))
    
    #sort the distances and select k values
    distance_with_classes = sorted(distance_with_classes, key=operator.itemgetter(0))
    distance_with_classes = distance_with_classes[:k]
    
    #Count the frequency of classes and select the most common one
    result = Counter(distance[1] for distance in distance_with_classes).most_common(1)
    predicted_class.append(result[0][0])


### Comparing our model with scikit library
Lets compare our reuslt with the scikit library. we have used accuracy for the comparison.

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model = KNeighborsClassifier(k)
model.fit(x_train, y_train)

print("Result comparison")
print("Actual Classes :",  y_test.astype(int))

print("Sklearn result :", model.predict(x_test).astype(int))
print("Our model result :", np.array(predicted_class).astype(int))

print("")

print("Accuracy")
print("Sklearn Model :", model.score(x_test, y_test))


print("Our Model :", accuracy_score(y_test, predicted_class))


Result comparison
Actual Classes : [1 1 1 1 4 4 4 1 3 3]
Sklearn result : [1 1 1 1 4 3 3 1 1 4]
Our model result : [1 1 1 1 4 4 4 1 1 4]

Accuracy
Sklearn Model : 0.6
Our Model : 0.8


Our model accuracy is better than sklearn model ;)