# Introduction 
K-Nearest Neighbour (KNN) can be used for solving classification as well as regression problems. However , it is more widely used in classification problems in industry.
 
In this notebook , we are going to create KNN algorithm from scratch and use it to solve IRIS dataset. Further we will compare our results with KNeighborsClassifier of sklearn.

We can implement KNN by following below steps:
* Load the data
* Initialize the value of k
* Iterate over all the test data points.
* for getting the predicted class, iterate from one to total no. of training data points
    1. Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Manhatten,Chebyshev, cosine, etc.
    2. Sort the calculated distances in ascending order based on distance values.
    3. Get top k rows from the sorted array.
    4. Get the most frequent class of these rows.
    5. Return the predicted class.

# Implementation
You can dowload the iris dataset from [here](https://gist.githubusercontent.com/gurchetan1000/ec90a0a8004927e57c24b20a6f8c8d35/raw/fcd83b35021a4c1d7f1f1d5dc83c07c8ffc0d3e2/iris.csv)

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
import math
import operator
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# importing data
data=pd.read_csv('../input/Iris.csv',index_col='Id')
data.head()

In [None]:
data.shape

In [None]:
# Defining a function to calculate euclidean distance between two data points
def euclideanDistance(data1,data2,length):
    distance=0
    for x in range(length):
        distance+=np.square(data1[x]-data2[x])
    return np.sqrt(distance)

# Defining our KNN model
def knn(train,test,k):
    
    
    length=test.shape[1]
    result=[]
    # calculating euclideanDistance between each row of training data and test data
    for y in range(len(test)):
        distances={}
        sort={}
        for x in range(len(train)):
            dist=euclideanDistance(test.iloc[y],train.iloc[x],length)
            distances[x]=dist
           
        #sorting them on the basis of distance
        sorted_d=sorted(distances.items(),key=operator.itemgetter(1))
    
        neighbors=[]
    
        # Extracting top k neighbors
        for  x in range(k):
            neighbors.append(sorted_d[x][0])
        
        classvotes={}
    
        # calculate most frequent class in the neighbors
        for x in range(len(neighbors)):
            response=train.iloc[neighbors[x]][-1]
        
            if response in classvotes:
                classvotes[response]+=1
            else:
                classvotes[response]=1
         
        sortedvotes=sorted(classvotes.items(),key=operator.itemgetter(1),reverse=True)
        result.append(sortedvotes[0][0])
    return (result)

    

In [None]:
# creating a dummy test set
testset=[[7.2,3.6,5.1,2.5],[7.5,3.8,5.3,2.8]]
test=pd.DataFrame(testset)
# setting no. of neighbors
k=3
#Running our model
result=knn(data,test,k)
print(result)


# Comaring Our model with scikit-learn

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh=KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4],data['Species'])

print(neigh.predict(test))

* Both models are predicting the same class. Hence our model is working correctly.
