# KNN-Python from Scratch

# # Here we are going to implement KNN algorithm from scratch without using any machine-learning libraries

[KNN Introduction](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)
> K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It mostly used to classifies a data point based on how its neighbours are classified.

# # # How KNN works
* Let us consider we have 10 classifiers.
* You want to predict new sample data point belongs to which classifier.
* Here KNN comes into picture to solve your problem.

Confused!?
Let me explain 

* KNN measures the distance between new data point and all the available data.
* 'K' in KNN refers to number of nearest neighbours, consider k = 5
* Collect five nearest data points based on distace we measured.
* Our new data point is classified by majority of votes from its five neighbours and new data point would be 4th classifier (among 10 classifiers) as four out of five nearest neighbours belong to 4th classifier

Got an Idea!



* Choosing k value varies with dataset your working
* Recommendation
    *         k = sqrt(N)
    *         where N is total number of samples

# Let's Start 

Here we are going to work with
[iris-data](https://www.kaggle.com/uciml/iris)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# # load the data

In [None]:
data = pd.read_csv('../input/iris-data/Iris.csv')
data.head(5)

# # Visualising different species(Classifiers) on which we are going to work

In [None]:
Species = list(set(data['Species']))
Specie1 = data[data['Species']==Species[0]]
Specie2 = data[data['Species']==Species[1]]
Specie3 = data[data['Species']==Species[2]]

In [None]:
import matplotlib.pyplot as plt
plt.scatter(Specie1['PetalLengthCm'], Specie1['PetalWidthCm'], label=Species[0])
plt.scatter(Specie2['PetalLengthCm'], Specie2['PetalWidthCm'], label=Species[1])
plt.scatter(Specie3['PetalLengthCm'], Specie3['PetalWidthCm'], label=Species[2])
plt.xlabel('PetalLengthCM')
plt.ylabel('PetalWidthCM')
plt.legend()
plt.title('Different Species Visualization')

* Independent Variables: ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']
* Dependent Variables: ['Species']

# # *Now our task is to predict the new data point belongs to which species based on sepalLength, sepalWidth, petalLength, petalWidth*

# # # Preprocessing Data

Removing Id column from data, which is unnecessary

In [None]:
req_data = data.iloc[:,1:]
req_data.head(5)

Shuffling the data, to avoid overFitting problem

In [None]:
shuffle_index = np.random.permutation(req_data.shape[0])        #shuffling the row index of our dataset
req_data = req_data.iloc[shuffle_index]
req_data.head(5)

* Setting 70% data into training data
* Therefore 30% data will be our test data

In [None]:
train_size = int(req_data.shape[0]*0.7)

In [None]:
train_df = req_data.iloc[:train_size,:] 
test_df = req_data.iloc[train_size:,:]
train = train_df.values
test = test_df.values
y_true = test[:,-1]
print('Train_Shape: ',train_df.shape)
print('Test_Shape: ',test_df.shape)

# KNN in 3 Steps
> 1 Measure distance (Euclidean Distance or Manhattan Distance)

> 2 Get nearest neighbours

> 3 Predict Classifier

# # Step 1

Measuring Distance using Euclidean Distance
>Mathematical formula  √ (x2 − x1)2 + (y2 − y1)2

In [None]:
from math import sqrt
def euclidean_distance(x_test, x_train):
    distance = 0
    for i in range(len(x_test)-1):
        distance += (x_test[i]-x_train[i])**2
    return sqrt(distance)

# # Step 2

Getting the nearest neighbours

In [None]:
def get_neighbors(x_test, x_train, num_neighbors):
    distances = []
    data = []
    for i in x_train:
        distances.append(euclidean_distance(x_test,i))
        data.append(i)
    distances = np.array(distances)
    data = np.array(data)
    sort_indexes = distances.argsort()             #argsort() function returns indices by sorting distances data in ascending order
    data = data[sort_indexes]                      #modifying our data based on sorted indices, so that we can get the nearest neightbours
    return data[:num_neighbors]               

# # Step 3

Predicting the classifier of which our new data point belongs too

In [None]:
def prediction(x_test, x_train, num_neighbors):
    classes = []
    neighbors = get_neighbors(x_test, x_train, num_neighbors)
    for i in neighbors:
        classes.append(i[-1])
    predicted = max(classes, key=classes.count)              #taking the most repeated class
    return predicted

In [None]:
def predict_classifier(x_test):
    classes = []
    neighbors = get_neighbors(x_test, req_data.values, 5)
    for i in neighbors:
        classes.append(i[-1])
    predicted = max(classes, key=classes.count)
    print(predicted)
    return predicted

Measuring the accuracy. So that we can know how accurate our model would predict new data samples

In [None]:
def accuracy(y_true, y_pred):
    num_correct = 0
    for i in range(len(y_true)):
        if y_true[i]==y_pred[i]:
            num_correct+=1
    accuracy = num_correct/len(y_true)
    return accuracy

Predicting test data

In [None]:
y_pred = []
for i in test:
    y_pred.append(prediction(i, train, 5))
y_pred

> Evaluating model performance

In [None]:
accuracy = accuracy(y_true, y_pred)

In [None]:
accuracy

* We are getting pretty good accuracy
* If your are getting low accuracy tune the value of k(nearest neighbours)

In [None]:
test_df.insert(5, 'Predicted_Species', y_pred, False)

In [None]:
test_df.sample(5)