# KNN (K-Nearest Neighbors)

Nearest neighbors algorithms (NNAs) are very simple conceptually: to classify a datum with specific feature values, find the data point that has the most similar feature values and put the original datum in that class. NNAs can also be used to predict missing feature values.

The most common NNA is the k-Nearest Neighbors algorithm where the top K nearest neighbors to the query are identified. An object is classified by a **majority vote of its neighbors**, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

<img src="https://miro.medium.com/max/700/1*rmdr7RsUPOWranwOuuIl7w.png">

## Important Parameter
- n neighbors: number of neighbors to consider (default is 5)

# Example KNN

Here we'll be using the same diabetes dataset to compare these results to the other models later today :)

## Loading the Data

In [1]:
#imports
import numpy as np
import pandas as po
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [2]:
#loading pima indians diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = po.read_csv(url, names=names)

#  'preg': number of pregnancies  
#  'plas': plasma glucose concentration 
#  'pres': blood pressure 
#  'skin': skin thickness
#  'test': Insulin
#  'mass': BMI
#  'pedi': diabetes pedigree function
#  'age': age
#  'class': '0' means does not have diabetes and '1' means has diabetes

data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Splitting Testing/Training Data

In [3]:
# columns we will use to make predictions with
x_cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

# column that we want to predict
y_col = 'class'

# 80-20 split of dataset
test_size = 0.2
x_training, x_testing, y_training, y_testing = train_test_split(data[x_cols], data[y_col], test_size=test_size, random_state=0)

## Creating Model

In [15]:
# creating a model with sklearn's k nearest neighbors, play around with the parameter to up the accuracy!
knn = KNeighborsClassifier(n_neighbors = 5)

# training/fitting model with training data
knn.fit(x_training, y_training)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

## Evaluating Model

In [16]:
#printing accuracy of testing & training data
y_train_pred= knn.predict(x_training)
print("Training Accuracy is ", accuracy_score(y_training, y_train_pred)*100)
y_test_pred= knn.predict(x_testing)
print("Testing Accuracy is ", accuracy_score(y_testing,y_test_pred)*100)

Training Accuracy is  81.10749185667753
Testing Accuracy is  79.87012987012987


## Notes

**Advantages**
- A good and quick starting point/baseline classifier
- Non-parametric (can be used with data that does not fit a normal distribution)

**Disadvantages**
- Degrades with high-dimension data (because there is less difference between closest and furthest neighbors)
- Unclear how to handle non-numeric features
- Doesn't handle data with skewed class distribution well (if one class is extremely dominant in the training data, it will dominate the "voting majority" for classifying new data)