# K-Nearest Neighbor


**What is KNN?**

- Supervised Algorithm

- Makes predictions based on how close a new data point is to known data points.

- Considered a **lazy algorithm** in that it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.

- Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances.  For regression problems, this might be the mean output variable. For classification problems this might be the mode (or most common) class value.

- It is important to define a metric to measure how similar data instances are. Euclidean distance can be used if attributes are all on the same scale (or you convert them to the same scale).



![knn2.jpg](attachment:knn2.jpg)

https://www.kdnuggets.com/2016/01/implementing-your-own-knn-using-python.html


**Pros**

1. Simple to implement

2. Robust to noisy training data

3. Effective if training data is large

4. Performs calculations "just in time", i.e. when a prediction is needed (as opposed to ahead of time)

5. Training instances can be updated and curated over time to keep predictions accurate.

**Cons**

1. Need to determine the value of K

2. The computation cost is high as it needs to compute the distance of each instance to all the training samples...you need to hang on to your entire training dataset.

3. Distance can break down in very high dimensions, negatively affecting the performance. This is know as the "Curse of dimensionality". To alleviate, only use those input variables that are most relevant to predicting the output variable.

In [None]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from pydataset import data

In [None]:
# read Iris data from pydatset
df = data('iris')

# convert column names to lowercase, replace '.' in column names with '_'
df.columns = [col.lower().replace('.', '_') for col in df]

df.head()

## Train Validate Test

Now we'll do our train/validate/test split:

- We'll do exploration and train our model on the `train` data

- We tune our model on `validate`, since it will be out-of-sample until we use it.

- And keep the `test` nice and safe and separate, for our final out-of-sample dataset, to see how well our tuned model performs on new data.

In [None]:
from sklearn.model_selection import train_test_split

def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [None]:
# split into train, validate, test
train, validate, test = train_validate_test_split(df, target='species', seed=123)

In [None]:
# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['species'])
y_train = train.species

X_validate = validate.drop(columns=['species'])
y_validate = validate.species

X_test = test.drop(columns=['species'])
y_test = test.species

## Train Model

**Create KNN Object**


In [None]:
# weights = ['uniform', 'distance']
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

**Fit the model** 

Fit the model to the training data. 

In [None]:
knn.fit(X_train, y_train)

**Make Predictions**

Classify each flower by its estimated species. 

In [None]:
# predict on X_train
y_pred = knn.predict(X_train)

In [None]:
y_pred

**Estimate Probability**

Estimate the probability of each species, using the training data. 

In [None]:
# calculate probabilities (if you need them)
y_pred_proba = knn.predict_proba(X_train)

In [None]:
# look at first 10 values
y_pred_proba[:10]

## Evaluate Model

**Compute the Accuracy**

In [None]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

**Create a confusion matrix**

In [None]:
print(confusion_matrix(y_train, y_pred))

**Create a classification report** 

In [None]:
pd.DataFrame(classification_report(y_train, y_pred, output_dict=True))

_______________________________

Let's look at how this works manually and through simple visual classification. 
We start with 4 *labeled* samples here. 

In [None]:
# first_nearest_neighbor
import pandas as pd
samples = pd.DataFrame({'a': [5.7, 5.5, 6.3], 
                        'b': [2.6, 3.5, 2.8], 
                        'c': [3.5, 1.3, 5.1], 
                        'd': [1.0, 0.2, 1.5], 
                        'target': ['versicolor', 'setosa', 'virginica']
                       })


samples

We then add 3 new unlabeled observations. 
For each observation, we look to the labeled samples to see which one it is closer to, in order to find the "1st-Nearest Neighbor". 

In [None]:
new_obs = pd.DataFrame([[6.3, 2.8, 5.1, 1.4], 
                       [6.25, 2.77, 5.09, 1.35], 
                       [5.5, 3.5, 1.29, 0.3]], 
                        columns = ['a', 'b', 'c', 'd'])

new_obs

It is pretty clear which samples each new observation corresponds to. So we add those predictions. 

In [None]:
pred_target = pd.DataFrame(['virginica', 'virginica', 'setosa'], columns=['pred_target'])
pd.concat([new_obs, pred_target], axis=1)

This is what K-Nearest Neighbors is doing for us. Except it's using the distance formula to actually compute the distance and find the K sample/labeled observations with the shortest or minimum distances. Of those K samples, which species is most common? i.e. what is the mode of those neighbors? 

## Validate Model

**Evaluate on Out-of-Sample data**

Compute the accuracy of the model when run on the validate dataset. 

In [None]:
# predict on X_validate 
y_pred = knn.predict(X_validate)
y_pred

In [None]:
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

## What is the best value of k?

In [1]:
metrics = []

# loop through different values of k
for k in range(1, 20):
            
    # define the thing
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # fit the thing (remmeber only fit on training data)
    knn.fit(X_train, y_train)
    
    # use the thing (calculate accuracy)
    train_accuracy = knn.score(X_train, y_train)
    validate_accuracy = knn.score(X_validate, y_validate)
    
    output = {
        "k": k,
        "train_accuracy": train_accuracy,
        "validate_accuracy": validate_accuracy
    }
    
    metrics.append(output)

# make a dataframe
results = pd.DataFrame(metrics)

# plot the data
results.set_index('k').plot(figsize = (16,9))
plt.ylim(0.90, 1)
plt.ylabel('Accuracy')
plt.xticks(np.arange(0,20,1))
plt.grid()

NameError: name 'KNeighborsClassifier' is not defined

## Exercises

Continue working in your `model` file with the titanic dataset. 

1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

1. Evaluate your results using the model score, confusion matrix, and classification report.

1. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

1. Run through steps 2-4 setting k to 10

1. Run through setps 2-4 setting k to 20

1. What are the differences in the evaluation metrics?  Which performs better on your in-sample data? Why?

1. Which model performs best on our out-of-sample data from `validate`?