## Tell me who your neighbors are, I will tell you who you are! 

Who would have thought that machine learning algoirthms could be as simple as that!KNN follows this simple analogy. K-Nearest Neighbours is a type of supervised machine learning algorithms that can be used for classification or regression, but commonly used in classification. 
To make a prediction, the KNN algorithm doesn’t calculate a predictive model from a training dataset like in logistic or linear regression. Take a moment and let that sink in!

*In KNN, there is no actual learning phase*

One of reasons why its also known as the lazy learner algorithm.

**Then how does KNN make a prediction?**

It uses the entire dataset and classifies a new data point based on the similarity. That means when an observation that isn’t part of the dataset, the algorithm will look for the K instances of the data points closest to our observation. Then for these K instances or neighbors, the algorithm will use their output in order to make the prediction.

**How does it work in case of regression and classification?**

If KNN is used for a regression problem, the mean (or median) of the target variables of the K closest observations will be used for predictions. 
If KNN is used for a classification problem, it’s the mode of the target class of the K closest observations that will be used for predictions.

![KNNImage](https://static.javatpoint.com/tutorial/machine-learning/images/k-nearest-neighbor-algorithm-for-machine-learning2.png)

And now, you know how KNN works! You've mastered another supervised machine learning algorithm!
Uh, no, not yet... there's more to know, hang on for a bit, yeah?

So far we've understood how KNN is supposed to work. Let's now understand the component that go into the algorithm,

* Calculating Similarity - what facets do I consider while telling A is similar to B? 
* K i.e., how many neighbours should I look at for similarities before I make a prediction?


### Calculating Similarity:

Similarity is calculating the distance between the point we know and the point we want to classify. Hence, in KNN, we will calculate the distance of the test point from all the observations of the training dataset and then find K nearest neighbors of it. This will for every test point and that is how it finds similarities in the data. 
To ensure we extract most of KNN, we need to make sure the distance metric provided to KNN is the right one. 

This [medium article](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) nicely explains the various distance metrics available and this paper - [Effects of Distance Measure Choice on KNN Classifier Performance - A Review](https://arxiv.org/pdf/1708.04321.pdf) explains how the choice of the distance metric can affect the model's performance. 

In this notebook, we're going to understand the components using sklearn's library. Hence, we'll look at how the selecting the right distance metric affects the model's performance metric.

### K:

This is an important paramemer and hence it needs to be chosen carefully. To select the value of K that suits your data, we run the KNN algorithm multiple times with different values of K. After this, we choose the K that reduces the number of errors encountered while maintaining the ability of the algorithm to make correct predictions when it receives new data.

Here's where things get tricky:

Decreasing the value of K makes our predictions become less stable. Why do we say that? Consider K = 1. KNN is a discriminative algorithm since it models the conditional probability of a sample belonging to a given class. And with 1 neighbour you estimate your probability of the test point based on a single sample: your closest neighbor. This is very sensitive to all sort of distortions like noise, outliers, mislabelling of data, and so on.

Conversely, as we increase the value of K, our predictions become more and more stable due to majority or average voting. We are therefore more likely to make correct predictions (up to a certain point) but with increasing K, we’re starting to see an increasing number of errors. It’s at this point that we know that we’ve pushed the value of K too far.

Hence its necessary to make sure we pick the right value for K keeping the errors in check. In the notebook, we'll see how to check the erros at various values for K and pick the one that's best.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [None]:
df = pd.read_csv('../input/bank-marketing/bank-additional-full.csv', delimiter=';')

The purpose of this notebook is to understand KNN. So to make things simpler, let's only consider the below feature set:
    
* age               
* job               
* marital           
* education         
* default   - has credit in default? (categorical: 'no', 'yes', 'unknown')        
* housing - has housing loan? (categorical: 'no', 'yes', 'unknown')  
* loan  - has personal loan? (categorical: 'no', 'yes', 'unknown')            
* campaign - number of contacts performed during this campaign and for this client (numeric, includes last contact)    
* pdays - number of days that passed by after the client was last contacted (numeric; 999 means client not previously contacted) 
* previous -  number of contacts performed before this campaign and for this client (numeric)         
* poutcome - outcome of the previous marketing campaign (categorical:'failure','nonexistent','success')         

In [None]:
df = df[['age', 'job', 'marital', 'education', 'default', 'housing', 'loan','campaign', 'pdays',
       'previous', 'poutcome', 'y']]
df.info()

Based on these features, we have target y which is has the client subscribed a term deposit? And for this case, the model performace mentric we'll be looking at is accuracy. But keep in mind, you cna change the metric based on the problem statement you want th solution to be optmized on and view the graphs.


We see a few categorical features in our dataset. We'll need to encode them before we scale and feed to the model.

In [None]:
### Encoding Categorical Features

objfeatures = df.select_dtypes(include="object").columns
le = preprocessing.LabelEncoder()

for feat in objfeatures:
    df[feat] = le.fit_transform(df[feat].astype(str))

In [None]:
### Normalize Data
X = df.drop('y', 1)
y = df['y']

X = preprocessing.StandardScaler().fit_transform(X.astype(int))

In [None]:
### train/test split and train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

We're going to run the experiment changing the below three parameter for sklearn's KNN:

* k
* metric

We're going to keep weight = ‘distance’ as  weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

Now note, that the default distance metric for sklearn is minkowski. It is a the generalized distance metric. The distance metric needs parameter p i.e., the power parameter The default for that is 2, which means that the default distance metric is Euclidean. Here's what the distance metric will change to if we change values for p - 

p = 1, Manhattan Distance

p = 2, Euclidean Distance

p = ∞, Chebychev Distance



### Task 1 -  Use Euclidean Distance and Find Optimum K

We find the value of K, keeping distance metric as Euclidean and caluclating the error at intervals. 

In [None]:
error_rate = []
for i in range(1,40):
    knn = KNeighborsClassifier(weights='distance', n_neighbors=i).fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='black', linestyle='dashed', 
         marker='o',markerfacecolor='yellow', markersize=6)
plt.title('Error Rate vs. K Value with Distance Metric: Euclidean')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:",min(error_rate),"at K =",error_rate.index(min(error_rate)))

In [None]:
acc = []
# Will take some time
from sklearn import metrics
for i in range(1,40):
    neigh = KNeighborsClassifier(weights='distance', n_neighbors=i).fit(X_train,y_train)
    yhat = neigh.predict(X_test)
    acc.append(metrics.accuracy_score(y_test, yhat))
    
plt.figure(figsize=(10,6))
plt.plot(range(1,40),acc,color='black', linestyle='dashed', 
         marker='o',markerfacecolor='yellow', markersize=6)
plt.title('Accuracy vs. K Value with Distance Metric: Euclidean')
plt.xlabel('K')
plt.ylabel('Accuracy')
print("Maximum accuracy:-",max(acc),"at K =",acc.index(max(acc)))

Here we see with K=30, that is taking 30 closest points or neighbours, we get a reasonable accuracy of 88% on test set.

Let's repeat this exercise with Manhattan Distance

### Task 2 - Use Manhattan Distance and Find Optimum K

In [None]:
error_rate = []
for i in range(1,40):
    knn = KNeighborsClassifier(weights='distance', n_neighbors=i, p=1).fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='black', linestyle='dashed', 
         marker='o',markerfacecolor='yellow', markersize=6)
plt.title('Error Rate vs. K Value with Distance Metric: Manhattan')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:",min(error_rate),"at K =",error_rate.index(min(error_rate)))

In [None]:
acc = []
for i in range(1,40):
    neigh = KNeighborsClassifier(weights='distance', n_neighbors=i, p=1).fit(X_train,y_train)
    yhat = neigh.predict(X_test)
    acc.append(metrics.accuracy_score(y_test, yhat))
    
plt.figure(figsize=(10,6))
plt.plot(range(1,40),acc,color='black', linestyle='dashed', 
         marker='o',markerfacecolor='yellow', markersize=6)
plt.title('Accuracy vs. K Value with Distance Metric: Manhattan')
plt.xlabel('K')
plt.ylabel('Accuracy')
print("Maximum accuracy:-",max(acc),"at K =",acc.index(max(acc)))

This shows us how the value of K is crucial and is dependent on the distance metric. It is always better to choose the distance metric that is best suited to the dataset at hand.

Now, we hve a clear understanding of KNN and also know how to work with the sklearn's version. Let's look at the Pros and Cons of KNN before we sign off!

**Pros:**

* It is extremely easy to implement and to understand
* Requires no training prior to making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training.
* There are only two parameters required to implement KNN i.e. the value of K and the distance function 

**Cons:**

* The KNN algorithm doesn't work well with high dimensional data and large datasets, it becomes difficult for the algorithm to calculate distance in each dimension for each datapoint
* Also as it doesn't train a model, it keeps in memory all the observations to be able to make its prediction, a challenge when working with large datasets.

Happy Learning!