#### K-Nearest Neighbors

'''The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.'''

![k-Nearest-Neigbhour](./img/knn.png)

Notice in the image above that most of the time, similar data points are close to each other. 

The KNN algorithm hinges on this assumption being true enough for the algorithm to be useful. 

KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph.

#### When do we use KNN algorithm?

KNN can be used for both classification and regression predictive problems. 

However, it is more widely used in classification problems in the industry

##### How does the KNN algorithm work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) :

![k-Nearest-Neigbhour](./img/knn-scenario1.png)

You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. 

The “K” is KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three datapoints on the plane. Refer to the following diagram for more details:

![k-Nearest-Neigbhour](./img/knn-scenario2.png)

The three closest points to BS is all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next, we will understand what are the factors to be considered to conclude the best K.

In [4]:
#import data
import pandas as pd

df = pd.read_csv('./data/winequality-red.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [31]:
#import knn

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

#define model
knn = KNeighborsRegressor(n_neighbors=4)

In [32]:
#assign x and y

y = df.quality
x = df.drop('quality',axis=1)

#split train and test
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

#fit the model
knn.fit(x_train,y_train)

KNeighborsRegressor(n_neighbors=4)

In [33]:
preds = knn.predict(x_test)

In [35]:
preds

array([5.75, 5.  , 6.25, 5.5 , 6.  , 5.5 , 5.25, 5.25, 5.5 , 5.5 , 7.25,
       4.75, 6.  , 6.  , 5.75, 7.  , 5.5 , 5.5 , 6.5 , 5.5 , 5.75, 5.75,
       6.  , 6.  , 5.75, 6.  , 6.  , 5.5 , 5.  , 6.  , 5.  , 5.25, 5.25,
       5.5 , 5.5 , 5.  , 6.  , 5.5 , 5.75, 5.5 , 5.75, 5.  , 5.5 , 5.  ,
       6.  , 5.5 , 6.5 , 6.  , 5.  , 5.25, 5.  , 5.25, 5.25, 6.75, 5.  ,
       5.25, 6.25, 6.25, 6.  , 5.5 , 5.75, 6.75, 5.75, 5.  , 5.5 , 5.75,
       6.25, 5.25, 5.5 , 5.5 , 6.  , 5.  , 5.75, 5.5 , 6.  , 5.25, 5.25,
       5.  , 5.25, 5.25, 5.25, 6.25, 6.  , 5.25, 5.75, 5.25, 5.25, 5.5 ,
       6.  , 5.  , 5.75, 5.25, 5.  , 5.5 , 5.  , 5.75, 5.5 , 5.75, 5.5 ,
       5.25, 5.25, 5.  , 5.75, 6.  , 6.  , 6.25, 5.75, 5.75, 5.  , 5.  ,
       6.25, 5.5 , 6.25, 5.  , 6.25, 6.25, 6.  , 5.5 , 5.  , 6.5 , 5.25,
       5.25, 5.75, 5.25, 5.  , 5.5 , 5.75, 5.5 , 5.  , 6.  , 5.75, 5.75,
       5.5 , 6.25, 5.25, 5.75, 5.25, 5.5 , 5.5 , 6.  , 5.75, 5.25, 6.5 ,
       5.25, 5.5 , 5.25, 6.  , 6.25, 5.5 , 5.  , 5.

In [34]:
mean_absolute_error(y_test,preds)

0.584375