# 8 - K-Nearest Neighbor(KNN)

**K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.**


**K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.**


**K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.**


**K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.**

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Completed.

**NOTE:It is assigned to the estimated value of our data by taking the average of the k nearest points around our data point.**

# Get dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("Hitters.csv")
data = df.copy()
data.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [2]:
# Check null values
data.isnull().values.any()

True

In [3]:
# Clear null values
data = data.dropna()
data.isnull().values.any()

False

# Create Dummy Variables

In [4]:
dummies = pd.get_dummies(data[["League","Division","NewLeague"]])
dummies.head()

Unnamed: 0,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
1,0,1,0,1,0,1
2,1,0,0,1,1,0
3,0,1,1,0,0,1
4,0,1,1,0,0,1
5,1,0,0,1,1,0


In [5]:
y = data["Salary"]

X_pre = data.drop(["Salary","League","Division","NewLeague"],axis=1).astype("float64")
X = pd.concat([X_pre,dummies[["League_N","Division_W","NewLeague_N"]]],axis=1)
X.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,League_N,Division_W,NewLeague_N
1,315.0,81.0,7.0,24.0,38.0,39.0,14.0,3449.0,835.0,69.0,321.0,414.0,375.0,632.0,43.0,10.0,1,1,1
2,479.0,130.0,18.0,66.0,72.0,76.0,3.0,1624.0,457.0,63.0,224.0,266.0,263.0,880.0,82.0,14.0,0,1,0
3,496.0,141.0,20.0,65.0,78.0,37.0,11.0,5628.0,1575.0,225.0,828.0,838.0,354.0,200.0,11.0,3.0,1,0,1
4,321.0,87.0,10.0,39.0,42.0,30.0,2.0,396.0,101.0,12.0,48.0,46.0,33.0,805.0,40.0,4.0,1,0,1
5,594.0,169.0,4.0,74.0,51.0,35.0,11.0,4408.0,1133.0,19.0,501.0,336.0,194.0,282.0,421.0,25.0,0,1,0


# Split train and test

In [6]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=33)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(210, 19)
(210,)
(53, 19)
(53,)


# Import model

In [7]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor().fit(X_train,y_train)
print(f"Default k = {knn_model.n_neighbors}")

Default k = 5


# Prediction

In [8]:
from sklearn.metrics import mean_squared_error

y_pred = knn_model.predict(X_test)
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)

print(f"MSE Loss Value = {mse}")
print(f"RMSE Loss Value = {rmse}")

MSE Loss Value = 54560.525481256605
RMSE Loss Value = 233.58194596598557


# Model Tuning

In [10]:
# Hyperparameter --> k : number of neighbors

from sklearn.model_selection import GridSearchCV

knn = KNeighborsRegressor()

knn_params = {"n_neighbors":np.arange(1,25,1)}

# Indicate cross validation

knn_cv_model = GridSearchCV(knn,knn_params,cv=10)
knn_cv_model.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])})

In [13]:
knn_cv_model.best_params_
print(f"Optimum k = {knn_cv_model.best_params_['n_neighbors']}")

Optimum k = 8


# Tuned Model

In [14]:
knn_tuned = KNeighborsRegressor(n_neighbors=8).fit(X_train,y_train)

# Prediction with tuned model
y_pred_tuned = knn_tuned.predict(X_test)
mse_tuned = mean_squared_error(y_test,y_pred_tuned)
rmse_tuned = np.sqrt(mse_tuned)

print(f"MSE Tuned Loss Value = {mse_tuned}")
print(f"RMSE Tuned Loss Value = {rmse_tuned}")

MSE Tuned Loss Value = 54156.66461550295
RMSE Tuned Loss Value = 232.71584521794588
