The **k-nearest neighbors (KNN)** algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both **classification** and **regression** problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. 
KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph.
There are other ways of calculating distance, and one way might be preferable depending on the problem we are solving. However, the straight-line distance (also called the Euclidean distance) is a popular and familiar choice.

The data we will use for regression looks like mtcars data as a form. I have worked on artificial neural networks with mtcars data before.

[Neural Network - Predict to Acceleration "R Application"](https://www.kaggle.com/hamzatanc/neural-network-arabalar-n-h-zlanmas-n-n-tahmini)

In [None]:
import pandas as pd

cars_data = pd.read_csv("../input/car-data/car_data.csv", index_col = "car_name")

In [None]:
cars_data.shape

Mtcars data consists of 392 observations and 7 variables. Considering that the K nearest neighbors algorithm is successful in small data, we can ignore the small size of the data.

In [None]:
cars_data.head()

In [None]:
cars_data.describe().T

When I look at the distribution of data to quarters, accumulation did not attract my attention.

In [None]:
cars_data.groupby("cylinders").count()

In this study, I want to filter according to the number of cylinders. For this reason, I will use the data of the "4 cylinder" vehicles with the highest frequency.

In [None]:
cars_data = cars_data[cars_data.cylinders == 4]
cars_data = cars_data.drop("cylinders", axis = 1)
cars_data.head()

In [None]:
cars_data.corr()

Our aim in the study is to predict the **acceleration** variable with the KNN model. So "acceleration" is our **dependent** variable. We can actually determine the variables that affect the performance of vehicles from life experience. We can think of this experience as **"Professional Knowledge"**, which has an important place in data science. I used the correlation table while determining the independent variables along with the experience. When determining the independent variables, we should be careful that the correlations with the dependent variable are large and the correlations between the independent variables are small.

I select the variables which I will use in the K Nearest Neighbors Regression model as "knn_regression_data".

In [None]:
knn_regression_data = cars_data.loc[:,["horsepower","weight", "mpg","displacement"]]
knn_regression_data.head()

**Normalization** is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. I use min/max normalizer. The min-max normalizer linearly rescales every feature to the [0,1] interval. Rescaling to the [0,1] interval is done by shifting the values of each feature so that the minimal value is 0, and then dividing by the new maximal value (which is the difference between the original maximal and minimal values).

![](http://bilgisayarkavramlari.sadievrenseker.com/wp-content/uploads/2012/01/normallesme6.png)

In [None]:
import numpy as np

knn_regression_data = (knn_regression_data - np.min(knn_regression_data))/(np.max(knn_regression_data) - np.min(knn_regression_data))
knn_regression_data.describe().T

When we examine the distribution of our normalized data, we see that the minimum value is equal to 0 and the maximum value is equal to 1.

I said that the data to be used in the regression model should be numerical. When we look at the types of variables with "dtypes", we see that they are float.

In [None]:
knn_regression_data.dtypes

In [None]:
knn_independent = knn_regression_data.drop("displacement", axis = 1)
knn_dependent = knn_regression_data["displacement"] # I want estimate to acceleration

In [None]:
from sklearn.model_selection import train_test_split

independent_train, independent_test, dependent_train, dependent_test = train_test_split(
    knn_independent, 
    knn_dependent, 
    test_size = 0.10, 
    random_state = 20)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor().fit(independent_train, dependent_train)
predicted_values = knn_model.predict(independent_test)

In [None]:
predict_df = pd.DataFrame({"Dependent_Test" : dependent_test, "Dependent_Predicted" : predicted_values})
predict_df.head()

We have normalized the data before and gave values in the range of 0-1. I applied the reverse of the normalization process to see the real predictions with the code below.

In [None]:
predict_df = (predict_df*(np.max(cars_data.displacement) - np.min(cars_data.displacement))) + np.min(cars_data.displacement)
predict_df.head()

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

print("Mean Squared Error = ", mean_squared_error(predict_df.Dependent_Predicted, predict_df.Dependent_Test))
print("Root Mean Squared Error = ", np.sqrt(mean_squared_error(predict_df.Dependent_Predicted, predict_df.Dependent_Test)))

If we want to examine the success of the model with statistical methods, we can look at the MSE value. In statistics, the mean squared error (MSE) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

![](https://veribilimcisi.files.wordpress.com/2017/07/83buy.png)

In [None]:
r2_score(predict_df.Dependent_Predicted, predict_df.Dependent_Test)

R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means that all dependent variables are completely explained by movements in the index (or the independent variable(s) you are interested in).

## Parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
import numpy as np

knn_params = {"n_neighbors" : np.arange(1,11,1)}
knn = KNeighborsRegressor()
knn_cv_model = GridSearchCV(knn, knn_params, cv = 10)
knn_cv_model.fit(independent_train, dependent_train)

In [None]:
knn_cv_model.best_params_["n_neighbors"]

As a result of the Parameter Tuning process, we determined that the optimum neighbor number (k) is 9.

In [None]:
knn_model = KNeighborsRegressor(n_neighbors = knn_cv_model.best_params_["n_neighbors"]).fit(independent_train, dependent_train)
predicted_values = knn_model.predict(independent_test)

In [None]:
predict_df = pd.DataFrame({"Dependent_Test" : dependent_test, "Dependent_Predicted" : predicted_values})

In [None]:
predict_df = (predict_df*(np.max(cars_data.displacement) - np.min(cars_data.displacement))) + np.min(cars_data.displacement)
predict_df.head()

In [None]:
print("Mean Squared Error = ", mean_squared_error(predict_df.Dependent_Test, predict_df.Dependent_Predicted))
print("Root Mean Squared Error = ", np.sqrt(mean_squared_error(predict_df.Dependent_Test, predict_df.Dependent_Predicted)))

We see that the mse value decreases in the use of optimum parameters.

In [None]:
r2_score(predict_df.Dependent_Test, predict_df.Dependent_Predicted)

In [None]:
from sklearn.model_selection import cross_val_score

MSE = []
MSE_CV = []

for k in range(10):
    k = k + 1
    knn_model = KNeighborsRegressor(n_neighbors = k).fit(independent_train, dependent_train)
    y_pred = knn_model.predict(independent_test)
    mse = mean_squared_error(y_pred, dependent_test)
    mse_cv = -1 * cross_val_score(knn_model, independent_train,dependent_train, cv = 10,
                         scoring = "neg_mean_squared_error").mean()
    MSE.append(mse)
    MSE_CV.append(mse_cv)
    print("k =", k, "MSE :", mse, "MSE_CV:", mse_cv)

In [None]:
import matplotlib.pyplot as plt

plt.plot(np.arange(1,11,1), MSE)
plt.plot(np.arange(1,11,1), MSE_CV)
plt.xlabel("Value of K for KNN")
plt.ylabel("Testing Accurracy");