<a href="https://colab.research.google.com/github/sadullahmath/LinearRegresyon/blob/master/K_Nearest_Neighbors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-Nearest Neighbors with GridSearchCV to Find the Optimal Number of Neighbors

In [0]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [0]:
# load data
housing_df = pd.read_csv('https://raw.githubusercontent.com/vishalv91/capstoneproject-realestate/master/HousingData.csv')

In [0]:
# drop null values
housing_df = housing_df.dropna()

In [0]:
# declare X and y
X = housing_df.iloc[:,:-1]
y = housing_df.iloc[:, -1]

In [0]:
def regression_model(model):
  # Create training and test sets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
  # Create the regressor: reg_all 
  reg_all = model
  # Fit the regressor to the training data
  reg_all.fit(X_train, y_train)
  # Predict on the test data: y_pred
  y_pred = reg_all.predict(X_test)
  # Compute and print RMSE
  rmse = np.sqrt(mean_squared_error(y_test, y_pred))
  print("Root Mean Squared Error: {}".format(rmse))

In [6]:
regression_model(LinearRegression())

Root Mean Squared Error: 6.275747570591628


In [7]:
regression_model(LinearRegression())

Root Mean Squared Error: 4.244219810135847


In [8]:
regression_model(LinearRegression())

Root Mean Squared Error: 4.5046996428171


The scores are different because we are splitting the data into a different training set and test set each time. Since the splitting of data happens randomly each time due to lack of random state parameter, therefore the model is based on different training sets. Furthermore, it's being scored against a different test set.

In [0]:
from sklearn.model_selection import GridSearchCV

In [0]:
neighbors = np.linspace(1, 20, 20)

In [0]:
k = neighbors.astype(int)

In [0]:
param_grid = {'n_neighbors': k}

In [0]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

In [0]:
knn_tuned = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error')

In [16]:
knn_tuned.fit(X, y)

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30,
                                           metric='minkowski',
                                           metric_params=None, n_jobs=None,
                                           n_neighbors=5, p=2,
                                           weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=0)

In [17]:
k = knn_tuned.best_params_
print("Best n_neighbors: {}".format(k))
score = knn_tuned.best_score_
rsm = np.sqrt(-score)
print("Best score: {}".format(rsm))

Best n_neighbors: {'n_neighbors': 7}
Best score: 8.516767055977628
