# Finding the Best K for KNN Regression

We are going to use the California housing dataset to illustrate how to find the best ``k`` value for KNN regression. The dataset was derived from the 1990 U.S. census. One row of the dataset represents the census of one block group.

## Import and Load the Dataset

In [None]:
from sklearn.datasets import fetch_california_housing

# as_frame=True loads the data in a dataframe format, with other metadata besides it
california_housing = fetch_california_housing(as_frame=True)

# Select only the dataframe part and assign it to the df variable
df = california_housing.frame

## Explore the Dataset

Take a peak at the first few rows of data.

In [None]:
import pandas as pd
df.head()

## Preprocessing the Dataset

We need to predict another median house value. To do so, we will assign ``MedHouseVal`` to ``y`` and all other columns to ``X`` just by dropping ``MedHouseVal``.

In [None]:
y = df['MedHouseVal']
X = df.drop(['MedHouseVal'], axis = 1)

## Splitting Data into Train and Test Sets

We sample 75% of the data for training and 25% of the data for testing. To ensure a reproducible evaluation, set the random_state using the provided ``SEED``.

In [None]:
from sklearn.model_selection import train_test_split

SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

## Feature Scaling both Train and Test Sets

By importing StandardScaler, instantiating it, fitting it according to our train data (preventing leakage), and transforming both train and test datasets, we can perform feature scaling.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Finding the Best K for KNN Regression

We will calculate the ``mean_absolute_error`` (MAE) for the predicted values of the test set for all the K values between 1 and 39. Print the best K value (the K value with the lowest MAE) in the end.

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor

error = []

# TODO: Calculating MAE error for K values between 1 and 39

print(error.index(min(error)) + 1)