1. Recap
==


In the last mission, we focused on increasing the number of attributes the model uses. We saw how, in general, adding more attributes generally lowered the error of the model. This is because the model is able to do a better job identifying the living spaces from the training set that are the most similar to the ones from the test set. However, we also observed how using all of the available features didn't actually improve the model's accuracy automatically and that some of the features were probably not relevant for similarity ranking. We learned that selecting relevant features was the right lever when improving a model's accuracy, not just increasing the features used in the absolute.

In this mission, we'll focus on the impact of increasing <span style="background-color: #F9EBEA; color:##C0392B">k</span>, the number of nearby neighbors the model uses to make predictions. We exported both the training (<span style="background-color: #F9EBEA; color:##C0392B">train_df</span>) and test sets (<span style="background-color: #F9EBEA; color:##C0392B">test_df</span>) from the last missions to CSV files, <span style="background-color: #F9EBEA; color:##C0392B">dc_airbnb_train.csv</span> and <span style="background-color: #F9EBEA; color:##C0392B">dc_airbnb_test.csv</span> respectively. Let's read both these CSV's into Dataframes.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**: 

1. Read <span style="background-color: #F9EBEA; color:##C0392B">dc_airbnb_train.csv</span> into a Dataframe and assign to <span style="background-color: #F9EBEA; color:##C0392B">train_df</span>.
2. Read <span style="background-color: #F9EBEA; color:##C0392B">dc_airbnb_test.csv</span> into a Dataframe and assign to <span style="background-color: #F9EBEA; color:##C0392B">test_df</span>.

2. Hyperparameter optimization
==

When we vary the features that are used in the model, we're affecting the data that the model uses. On the other hand, varying the k value affects the behavior of the model independently of the actual data that's used when making predictions. In other words, we're impacting how the model performs without trying to change the data that's used.

Values that affect the behavior and performance of a model that are unrelated to the data that's used are referred to as **hyperparameters**. The process of finding the optimal hyperparameter value is known as hyperparameter optimization. A simple but common [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization) technique is known as [grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search), which involves:

- selecting a subset of the possible hyperparameter values,
- training a model using each of these hyperparameter values,
- evaluating each model's performance,
- selecting the hyperparameter value that resulted in the lowest error value.

Grid search essentially boils down to evaluating the model performance at different k values and selecting the k value that resulted in the lowest error. While grid search can take a long time when working with large datasets, the data we're working with in this mission is small and this process is relatively quick.

Let's confirm that grid search will work quickly for the dataset we're working with by first observing how the model performance changes as we increase the k value from <span style="background-color: #F9EBEA; color:##C0392B">1</span> to <span style="background-color: #F9EBEA; color:##C0392B">5</span>. If you recall, we set <span style="background-color: #F9EBEA; color:##C0392B">5</span> as the <span style="background-color: #F9EBEA; color:##C0392B">k</span> value for the last 2 missions. Let's use the features from the last mission that resulted in the best model accuracy:

- <span style="background-color: #F9EBEA; color:##C0392B">accommodates</span>
- <span style="background-color: #F9EBEA; color:##C0392B">bedrooms</span>
- <span style="background-color: #F9EBEA; color:##C0392B">bathrooms</span>
- <span style="background-color: #F9EBEA; color:##C0392B">number_of_reviews</span>

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**: 

1. Create a list containing the integer values **1**, **2**, **3**, **4**, and **5**, in that order, and assign to <span style="background-color: #F9EBEA; color:##C0392B">hyper_params</span>.
2. Create an empty list and assign to <span style="background-color: #F9EBEA; color:##C0392B">mse_values</span>.
3. Use a **for loop** to iterate over <span style="background-color: #F9EBEA; color:##C0392B">hyper_params</span> and in each iteration:
    - Instantiate a <span style="background-color: #F9EBEA; color:##C0392B">KNeighborsRegressor</span> object with the following parameters:
        - <span style="background-color: #F9EBEA; color:##C0392B">n_neighbors</span>: the current value for the iterator variable,
        - <span style="background-color: #F9EBEA; color:##C0392B">algorithm</span>: brute
    - Fit the instantiated k-nearest neighbors model to the following columns from <span style="background-color: #F9EBEA; color:##C0392B">train_df</span>:
        - <span style="background-color: #F9EBEA; color:##C0392B">accommodates</span>
        - <span style="background-color: #F9EBEA; color:##C0392B">bedrooms</span>
        - <span style="background-color: #F9EBEA; color:##C0392B">bathrooms</span>
        - <span style="background-color: #F9EBEA; color:##C0392B">number_of_reviews</span>
    - Use the trained model to make predictions on the same columns from <span style="background-color: #F9EBEA; color:##C0392B">test_df</span> and assign to <span style="background-color: #F9EBEA; color:##C0392B">predictions</span>.
    - Use the **mean_squared_error** function to calculate the MSE value between <span style="background-color: #F9EBEA; color:##C0392B">predictions</span> and the <span style="background-color: #F9EBEA; color:##C0392B">price</span> column from <span style="background-color: #F9EBEA; color:##C0392B">test_df</span>.
    - Append the MSE value to <span style="background-color: #F9EBEA; color:##C0392B">mse_values</span>.
4. Display <span style="background-color: #F9EBEA; color:##C0392B">mse_values</span> using the **print()** function.

In [4]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error


hyper_params = [1,2,3,4,5]
mse_values = []

for i in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=i, algorithm='brute')
    knn.fit(train_df[['accommodates','bedrooms','bathrooms','number_of_reviews']], train_df['price'])
    predictions = knn.predict(test_df[['accommodates','bedrooms','bathrooms','number_of_reviews']])
    mse = mean_squared_error(predictions,test_df['price'])
    mse_values.append(mse)

In [1]:
import pandas as pd

train_df = pd.read_csv('dc_airbnb_train.csv')
test_df = pd.read_csv('dc_airbnb_test.csv')

In [2]:
train_df.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
2,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505
3,-0.596544,-0.249467,-0.439151,-0.546858,209.0,0.487635,-0.016584,-0.448301
4,4.393004,4.507903,1.264998,2.829956,215.0,-0.065038,-0.016553,0.646219


In [3]:
test_df.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,-0.596544,-1.43881,-0.439151,-0.546858,105.0,-0.341375,-0.016548,-0.243079
1,0.90032,0.939875,1.264998,0.297345,309.0,0.487635,-0.016594,-0.243079
2,-0.596544,-0.249467,2.117072,-0.546858,55.0,-0.341375,-0.016573,0.714626
3,-0.596544,-0.249467,-0.439151,-0.546858,180.0,-0.341375,-0.016573,-0.448301
4,-0.596544,-0.249467,-0.439151,-0.546858,130.0,-0.341375,-0.016573,-0.448301
