# Workflow & Hyperparameter Optimization

In [0]:
import pandas as pd
import seaborn as sns
import numpy as np

👇 Import the house price data set. We will only keep numerical feature for sake of simplicity

Your goal will be to fit the best KNN Regressor. And in particular, how many "neighbors" (K in KNN) should you consider to best predict your house-price?

In [0]:
# Load raw data
data = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/houses_train_raw.csv')

# Only keep numerical columns and raws without NaN
data = data.select_dtypes(include=np.number).dropna()

data

In [0]:
X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

## 1. Train/Test split

👇 Split the data to create your `X_train` `X_test` and `y_train` `y_test`. Use:
- `test_size=0.3`
- `random_state=0` to compare with your buddy

## 2. Scaling

Scaling is always very important for KNN.

❓ _Standard-Scale_ your training set

## 3. Baseline KNN model

❓ 5-fold cross validate a simple KNN regressor taking into account only the closest neighbor, and compute its mean cv-score

## 4. Grid search

Let's use sklearn `GridSearchCV` to find the best KNN hyperparameter `n_neighbors`.
- Start coarse-grain approach, with `n_neighbors` = [1,5,10,20,50]
- 5-fold cross validate each combination
- Be sure to maximize your performance time using `n_jobs`

❓ According to the grid search, what is the optimal K value?

❓ What is the best score the optimal K value produced?

We now have an idea about where the best k lies, but some of the values we did not try could be better!

Re-run a fine-grain grid search with k values around to your previous best value

❓ What is the `best_score` and `best_k` you find?

In [0]:
best_score = ?
best_k = ?

#### 🧪 Test your code

In [0]:
from nbresult import ChallengeResult
result = ChallengeResult('knn',
                         best_k=best_k,
                         best_score=best_score)
result.write()
print(result.check())

### Visual check

☝️ This problem is actually simple enough to perform a grid search manually.
- Loop manually over all values of k from 1 to 50 and store the mean cv-scores of each model in a list.
- Plot the score as a function of k to visualy find the best k

❓Can you guess what makes GridSearchCV a better option than such manual loop ?
 
<details>
    <summary>Answer</summary>

- Sklearn's `n_jobs=-1` allows you to paralellize search of each CPU
- What if you had multiple hyper-parameters to co-optimize ?
</details>

## 5. Multiple params

KNNRegressor suppports various _distance metrics_ via the hyper-parameter `p` [see docs](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

❓Use GridSearchCV to search for best `k` and `p` at the same time: Try all combinations for `k` = [1, 5, 10, 20, 50] and `p` = [1, 2, 3]. 

❓ How many models did you trained overall?

<details>
    <summary>Hint</summary>

Much more than 15. Think twice :)
    <details>
    <summary>Answer</summary>

75 models due to CV=5
</details>

❓ What are the best parameters and the best score?

## 6. Random Search

Now let's see if a Random Search can find a better combination with the same number of model fits?
Use `RandomizedSearchCV` to
- Randomly sample `k` from a uniform `randint(1,50)` distribition
- Sample `p` from a list [1,2,3]
- Use the correct number of `n_iter` and `cv` to fit the exact same number of models than in your previous GridSearchCV.

## 7. Generalization

👇 This is your final chance to fine-tune your model
- Refine your RandomsearchCV if you wish
- Choose your best model hyper-params and instantiate it
- Re-fit it on the __entire__ train set

👇 Time has come to discover our model's performance on the **unseen** test set `X_test`. Compute the r2 score for the test set and save it as `r2_test`.

❓ Would you consider the optimized model to generalize well?

<details><summary>Answer</summary>

Test score may decrease a bit with train set. Probably not more than 5%. This can be due to
- An non-representative train/test split
- A cross-val number too small leading to overfitting the model-tuning phase. The more you cross-validated, the more robust your findings will generalize - but you can't increase cv too much if your dataset is too small as you won't keep enough observations in each fold to be representative.
- Our dataset is very small and our hyperparameter optimization is thus extremely dependent (and overfitting) on our train/test split. Always make sure your dataset is much bigger than the total number of hyperparameter combinations you are trying out!
    
</details>

#### 🧪 Test your code 

In [0]:
from nbresult import ChallengeResult
result = ChallengeResult('r2', 
                         r2_test=r2_test)
result.write()
print(result.check())

## 🏁 Congratulation. Please push the exercise once completed