### Lab:  Grid Search & Hyperparameter Tuning

Welcome!  Today's lab is going to allow us to blend together a number of the concepts covered in Unit 3 into one cohesive whole

 - Random Forests
 - Hyperparameter tuning models with a Grid Search
 - Using custom loss functions to keep track of how you're doing

#### Step 1a:  Load in the training and the test set

In [1]:
# your answer here
import pandas as pd
import numpy as np
train = pd.read_csv('../data/iowa_housing/train.csv')
test  = pd.read_csv('../data/iowa_housing/test.csv')

#### Step 1b: Create the `y` variable for `SalePrice`, remove it from the training set, and drop the indexes for both datasets.  Take the log of `SalePrice`.

In [2]:
# your answer here
y = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)
test_id = test['Id']
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

#### Step 2: Fill in the missing values (Completed For You)

Just so you can see how it works, all the code is listed here.  It is using the variables `train` and `test` to refer to the training and test sets you loaded in.  If these are something different, then you'll need to re-run things appropriately.

In [3]:
# just run this code
train_empty = train.loc[:, train.isnull().sum() > 0]
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

# we'll use this for GarageYrBlt since it's a numeric column
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

#### Step 3: Make A Pipeline For a Random Forest

Use the following steps:

  - OrdinalEncoder
  - OneHotEncoder
  - RandomForest
  
**Note:** Do you understand why we're not scaling our data?

In [15]:
# your answer here
# mapping for the ordinal columns -- you can just use this to speed things up
garage_mapping = {
    'None': 0, # no garage
    'Unf' : 1, # unfinished garage
    'RFn' : 2, # partially finished garage
    'Fin' : 3  # finished garage
}

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder, OneHotEncoder

# to be used with the ordinal encoder
mapping = {
    'col': 'GarageFinish',
    'mapping': garage_mapping
}

# initialize everything
rf = RandomForestRegressor()
ore = OrdinalEncoder(cols=['GarageFinish'], mapping=[mapping])
ohe = OneHotEncoder()

# make the pipe
pipe = make_pipeline(ore, ohe, rf)

**Step 4:** Import `mean_squared_error` and `make_scorer` from the metrics module, and turn it into a loss function that can be used in cross validation.

**Hint:** Set the argument `greater_is_better` to `False` for the `make_scorer` function.

In [39]:
# your answer here
from sklearn.metrics import mean_squared_error, make_scorer

loss_function = make_scorer(mean_squared_error, greater_is_better=False, squared=False)

#### Step 5: Setup Your Grid Search

Do the following:

 - Create a dictionary of values to test the following parameters:
   - `min_samples_leaf`: 1, 5, 10, 25
   - `max_features`: 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
   - `n_estimators`: 10, 50, 100
 - Initialize an instance of GridSearchCV with 5 folds, and the loss function from step 4

In [40]:
# your answer here
from sklearn.model_selection import GridSearchCV

params = {
    'randomforestregressor__min_samples_leaf': [1, 5, 10, 25],
    'randomforestregressor__max_features': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'randomforestregressor__n_estimators': [10, 50, 100]
}

grid = GridSearchCV(estimator=pipe, param_grid=params, cv=5, scoring=loss_function)

**Step 6:** Fit your grid on the pipeline you created in step 3.

In [41]:
grid.fit(train, y)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ordinalencoder',
                                        OrdinalEncoder(cols=['GarageFinish'],
                                                       drop_invariant=False,
                                                       handle_missing='value',
                                                       handle_unknown='value',
                                                       mapping=[{'col': 'GarageFinish',
                                                                 'mapping': {'Fin': 3,
                                                                             'None': 0,
                                                                             'RFn': 2,
                                                                             'Unf': 1}}],
                                                       return_df=True,
                                     

**Step 7:** What combination gave you the best results?

In [42]:
# your answer here
grid.best_params_

{'randomforestregressor__max_features': 0.3,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__n_estimators': 50}

#### Bonus

**B1: Among the parameters that you searched for, which ones had the strongest assocation with better validation scores?** 

In [43]:
# your answer here
grid_results = pd.DataFrame(grid.cv_results_)

In [45]:
# more trees on average gives better scores
grid_results.groupby('param_randomforestregressor__n_estimators')['mean_test_score'].mean()

param_randomforestregressor__n_estimators
10    -0.161912
50    -0.156856
100   -0.156112
Name: mean_test_score, dtype: float64

In [46]:
# less samples/leaf did better
grid_results.groupby('param_randomforestregressor__min_samples_leaf')['mean_test_score'].mean()

param_randomforestregressor__min_samples_leaf
1    -0.147804
5    -0.154002
10   -0.159136
25   -0.172231
Name: mean_test_score, dtype: float64

In [47]:
# fairly modest difference between values
grid_results.groupby('param_randomforestregressor__max_features')['mean_test_score'].mean()

param_randomforestregressor__max_features
0.3   -0.158835
0.4   -0.158030
0.5   -0.157850
0.6   -0.158044
0.7   -0.157912
0.8   -0.159087
Name: mean_test_score, dtype: float64

**B2: What were the 5 most important variables in impacting your housing price?**

In [48]:
# set to best parameters and fit
pipe.steps[2][1].set_params(n_estimators=50, min_samples_leaf=1, max_features=0.3)
pipe.fit(train, y)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['GarageFinish'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'GarageFinish',
                                          'mapping': {'Fin': 3, 'None': 0,
                                                      'RFn': 2, 'Unf': 1}}],
                                return_df=True, verbose=0)),
                ('onehotencoder',
                 OneHotEncoder(cols=['MSZoning', 'Neighborhood', 'GarageType'],
                               drop_invariant=Fal...
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=None,
                                       max_features=0.3, max_leaf_nodes=None,
                                       max_samples=None,
                                       min_impurity_decrease=0.0

In [54]:
# and build dataframe for importances
importances = pd.DataFrame({
    'columns': pipe.steps[1][1].get_feature_names(),
    'importance': pipe.steps[2][1].feature_importances_
})

importances.sort_values(by='importance', ascending=False)[:5]

Unnamed: 0,columns,importance
32,OverallQual,0.2678
38,GrLivArea.1,0.162412
35,GrLivArea,0.11088
34,YearBuilt,0.094305
36,1stFlrSF,0.064643
