### Lab:  Random Forests & Regularized Linear Models -- Solutions

Welcome!  Today's lab is going to allow us to blend together a number of techniques that have been brewing throughout unit 3:

 - Cross validating the parameters of different models
 - Regularized Linear Models (Ridge & Lasso)
 - And our newly added technique:  the Random Forest!  
 
We'll continue working on the Ames dataset, and see if we can implement the different methods we've discussed so far.

#### Step 1a:  Load in the training and the test set

In [4]:
# your answer here
import pandas as pd
import numpy as np
train = pd.read_csv('../data/iowa_housing/train.csv')
test  = pd.read_csv('../data/iowa_housing/test.csv')

#### Step 1b: Create the `y` variable for `SalePrice`, remove it from the training set, and drop the indexes for both datasets.  Take the log of `SalePrice`.

In [5]:
# your answer here
y = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)
test_id = test['Id']
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

#### Step 2: Fill in the missing values, using the techniques discussed so far

**Note:** Like last time, you can copy & paste the answers from the solutions manual if the class is running out of time and/or you feel like you already understand the main points behind this step.

In [6]:
# your answer here
train_empty = train.loc[:, train.isnull().sum() > 0]
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

# we'll use this for GarageYrBlt since it's a numeric column
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

#### Step 3: Make Pipelines For Both a Random Forest and One of Ridge/Lasso

Use the following steps for each one:

 - **Linear Models:**
  - StandardScaler
  - OrdinalEncoder (make sure to specify what columns you want this to apply to)
  - OneHotEncoder
  - Ridge/Lasso
  
 - **Random Forests:**
  - OrdinalEncoder
  - OneHotEncoder
  - RandomForest
  
**Note:** Do you understand why we're using different steps for each one?

In [18]:
# your answer here
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import make_pipeline

# mapping for the ordinal columns
garage_mapping = {
    'None': 0, # no garage
    'Unf' : 1, # unfinished garage
    'RFn' : 2, # partially finished garage
    'Fin' : 3  # finished garage
}

col_mapping = {
    'col': 'GarageFinish',
    'mapping': garage_mapping
}

# Pipeline components
sc    = StandardScaler()
ore   = OrdinalEncoder(cols=['GarageFinish'], mapping=[col_mapping])
ohe   = OneHotEncoder()
ridge = Ridge()
rf    = RandomForestRegressor()

lm_pipe = make_pipeline(ore, ohe, sc, ridge)
rf_pipe = make_pipeline(ore, ohe, rf)

#### Step 4a: Cross Validate The Best Version of Alpha for Your Linear Model

This will follow a process very similar to the previous class.

 - create your list of alphas using `np.logspace`
 - using a for-loop, do the following:
  - set the value of alpha for your model to the current one
  - use cross validation to arrive at the validation score
  - log the value of the validation score & alpha to a list
  
 When you're done you should have a list that looks something like this:
 
  `[(0.86747374, .001), (0.8547574, 0.1), (0.8573584, 1)]`

In [19]:
# your answer here
from sklearn.model_selection import cross_val_score

alphas       = np.logspace(-3, 3, 7)
ridge_scores = []

# for loop to cross validate alpha
for alpha in alphas:
    lm_pipe.steps[3][1].set_params(alpha=alpha)
    scores = cross_val_score(estimator=lm_pipe, X=train, y=y, cv=10)
    ridge_scores.append((np.mean(scores), alpha))

In [20]:
# and 100 wins
max(ridge_scores)

(0.8616642766670427, 100.0)

#### Step 4b: Set the value of alpha to the one that gave you the best cross validation results

In [22]:
# your answer here
lm_pipe.steps[3][1].set_params(alpha=100)

Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

#### Step 5a: Cross Validate The Ideal Value for `max_features` For Your Random Forest

This will follow a process very similar to the previous step.

 - create a list of values that ranges from 0.2 - 0.9
 - using a for-loop, do the following:
  - set the value of `max_features` for your model to the current one
  - use cross validation to arrive at the validation score
  - log the value of the validation score & max samples to a list
  
 When you're done you should have a list that looks something like this:
 
  `[(0.86747374, .2), (0.8547574, 0.3), (0.8573584, 0.4)]`

In [25]:
# your answer here
sample_size = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
rf_scores   = []

# do the cross validation
for size in sample_size:
    rf_pipe.steps[2][1].set_params(max_features=size)
    scores = cross_val_score(estimator=rf_pipe, X=train, y=y, cv=10)
    rf_scores.append((np.mean(scores), size))

In [26]:
# get the best values
max(rf_scores)

(0.8700499374632763, 0.5)

#### 5b: Set the value of `max_samples` in your pipeline to the best one found in your cross validation results

In [27]:
# your answer here
rf_pipe.steps[2][1].set_params(max_features=0.5)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=0.5, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

#### Step 6: Fit your cross validated pipelines on your **entire** training set

In [28]:
# your answer here
lm_pipe.fit(train, y)
rf_pipe.fit(train, y)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['GarageFinish'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'GarageFinish',
                                          'mapping': {'Fin': 3, 'None': 0,
                                                      'RFn': 2, 'Unf': 1}}],
                                return_df=True, verbose=0)),
                ('onehotencoder',
                 OneHotEncoder(cols=['MSZoning', 'Neighborhood', 'GarageType'],
                               drop_invariant=Fal...
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=None,
                                       max_features=0.5, max_leaf_nodes=None,
                                       max_samples=None,
                                       min_impurity_decrease=0.0

#### Step 7: Make predictions on your test set using each of your pipelines.

**Note:** If you want the predictions to be in their original units, use the `np.exp()` method to transform your predictions.

In [30]:
# your answer here
lm_preds = np.exp(lm_pipe.predict(test))
rf_preds = np.exp(rf_pipe.predict(test))

#### Bonus: Put your predictions into a dataframe

In [42]:
# your answer here
rf_submission = pd.DataFrame({
    'Id': np.arange(1461, 2920),
    'SalePrice': rf_preds
})

lm_submission = pd.DataFrame({
    'Id': np.arange(1461, 2920),
    'SalePrice': lm_preds
})

In [43]:
# this would create a submission file
rf_submission.to_csv('rf_submission.csv', index=False)
lm_submission.to_csv('lm_submission.csv', index=False)