### Lab: KFold, Regularization, & Pipelines

Welcome!  This lab is going to introduce us to some very important aspects of data processing and model building.  

Specifically, it's going to go over the following:

 - **KFold Cross Validation:** This is a more thorough way of choosing your validation set to give you a better idea of how your model might perform under various circumstances within your training data.
 - **Regularization:** This is an evergreen technique for dealing with models that are overfit (ie, higher scores on training vs. test data).  Regularized linear models are often much better prepared to handle messy data & outliers when using this technique.
 - **Pipelines:** (Time permitting!) This is an underappreciated aspect of the Scikit-Learn api that allows you to chain together multiple data processing steps, making it much easier to test different models and work seamlessly between your training & test sets.

**Note:** This lab builds off of the one performed in the last class.  As such, it might be easier just to keep working in your previous lab to answer these questions.  It assumes you already have your data processed from the iowa housing lab.  

The questions are listed here just to make the separation of concerns easier.

### Question 1: How Does Your Validation Score Differ Using KFold Cross Validation?

Take a look at the validation score you got from your previous exercise.  

This time, run your model through KFold cross validation using `cross_val_score`.  Is your total validation score appreciably different?  What were your highest and lowest values?

What if you changed your number of folds?  Try using 5, 10, & 25 folds.

In [1]:
# your answer here
# these are the steps from the previous lab
import pandas as pd
import numpy as np
train = pd.read_csv('../data/iowa_housing/train.csv')
test  = pd.read_csv('../data/iowa_housing/test.csv')

# your answer here
y = train['SalePrice']
train.drop('SalePrice', axis=1, inplace=True)
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

train_empty = train.loc[:, train.isnull().sum() > 0]
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

# we'll use this for GarageYrBlt since it's a numeric column
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

# your code here
# we'll assume the GarageFinish is ordinal.  Ie, FinishedGarage > Unfinished Garage
garage_mapping = {
    'None': 0, # no garage
    'Unf' : 1, # unfinished garage
    'RFn' : 2, # partially finished garage
    'Fin' : 3  # finished garage
}

train['GarageFinish'] = train['GarageFinish'].map(garage_mapping)
test['GarageFinish']  = test['GarageFinish'].map(garage_mapping)

# MSSubClass is really a category, moreso than a true number
# so we'll add it to the list of items to be encoded
train['MSSubClass'] = train['MSSubClass'].astype(str)
test['MSSubClass']  = test['MSSubClass'].astype(str)

# concatenate and encode
master = pd.concat([train, test])
master = pd.get_dummies(master)

# drop MSSubClass150
master.drop('MSSubClass_150', axis=1, inplace=True)

# and split back apart
train  = master.iloc[:1460].copy()
test   = master.iloc[1460:].copy()

# save these values, to use on both your training and test set
train_means = train.mean()
train_stds  = train.std()

# standardize the training set
train -= train_means
train /= train_stds

# and do the same for the test set
test -= train_means
test /= train_stds

### Question 2: Updating Your Model With Ridge & Lasso Regression

Instead of using Linear Regression, import `Ridge` or `Lasso`, and use cross validation to find the ideal value of alpha.  

Some basic tips:

For values of alpha try this:  `alphas = np.logspace(-4, 4, 9)`
Then write a `for-loop` that generically goes like this:

`for value in alphas:
    1). set value of alpha to current value using set_params() method
    2). pass in instance of Ridge or Lasso into cross_val_score
    3). using a tuple, append the average of all results from step 2 into a list, along with the value of alpha`
    
When you're finished, you should have a list that has 9 tuples inside it, each one with the average cross validation score as well as the value of alpha associated with it.

In [None]:
# your answer here

**Bonus:** In Scikit-Learn cross validation is sometimes built into algorithms automatically.  Luckily this is the case with `Ridge` and `Lasso`.  If you're inclined to take a look at the `RidgeCV` and `LassoCV` methods, you can basically combine what we just did into one step.

**Note:** These aren't always available, and they don't always work in the same way, so remember that they won't always be an option.

**RidgeCV:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
**LassoCV:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

### Step 3: Building A Pipeline

Let's try building some pipelines to test out different versions of our models more easily.  

For this one, we are going to start fresh a little bit to get the hang of using our pipelines, and to go through the entire process.

So......

**a)** Reload the training and test sets

 - create a new variable for `y`, and set it equal to the log of `SalePrice`
 - create a variable for the `id` column in the test set -- this will be reused later

In [2]:
# your answer here
import pandas as pd
import numpy as np
train = pd.read_csv('../data/iowa_housing/train.csv')
test  = pd.read_csv('../data/iowa_housing/test.csv')

y = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)
test_id = test['Id']
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

**b)** Fill in the missing data on training & test

**Note:** If you feel like you have a good handle on this, you can just copy and paste from your previous solutions or the lab manual.  

If you have the time and think you need extra practice, feel free to try and re-create the results on your own.....just be mindful of time.

In [3]:
# your answer here
train_empty = train.loc[:, train.isnull().sum() > 0]
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

# we'll use this for GarageYrBlt since it's a numeric column
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

**c)** Reclassify the `MSSubClass` column as a string

In [4]:
# your answer here
train['MSSubClass'] = train['MSSubClass'].astype(str)
test['MSSubClass']  = test['MSSubClass'].astype(str)

**d)** Create Your Pipeline!


a). Initialize instances for each of the following items:

 - An ordinal encoder for the `GarageFinish` column (be careful about the mapping dictionary here)
 - A categorical encoder for your nominal columns
 - The standard scaler
 - Lasso or Ridge regression, with the cross validated value of alpha from the previous exercise

In [21]:
# your answer here
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

sc    = StandardScaler()
ore   = ce.OrdinalEncoder(cols=['GarageFinish'], mapping=[mapping])
ohe   = ce.OneHotEncoder()
ridge = Ridge()

In [28]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

pipe = make_pipeline(ore, ohe, sc, ridge)

In [20]:
mapping = {
    'col': 'GarageFinish',
    'mapping': {
        'None': 1,
        'Unf':  2,
        'RFn':  3,
        'Fin':  4
    }
}

In [25]:
pipe.fit(train, y)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['GarageFinish'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'GarageFinish',
                                          'mapping': {'Fin': 4, 'None': 1,
                                                      'RFn': 3, 'Unf': 2}}],
                                return_df=True, verbose=0)),
                ('onehotencoder',
                 OneHotEncoder(cols=['MSSubClass', 'MSZoning', 'Neighborhood',
                                     'GarageType'],
                               drop_invariant=False, handle_missing='value',
                               handle_unknown='value', return_df=True,
                               use_cat_names=False, verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('r

In [27]:
np.exp(pipe.predict(test))

array([117757.19893444, 153765.15486921, 170628.65553705, ...,
       150764.80850924, 117678.20922137, 226557.41950875])

In [29]:
cross_val_score(estimator=pipe, X=train, y=y, cv=10)

array([0.87437439, 0.91234571, 0.91290423, 0.82657394, 0.86676275,
       0.86542806, 0.8769858 , 0.89144589, 0.72653906, 0.87891162])

In [38]:
alphas = np.logspace(-3, 3, 7)
cv_scores = []

for alpha in alphas:
    pipe.steps[-1][1].set_params(alpha=alpha)
    scores = cross_val_score(estimator=pipe, X=train, y=y, cv=10)
    cv_scores.append((np.mean(scores), alpha))

In [42]:
pipe.steps[-1][1].set_params(alpha=100)

Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [43]:
pipe.fit(train, y)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['GarageFinish'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'GarageFinish',
                                          'mapping': {'Fin': 4, 'None': 1,
                                                      'RFn': 3, 'Unf': 2}}],
                                return_df=True, verbose=0)),
                ('onehotencoder',
                 OneHotEncoder(cols=['MSSubClass', 'MSZoning', 'Neighborhood',
                                     'GarageType'],
                               drop_invariant=False, handle_missing='value',
                               handle_unknown='value', return_df=True,
                               use_cat_names=False, verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('r

**e)** Fit the pipeline on your training set, and predict the values on your test set

 - to get the "real" values of your prediction you would use the function `np.exp()`
 
ie, if `pipe.predict(test)` gives you the predicted log values of your test set, then `np.exp(pipe.predict(test))` would give you the actual expected housing prices.

In [44]:
# your answer here
pipe.predict(test)

array([11.67852521, 11.9448299 , 12.05936728, ..., 11.92412032,
       11.66149733, 12.3309958 ])