## Putting what we have together 

The broad steps are now quite simple _in principle:_
1. Set up the model/pipeline 
    - **New skill today: dealing with multiple data types in the pipeline using `ColumnTransformer`** <br> <br>
    
2. Optimize the model's/pipeline's parameters
    - E.g. What is the "best" strategy for imputing? How should we scale the variables? How many neighbors is the "best"? 
    - Options: `GridSearchCV`, `RandomizedSearchCV`, and SK has some model specific `-CV` functions (e.g. `LassoCV`)
    - Tip: print the pipeline to figure out how to specify the parameters keys for  `GridSearchCV` <br> <br>
3. Try new combinations of X variables (which to include), X variable transformations (log, non-linear polynomials), and model types (e.g. regression vs logistic), and optimize each 
    - If you have 40 variables, there are $2^40>billion$ possible combinations. You can't check all of those!    
    - Forward selection: 
        1. Start with empty model and add the variable that generates largest score increase (CV score, AIC, BIC, adj R2)
        2. Continue adding variables until some stopping condition is reached 
    - Backword selection is the opposite. Start with all variables and remove the least helpful. Continue until some stopping condition is reached. Function: `RFECV`.         
        - Alternate backwords approaches: `LassoCV` and `SelectFromModel`
    - [`sklearn.feature_selection`](https://scikit-learn.org/stable/modules/feature_selection.html) has a bunch of options and examples to show you different approaches for feature selection. Most can be used in a pipeline! :)
    
    ```python
    reg = Pipeline([
                      ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))), # or SFM(LassoCV()) 
                      ('reg', LinearRegression())
                    ])
    reg.fit(X, y)
    ```
    <br> <br>
4. Compare all the optimized models

    ```python
    <build list of models>
    for model in models:
        cross_validate(model, X, y, cv, ...)
    ```

5. Save the model as an OBJECT others can load and use quickly

## Those 5 steps in pseudo code

```python
imports 
load data

########################################################################################
# STEP 0: EDA
########################################################################################

Obviously, explore the data and use best practices throughout. This is just pseudo code,
not a fully fleshed out "fill in the blanks" template

########################################################################################
# STEP 1: build a pipeline with data cleaning and an estimator
########################################################################################

# after this, I quickly run pipe_modelName.fit() and pipe_modelName.predict()  
# to make sure this works before going forward, but then delete those commands

pipe_modelName = make_pipeline(<a sequence of data steps, and the last step is a model>)  

########################################################################################
# STEP 2: optimize the pipeline
########################################################################################

# this is the GridSearchCV approach - manually set up the param&value combos to try
# doc + examples: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV 

param_grid = {'stepname__paramname':[val1,val2,...,valN]} # params to try
cv = ...                                                  # what folds to use
grid = GridSearchCV(pipe_modelName, param_grid,cv,...)    # set up optimizer
grid.fit(X,y)                              # fit grid like a "normal model obj"
optimal_vrs_of_model1 = grid.best_params_  # grid now has new features. save best model

# part of this optimization step is picking the best model features

########################################################################################
# STEP 3: NOW MOVING BEYOND THAT,YOU SHOULD TRY OTHER THINGS! 
#           (WHAT ARE THE ODDS YOUR FIRST PASS CAN'T BE BEAT?)
########################################################################################

# MODEL #2
# build a new pipeline (e.g. change the model type, which vars to use, how to modify
# the vars)
# and repeat the pipeline optimization. save the optimal vrs of that model.

# MODEL #3
# again...

...

# MODEL #N:
# again...

########################################################################################
# STEP 4: Compare the optimized models
########################################################################################

# In practice, I'd actually loop through the models with a for-loop and print
# the name/scores nicely, but this is just pseudo code

cross_validate(optimal_vrs_of_model1,...)   
cross_validate(optimal_vrs_of_model2,...) 
...
cross_validate(optimal_vrs_of_modelN,...) 

########################################################################################
# STEP 5: Finishing up
########################################################################################

# summarize your preferred model (print stats, visual support backing your choice)
# save the model as an OBJECT others can load and use quickly

we will do this in a minute!
```

## New skill #1: Dealing with multiple variable types

### Simple pipelines fail on real world data

A pipeline from the last lecture was
```python
knn_pipe2 = make_pipeline(
                        SimpleImputer(strategy='mean'),
                        preprocessing.StandardScaler(),  # clean the data
                        KNeighborsClassifier()           # model
                        )
```

**The problem is that this won't work if the data has any string (data type = 'object') variables.** Real data usually has 
- numeric variables that are continuous,
- numeric variables that are categorical,
- string variables that are categorical,
- string variables to process with textual analysis,
- variables to ignore. 

The solution is to build a pipeline that can process different variables differently. Below, I get you set up using the assignment data. :)

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn import metrics

# DL data
url = 'https://github.com/LeDataSciFi/lectures-spr2020/blob/master/assignment_data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip') 

# separate out y var
y = fannie_mae['Original_Interest_Rate']
fannie_mae.drop('Original_Interest_Rate',axis=1,inplace=True)

### Set up how each data type will get dealt with

Let's start with the continuous numeric variables. Here, I just try a few variables. 

In [2]:
num_features = ['Original_UPB', 'Original_Loan_Term','Original_Debt_to_Income_Ratio']
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])


Now the categorical features. Again, just a few variables.

In [3]:
cat_features = ['Property_type', 'Loan_purpose']
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))])


### Combine the column-specific transformations with ColumnTransformer

In [4]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)])

In [5]:
preprocessor.fit(fannie_mae)
pd.DataFrame(preprocessor.transform(fannie_mae))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,-1.443151,0.642953,-0.990988,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.101791,0.642953,-0.639975,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,-0.615504,-1.543334,-0.201208,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,-1.121288,-1.543334,-1.429754,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,-1.277621,-1.543334,-2.044026,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
135033,0.101791,-0.814572,1.027337,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
135034,-0.872994,0.642953,0.500818,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
135035,0.460438,0.642953,-1.166494,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
135036,-0.882190,0.642953,-1.254247,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


### This is ready to include in a pipeline with an estimator

It's as easy as: `make_pipeline(preprocessor, model_of_your_choice())`. 

For example:

In [6]:
# copy and paste this, don't go over - they can figure out later
# get numerical col names (line 1) + post-transform categorical names (line 2)
cols  = preprocessor.transformers_[0][2] .copy()   #t_[0] is num trans,[2] item is col names, copy() so we don't change the underlying data structure!
cols += preprocessor.transformers_[1][1]['onehot']\
                     .get_feature_names(cat_features).tolist()  #t_[1] is cat trans,[1] is steps inside cat trans, get onehot, then pull the feature names

pd.DataFrame(preprocessor.transform(fannie_mae), columns=cols)

Unnamed: 0,Original_UPB,Original_Loan_Term,Original_Debt_to_Income_Ratio,Property_type_CO,Property_type_CP,Property_type_MH,Property_type_PU,Property_type_SF,Loan_purpose_C,Loan_purpose_P,Loan_purpose_R,Loan_purpose_U
0,-1.443151,0.642953,-0.990988,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.101791,0.642953,-0.639975,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,-0.615504,-1.543334,-0.201208,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,-1.121288,-1.543334,-1.429754,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,-1.277621,-1.543334,-2.044026,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
135033,0.101791,-0.814572,1.027337,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
135034,-0.872994,0.642953,0.500818,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
135035,0.460438,0.642953,-1.166494,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
135036,-0.882190,0.642953,-1.254247,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [7]:
# combine preprocessor with estimator
pipe_reg = make_pipeline(preprocessor,
                        LinearRegression())
pipe_reg # look at it

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                        

In [8]:
##could do this
pipe_reg.fit(fannie_mae, y)
pipe_reg.predict(fannie_mae)'''

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-8-942801a969a2>, line 3)

**That concludes step 2 in the pseudo, the real-world pipeline is ready!**

.

.

.

.

.

.

.


## Optimizing the overly simply model above

Three reasons for doing this: 
- specifying `param_grid` is just a little different because the pipeline has steps with nested steps
- one more example of optimizing a model
- you'll see how I'll evaluate your final model

Optimizing this pipeline is just like the pseudo code above: set up the parameter grid, then the grid to search, then fit and save the optimized model to an object.

   

In [None]:
param_grid = {
             'columntransformer__num__imputer__strategy': ['mean', 'median','most_frequent']
             }

_Note how we accessed the column transformer, 2 underscores, then the num transformer inside it, 2 underscores,  then the imputer step, then the strategy parameter. I wouldn't have known to do this without looking at the `pipe_reg` output above._

In [None]:
grid_search = GridSearchCV(pipe_reg, param_grid, cv=5,scoring='r2')
grid_search.fit(fannie_mae, y)
# grid_search.best_params_                   # examined this
opt_model_reg = grid_search.best_estimator_  # save best model to an actual model object

---

_START ASIDE: you can quickly check the model object's R2 in-sample (all of your data) and on the CV folds_

In [None]:
# how does this do insample?
print("In sample:          ",metrics.r2_score(y,
                                              opt_model_reg.predict(fannie_mae)
                                             ).round(3)) 

# lol this model generates negative R2 in the CV folds
print("Validation fold avg:",cross_validate(opt_model_reg,
                                            fannie_mae, y,
                                            scoring=['neg_mean_squared_error','r2']
                                           )
                                           ['test_r2'].mean().round(3))

Lol. You probably will want to include more variables.

_END ASIDE_

---

Now, you can drop in the code from [the assignment instructions](https://github.com/LeDataSciFi/LeDataSciFi.github.io/blob/master/assignments/asgn06_pred.md) to save this model to a file I'll evaluate. Make sure to ONLY save your best model!

.

.

.

.

.

.

## Feature Selection - Exercise / Breakout time! 

Let's break off into groups and try to select which variables to include in a model.

Groups can try to implement any of the feature selection at the top of the code. 

# excercise 1: add 5 new continupus variables to your pipeline and see how R2 changes

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
num_features = ['Original_UPB', 'Original_Loan_Term','Original_Debt_to_Income_Ratio', 'UNRATE', 'rGDP' , 'CPIAUCSL', 'Original_LTV_(OLTV)', 'TCMR']


In [None]:
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)])

In [None]:
preprocessor.fit(fannie_mae)
pd.DataFrame(preprocessor.transform(fannie_mae))

In [None]:

# copy and paste this, don't go over - they can figure out later
# get numerical col names (line 1) + post-transform categorical names (line 2)
cols  = preprocessor.transformers_[0][2] .copy()   #t_[0] is num trans,[2] item is col names, copy() so we don't change the underlying data structure!
cols += preprocessor.transformers_[1][1]['onehot']\
                     .get_feature_names(cat_features).tolist()  #t_[1] is cat trans,[1] is steps inside cat trans, get onehot, then pull the feature names

pd.DataFrame(preprocessor.transform(fannie_mae), columns=cols)

In [None]:
pipe_reg = make_pipeline(preprocessor,
                        LinearRegression())
pipe_reg

In [None]:
param_grid = {
             'columntransformer__num__imputer__strategy': ['mean', 'median','most_frequent']
             }

In [None]:
grid_search = GridSearchCV(pipe_reg, param_grid, cv=5,scoring='r2')
grid_search.fit(fannie_mae, y)
# grid_search.best_params_                   # examined this
opt_model_reg = grid_search.best_estimator_


In [None]:
print("In sample:          ",metrics.r2_score(y,
                                              opt_model_reg.predict(fannie_mae)
                                             ).round(3)) 

# lol this model generates negative R2 in the CV folds
print("Validation fold avg:",cross_validate(opt_model_reg,
                                            fannie_mae, y,
                                            scoring=['neg_mean_squared_error','r2']
                                           )
                                           ['test_r2'].mean().round(3))

# example 2: add 2 new categorical variabels

In [None]:
cat_features = ['Property_type', 'Loan_purpose', 'Occupancy_type', 'Mortgage_Insurance_type', 'Loan_indentifier']
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)])

In [None]:
pipe_reg = make_pipeline(preprocessor,
                        LinearRegression())
pipe_reg

In [None]:
param_grid = {
             'columntransformer__num__imputer__strategy': ['mean', 'median','most_frequent']
             }

In [None]:
print("In sample:          ",metrics.r2_score(y,
                                              opt_model_reg.predict(fannie_mae)
                                             ).round(3)) 

# lol this model generates negative R2 in the CV folds
print("Validation fold avg:",cross_validate(opt_model_reg,
                                            fannie_mae, y,
                                            scoring=['neg_mean_squared_error','r2']
                                           )
                                           ['test_r2'].mean().round(3))

# find variables to ignore, that do not effectr R2

In [None]:
ignore = ['Loan_indentifier', 'Qdate', 'Origination_date']
dumb_nums = fannie_mae.select_dtypes('number').columns.to_list()
dumb_cats = [ele for ele in fannie_mae.columns if ele not in dumb_nums]

In [None]:
num_features = ##add vars
cat_features = ##add vars
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)])
pipe_reg = make_pipeline(preprocessor,
                        LinearRegression() ##add selectfrommodel
                        )
pipe_reg
print("In sample:          ",metrics.r2_score(y,
                                              opt_model_reg.predict(fannie_mae)
                                             ).round(3)) 

# lol this model generates negative R2 in the CV folds
print("Validation fold avg:",cross_validate(opt_model_reg,
                                            fannie_mae, y,
                                            scoring=['neg_mean_squared_error','r2']
                                           )
                                           ['test_r2'].mean().round(3))

In [1]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# add a selection preprocessing step
pipe_reg = make_pipeline(preprocessor,
                        SelectFromModel(LassoCV()),     # turn this on/off to see diff                    
                        LinearRegression())

ImportError: cannot import name 'LabelBinarizer' from 'sklearn.preprocessing' (unknown location)