# Creating some ML pipelines

### Introduction

I thought I'd do a quick write up on how you can build some simple and effective ML pipelines using sklearn.

I discovered the pipeline/gridsearch combo a few weeks ago after sending off some of my code for review.
In sending off my code I realized that were a few things that I had tweaked for performance, but weren't obvious to the reviewer.

- I had median imputed some variables (continuous features), while other variables were filled by mode
- I did a large amount of feature engineering, only to use a subset of those features in my model building (they were the best I swear)

So even though I did some work to get to that reviewed copy, these experiments that I went through during the process wouldn't be easy to understand unless the reviewer looked at all my commits etc.

But then I found **pipelines / gridsearch** and all was good in the world.

**Pipelines** let you combine all your feature engineering / pre-processing / modelling into one object

**Gridsearch** then lets you test all your assumptions / hyperparameters to find out which combinations generate the best result

I haven't seen many write ups about them so I thought I'd do one myself.

1. **References:**

- http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- http://scikit-learn.org/stable/modules/pipeline.html
- http://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
- https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline
- http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
- https://michelleful.github.io/code-blog/2015/06/20/pipelines/

# Libraries + MAE

In [1]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Imputer, PolynomialFeatures, StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
import os
import time

import warnings
warnings.filterwarnings("ignore")

def MAE(y, ypred):
    
    import numpy as np
    
    return np.sum([abs(y[i]-ypred[i]) for i in range(len(y))]) / len(y)   

# Data read in and prep
### Making sure the data is in a lightGBM friendly format

In [2]:
train = pd.read_csv("../input/train_2016_v2.csv")
properties = pd.read_csv('../input/properties_2016.csv')

for c, dtype in zip(properties.columns, properties.dtypes):	
    if dtype == np.float64:
        properties[c] = properties[c].astype(np.float32)

df_train = (train.merge(properties, how='left', on='parcelid')
            .drop(['parcelid', 'transactiondate', 'propertyzoningdesc', 
                         'propertycountylandusecode', 'fireplacecnt', 'fireplaceflag'], axis=1))

train_columns = df_train.columns 

# Splitting the dataset to train/valid sets

In [3]:
valid = df_train.iloc[1:20000, :]
train = df_train.iloc[20001:90275, :]

y_train = train['logerror'].values
y_valid = valid['logerror'].values

x_train = train.drop('logerror', axis = 1)
x_valid = valid.drop('logerror', axis = 1)

idVars = [i for e in ['id',  'flag', 'has'] for i in list(train_columns) if e in i] + ['fips', 'hashottuborspa']
countVars = [i for e in ['cnt',  'year', 'nbr', 'number'] for i in list(train_columns) if e in i]
taxVars = [col for col in train_columns if 'tax' in col and 'flag' not in col]
          
ttlVars = idVars + countVars + taxVars
dropVars = [i for e in ['census',  'tude', 'error'] for i in list(train_columns) if e in i]
contVars = [col for col in train_columns if col not in ttlVars + dropVars]

for c in x_train.dtypes[x_train.dtypes == object].index.values:
    x_train[c] = (x_train[c] == True)
    
for c in x_valid.dtypes[x_valid.dtypes == object].index.values:
    x_valid[c] = (x_valid[c] == True)   

# The first pipeline

Since everyone is using lightGBM I'll use that. 
Initially we'll just look at the continuous variables in model building, but we'll extend that out too.

So let's start with the easy pipeline that:

- Imputes the missing values with the median
- Selects the best 5 features
- Builds a LightGBM model

In [4]:
print(contVars)

x_train_cont = x_train[contVars]
x_valid_cont = x_valid[contVars]

In [5]:
pipeline = Pipeline(
                    [('imp', Imputer(missing_values='NaN', strategy = 'median', axis=0)),
                     ('feat_select', SelectKBest(k = 5)),
                     ('lgbm', LGBMRegressor())
                     
])

pipeline.fit(x_train_cont, y_train)   

y_pred = pipeline.predict(x_valid_cont)
print('MAE on validation set: %s' % (round(MAE(y_valid, y_pred), 5)))

## Pipeline 2.0 - oh hai there gridsearch

But from the above code we have made a few assumptions that haven't been tested.

**We assume that:**
- Median is the best way of imputing the variables
- Only 5 variables needed for the lowest error 

But we don't need to assume these, we can test these assumptions and find out which actually results in the lowest error.

In [6]:
pipeline = Pipeline(
                    [('imp', Imputer(missing_values='NaN', axis=0)),
                     ('feat_select', SelectKBest()),
                     ('lgbm', LGBMRegressor())
                     
])

parameters = {}
parameters['imp__strategy'] = ['mean', 'median', 'most_frequent']
parameters['feat_select__k'] = [5, 10]

CV = GridSearchCV(pipeline, parameters, scoring = 'mean_absolute_error', n_jobs= 1)
CV.fit(x_train_cont, y_train)   

print('Best score and parameter combination = ')

print(CV.best_score_)    
print(CV.best_params_)    

y_pred = CV.predict(x_valid_cont)
print('MAE on validation set: %s' % (round(MAE(y_valid, y_pred), 5)))

Interesting, I never thought that using a mode would come out on top.

But we since we're also here we let's also test and see what is the best imputation policy for tax variables. First I'll quickly write a column extractor that plays nicely with the pipeline.

These can look hard at first, but they definitely get easier as you write a few. 
And since we're writing some code we should probably do some testing to make sure that it works the way we think it will.

# Column extractor

## Takes a list of columns and returns a df with those cols

In [11]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, subset):
        self.subset = subset

    def transform(self, X, *_):
        return X.loc[:, self.subset]

    def fit(self, *_):
        return self

# Testing

Since we've already created x_train_cont I'll test the column extractor on this case

In [12]:
contExtractor = ColumnSelector(contVars)
x_train_cont_test = contExtractor.transform(x_train).head()

x_train_cont.head().equals(x_train_cont_test)

# Pipeline 3.0 - taxes

So let's use the ColumnExtractor we created earlier to run the same analysis we did earlier on the tax variables

In [14]:
pipeline = Pipeline([
                    ('tax_dimension', ColumnSelector(taxVars)),
                    ('imp', Imputer(missing_values='NaN', axis=0)),
                    ('column_purge', SelectKBest()),
                    ('lgbm', LGBMRegressor())
                     
])

parameters = dict(imp__strategy=['mean', 'median', 'most_frequent'],
                    column_purge__k=[5, 2, 1] 

)   

CV = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 1)
CV.fit(x_train, y_train)   

print(CV.best_params_)    
print(CV.best_score_)    

y_pred = CV.predict(x_valid)
print('MAE on validation set: %s' % (round(MAE(y_valid, y_pred), 5)))

# Pipeline 4.0 - contVars + taxes (FeatureUnion intro)

Let's use FeatureUnion to apply different pre-processing pipelines to different types of variables

In [16]:
pipeline = Pipeline([
        
    ('unity', FeatureUnion(
        transformer_list=[

            ('cont_portal', Pipeline([
                ('selector', PortalToColDimension(contVars)),
                ('cont_imp', Imputer(missing_values='NaN', strategy = 'median', axis=0)),
                ('scaler', StandardScaler())             
            ])),
            ('tax_portal', Pipeline([
                ('selector', PortalToColDimension(taxVars)),
                ('tax_imp', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
                ('scaler', MinMaxScaler(copy=True, feature_range=(0, 3)))
            ])),
        ],
    )),
    ('column_purge', SelectKBest(k = 5)),    
    ('lgbm', LGBMRegressor()),
])

parameters = {}
parameters['column_purge__k'] = [5, 10]

grid = GridSearchCV(pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 2)
grid.fit(x_train, y_train)   

print('Best score and parameter combination = ')

print(grid.best_score_)    
print(grid.best_params_)    

y_pred = grid.predict(x_valid)
print('MAE on validation set: %s' % (round(MAE(y_valid, y_pred), 5)))

# Finished with your model and push it to prod?

You can use the joblib library to serialize this best case pipeline and push it to a .pkl file.

You easily move this to your prod server and just re-open it and your pipeline will will work like a charm (assuming no changing in data types, and that's a pretty big IF).

But when the new data comes through you can open up the following pickle and easily score the model you built against the new data

In [17]:
from sklearn.externals import joblib
joblib.dump(grid.best_estimator_, 'rick.pkl')

# Summary

In hindsight, I'd be surprised if anyone used these approaches right out of the box to win this competition.

BUT, hopefully you see some duplication in your feature engineering / modelling pipeline and that this type of set up might help you re-use some of that code for a better result in your future comps.

Miller out.

PS: If you want any of this explained just add a comment and I'll get around to it when I can.