# Introduction
Often times you will need to use different preprocessing methods before feeding data into an estimator. An approach will be to write a bunch of custom functions to preprocess the data before feeding it into an estimator. However, this makes hyperparameter searching difficult especially when the parameters that need tuning are part of your preprocessing steps, such as number of components in a PCA or n-grams in a TF-IDF vectorizer. 

To simplify and better manage the process, FeatureUnion and Pipeline can be utilized. In addition, sometimes out of the box processing functions are just not enough and you want to incorporate custom functions into the pipeline. This is where writing your own *Scikit-Learn style* transformer comes into play. If you write your own transformer that conforms to Scikit-Learn syntax, you can include any custom data processing functions into your pipeline.

This notebook showcases how to use FeatureUnion and Pipeline, and write your own Scikit-Learn style transformer.

In [65]:
# custom transformer needs to inherit from these two base classes
from sklearn.base import BaseEstimator,TransformerMixin 

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression

from sklearn.datasets import make_regression

We will use dummy data in the form of a dataframe. In the dummy data, there will be 5 features but only 2 of them are useful. In this case, only features 'a' and 'e' are useful.

In [107]:
features,target,coef = make_regression(n_features=5,n_informative=2,noise=1.0,coef=True,random_state=1)
df = pd.DataFrame(features,columns=['a','b','c','d','e'])
df['target'] = target

print 'Underlying coefficients:',coef

df.head()

Underlying coefficients: [  7.6188809    0.           0.           0.          20.64540589]


Unnamed: 0,a,b,c,d,e,target
0,1.252868,0.51293,0.488518,-0.298093,-0.754398,-5.859529
1,0.502741,1.558806,-1.219744,0.109403,1.61695,34.985791
2,-1.114871,-0.76731,1.460892,0.674571,0.515414,2.808538
3,2.137828,-0.785534,0.71479,-1.755926,0.564383,28.453276
4,-2.037201,-1.942589,-2.114164,-2.506441,-0.618037,-28.173472


To create your own custom transformer class, you need to implement at a minimum fit and a transform function. If your transformer needs to "remember" any value, it is advised to do so in the object initiation step. The transformer will simply return the selected column via the "key" value passed during object initiation. The "fit" operation does nothing so we simply return self. The transform function is the workhorse of the class, and simply returns the columns selected in the dataframe.

In this custom transformer example, we will simply select certain column(s) from the dataframe as our X-inputs into the linear regression model, with the fourth column as our target. 

Beware that in Scikit-Learn, estimators require a two dimensional input, even if you are only using a singular feature to fit the model!

In [48]:
class itemSelector(BaseEstimator,TransformerMixin):
    def __init__(self,key):
        # Make sure this transformer is feeding subsequent steps a two dimensional array by requiring the keys
        # to be in a list.
        assert type(key) == list, 'Key(s) selected need(s) to be in a list.'
        self.key = key
    def fit(self,df,y=None):
        # We don't need fit to do anything, simply return self.
        return self
    def transform(self,df,y=None):
        # Return the selected columns from the dataframe.
        return df[self.key]

Pipeline in Scikit-Learn lets you chain transforming steps in a linear series, with the final step being an estimator. 

Similarly, FeatureUnion "chains" transformers, but in parallel. It aggregates the outputs from each transformer into a new feature space. Any subsequent steps will use this newly created feature space as their inputs.

In our example, since we know only 'a' and 'e' are useful, we will use the above custom class "itemSelector" to select only those columns from our dataframe, using the FeatureUnion function. It will then feed the new output space (i.e. the 'a' and 'e' columns) into the Pipeline's LinearRegression step. 

In [108]:
pipe = Pipeline([
        ('union',FeatureUnion([('first',itemSelector(key=['a'])),
                               ('second',itemSelector(key=['e']))])),
        ('regress',LinearRegression())
        ])

pipe.fit(df,df['target'])
print 'Intercept:',pipe.named_steps['regress'].intercept_
print 'Coefficients:',pipe.named_steps['regress'].coef_

Intercept: 0.122657969284
Coefficients: [  7.52398081  20.58428489]


With only 2 coefficients, this shows that our itemSelector() custom transformer, FeatureUnion, and Pipeline worked as intended (pipeline ignored the other 3 features that were not selected).

The model was able to recover coefficients that are very close to our original data (7.62 and 20.645). 