# Introduction
Often times you will need to use different preprocessing methods before feeding data into an estimator. An approach will be to write a bunch of custom functions to preprocess the data before feeding it into an estimator. However, this makes hyperparameter searching difficult especially when the parameters that need tuning are part of your preprocessing steps, such as number of components in a PCA or n-grams in a TF-IDF vectorizer. 

To simplify and better manage the process, FeatureUnion and Pipeline can be utilized. In addition, sometimes out of the box processing functions are just not enough and you want to incorporate custom functions into the pipeline. This is where writing your own *Scikit-Learn style* transformer comes into play. If you write your own transformer that conforms to Scikit-Learn syntax, you can include any custom data processing functions into your pipeline.

This notebook showcases how to use FeatureUnion and Pipeline, and write your own Scikit-Learn style transformer.

In [161]:
# custom transformer needs to inherit from these two base classes
from sklearn.base import BaseEstimator,TransformerMixin 

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression

We will use dummy data in the form of a dataframe. Four columns of numerical data full of 1s, 2s, 3s, and 4s. This is intentional because we can confirm whether our custom function worked as intended in a linear regression model.

In [162]:
df = pd.DataFrame()
df['First'] = pd.Series(np.repeat([1],100))
df['Second'] = pd.Series(np.repeat([2],100))
df['Third'] = pd.Series(np.repeat([3],100))
df['Fourth'] = pd.Series(np.repeat([4],100))

To create your own custom transformer class, you need to implement at a minimum fit and a transform function. If your transformer needs to "remember" any value, it is advised to do so in the object initiation step. The transformer will simply return the selected column via the "key" value passed during object initiation. The "fit" operation does nothing so we simply return self. The transform function is the workhorse of the class, and simply returns the columns selected in the dataframe.

In this custom transformer example, we will simply select certain column(s) from the dataframe as our X-inputs into the linear regression model, with the fourth column as our target. 

Beware that in Scikit-Learn, estimators require a two dimensional input, even if you are only using a singular feature to fit the model!

In [164]:
class itemSelector(BaseEstimator,TransformerMixin):
    def __init__(self,key):
        # Make sure this transformer is feeding subsequent steps a two dimensional array by requiring the keys
        # to be in a list.
        assert type(key) == list, 'Key(s) selected need(s) to be in a list.'
        self.key = key
    def fit(self,df,y=None):
        # We don't need fit to do anything, simply return self.
        return self
    def transform(self,df,y=None):
        # Return the selected columns from the dataframe.
        return df[self.key]

Pipeline in Scikit-Learn lets you chain transforming steps in a linear series, with the final step being an estimator. 

Similarly, FeatureUnion "chains" transformers, but in parallel. It aggregates the outputs from each transformer into a new feature space. Any subsequent steps will use this newly created feature space as their inputs.

In our example, we will use the above custom class "itemSelector" to select the 'First' and 'Third' columns from our dataframe, using the FeatureUnion function. It will then feed the new output space (i.e. the 'First' and 'Third' columns) into the Pipeline's LinearRegression step. 

In [171]:
pipe = Pipeline([
        ('union',FeatureUnion([('first',itemSelector(key=['First'])),
                               ('second',itemSelector(key=['Third']))])),
        ('regress',LinearRegression(fit_intercept=False)
        )])

We will then fit the Pipeline with our dataframe using 'First' and 'Third', with 'Fourth' as our target. We are not fitting an intercept so we can confirm that the coefficients for the 'First' and 'Third' columns calculate to be 4 at the end.

In [172]:
pipe.fit(df,df['Fourth'])
pipe.named_steps['regress'].coef_

array([ 0.4,  1.2])

The linear regression formula is **0.4 \* df['First'] + 1.2 * df['Third'] = df['Fourth']**, which is correct because our 'First' column is simply a list of 1s and our 'Third' column is a list of 3s. In other words:

**0.4 \* 1.0 + 1.2 * 3.0 = 4.0**

This shows that our itemSelector() custom transformer worked as intended, with df['Second'] being completely ignored in the model fitting.