## Composite Estimators using Pipeline & FeatureUnions

<hr>

### Agenda
1. Introduction to Composite Estimators
2. Pipelines
3. TransformedTargetRegressor
4. FeatureUnions
5. ColumnTransformer
6. GridSearch on pipeline

PS: scikit version 0.20

<hr>

### 1. Introduction to Composite Estimators
* One or more transformers are connected to estimators resulting into composite estimator.
* Composite transformer is implemented using Pipeline
* FeatureUnion is used to concatenate output of transformers to create derived feature
* Pipeline make machine learning code reuseable & modular

### 2. Pipeline
* Before data is fed to learning algorithm, it needs to be handled for missing values.
* Different pre-processing needs to be done.
* The output of preprocessor is to be subjected to next preprocessor & finally the estimator
* This whole process can be automated using Pipeline

<img src="https://github.com/awantik/machine-learning-slides/blob/master/pipeline-ml2.png?raw=true">

* Intermediate steps .i.e transformers must implement fit & transform
* The same trained pipeline can used for prediction

#### Predicting horror author from text 

In [1]:
import pandas as pd

In [2]:
horror_train_data = pd.read_csv('data/horror-train.csv')

In [3]:
horror_train_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [4]:
horror_test_data= pd.read_csv('data/horror-test.csv')

In [5]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
id      8392 non-null object
text    8392 non-null object
dtypes: object(2)
memory usage: 131.2+ KB


In [6]:
horror_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
id        19579 non-null object
text      19579 non-null object
author    19579 non-null object
dtypes: object(3)
memory usage: 459.0+ KB


In [7]:
horror_train_data = horror_train_data[['text','author']]

In [8]:
from sklearn.pipeline import make_pipeline

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [11]:
pipelines = []
for model in [LogisticRegression(), DecisionTreeClassifier(), MultinomialNB(), SVC()]:
    pipeline = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              model)
    pipelines.append(pipeline)

In [12]:
pipelines[3].steps[2]

('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False))

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
trainX,testX,trainY,testY = train_test_split(horror_train_data.text, horror_train_data.author)

In [15]:
for pipeline in pipelines:
    pipeline.fit(trainX, trainY)

In [16]:
for pipeline in pipelines:
    print (pipeline.score(testX, testY))

0.791215526047
0.589581205312
0.808784473953
0.403268641471


In [17]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
id      8392 non-null object
text    8392 non-null object
dtypes: object(2)
memory usage: 131.2+ KB


In [18]:
results = []
for pipeline in pipelines:
    result = pipeline.predict(horror_test_data.text)
    results.append(result)

In [19]:
results

[array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'EAP'], dtype=object),
 array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'EAP'], dtype=object),
 array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], 
       dtype='<U3'),
 array(['EAP', 'EAP', 'EAP', ..., 'EAP', 'EAP', 'EAP'], dtype=object)]

In [20]:
pipelines[0].steps[0][1].transform(horror_test_data.text)

<8392x22218 sparse matrix of type '<class 'numpy.int64'>'
	with 88782 stored elements in Compressed Sparse Row format>

#### Caching transformers within a Pipeline
* Storing state of transformers is also possible to prevent recomputation of transformers
* When pipeline is subjected to GridSearch situations like this happens

In [21]:
from sklearn.model_selection import GridSearchCV

In [22]:
svc_pipe =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              SVC())

In [23]:
dt_pipe = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              DecisionTreeClassifier())

In [24]:
svc_pipe

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [25]:
svc_pipe.steps

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None, stop_words='english',
          strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('tfidftransformer',
  TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
 ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]

In [26]:
import numpy as np
params = {
    'svc__C': list(np.logspace(1,20,20))
}

In [27]:
dt_pipe.steps[2]

('decisiontreeclassifier',
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=None,
             splitter='best'))

In [28]:
params = {
    'countvectorizer__max_features':[5000,7500,10000],
    'decisiontreeclassifier__max_depth':[100,200]
}

In [29]:
gs = GridSearchCV(dt_pipe,cv=5,param_grid=params, n_jobs=-1)

In [30]:
%timeit gs.fit(trainX,trainY)

1 loop, best of 3: 53.2 s per loop


In [31]:
gs.best_params_

{'countvectorizer__max_features': 5000,
 'decisiontreeclassifier__max_depth': 200}

In [32]:
gs.best_score_

0.60249250885317351

%timeit gs.fit(trainX,trainY)

In [34]:
! pip install sklearn.utils

Collecting sklearn.utils
  Downloading https://files.pythonhosted.org/packages/fa/90/513bad627e9e8b76760d1d6d4e917641f96e6d384494226abb378940f125/sklearn_utils-0.0.15.tar.gz
Collecting pyfunctional (from sklearn.utils)
  Downloading https://files.pythonhosted.org/packages/c5/55/39ca9321c4b78f662afff39ed91bf65fa158c9ce92ae859f316f39f0806a/PyFunctional-1.3.0-py3-none-any.whl (46kB)
Collecting tabulate<=1.0.0 (from pyfunctional->sklearn.utils)
  Downloading https://files.pythonhosted.org/packages/c4/41/523f6a05e6dc3329a5660f6a81254c6cd87e5cfb5b7482bae3391d86ec3a/tabulate-0.8.6.tar.gz (45kB)
Collecting future<=1.0.0 (from pyfunctional->sklearn.utils)
  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
Collecting dill<=0.2.7.1,>=0.2.6 (from pyfunctional->sklearn.utils)
  Downloading https://files.pythonhosted.org/packages/91/a0/19d4d31dee064fc553ae01263b5c55e7fb93daff03a69debbedee647c5a0/dill-0

You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [33]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.utils import Memory

cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=0)
svc_pipe_cached =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              SVC(), memory = memory)

ImportError: cannot import name 'Memory'

In [None]:
gs_cached = GridSearchCV(svc_pipe_cached,cv=2,param_grid=params, verbose=0)

In [None]:
%timeit gs_cached.fit(trainX,trainY)

### 3. Transforming target in regression
* Dependent variables & independent variables should be linearly related
* In case, dependent variable is not normally distribted. We can make it happen for better error.
* The prediction also needs to be remapped
* This entire process can be automated using TransformedTargetRegressor

In [None]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
boston = load_boston()

In [None]:
X = boston.data

In [None]:
y = boston.target

In [None]:
regressor = LinearRegression()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
regressor.fit(X_train, y_train)

In [None]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

In [None]:
pred = regressor.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

### Convert data from non-normal distribution to normal distribution

In [None]:
from sklearn.preprocessing import QuantileTransformer

In [None]:
# pt = PowerTransformer()

In [None]:
qt = QuantileTransformer(output_distribution='normal')

In [None]:
#X_tf = pt.fit_transform(X)
#OR
X_tf = qt.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_tf, y, random_state=0)

In [None]:
regressor = LinearRegression()

In [None]:
regressor.fit(X_train, y_train)

In [None]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

In [None]:
pred = regressor.predict(X_test)

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

In [None]:
from sklearn.compose import TransformedTargetRegressor

In [None]:
regr = TransformedTargetRegressor(regressor=regressor,transformer=qt)

In [None]:
regr.fit(X_train, y_train)

In [None]:
pred = regr.predict(X_test)

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

In [None]:
r2_score(y_pred=pred, y_true=y_test)

#### Hyper-parameters of TransformedTargetRegressor
* regressor - initialized model
* transformer - which supports transform & inverse_transform functions
* function - to convert target 
* inverse_function - to convert back predicted target in original data scale

### 4. FeatureUnion
* It combines several transformer objects into one transformer
* Transformers are executed in parallel
* During fitting, each of these are fit parallelly
* During transform, output is concatenated parallely

#### Predicting employee exit - The Pipeline & FeatureUnion Way

In [None]:
emp_data = pd.read_csv('Data/HR_comma_sep.csv.txt')

In [None]:
emp_data.head()

In [None]:
emp_data.rename(columns={'sales':'dept'}, inplace=True)

In [None]:
num_cols = ['number_project','average_montly_hours','time_spend_company']

In [None]:
bin_cols = ['Work_accident','promotion_last_5years']

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder,LabelEncoder, LabelBinarizer, MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

In [None]:
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self,key):
        self.key = key
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.key]

In [None]:
class MyLabelBinarizer(TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)

In [None]:
pipeline_dept = Pipeline([
    ('selector', ItemSelector('dept')),
    ('lb', MyLabelBinarizer()),
])

In [None]:
pipeline_dept.fit_transform(emp_data)

In [None]:
class MultiItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self,keys):
        self.keys = keys
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.keys]

In [None]:
class SalaryMapper(BaseEstimator, TransformerMixin):
    
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        db = {'low':1,'medium':2,'high':3}
        print (type(X))
        r = X.str.strip().replace(db)
        return r.values.reshape(-1,1)

In [None]:
pipeline_salary = Pipeline([
    ('selector',ItemSelector('salary')),
    ('sm',SalaryMapper())
])

In [None]:
pipeline_numbers = Pipeline([
    ('selector',MultiItemSelector(num_cols)),
    ('scaling', MinMaxScaler())
])

In [None]:
pipeline_bin = Pipeline([
    ('selector',MultiItemSelector(bin_cols))
])

In [None]:
fu = FeatureUnion([
    ('dept_pipe',pipeline_dept),
    ('salary_pipe',pipeline_salary),
    ('numbers_pipe',pipeline_numbers),
    ('bin_pipe',pipeline_bin)
])

In [None]:
pipeline = Pipeline([
    ('union',fu),
    #('feature_selector',SelectKBest(k=15)),
    ('classifier',RandomForestClassifier(n_estimators=10))
])

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
trainX,testX, trainY,testY = train_test_split(emp_data.drop('left',axis=1), emp_data.left)

In [None]:
pipeline.fit(trainX,trainY)

In [None]:
pipeline.predict(testX)

In [None]:
pipeline.score(testX,testY)

### 5. ColumnTransformer ( Beta stage )
* Datasets consist of hetrogenous types of columns
* An easy technique to map column to pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
titanic_data = pd.read_csv('Data/titanic-train.csv.txt', index_col='PassengerId')

In [None]:
titanic_data.head()

In [None]:
num_cols = ['Age','Fare']
cat_cols = ['Embarked','Sex','Pclass']

In [None]:
pipeline_num = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaling',StandardScaler())
])

In [None]:
pipeline_cat = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoding', OneHotEncoder(handle_unknown='ignore'))
])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', pipeline_num, num_cols),
        ('cat', pipeline_cat, cat_cols)])

In [None]:
pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                ('classifier',RandomForestClassifier(n_estimators=10))])

In [None]:
X = titanic_data.drop('Survived',axis=1)

In [None]:
Y = titanic_data.Survived

In [None]:
trainX,testX,trainY,testY = train_test_split(X,Y)

In [None]:
pipeline.fit(trainX,trainY)

In [None]:
pipeline.score(testX,testY)

### 6. GridSearch for pipelines
* Pipelines consist of combination of transformers & estimators
* Both transformers & estimators are configured hyper-parameters as a fine tuning process

In [None]:
pipeline.steps

In [None]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [10,15,20],
}

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_search = GridSearchCV(pipeline, param_grid, cv=5, iid=False)
grid_search.fit(trainX,trainY)

In [None]:
grid_search.score(testX,testY)

In [None]:
grid_search.best_params_