## Composite Estimators using Pipeline & FeatureUnions

<hr>

### Agenda
1. Introduction to Composite Estimators
2. Pipelines
3. TransformedTargetRegressor
4. FeatureUnions
5. ColumnTransformer
6. GridSearch on pipeline

PS: scikit version 0.20

<hr>

### 1. Introduction to Composite Estimators
* One or more transformers are connected to estimators resulting into composite estimator.
* Composite transformer is implemented using Pipeline
* FeatureUnion is used to concatenate output of transformers to create derived feature
* Pipeline make machine learning code reuseable & modular

### 2. Pipeline
* Before data is fed to learning algorithm, it needs to be handled for missing values.
* Different pre-processing needs to be done.
* The output of preprocessor is to be subjected to next preprocessor & finally the estimator
* This whole process can be automated using Pipeline

<img src="https://github.com/awantik/machine-learning-slides/blob/master/pipeline-ml2.png?raw=true">

* Intermediate steps .i.e transformers must implement fit & transform
* The same trained pipeline can used for prediction

#### Predicting horror author from text 

In [None]:
import pandas as pd

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving horror-train.csv to horror-train.csv
User uploaded file "horror-train.csv" with length 3295644 bytes


In [None]:
horror_train_data = pd.read_csv('horror-train.csv')

In [None]:
horror_train_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [None]:
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving horror-test.csv to horror-test.csv
User uploaded file "horror-test.csv" with length 1351241 bytes


In [None]:
horror_test_data= pd.read_csv('horror-test.csv')

In [None]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.2+ KB


In [None]:
horror_train_data = horror_train_data[['text','author']]

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
pipelines = []
for model in [LogisticRegression(), DecisionTreeClassifier(), MultinomialNB(), SVC()]:
    pipeline = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              model)
    pipelines.append(pipeline)

In [None]:
pipelines[3].steps[2]

('svc',
 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False))

In [None]:
pipelines[2].steps[1]

('tfidftransformer',
 TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True))

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
trainX,testX,trainY,testY = train_test_split(horror_train_data.text, horror_train_data.author)

In [None]:
for pipeline in pipelines:
    pipeline.fit(trainX, trainY)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
for pipeline in pipelines:
    print (pipeline.score(testX, testY))

0.7928498467824311
0.5873340143003064
0.8081716036772216
0.7842696629213484


In [None]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.2+ KB


In [None]:
results = []
for pipeline in pipelines:
    result = pipeline.predict(horror_test_data.text)
    results.append(result)

In [None]:
results

[array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], dtype=object),
 array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'EAP'], dtype=object),
 array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], dtype='<U3'),
 array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], dtype=object)]

In [None]:
pipelines[0].steps[0][1].transform(horror_test_data.text)

<8392x22210 sparse matrix of type '<class 'numpy.int64'>'
	with 88740 stored elements in Compressed Sparse Row format>

#### Caching transformers within a Pipeline
* Storing state of transformers is also possible to prevent recomputation of transformers
* When pipeline is subjected to GridSearch situations like this happens

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
svc_pipe =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              SVC())

In [None]:
dt_pipe = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              DecisionTreeClassifier())

In [None]:
svc_pipe

Pipeline(memory=None,
         steps=[('countvectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=Non...one)),
                ('tfidftransformer',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('svc',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='

In [None]:
svc_pipe.steps[2]

('svc',
 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False))

In [None]:
import numpy as np
params = {
    'svc__C': list(np.logspace(1,20,20))
}

In [None]:
dt_pipe.steps[2]

('decisiontreeclassifier',
 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=None, max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=None, splitter='best'))

In [None]:
params = {
    'countvectorizer__max_features':[5000,7500,10000],
    'decisiontreeclassifier__max_depth':[100,200]
}

In [None]:
gs = GridSearchCV(dt_pipe,cv=5,param_grid=params, n_jobs=-1)

In [None]:
%timeit gs.fit(trainX,trainY)

1 loop, best of 5: 50.2 s per loop


In [None]:
gs.best_params_

{'countvectorizer__max_features': 10000,
 'decisiontreeclassifier__max_depth': 200}

In [None]:
gs.best_score_

0.6008577029518155

%timeit gs.fit(trainX,trainY)

In [None]:
! pip install sklearn.utils

Collecting sklearn.utils
  Downloading https://files.pythonhosted.org/packages/fa/90/513bad627e9e8b76760d1d6d4e917641f96e6d384494226abb378940f125/sklearn_utils-0.0.15.tar.gz
Collecting pyfunctional
[?25l  Downloading https://files.pythonhosted.org/packages/c5/55/39ca9321c4b78f662afff39ed91bf65fa158c9ce92ae859f316f39f0806a/PyFunctional-1.3.0-py3-none-any.whl (46kB)
[K     |████████████████████████████████| 51kB 2.8MB/s 
Collecting dill<=0.2.7.1,>=0.2.6
[?25l  Downloading https://files.pythonhosted.org/packages/91/a0/19d4d31dee064fc553ae01263b5c55e7fb93daff03a69debbedee647c5a0/dill-0.2.7.1.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 9.5MB/s 
Building wheels for collected packages: sklearn.utils, dill
  Building wheel for sklearn.utils (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn.utils: filename=sklearn_utils-0.0.15-cp36-none-any.whl size=27880 sha256=4dfef61f26c59e53c5d96bd109fa9e2c3e1b31dcb11b4dc935fa77d3ecac6335
  Stored in directory: /root/.cache/pi

In [None]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.utils import Memory

cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=0)
svc_pipe_cached =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              SVC(), memory = memory)



In [None]:
! pip install joblib
joblib.memory



In [None]:
params = {
    'svc__C': list(np.logspace(1,20,20))
}

In [None]:
gs_cached = GridSearchCV(svc_pipe_cached,cv=5,param_grid=params, verbose=0)

In [None]:
%timeit gs_cached.fit(trainX,trainY)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  **fit_params_steps[name])
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  **fit_params_steps[name])
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide th

KeyboardInterrupt: ignored

### 3. Transforming target in regression
* Dependent variables & independent variables should be linearly related
* In case, dependent variable is not normally distribted. We can make it happen for better error.
* The prediction also needs to be remapped
* This entire process can be automated using TransformedTargetRegressor

In [None]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
boston = load_boston()

In [None]:
X = boston.data

In [None]:
y = boston.target

In [None]:
regressor = LinearRegression()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

R2 score: 0.64


In [None]:
pred = regressor.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

3.6683301481357256

In [None]:
r2_score(y_pred=pred,y_true=y_test )

0.635463843320211

### Convert data from non-normal distribution to normal distribution

In [None]:
from sklearn.preprocessing import QuantileTransformer

In [None]:
# pt = PowerTransformer()

In [None]:
qt = QuantileTransformer(output_distribution='normal')

In [None]:
#X_tf = pt.fit_transform(X)
#OR
X_tf = qt.fit_transform(X)

  % (self.n_quantiles, n_samples))


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_tf, y, random_state=0)

In [None]:
regressor = LinearRegression()

In [None]:
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

R2 score: 0.66


In [None]:
pred = regressor.predict(X_test)

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

3.6327621516037127

In [None]:
from sklearn.compose import TransformedTargetRegressor

In [None]:
regr = TransformedTargetRegressor(regressor=regressor,transformer=qt)

In [None]:
regr.fit(X_train, y_train)

  % (self.n_quantiles, n_samples))


TransformedTargetRegressor(check_inverse=True, func=None, inverse_func=None,
                           regressor=LinearRegression(copy_X=True,
                                                      fit_intercept=True,
                                                      n_jobs=None,
                                                      normalize=False),
                           transformer=QuantileTransformer(copy=True,
                                                           ignore_implicit_zeros=False,
                                                           n_quantiles=1000,
                                                           output_distribution='normal',
                                                           random_state=None,
                                                           subsample=100000))

In [None]:
pred = regr.predict(X_test)

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

3.349020126293933

In [None]:
r2_score(y_pred=pred, y_true=y_test)

0.7099444634911461

#### Hyper-parameters of TransformedTargetRegressor
* regressor - initialized model
* transformer - which supports transform & inverse_transform functions
* function - to convert target 
* inverse_function - to convert back predicted target in original data scale