In [1]:
print("Composite Estimators using Pipeline & FeatureUnions")

Composite Estimators using Pipeline & FeatureUnions


#### Introduction to Composite Estimators
- One or more transformers are connected to estimators resulting into composite estimator.
- Composite transformer is implemented using Pipeline
- FeatureUnion is used to concatenate output of transformers to create derived feature
- Pipeline make machine learning code reuseable & modular

#### 1. Pipeline
- Before data is fed to learning algorithm, it needs to be handled for missing values.
- Different pre-processing needs to be done.
- The output of preprocessor is to be subjected to next preprocessor & finally the estimator
- This whole process can be automated using Pipeline
- Intermediate steps .i.e transformers must implement fit & transform
- The same trained pipeline can used for prediction

###### Predicting horror author from text

In [2]:
import pandas as pd

In [3]:
horror_train_data = pd.read_csv('Data/horror-train.csv')

In [4]:
horror_train_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [5]:
horror_test_data= pd.read_csv('Data/horror-test.csv')

In [6]:
horror_test_data.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


In [7]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.3+ KB


In [8]:
horror_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      19579 non-null  object
 1   text    19579 non-null  object
 2   author  19579 non-null  object
dtypes: object(3)
memory usage: 459.0+ KB


In [9]:
horror_train_data = horror_train_data[['text','author']]

In [10]:
from sklearn.pipeline import make_pipeline

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [14]:
pipelines = []
for model in [LogisticRegression(), DecisionTreeClassifier(), MultinomialNB(), LinearSVC()]:
    pipeline = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              model)
    pipelines.append(pipeline)

In [15]:
pipelines[1].steps[2]

('decisiontreeclassifier', DecisionTreeClassifier())

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
trainX,testX,trainY,testY = train_test_split(horror_train_data.text, horror_train_data.author)

In [18]:
for pipeline in pipelines:
    pipeline.fit(trainX, trainY)

In [19]:
for pipeline in pipelines:
    print (pipeline.score(testX, testY))

0.7940755873340143
0.5942798774259448
0.807150153217569
0.8014300306435138


In [20]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.3+ KB


In [21]:
results = []
for pipeline in pipelines:
    result = pipeline.predict(horror_test_data.text)
    results.append(result)

In [22]:
results

[array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], dtype=object),
 array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'EAP'], dtype=object),
 array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], dtype='<U3'),
 array(['MWS', 'EAP', 'EAP', ..., 'EAP', 'MWS', 'HPL'], dtype=object)]

In [23]:
pipelines[0].steps[0][1].transform(horror_test_data.text)

<8392x22141 sparse matrix of type '<class 'numpy.int64'>'
	with 88765 stored elements in Compressed Sparse Row format>

##### Caching transformers within a Pipeline
- Storing state of transformers is also possible to prevent recomputation of transformers
- When pipeline is subjected to GridSearch situations like this happens

In [24]:
from sklearn.model_selection import GridSearchCV

In [25]:
svc_pipe =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              LinearSVC())

In [26]:
dt_pipe = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              DecisionTreeClassifier())

In [27]:
svc_pipe

In [28]:
svc_pipe.steps

[('countvectorizer', CountVectorizer(stop_words='english')),
 ('tfidftransformer', TfidfTransformer()),
 ('linearsvc', LinearSVC())]

In [29]:
import numpy as np
params = {
    'linearsvc__C': list(np.logspace(1,20,20))
}

In [30]:
dt_pipe.steps[2]

('decisiontreeclassifier', DecisionTreeClassifier())

In [31]:
params = {
    'countvectorizer__max_features':[5000,7500,10000],
    'decisiontreeclassifier__max_depth':[100,200]
}

In [32]:
gs = GridSearchCV(dt_pipe,cv=5,param_grid=params, n_jobs=-1)

In [33]:
gs.fit(trainX,trainY)

In [34]:
gs.best_params_

{'countvectorizer__max_features': 10000,
 'decisiontreeclassifier__max_depth': 200}

In [35]:
gs.best_score_

0.6026287041495382

In [37]:
from tempfile import mkdtemp
from shutil import rmtree
#from sklearn.utils import Memory
from joblib import Memory  # Use joblib instead of sklearn.utils.Memory

cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=0)
svc_pipe_cached =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              LinearSVC(), memory = memory)

In [38]:
gs_cached = GridSearchCV(svc_pipe_cached,cv=2,param_grid=params, verbose=0)

In [39]:
%timeit gs_cached.fit(trainX,trainY)

ValueError: Invalid parameter 'decisiontreeclassifier' for estimator Pipeline(memory=Memory(location=C:\Users\PC\AppData\Local\Temp\tmp57ncqci5\joblib),
         steps=[('countvectorizer', CountVectorizer(stop_words='english')),
                ('tfidftransformer', TfidfTransformer()),
                ('linearsvc', LinearSVC())]). Valid parameters are: ['memory', 'steps', 'verbose'].

#### 2. Transforming target in regression
- Dependent variables & independent variables should be linearly related
- In case, dependent variable is not normally distribted. We can make it happen for better error.
- The prediction also needs to be remapped
- This entire process can be automated using TransformedTargetRegressor

In [43]:
#from sklearn.datasets import load_boston #politics  use carlifonia data instead

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [44]:
housing = fetch_california_housing()

In [45]:
X = housing.data

In [46]:
y = housing.target

In [47]:
regressor = LinearRegression()

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [49]:
regressor.fit(X_train, y_train)

In [50]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

R2 score: 0.59


In [51]:
pred = regressor.predict(X_test)

In [52]:
from sklearn.metrics import mean_absolute_error, r2_score

In [53]:
mean_absolute_error(y_pred=pred, y_true=y_test)

0.5368950735045277

### Convert data from non-normal distribution to normal distribution

In [54]:
from sklearn.preprocessing import PowerTransformer,QuantileTransformer

In [55]:
pt = PowerTransformer()

In [56]:
qt = QuantileTransformer(output_distribution='normal')

In [57]:
#X_tf = pt.fit_transform(X)
#OR
X_tf = qt.fit_transform(X)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X_tf, y, random_state=0)

In [59]:
regressor = LinearRegression()

In [60]:
regressor.fit(X_train, y_train)

In [61]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

R2 score: 0.59


In [62]:
pred = regressor.predict(X_test)

In [63]:
mean_absolute_error(y_pred=pred, y_true=y_test)

0.5496656773369633

In [64]:
from sklearn.compose import TransformedTargetRegressor

In [65]:
regr = TransformedTargetRegressor(regressor=regressor,transformer=qt)

In [66]:
regr.fit(X_train, y_train)

In [67]:
pred = regr.predict(X_test)

In [68]:
mean_absolute_error(y_pred=pred, y_true=y_test)

0.5382148400838282

In [69]:
r2_score(y_pred=pred, y_true=y_test)

0.568796248787967


#####  Hyper-parameters of TransformedTargetRegressor
- regressor - initialized model
- transformer - which supports transform & inverse_transform functions
- function - to convert target
- inverse_function - to convert back predicted target in original data scale
#### 3. FeatureUnion
- It combines several transformer objects into one transformer
- Transformers are executed in parallel
- During fitting, each of these are fit parallelly
- During transform, output is concatenated parallely
- Predicting employee exit - The Pipeline & FeatureUnion Way

In [70]:
emp_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/HR_comma_sep.csv.txt')

In [71]:
emp_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [72]:
emp_data.rename(columns={'sales':'dept'}, inplace=True)

In [73]:
num_cols = ['number_project','average_montly_hours','time_spend_company']

In [74]:
bin_cols = ['Work_accident','promotion_last_5years']

In [75]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder,LabelEncoder, LabelBinarizer, MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest

In [76]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

In [77]:
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self,key):
        self.key = key
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.key]

In [78]:
class MyLabelBinarizer(TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)

In [79]:
pipeline_dept = Pipeline([
    ('selector', ItemSelector('dept')),
    ('lb', MyLabelBinarizer()),
])

In [80]:
pipeline_dept.fit_transform(emp_data)

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0]])

In [81]:
class MultiItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self,keys):
        self.keys = keys
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.keys]

In [82]:
class SalaryMapper(BaseEstimator, TransformerMixin):
    
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        db = {'low':1,'medium':2,'high':3}
        print (type(X))
        r = X.str.strip().replace(db)
        return r.values.reshape(-1,1)

In [83]:
pipeline_salary = Pipeline([
    ('selector',ItemSelector('salary')),
    ('sm',SalaryMapper())
])

In [84]:
pipeline_numbers = Pipeline([
    ('selector',MultiItemSelector(num_cols)),
    ('scaling', MinMaxScaler())
])

In [85]:
pipeline_bin = Pipeline([
    ('selector',MultiItemSelector(bin_cols))
])


In [86]:
fu = FeatureUnion([
    ('dept_pipe',pipeline_dept),
    ('salary_pipe',pipeline_salary),
    ('numbers_pipe',pipeline_numbers),
    ('bin_pipe',pipeline_bin)
])

In [87]:
pipeline = Pipeline([
    ('union',fu),
    #('feature_selector',SelectKBest(k=15)),
    ('classifier',RandomForestClassifier(n_estimators=10))
])


In [88]:
from sklearn.model_selection import train_test_split

In [89]:
trainX,testX, trainY,testY = train_test_split(emp_data.drop('left',axis=1), emp_data.left)

In [90]:
pipeline.fit(trainX,trainY)

<class 'pandas.core.series.Series'>


  r = X.str.strip().replace(db)


In [91]:
pipeline.predict(testX)

<class 'pandas.core.series.Series'>


  r = X.str.strip().replace(db)


array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

In [92]:
pipeline.score(testX,testY)

<class 'pandas.core.series.Series'>


  r = X.str.strip().replace(db)


0.9664

#### 4. ColumnTransformer ( Beta stage )
- Datasets consist of hetrogenous types of columns
- An easy technique to map column to pipeline

In [93]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [94]:
titanic_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/titanic-train.csv.txt', index_col='PassengerId')

In [95]:
titanic_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [96]:
num_cols = ['Age','Fare']
cat_cols = ['Embarked','Sex','Pclass']

In [97]:
pipeline_num = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaling',StandardScaler())
])

In [98]:
pipeline_cat = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoding', OneHotEncoder(handle_unknown='ignore'))
])

In [99]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', pipeline_num, num_cols),
        ('cat', pipeline_cat, cat_cols)])

In [100]:
pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                ('classifier',RandomForestClassifier(n_estimators=10))])

In [101]:
X = titanic_data.drop('Survived',axis=1)

In [102]:
Y = titanic_data.Survived

In [103]:
trainX,testX,trainY,testY = train_test_split(X,Y)

In [104]:
pipeline.fit(trainX,trainY)

In [105]:
pipeline.score(testX,testY)

0.7937219730941704

#### 5. GridSearch for pipelines
- Pipelines consist of combination of transformers & estimators
- Both transformers & estimators are configured hyper-parameters as a fine tuning process

In [106]:
pipeline.steps

[('preprocessor',
  ColumnTransformer(transformers=[('num',
                                   Pipeline(steps=[('imputer',
                                                    SimpleImputer(strategy='median')),
                                                   ('scaling',
                                                    StandardScaler())]),
                                   ['Age', 'Fare']),
                                  ('cat',
                                   Pipeline(steps=[('imputer',
                                                    SimpleImputer(fill_value='missing',
                                                                  strategy='constant')),
                                                   ('encoding',
                                                    OneHotEncoder(handle_unknown='ignore'))]),
                                   ['Embarked', 'Sex', 'Pclass'])])),
 ('classifier', RandomForestClassifier(n_estimators=10))]

In [107]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [10,15,20],
}

In [108]:
from sklearn.model_selection import GridSearchCV

In [112]:
#grid_search = GridSearchCV(pipeline, param_grid, cv=5, iid = False)

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(trainX,trainY)

In [113]:
grid_search.score(testX,testY)

0.7892376681614349

In [114]:
grid_search.best_params_

{'classifier__n_estimators': 15,
 'preprocessor__num__imputer__strategy': 'mean'}