# Pipeline Fun
In this notebook I'll be walking through the Pipeline class and make_pipeline method in scikit learn.

In [1]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Binarizer, PolynomialFeatures, \
StandardScaler, FunctionTransformer, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, \
make_union
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif

#### Iris Dataset

In [2]:
# Import iris dataset and see shape and head
df_iris = pd.read_csv('./iris.csv')

print(df_iris.shape)
df_iris.head()

(150, 5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# Setup feature matrix and target vector
X = df_iris.drop('species', axis=1)
y = df_iris.species

# Label encode target vector and print label mapping
le = LabelEncoder()
le.fit(y)
le_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print('Label Encoder Mapping:', le_mapping)
y = le.transform(y)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\
                                                   random_state=1337)

# Use standard scaler
# ss = StandardScaler()
# X_train = ss.fit_transform(X_train)
# X_test = ss.transform(X_test)

Label Encoder Mapping: {'setosa': 0, 'versicolor': 1, 'virginica': 2}


#### Pipeline Class Walkthrough 
We'll walk through scikit learn's Pipeline class below. First we'll define a feature extractor class that extracts a given column in a dataframe.

In [4]:
# Create Feature Extractor class
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[[self.column]].values

We'll now define the modeling steps in our pipeline. For this example we'll take the petal length feature, bin it into a dummy variable where the cutoff is the median petal length, then predict the flower species using that one feature.

In [5]:
# Define modeling steps
modeling_steps = [
    ('extract-petal_length', FeatureExtractor('petal_length')),
    ('cut_off_at_median', Binarizer(X_train['petal_length'].median())),
    ('predict_using_knn', KNeighborsClassifier())
]

print(modeling_steps)

[('extract-petal_length', FeatureExtractor(column='petal_length')), ('cut_off_at_median', Binarizer(copy=True, threshold=4.4)), ('predict_using_knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]


In [6]:
# Instantiate Pipeline object with steps and fit
model_1 = Pipeline(modeling_steps)
model_1.fit(X_train, y_train)

Pipeline(steps=[('extract-petal_length', FeatureExtractor(column='petal_length')), ('cut_off_at_median', Binarizer(copy=True, threshold=4.4)), ('predict_using_knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))])

With this Pipeline object, we can pass in different dataframes and make predictions as long as it contains the 'petal_length' column. 

In [7]:
# Score on training and test set
print('Training score:', model_1.score(X_train, y_train))
print('Test score:', model_1.score(X_test, y_test))

Training score: 0.533333333333
Test score: 0.5


In [8]:
# Get first 5 predictions
print(model_1.predict(X_test)[0:5])

[1 1 2 1 2]


#### Abalone Dataset
We'll run another Pipeline class for this abalone dataset. The data can be found at http://archive.ics.uci.edu/ml/datasets/Abalone.

In [9]:
# CSV doesn't contain column header so we'll have to create a list
# of the columns given in the url
abalone_cols = ['sex', 'length', 'diameter', 'height', 'whole_weight',\
               'shucked_weight', 'viscera_weight', 'shell_weight', 'rings']

df_abalone = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',\
                names=abalone_cols)

print(df_abalone.shape)
df_abalone.head()

(4177, 9)


Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [10]:
# Check for Nans and datatypes
df_abalone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
sex               4177 non-null object
length            4177 non-null float64
diameter          4177 non-null float64
height            4177 non-null float64
whole_weight      4177 non-null float64
shucked_weight    4177 non-null float64
viscera_weight    4177 non-null float64
shell_weight      4177 non-null float64
rings             4177 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


For this example let's try to predict whether or not the abalone in question is above the average age in our sample.

In [11]:
# Set up feature matrix and target vector
X = df_abalone.drop('rings', axis=1)
y = df_abalone.rings

# Binarize target vector
y = y.map(lambda x: 1 if x > y.mean() else 0)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\
                                                   random_state=1337)

In [12]:
# Rings column summary statistics
print(y_train.describe())

print('')
print('baseline: {}%'.format(round(y_train.mean()*100, 2)))

count    3341.000000
mean        0.503741
std         0.500061
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: rings, dtype: float64

baseline: 50.37%


In [13]:
# Modeling steps
modeling_steps = [
    ('extract_diameter', FeatureExtractor('diameter')),
    ('create_polynomials', PolynomialFeatures(3, include_bias=False)),
    ('standardize', StandardScaler()),
    ('predict', LogisticRegression())
]

# Instantiate Pipeline
pipe = Pipeline(modeling_steps)

# Fit pipeline object
pipe.fit(X_train, y_train)

Pipeline(steps=[('extract_diameter', FeatureExtractor(column='diameter')), ('create_polynomials', PolynomialFeatures(degree=3, include_bias=False, interaction_only=False)), ('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('predict', LogisticRegression(C=1.0, class_weight=None, dual...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [14]:
# Score against test set
pipe.score(X_test, y_test)

0.72368421052631582

In [15]:
pred = pipe.predict(X_test)

cf_matrix = pd.DataFrame(confusion_matrix(y_test, pred), \
                        columns=['Predicted 0', 'Predicted 1'], \
                        index=['Actual 0', 'Actual 1'])

print(cf_matrix)
print('')
print(classification_report(y_test, pred))

          Predicted 0  Predicted 1
Actual 0          309          129
Actual 1          102          296

             precision    recall  f1-score   support

          0       0.75      0.71      0.73       438
          1       0.70      0.74      0.72       398

avg / total       0.73      0.72      0.72       836



For just including polynomial features of one dimension, this model performed fairly well! We beat the baseline by ~20% and the precision and recall scores are fairly balanced. 

#### make_pipeline()
We'll now walkthrough the make_pipeline method in scitkit learn. Big difference between this method and the previous Pipeline class is that we can skip the step where we name the steps of the pipeline.

In [16]:
# Create pipeline object
pipe = make_pipeline(
    FeatureExtractor('diameter'),
    PolynomialFeatures(3, include_bias=False),
    StandardScaler(),
    LogisticRegression()
)

# Fit pipe object
pipe.fit(X_train, y_train)

# Score against test set
pipe.score(X_test, y_test)

0.72368421052631582

We got the same score as with the Pipeline class which was expected. <br/><br/>
If we wanted to one hot encode a categorical column, we would need to create a 'categorical extractor' class.

In [17]:
class CategoricalExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
        self.values = None
        
    def _create_values(self, indices):
        return {ind: i+1 for i, ind in enumerate(indices)}
    
    def _apply_values(self, row_val):
        return self.values.get(row_val, 0)
        
    def fit(self, X, y=None):
        self.values = self._create_values(X[self.column].value_counts().index)
        return self 
    
    def transform(self, X, y=None):
        col = X[self.column].apply(self._apply_values)
        return col.values.reshape(-1, 1)

We can now use this class instead of FeatureExtractor to pass it directly to OneHotEncoder.

In [18]:
# Show unique categories of sex and show dummified matrix
pipe = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore')
)

print(X_train['sex'].value_counts())
pipe.fit(X_train)
print(pipe.transform(X_train)[0:5, :])

M    1228
I    1071
F    1042
Name: sex, dtype: int64
[[ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]


#### FeatureUnion() 
Pipeline is great to manipulate a single column, but what if we wanted to manipulate multiple columns? (Which is more often than not)<br/>
Feature Union to the rescue!

In [19]:
# pipeline for petal length
petal_length_pipe = make_pipeline(
    FeatureExtractor('petal_length'),
    Binarizer(df_iris['petal_length'].mean())
)

# pipeline for petal width
petal_width_pipe = make_pipeline(
    FeatureExtractor('petal_width'),
    PolynomialFeatures(2, include_bias=False)
    
)

# Create feature union for the 2 pipelines above
fu = FeatureUnion([
    ('petal_length_transformer', petal_length_pipe),
    ('petal_width_transformer', petal_width_pipe)
])

# Fit and transform feature union
fu.fit(df_iris)
fu.transform(df_iris)[0:5, :]

array([[ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04]])

Just like Pipeline, FeatureUnion has a function that removes some of the boilerplate code (make_union).

In [20]:
fu = make_union(
    petal_length_pipe,
    petal_width_pipe
)

fu.fit(df_iris)
fu.transform(df_iris)[0:5, :]
# As expected it's identical to the above feature union

array([[ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04],
       [ 0.  ,  0.2 ,  0.04]])

Let's try creating another feature union, but with the abalone dataset.

In [21]:
length_mean = X_train.length.mean()

length_pipe = make_pipeline(
    FeatureExtractor('length'),
    Binarizer(length_mean)
)

diameter_pipe = make_pipeline(
    FeatureExtractor('diameter'),
    PolynomialFeatures(2, include_bias=False)
)

height_pipe = make_pipeline(
    FeatureExtractor('height'),
    Binarizer(0.1)
)

fu_abalone = make_union(
    length_pipe,
    diameter_pipe,
    height_pipe
)

fu_abalone.fit(X_train)
fu_abalone.transform(X_train)[0:5, :]

array([[ 0.      ,  0.26    ,  0.0676  ,  0.      ],
       [ 0.      ,  0.295   ,  0.087025,  0.      ],
       [ 1.      ,  0.42    ,  0.1764  ,  1.      ],
       [ 1.      ,  0.45    ,  0.2025  ,  1.      ],
       [ 1.      ,  0.43    ,  0.1849  ,  1.      ]])

As expected there should be 4 columns from the transformation- binarized length, 2 columns for polynomial features, and binarized height.

The power of pipelines and feature unions is that you can chain them together. We'll build a chained example next.

In [22]:
# Create pipelines for OHE sex column and polynomial features with
# diameter column
sex_pipe = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore')
)

diameter_pipe = make_pipeline(
    FeatureExtractor('diameter'),
    PolynomialFeatures(2, include_bias=False)
)

# Create feature union for pipelines and extract width and length 
feature_transformers = make_union(
    sex_pipe,
    diameter_pipe,
    FeatureExtractor('height'),
    FeatureExtractor('length')
)

# Create modeling pipeline with feature union, standard scaler, and
# random forest classifier
modeling_pipe = make_pipeline(
    feature_transformers,
    StandardScaler(),
    RandomForestClassifier()
)

# Fit modeling pipeline
modeling_pipe.fit(X_train, y_train)

Pipeline(steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(steps=[('categoricalextractor', CategoricalExtractor(column='sex')), ('onehotencoder', OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='ignore', n_values='au...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])

In [23]:
# Score against test set
modeling_pipe.score(X_test, y_test)

0.6925837320574163

In [24]:
# Predict test data
pred = modeling_pipe.predict(X_test)

# Print confusion matrix and classification report
cf_matrix = confusion_matrix(y_test, pred)
print(pd.DataFrame(cf_matrix, columns=['Predicted 0', 'Predicted 1'],\
            index=['Actual 0', 'Actual 1']))
print(classification_report(y_test, pred))

          Predicted 0  Predicted 1
Actual 0          307          131
Actual 1          126          272
             precision    recall  f1-score   support

          0       0.71      0.70      0.70       438
          1       0.67      0.68      0.68       398

avg / total       0.69      0.69      0.69       836



All that work and we received a slightly worse score than when we predicted with just the diameter as a polynomial feature!

#### Conclusion
Pipelines and feature unions can create really robust and powerful way to reproducibly transform and predict models. In cases where we have a need to continually predict a target given the same set of features, pipelines can help really speed up and automate the process.