I've been trying to improve my preprocessing and modelling workflow and have successfully put together a [scikit-learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to process the string feature and apply an XGBoost model. I wrote this notebook to demonstrate the use of a pipeline for linking together data preprocessing and model, rather than to present a high-scoring model in this month's Tabular Playground Series competition.  While there is amazing functionality and extensibility in scikit-learn pipelines, it is not as easy to use as I imagined it would be: for example, naming output columns, from e.g. `OneHotEncoder` is probably more difficult than it should be.

# What is a pipeline?

A pipeline is a series of processing and/or modelling steps bundled together into one object. From the [scikit-learn pipeline documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html):

> The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

Quite often we have a series of transformations (e.g. encoding categorical variables, power transforms for numerical variables, feature engineering, imputing missing values) that we'd like to apply to a dataset, and a candidate model (or models). Notice that preprocessing steps in scikit-learn all use `fit()` and `transform()` methods, and the machine learning estimators and models have similarly structured APIs, namely the `fit()` and `predict()` methods. The idea behind a pipeline is to use these similar APIs to link together transformations and models so that the output of one step feeds into the input of the next step in the pipeline.



In [None]:
from sklearn.pipeline import Pipeline

# This displays pipelines as topological diagrams which are a bit more informative than the 
# python text __repr__
from sklearn import set_config
set_config(display='diagram')


## Benefits

Benefits of using a pipeline:

1. Eliminating duplicated code.

    An important tenet of software engineering is ["Don't repeat yourself"](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself), or DRY. Copying and pasting code used to process training data to process the testing data is fraught with problems. Changes made to the data pre-processing workflow can be easily missed or errors made if you need to change the code in two (or more) places. Using a pipeline eliminates duplicated code by combining preprocessing steps into one class which can be easily instantiated again.

2. Avoiding [data](https://machinelearningmastery.com/data-preparation-without-data-leakage/) [leakage](https://jfrog.com/community/data-science/be-careful-from-data-leakage/) by processing training and test data separately.

    One (substandard) way to get around duplicated code is to combine the training and test data and apply the preprocessing to the combined dataset. I have done this before, and I see it quite commonly on kaggle, but it is [not recommended](https://community.alteryx.com/t5/Data-Science/Dealing-with-Data-Leakage/ba-p/827583). Briefly, preprocessing training and test data concurrently means test data information (e.g. distributions of features) is seen by the `fit()` method, and this will provide an overly optimistic test score, which ultimately leads to degraded performance on new predictions.

3. Code readability.

    By eliminating duplicated code and having everything in one place, readability of your code will be improved. 
4. Allows the possibility of tuning preprocessing choices for better test predictions.

    Do you know what the effect is of the different preprocessing choices you've made on your model predictions? Testing the effect of preprocessing is typically done manually, by making a change and rerunning the notebook. This is slow and bulky to do, and it's very difficult to perform a proper workflow evaluation. Pipelines can automate the tuning and evaluation of your preprocessing workflow and help to answer questions like:
    
    1. Does particular preprocessing steps actually benefit the model?
    2. For categorical encoding (e.g. OneHotEncoder), does grouping smaller infrequent classes help?
    3. What parameters are best for numerical transforms (e.g. power transforms)?
    4. Does dropping certain columns help predictions?
    
    
# Data from the [Tabular Playground Series May 2022 competition](https://www.kaggle.com/competitions/tabular-playground-series-may-2022).



In [None]:
import numpy as np
import pandas as pd

train = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/train.csv', 
                    index_col='id')
test = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/test.csv', 
                   index_col='id')

# This example data is used to demonstrate transformation of multiple categorical columns
# As we are only transforming one column from train/test
example_categorial_data_columns = pd.DataFrame({'a1':['gds','fff','das','fbd','ggg'],
                                                'a2':['xg','tt','fa','fd','tt']})

# A small sample of the training data set for illustrative purposes:
train_small = train.head(10).loc[:,['f_01','f_27','f_28']]
test_small = test.head(10).loc[:,['f_01','f_27','f_28']]

Many wonderful EDAs have been produced for this competition, e.g.

1. [AmbrosM](https://www.kaggle.com/code/ambrosm/tpsmay22-eda-which-makes-sense) as usual provided an amazing EDA.
     
2. [Ritika Gupta](https://www.kaggle.com/code/ritzig/feature-interaction-tutorial-pdp-shap-ensemble-mod) also discusses feature interactions.
3. [Kelli Belcher](https://www.kaggle.com/code/kellibelcher/tps-may-2022-eda-lgbm-neural-networks) has an EDA with LGBM feature importance.
4. [Naosher Mustakim](https://www.kaggle.com/code/naoshermustakim/comprehensive-eda-tps-may) also has a good EDA.
5. I provided an [overview](https://www.kaggle.com/code/nnjjpp/eda-may-2022-exploring-the-string-feature-f-27) of the string feature (`f_27`)

There are many more EDAs, check the "Code" section of the competition.

Briefly, the data consists of 30 features, one of which is a string of ten letters, fourteen of the remaining features are ordinal (integer-valued) variables, and the rest are continuous (real-valued) variables. The `train_small` sample data subset has the string feature and two continuous variables to demonstrate the pipeline.


In [None]:
train_small

# Data preprocessing

For this notebook the only data preprocessing that I do is splitting of the string feature and encoding the letters. Obviously other preprocessing is possible, and feature engineering through the creation of aggregated features and interaction variables has been popular and successful in this competition. 

## Splitting and encoding of the string feature

As a list of strings the `f_27` feature can be easily separated using a python list comprehension:


In [None]:
def g(X):
    return pd.DataFrame([list(x) for x in X])

f_27_split = g(train_small['f_27'])
f_27_split

We can then apply `OrdinalEncoder` to this to turn these letters into numerical codes, which is required for models like XGBoost.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

OrdinalEncoder().fit_transform(f_27_split)

## Create a transformer class

In order to use this transformation in a pipeline, we need to wrap the steps up into a transformer class. We inherit a few things from `BaseEstimator` and `TransformerMixin` (scikit-learn base classes). The actual splitting and encoding is performed in the `transform` method:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin 
# the methods get_params() and set_params() are needed by 
# the pipeline and are inherited from BaseEstimator
class SplitAndEncode(BaseEstimator,TransformerMixin): 
    def fit(self, X, y=None):
        print(f'{self.__class__.__name__}(): fit called')
        # Just save some information that could be useful in the transformation:
        self.n_features_in = X.shape[1]
        self.feature_names_in_ = X.columns
        # Calculate the maximum length of string features, which is not
        # used here but could be useful for processing variable length
        # strings:
        self.feature_lengths = X.aggregate(func=lambda y: max([len(x) for x in y])) 
    def transform(self, X, y=None):
        print(f'{self.__class__.__name__}(): transform called')
        if self.n_features_in is None:
            raise ValueError(f'Need to call {self.__class__.__name__}.fit() method first')
        
        #######################################
        # Here is where the action takes place:
        unencoded = np.column_stack([[list(x) for x in X.loc[:,col]] for i,col in \
                        enumerate(X.columns)])
        from sklearn.preprocessing import OrdinalEncoder
        oe = OrdinalEncoder()
        return oe.fit_transform(unencoded)
        #######################################
        
        
    def fit_transform(self, X, y = None):
        self.fit(X, y)
        res = self.transform(X, y)
        return res
    def inverse_transform(self):
        # We probably should define an inverse_transform method that combines the 
        # encoded columns back to a string
        raise NotImplementedError

SplitAndEncode().fit_transform(train_small[['f_27']])

The `column_stack` operation and enumeration of columns of `X` in the `transform` method above allows multiple categorical columns to be passed through. Although we only have one string feature in the competition data, applying the transformation to the example dataframe `example_categorical_data_columns` demonstrates this:

In [None]:
example_categorial_data_columns

In [None]:
SplitAndEncode().fit_transform(example_categorial_data_columns)

So now we have the two steps, i.e. splitting the string and encoding the letters, wrapped up in a transformer class, which looks very similar to the usual preprocessing classes (e.g. `OrdinalEncoder`, `MinMaxScaler`) in scikit-learn.

The next step is to use a `ColumnTransformer`, which allows us to combine multiple column transformations (not done here), and allow the rest of the columns to pass through unchanged. [This article](https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/) at machinelearningmastery.com has a lot more detail on this and I found it very useful. As usual, [the scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) is also helpful.

To set up a `ColumnTransformer`, we make a list of transformers, each of which is specified by a name, the transformer instance and a list of columns. We also define what happens to columns that we haven't specified using the `remainder` keyword. The default is to ignore these columns, but we'd like to include them untransformed.

In [None]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(transformers=[('separate_and_encode_categorical', # name, can be anything
                                      SplitAndEncode(), # transformer class instance
                                      ['f_27']), # list of columns to be transformed
                                     
                                     # Can include additional transformers here...
                                     
                                    ],
                       remainder='passthrough')

ct

In [None]:
ct.fit_transform(train_small)

Notice that the columns have been reorded here: the columns in the transformed data follow the order specified in `ColumnTransformer`, with the transformed `f_27` variables appearing first, then the passthrough variables (i.e. `f_01` and `f_28`). This is fine for a model, but is far from ideal for us, as we have lost the column names and this could (probably will) cause problems and confusion later on.

A common way of dealing with this lack of column names is to write over the columns in the dataframe with the transformed data in-place, e.g.:

> <code>data.loc[:,columns_to_be_transformed] = OrdinalEncoder().fit_transform(data.loc[:,columns_to_be_transformed])</code>

This is problematic as we are either creating two versions of the same named dataset or creating temporary variables, neither of which is good data-science practice. Also, this workflow is unable (as far as I'm aware) to be incorporated into a pipeline. We need to somehow incorporate column renaming into the pipeline workflow.

# Naming output feature columns within a pipeline

Pandas generally saves a copy of the column names going into transformers in the `feature_names_in_` attribute, and the API sometimes (but not always!) has a `get_feature_names_out` method that gives us a list of feature names corresponding to the output from the transformer. It looks like [newer releases of scikit-learn](https://scikit-learn.org/stable/whats_new/v1.1.html) are rolling out `get_feature_names_out` methods more consistently across estimators and transformers, but for the moment naming output columns is a bit tricky. I found a [question/answer on stack overflow that describes the problem](https://stackoverflow.com/questions/61079602/how-to-get-feature-names-using-a-column-transformer), and what follows is based loosely on this.

## Implementing a feature names generator

Let's implement a `get_feature_names_out` method for our transformer class by taking the input feature names and appending the letter position index to get 'f_27_0', 'f_27_1', and so on. Note that using the `feature_names_in_` attribute means we can pass through multiple columns into the transformer (if there was an additional string feature for example) and end up with meaningful names across different feature columns.

By creating a derived class from the `SplitAndEncode` class we [inherit](https://docs.python.org/3/tutorial/classes.html#inheritance) all the previous attributes and methods, so we can just define the new method:

In [None]:
class SplitAndEncodeWithNames(SplitAndEncode):
    def get_feature_names_out(self):
        names_out = []
        for i, name_in in enumerate(self.feature_names_in_):
            names_out += [f'{name_in}_{j}' for j in range(self.feature_lengths[i])]
        return names_out

split_transformer_with_names = SplitAndEncodeWithNames()
split_transformer_with_names.fit(train_small[['f_27']])
split_transformer_with_names.get_feature_names_out()

And let's try it on the example dataframe with two categorical columns:

In [None]:
split_transformer_with_names.fit(example_categorial_data_columns)
split_transformer_with_names.get_feature_names_out()

So this works for single columns as well as multiple columns by iterating through the `feature_names_in_` attribute and the `feature_lengths` attribute that we set up in the `fit` method of the `SplitAndEncode` class.

## Incorporating a get_feature_names_out method into ColumnTransformer

`ColumnTransformer` does have a `get_feature_names_out` method but it [causes problems when upstream `get_feature_names_out` methods are not properly defined](https://johaupt.github.io/blog/columnTransformer_feature_names.html). In particular, the passthrough section of the `ColumnTransformer` doesn't provide feature names. The following class derived from `ColumnTransformer` gets output feature names from the transformers and appends the passthrough names:

In [None]:
class ColumnTransformerNamed(ColumnTransformer):
    def get_feature_names_out(self):
        names = []
        for transformer in self.transformers_:
            if transformer[0] == 'remainder':
                if transformer[1] == 'passthrough':
                    names += list(self.feature_names_in_[transformer[2]])
                break
            else:
                names += transformer[1].get_feature_names_out()
        return names
ct_named = ColumnTransformerNamed(transformers=[('separate_and_encode_categorical', 
                                                 SplitAndEncodeWithNames(), ['f_27'])],
                                  remainder='passthrough')

ct_named

In [None]:
train_tr = ct_named.fit_transform(train_small)
ct_named.get_feature_names_out()

## Final preprocessing pipeline step: convert to pandas dataframe and name columns

The `get_feature_names_out` methods are properly defined for the steps, so I create another transformer that converts the numpy array to a pandas dataframe and names the columns accordingly:

In [None]:
class rename(TransformerMixin):
    def __init__(self, name_func):
        self.name_func = name_func
        pass
    def fit(self, X, y=None):
        self.names = self.name_func()
    def transform(self, X, y=None):
        Xpd = pd.DataFrame(X)
        Xpd.columns = self.names
        return Xpd
    def fit_transform(self, X, y=None):
        self.fit(X,y)
        return self.transform(X,y)
    def get_feature_names_out(self):
        return self.name_func()

We can include this transformer as the next step in a pipeline, passing through a function that returns the column names (i.e. `get_feature_names_out`).

# Assembling the preprocessing steps into a pipeline

The scikit-learn `Pipeline` [constructor syntax](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is similar to the `ColumnTransformer` syntax. We specify a list of steps for the pipeline to perform. Alternatively, we can use the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) function, which differs from the constructor by not needing us to name the steps.

In [None]:
preprocessing_pipe = Pipeline(steps=(('tr', ct_named),
                                     ('rename', rename(ct_named.get_feature_names_out))))
preprocessing_pipe

In [None]:
%%time
train_transformed = preprocessing_pipe.fit_transform(train_small)
train_transformed

The pipeline we have made performs all the preprocessing we have specified, returns a pandas dataframe and has the columns nicely (and correctly) labelled. We can now apply the fitted pipeline to the test data:

In [None]:
%%time
preprocessing_pipe.transform(test_small)

which works as expected. Note that this workflow (fitting on training data and only transforming the test data - notice that `fit` is not called on the test data) prevents data leakage as [described on Stack Overflow](https://stats.stackexchange.com/questions/267012/difference-between-preprocessing-train-and-test-set-before-and-after-splitting), which was one of our objectives in setting up the pipeline workflow.

# Fitting, transforming and predicting with the pipeline

The hard part is done and now we can use the pipeline to preprocess the data, fit the model, and use it to make predictions. To do this, just include a model as the final step. The preprocessed data will be used as an input into the model. Here we use XGBoost as an example.

In [None]:
from xgboost import XGBClassifier
num_xgb_ests = 150
preprocessing_and_model_pipe = Pipeline(steps=(('tr', ct_named),
                                               ('rename', 
                                                rename(ct_named.get_feature_names_out)),
                                               ('xgboost', 
                                                XGBClassifier(n_estimators = num_xgb_ests,
                                                             objective = 'binary:logistic'))))
preprocessing_and_model_pipe

Calling `fit` on the pipeline runs `fit_transform` on the preprocessing steps and `fit` on the final model.

In [None]:
%%time
preprocessing_and_model_pipe.fit(X = train.drop('target', axis=1),
                                 y = train['target'])

Predictions are made using the pipeline `predict` method:

In [None]:
%%time 
predictions = preprocessing_and_model_pipe.predict(X=test)

submission = pd.DataFrame({'id': test.index,
                           'target': predictions})
submission.to_csv('submission.csv', index=False)

## Model evaluation

For model evaluation purposes, we plot the feature importances from the XGBoost model. We can use the column names from the `get_feature_names_out` method from within the pipeline via the `named_steps` attribute of the pipeline. 

In [None]:
# Function courtesy of Tyrion Lannister-lzy:
# https://www.kaggle.com/code/tyrionlannisterlzy/xgboost-dnn-ensemble-lb-0-980

def plot_feature_importance(importance, names, model_type, max_features = 10):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_df = fi_df.head(max_features)

    #Define size of bar plot
    plt.figure(figsize=(8,6))

    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + ' feature importance plot')
    plt.xlabel('IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

import seaborn as sns
import matplotlib.pyplot as plt
plot_feature_importance(preprocessing_and_model_pipe.named_steps['xgboost'].feature_importances_,
                        preprocessing_and_model_pipe.named_steps['rename'].get_feature_names_out(),
                        f'XGBoost Classifier, {num_xgb_ests} estimators', max_features = 25)

# Further directions

- Implementation of the `inverse_transform` method
- Allowing variable length strings (the `SplitAndEncode.feature_lengths` attribute can be used for this)
- Incorporation of pipeline into a hyperparameter tuning workflow. Note that parameters for the internals of the pipeline (for example preprocessing steps) can be specified according to the [nested parameters](https://scikit-learn.org/stable/modules/compose.html#nested-parameters) section of the documentation.
- Work out how to create new derived feature columns (aggregates, interactions, etc.)

# Conclusions

- Pipelines are an important and useful part of the data-science workflow, with the benefits of creating reproducible and more readable code, minimising data leakage, and offering the ability to tune parameters of the preprocessing step. 
- The use of a `ColumnTransformer` allows different columns to be processed differently.
- Preserving column names in a pipeline can be a bit tricky, but is possible by using (and possibly modifying) `get_feature_names_out` methods.

Comments are welcome, please let me know if you have any suggestions on how to improve things, or if you have used pipelines in this competition or elsewhere.