## Pipeline and column transformer example

First i import the libraries I'm going to use in the example.

In [1]:
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.compose import ColumnTransformer

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

First load the data and check the columns and values to get an overview. The data is from a compitition called "Spaceship Titanic" found on Kaggle.com

In [2]:
df = pd.read_csv("train.csv")

df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Next I split the target from the orginal dataframe and saves it separately, I also drop two other features for simplicity's sake in this example.

In [3]:
df_y = df["Transported"]

df.drop("Transported", axis=1, inplace=True)
df.drop("Name", axis=1, inplace=True)
df.drop("PassengerId", axis=1, inplace=True)

I then set up some lists that I'm going to use later in the script.

In [4]:
scores = ["accuracy", "f1"]
cat_scores = ["HomePlanet", "CryoSleep", "Cabin", "Destination", "VIP"]
num_scores = ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]

Here is the first example of a pipeline. A pipeline is a class that returns an object specifying the steps that are to be taken and in which order they are to be taken. The parameter is a list where every object on the list has a name and a function. In the example below I specify that the first step, named "encode", will use the sci-kit learn OneHotEncoder with it's paramters, the second step is named "impute" and will use a KNN imputor etc.

I can then enter the entire pipeline in the place where a model would be specified in cross_validates parameters, I also add the target data and what scores I want returned.

In [5]:
pipe_rf = Pipeline([("encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
                   ("impute", KNNImputer()),
                   ("scale", MaxAbsScaler()),
                   ("classify", RandomForestClassifier(n_jobs=-1))])

results = cross_validate(pipe_rf, df, df_y, scoring=scores)

In [6]:
print(results["test_accuracy"])
print(results["test_f1"])

[0.73893042 0.73778033 0.73145486 0.75086306 0.75719217]
[0.71660424 0.72161172 0.69377049 0.71343481 0.73690773]


The problem with the pipeline above is that it applies all the steps on all the columns. In the case of our example every unique number in the numerical features would get it's own category leading to a bloat of dimensions in the dataset.

When we want to apply certain functions to certain specific columns we can use a column transformer. The column transformer is similar to the pipeline in that it takes a number of steps to take but it also has the addition of being able to specify the target of each step. In our case we want several steps to be done to different groups of data, i.e. one set of steps for the numerical features and one set for the categorical features.

What we do then is to set up two pipelines with the desired steps for each group. We then use those pipelines as the functions in the column transformer and pastes the lists we created earlier with the names of the numerical and categorical features. We then add the column transformer object in a last pipeline together with the classifier and adds it into the cross_validate function as we did before.

In [7]:
num_transformer = Pipeline([("num_imputer", KNNImputer()),
                            ("num_scaler", MaxAbsScaler())])

col_transformer = Pipeline([("encoding", OneHotEncoder(handle_unknown="ignore", sparse=False)),
                           ("imputer", KNNImputer())])

preprocessing = ColumnTransformer([("num_cols", num_transformer, num_scores),
                                  ("cat_cols", col_transformer, cat_scores)])

pipe_rf2 = Pipeline([("preprocessing", preprocessing),
                     ("classifier", RandomForestClassifier(n_jobs=-1))])

results2 = cross_validate(pipe_rf2, df, df_y, scoring=scores)

In [8]:
print(results2["test_accuracy"])
print(results2["test_f1"])

[0.78953422 0.78033353 0.7826337  0.79919448 0.79056387]
[0.79530201 0.78848283 0.77580071 0.79434296 0.79318182]


As we can see we were able to increase all scores by using the column transformer. We can also see that the original dataframe is unchanged since all the pipeline steps only used a copy of the dataset. This is useful if we want to try different versions of data in the future.

In [15]:
df.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0
3,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0
4,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0
