# The ColumnTransformer

All scikit-learn transformers apply their transformation to all of the columns. If you desire to apply different transformations to different columns, you'll need to use the `ColumnTransformer` meta-estimator. Let's begin by selecting continuous, nominal, and ordinal features.

In [None]:
import pandas as pd
import numpy as np
housing = pd.read_csv('../data/housing_sample.csv')
X = housing[['Neighborhood', 'Exterior1st', 'GrLivArea', 'GarageArea', 'HeatingQC']]
y = housing['SalePrice']
X.head()

### Different types of data needing different transformations

Our input data contains continuous, nominal, and ordinal data, each needing their own transformation. As we saw in the last chapter, passing in the above five-column dataset to a transformer such as an instance of `OneHotEncoder`, transformed each column, including those that we did not want to get transformed. In order to apply different transformations to different columns of data, you'll need to use the `ColumnTransformer`.

### Create a list of 3-item tuples - name, transformer, columns


The `ColumnTransformer` requires that you instantiate it with a list of 3-item tuples. The first value of the tuple is a string called the **name**. This will be used if you refer to the transformer during a grid search. The second value of the tuple is the actual **transformer**. In this example, we will be doing one-hot encoding. The last value in the tuple is the list of **columns** to apply the transformation to. A separate three-item tuple will be created for each group of columns needing to be transformed.

Let's begin by only transforming the nominal features `Neighborhood` and `Exterior1st`. Since we just have one transformation group, we'll create a list containing a single three-item tuple. The **name** of this transformation group is 'nom' (short for nominal). The **transformer** is the `OneHotEncoder` instance `ohe`.  The **columns** are a list of the two column names to be transformed (`nom_cols`). Our three-item tuple for this transformation group is `('nom', ohe, nom_cols)`.

After the `OneHotEncoder` transformer is instantiated  and the list of columns declared, the list of three-item tuples needed for the `ColumnTransformer` is created.

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
nom_cols = ['Neighborhood', 'Exterior1st']
transformers = [('nom', ohe, nom_cols)]

### Instantiate the `ColumnTransformer`

After creating the list of three-item tuples, we can instantiate the `ColumnTransformer`, which is located in the `compose` module.

In [None]:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers)

Let's fit and transform our input data and output the first row. Notice that we have a numpy array now. 

In [None]:
X_t = ct.fit_transform(X)
X_t[:1]

### What happens to the other columns?

Only the columns provided in the three-item tuple get transformed. The other columns are dropped by default. Let's output the shape and verify that the number of columns are equal to the number of unique values in those two columns.

In [None]:
X_t.shape

In [None]:
X[['Neighborhood', 'Exterior1st']].nunique()

### Get new feature names

Use the `get_feature_names` to get the column names of the transformed array. The first 10 features are outputted below. Notice that the name of the transformer, 'nom', is prepended to the beginning and separated from the remainder of the name by two underscores.

In [None]:
ct.get_feature_names()[:10]

### Keep the remaining columns

Alternatively, we can choose to keep the remaining columns unchanged in the result by setting the `remainder` parameter to 'passthrough'. We reinstantiate our `ColumnTransformer` and output the first rows of the transformed data with the non-transformed columns kept.

In [None]:
ct = ColumnTransformer(transformers, remainder='passthrough')
X_t = ct.fit_transform(X)
X_t[:1]

In [None]:
X_t.shape

There are three more columns than there were when we encoded just the two string columns. We can look at the last three columns to verify that they have been passed through the transformer without any transformation.

In [None]:
X_t[:5, -3:]

### Get new column names - NotImplementedError

In the future, we will be able to use the `get_feature_names` method to get the column names of the transformed array when using 'passthrough'. Unfortunately, this has not been implemented yet.

In [None]:
ct.get_feature_names()

## Add transformation group to scale the continuous features

We can add a new transformation group that scales the continuous features. Let's extend our list of transformers by adding a new three-item tuple to it. We give it the **name** 'con' (short for continuous), use the `StandardScaler` instance `ss` **transformer**, and create a list of the continuous **columns** `GrLivArea` and `GarageArea`. There is one feature that does not appear in either of the transformations, `HeatingQC`, that we  continue to drop. We output the first row and the shape after the transformation.

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
nom_cols = ['Neighborhood', 'Exterior1st']
con_cols = ['GrLivArea', 'GarageArea']
transformers = [('nom', ohe, nom_cols), ('con', ss, con_cols)]
ct = ColumnTransformer(transformers)
X_t = ct.fit_transform(X)
X_t[:1]

In [None]:
X_t.shape

### Add transformation group for ordinal encoding

Let's add one more column group for the ordinal columns. In this dataset, only `HeatingQC` is ordinal. We transform these three column groupings and output the shape.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
order = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]
oe = OrdinalEncoder(order)
ord_cols = ['HeatingQC']
transformers = [('nom', ohe, nom_cols),
                ('con', ss, con_cols),
                ('ord', oe, ord_cols)]
ct = ColumnTransformer(transformers)
X_t = ct.fit_transform(X)
X_t.shape

The `ColumnTransformer` splits each of the column groupings into its own dataset, applies the particular transformation to each grouping and then combines all the columns back together again.

![][0]

[0]: images/columntransformer_basic.png

## Machine learning after transforming

Now that we successfully transformed each set of columns, we can perform machine learning. To do this, we build a short pipeline where the first completes the column transformations and the second does the machine learning. In this case, we use a decision tree as our machine learning model. Remember that you must create a list of two item tuples (name, estimator) to instantiate the pipeline.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
dtr = DecisionTreeRegressor()
steps = [('ct', ct), ('dtr', dtr)]
pipe = Pipeline(steps)

Because the last step is a machine learning model, this pipeline must be trained using the `fit` method. Calling the `fit` method passes the data into the `ColumnTransformer` which independently learns how to transform each of the three column groupings. It then transforms each column grouping returning a single array which is used for training the `DecisionTreeRegressor`.

In [None]:
pipe.fit(X, y);

In [None]:
pipe.predict(X)

Let's evaluate this pipeline using cross validation.

In [None]:
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=123)
cross_val_score(pipe, X, y, cv=kf).mean()

## Create a pipeline within the `ColumnTransformer`

Our first ColumnTransformer applied a single transformation to each set of columns. We can take this further and apply any number of transformations to a distinct set of columns by using a pipeline within the ColumnTransformer. The diagram below shows this process.
![][1]

Let's begin by creating a pipeline containing two steps (imputation and standardization) for the continuous features.

[1]: images/columntransformer_pipeline.png

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
con_si = SimpleImputer(strategy='mean')
con_ss = StandardScaler()
con_steps = [('si', con_si),('ss', con_ss)]
con_pipe = Pipeline(con_steps)

We also create a two-step pipeline (imputation and encoding) for the nominal and ordinal categorical columns.

In [None]:
nom_si = SimpleImputer(strategy='most_frequent')
nom_ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
nom_steps = [('si', nom_si), ('ohe', nom_ohe)]
nom_pipe = Pipeline(nom_steps)

ord_si = SimpleImputer(strategy='most_frequent')
ord_oe = OrdinalEncoder(order)
ord_steps = [('si', ord_si), ('oe', ord_oe)]
ord_pipe = Pipeline(ord_steps)

Finally, we prepare a column transformer to pass the two continuous features through the continuous pipeline, the two nominal features through the nominal pipeline, and the ordinal feature through the ordinal pipeline.

In [None]:
transformers = [('con', con_pipe, con_cols),
                ('nom', nom_pipe, nom_cols),
                ('ord', ord_pipe, ord_cols)]
ct = ColumnTransformer(transformers)

We can now transform all of the columns in our dataset at once. The first rows is output to the screen.

In [None]:
X_t = ct.fit_transform(X)
X_t[:1]

In [None]:
X_t.shape

### Create a final pipeline to do machine learning

Now that we can apply separate transformations to separate groups of columns, we can pass this result to a machine learning model. To connect the ColumnTransformer to the machine learning estimator, we use a pipeline. We use a random forest with 100 trees and a max depth of 5 as our model.

![][1]

[1]: images/columntransformer_pipeline_ml.png

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100, max_depth=5)
steps = [('ct', ct), ('rfr', rfr)]
pipe_final = Pipeline(steps)

Finally, we can get cross validated scores.

In [None]:
cross_val_score(pipe_final, X, y, cv=kf).mean()

### All in one cell

It might be helpful to see all the steps used to setup the final pipeline absent of the imports

In [None]:
# continuous pipeline
con_si = SimpleImputer(strategy='mean')
con_ss = StandardScaler()
con_steps = [('si', con_si), ('ss', con_ss)]
con_pipe = Pipeline(con_steps)

# nominal pipeline
nom_si = SimpleImputer(strategy='most_frequent')
nom_ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
nom_steps = [('si', nom_si), ('ohe', nom_ohe)]
nom_pipe = Pipeline(nom_steps)

# ordinal pipeline
ord_si = SimpleImputer(strategy='most_frequent')
ord_oe = OrdinalEncoder(order)
ord_steps = [('si', ord_si), ('oe', ord_oe)]
ord_pipe = Pipeline(ord_steps)

# ColumnTransformer setup
nom_cols = ['Neighborhood', 'Exterior1st']
con_cols = ['GrLivArea', 'GarageArea']
ord_cols = ['HeatingQC']

transformers = [('con', con_pipe, con_cols),
                ('nom', nom_pipe, nom_cols),
                ('ord', ord_pipe, ord_cols)]
ct = ColumnTransformer(transformers)

# Final pipeline
rfr = RandomForestRegressor(n_estimators=100)
steps = [('ct', ct), ('rfr', rfr)]
pipe_final = Pipeline(steps)

## Grid searching the final pipeline

Tuning the random forest within this pipeline requires us to refer to the hyperparameter by preceding it with the name of the step ('rfr') followed by two underscores.

Tuning the hyperparameters of the transformers takes more work. For instance, if we want to tune the `strategy` of the `SimpleImputer` for the continuous columns, we have to traverse back through the `ColumnTransformer` ('ct'), continuous pipeline ('con'), before finally reaching the `SimpleImputer` ('si'). Below, we search three random forest hyperparameters and the strategy of continuous imputation.

In [None]:
grid = {'rfr__max_depth': range(4, 10),
        'rfr__min_samples_leaf': [10, 20],
        'rfr__max_features': [.5, .7],
        'ct__con__si__strategy': ['mean', 'median']}
gs = GridSearchCV(pipe_final, grid, cv=kf, n_jobs=-1)
gs.fit(X, y);

In [None]:
gs.best_params_

In [None]:
gs.best_score_

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Create different pipelines and grid search them to find the best parameters.</span>