# 6. The ColumnTransformer

The `ColumnTransformer` is a completely new meta-estimator released for scikit-learn version 0.20. It has greatly improved the transitioning from pandas to scikit-learn. The addition of the `ColumnTransformer` allows us to:

* Apply different transformations to different columns of data
* Use a pandas dataframe as input data
* Create a pipeline connecting all of our steps

## The previous workflow
Before the `ColumnTransformer` existed, there was no direct way to apply different transformations to different columns of data. This was a major missing ability as string columns are processed differently than numeric columns. One of the most common frustrating scenarios was to have a pandas DataFrame consisting of both string and numeric columns with the desire to do machine learning on it in scikit-learn.

### Common workarounds
Because there was no available path in scikit-learn to prepare a pandas dataframe with a mix of string and numeric columns for machine learning, a number of workarounds were built.

* pandas `get_dummies` function - is able to one-hot encode string columns (and ignore the numeric columns) but it is not an estimator, does not integrate with a pipeline, and cannot be used on unseen data to produce the same encoding.
* scikit-learn's `MultiLabelBinarizer` - This actually does one-hot encoding for strings but was originally built for target variables and not input data. Importantly, you cannot apply it just a subset of the data.
* The pandas [sklearn_pandas][1] library - The integration between pandas and scikit-learn was so bad, that an entire new package was built. It does provide an alternative solution to the integration problems, but ColumnTransformer solves these issues now.

The `ColumnTransformer` paired with the upgraded `OneHotEncoder` makes the transition from pandas much easier, more robust, and gives us a single obvious path forward.

[1]: https://github.com/scikit-learn-contrib/sklearn-pandas

## Processing string and numeric columns separately
String columns are processed differently than numeric columns and need separate pipelines to handle the transformations for each. For instance, we might want to impute missing values with a constant for string columns and the mean for numeric.

In [None]:
import pandas as pd
hs = pd.read_csv('data/housing_sample.csv')
hs.head()

In [None]:
hs.isna().sum()

In [None]:
y = hs.pop('SalePrice').values

## Using the `ColumnTransformer`
The major purpose of the `ColumnTransformer` is to apply a transformation to a specific subset of the columns and not the entire input data which is done for all the other transformers. Let's say we want to only fill in missing values for the string columns. 

To get started, we must create a list of three-item tuples. The tuple items are the name of the transformer (as a string), the instantiated transformer, and the list of columns to apply the transformation to. The `ColumnTransformer` is built to be used with pandas DataFrames so you can pass in your DataFrame directly to it.

Below, we create a list with a single three-item tuple to impute missing values to the 'Neighborhood' and 'Exterior1st' string columns

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', fill_value='MISSING')
string_cols = ['Neighborhood', 'Exterior1st']
transformers = [('impute', si, string_cols)]

We can now instantiate the `ColumnTransformer` with our list of transformations and then fit and return the transformed data.

In [None]:
ct = ColumnTransformer(transformers)
X_filled = ct.fit_transform(hs)
X_filled

### Columns dropped - numpy array returned
By default, all the columns not listed are dropped. Also, we are returned a numpy array and are no longer in pandas.

### Keep the remaining columns
The `remainder` parameter controls what happens to the other columns not specified in the list of transformers. This parameter is defaulted to the string 'drop'. Setting it to 'passthrough' will keep the other columns in the returned array. 

In [None]:
ct = ColumnTransformer(transformers, remainder='passthrough')
ct.fit_transform(hs)[:5]

### Imputing the numeric columns with the mean
The `ColumnTransformer` allows us to use different imputation strategies for different sets of columns. Below, we instantiate a second imputer and use it for the numeric columns. It's not necessary to use 'passthrough' as all the columns are named in the transformers.

In [None]:
string_si = SimpleImputer(strategy='constant', fill_value='MISSING')
numeric_si = SimpleImputer(strategy='mean')

string_cols = ['Neighborhood', 'Exterior1st']
numeric_cols = ['YearBuilt', 'LotFrontage', 'GrLivArea', 'GarageArea']

transformers = [('impute_string', string_si, string_cols), 
                ('impute_numeic', numeric_si, numeric_cols)]

ct = ColumnTransformer(transformers)
ct.fit_transform(hs)[:5]

## Build separate pipelines
If we want to apply multiple transformations to different subsets of the data then we will need to a build a pipeline for each section. Below, we impute missing values and one-hot encode the string columns and separately impute the missing values and standardize the numeric columns.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# string pipeline
string_si = SimpleImputer(strategy='constant', fill_value='MISSING')
ohe = OneHotEncoder(sparse=False)
steps = [('impute', string_si), ('encode', ohe)]
string_pipe = Pipeline(steps)

# numeric pipeline
numeric_si = SimpleImputer(strategy='mean')
ss = StandardScaler()
steps = [('si', numeric_si), ('standardize', ss)]
numeric_pipe = Pipeline(steps)

# columns
string_cols = ['Neighborhood', 'Exterior1st']
numeric_cols = ['YearBuilt', 'LotFrontage', 'GrLivArea', 'GarageArea']

transformers = [('string', string_pipe, string_cols), 
                ('numeric', numeric_pipe, numeric_cols)]

ct = ColumnTransformer(transformers)
X_transformed = ct.fit_transform(hs)
X_transformed

## Adding machine learning
We can create one last pipeline to do machine learning on the final transformed output.

In [None]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold
from mymetrics import root_mean_squared_log_error

rfr = RandomForestRegressor()
steps = [('transformers', ct), ('rfr', rfr)]
final_pipe = Pipeline(steps)

kf = KFold(n_splits=5, shuffle=True)
cross_val_score(final_pipe, hs, y, cv=kf, scoring=root_mean_squared_log_error)

## Grid searching a pipeline of transformers
To access a specific estimator in this pipeline of transformers, you must continually append the name of the transformer/pipeline followed by two underscores.

In [None]:
from sklearn.model_selection import GridSearchCV
grid = {'transformers__numeric__si__strategy': ['mean', 'median'],
       'rfr__n_estimators': [50, 100], 'rfr__max_depth': range(2, 6)}
gs = GridSearchCV(final_pipe, grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(hs, y)
gs.best_params_

## Summary of Commands

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from mymetrics import root_mean_squared_log_error

# string pipeline
string_si = SimpleImputer(strategy='constant', fill_value='MISSING')
ohe = OneHotEncoder(sparse=False)
steps = [('impute', string_si), ('encode', ohe)]
string_pipe = Pipeline(steps)

# numeric pipeline
numeric_si = SimpleImputer(strategy='mean')
ss = StandardScaler()
steps = [('si', numeric_si), ('standardize', ss)]
numeric_pipe = Pipeline(steps)

# columns
string_cols = ['Neighborhood', 'Exterior1st']
numeric_cols = ['YearBuilt', 'LotFrontage', 'GrLivArea', 'GarageArea']

transformers = [('string', string_pipe, string_cols), 
                ('numeric', numeric_pipe, numeric_cols)]

ct = ColumnTransformer(transformers)
rfr = RandomForestRegressor()
steps = [('transformers', ct), ('rfr', rfr)]
final_pipe = Pipeline(steps)

kf = KFold(n_splits=5, shuffle=True)
grid = {'transformers__numeric__si__strategy': ['mean', 'median'],
       'rfr__n_estimators': [50, 100], 'rfr__max_depth': range(2, 6)}
gs = GridSearchCV(final_pipe, grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(hs, y)
gs.best_params_

## Exercise
Use the `ColumnTransformer` to build separate pipelines for string and numeric columns. Build a final pipeline that adds machine learning as the last step.