---
title: "Applying transformers to columns"
format:
    revealjs:
        slide-number: true
        toc: true
        code-fold: false
        code-tools: true

---

## Introduction
Often, transformers need to be applied only to a subset of columns, rather than 
the entire dataframe. 

As an example, it does not make sense to apply a `StandardScaler` to a column 
that contains strings, and indeed doing so would raise an exception. 

Scikit-learn provides the `ColumnTransformer` to deal with this: 

In [7]:
#| echo: true
import pandas as pd
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})

categorical_columns = selector(dtype_include=object)(df)
numerical_columns = selector(dtype_exclude=object)(df)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))
transformed = ct.fit_transform(df)
transformed

array([[-1.22474487,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.22474487,  0.        ,  1.        ,  0.        ]])

`make_column_selector` allows to choose columns based on their datatype, or by 
using regex to filter column names. In some cases, this degree of control is 
not sufficient. 

To address such situations, skrub implements different transformers that allow 
to modify columns from within scikit-learn pipelines. Additionally, the selectors
API allows to implement powerful, custom-made column selection filters. 

`SelectCols` and `DropCols` are transformers that can be used as part of a 
pipeline to filter columns according to the selectors API, while `ApplyToCols` and
`ApplyToFrame` replicate the `ColumnTransformer` behavior with a different syntax
and access to the selectors. 

## `ApplyToCols` and `ApplyToFrame`

### Applying a transformer to separate columns: `ApplyToCols`
In many cases, `ApplyToCols` can be a direct replacememnt for the `ColumnTransformer`,
like in the following example:

In [8]:
#| echo: true
import skrub.selectors as s
from sklearn.pipeline import make_pipeline
from skrub import ApplyToCols

numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())

transformed = make_pipeline(numeric, string).fit_transform(df)
transformed

Unnamed: 0,text_bar,text_baz,text_foo,number
0,0.0,0.0,1.0,-1.224745
1,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,1.224745


In this case, we are applying the `StandardScaler` only to numeric features using 
`s.numeric()`, and `OneHotEncoder` with `s.string()`. 

Under the hood, `ApplyToCol` selects all columns that satisfy the condition specified
in `cols` (in this case, that the dtype is numeric), then clones and applies the
specified transformer (`StandardScaler`) to each column _separately_. 

::: {.callout-important}
Columns that are not selected are passed through without any change, thus string
columns are not touched by the `numeric` transformer. 
:::

By passing through unselected columns without changes it is possible to chain 
several `ApplyToCols` together by putting them in a scikit-learn pipeline. 

### Applying the same transformer to multiple columns at once: `ApplyToFrame`
In some cases, it may be beneficial to apply the same transformer to a subset of 
columns in a dataframe. 

This example dataframe contains some patient information, and some (random) 
metrics. 

In [9]:
import pandas as pd
import numpy as np

n_patients = 20
np.random.seed(42)
df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

for i in range(5):
    df[f"metric_{i}"] = np.random.normal(loc=50, scale=10, size=n_patients)

df["diagnosis"] = np.random.choice(["A", "B", "C"], size=n_patients)
df.head()

Unnamed: 0,patient_id,age,sex,metric_0,metric_1,metric_2,metric_3,metric_4,diagnosis
0,P000,56,F,39.871689,52.088636,41.607825,50.870471,52.961203,B
1,P001,69,M,53.142473,30.403299,46.907876,47.009926,52.610553,A
2,P002,46,F,40.919759,36.71814,53.312634,50.917608,50.051135,B
3,P003,32,F,35.876963,51.968612,59.755451,30.124311,47.654129,B
4,P004,60,F,64.656488,57.384666,45.208258,47.803281,35.846293,C


With `ApplyToFrame`, it is easy to apply a decomposition algorithm such as `PCA` 
to condense the `metric_*` columns into a smaller number of features: 

In [10]:
from skrub import ApplyToFrame
from sklearn.decomposition import PCA

reduce = ApplyToFrame(PCA(n_components=2), cols=s.glob("metric_*"))

df_reduced = reduce.fit_transform(df)
df_reduced.head()

Unnamed: 0,patient_id,age,sex,diagnosis,pca0,pca1
0,P000,56,F,B,-2.647377,7.025046
1,P001,69,M,A,-2.480564,-11.246997
2,P002,46,F,B,4.27484,-5.039065
3,P003,32,F,B,14.116747,15.620615
4,P004,60,F,C,-19.073862,1.186541


### The `allow_reject` parameter
When `ApplyToCols` or `ApplyToFrame` are using a skrub transformer, they can use
the `allow_reject` parameter for more flexibility. By setting `allow_reject` to 
`True`, columns that cannot be treated by the current transformer will be ignored
rather than raising an exception. 

Consider this example. By default, `ToDatetime` raises a `RejectColumn` exception
when it finds a column it cannot convert to datetime. 

In [13]:
from skrub import ToDatetime
df = pd.DataFrame({
    "date": ["03 January 2023", "04 February 2023", "05 March 2023"],
    "values": [10, 20, 30]
})
df

Unnamed: 0,date,values
0,03 January 2023,10
1,04 February 2023,20
2,05 March 2023,30


By setting `allow_reject=True`, the datetime column is converted properly and 
the other column is passed through without issues. 

In [14]:
with_reject = ApplyToCols(ToDatetime(), allow_reject=True)
with_reject.fit_transform(df)

Unnamed: 0,date,values
0,2023-01-03,10
1,2023-02-04,20
2,2023-03-05,30


## Selection operations in a scikit-learn pipeline
In some situations, it may be necessary to select or remove specific columns from
a dataframe: unlike `ApplyToCols` and `ApplyToFrame`, this means removing some 
features from the original table. This can be done with `SelectCols` and `DropCols`, 
which work as their name suggests, and can take a `cols` parameter to choose
which columns to select or drop respectively.

In [15]:
from skrub import ToDatetime
df = pd.DataFrame({
    "date": ["03 January 2023", "04 February 2023", "05 March 2023"],
    "values": [10, 20, 30]
})
df

Unnamed: 0,date,values
0,03 January 2023,10
1,04 February 2023,20
2,05 March 2023,30


We can selectively choose or drop columns based on names, or more complex rules 
(see the next chapter).

In [16]:
from skrub import SelectCols
SelectCols("date").fit_transform(df)

Unnamed: 0,date
0,03 January 2023
1,04 February 2023
2,05 March 2023


In [17]:
from skrub import DropCols
DropCols("date").fit_transform(df)

Unnamed: 0,values
0,10
1,20
2,30


## Concatenating the skrub column transformers
Skrub column transformers can be concatenated by using scikit-learn pipelines.
In the following example, we first select only the column `patiend_id`, then encode
it using `OneHotEncoder` and finally use `PCA` to reduce the number of dimensions.

This is done by wrapping the latter two steps in `ApplyToCols` and `ApplyToFrame` 
respectively, and then putting all transformers in order in a scikit-learn pipeline
using `make_pipeline`. 

In [18]:
from sklearn.pipeline import make_pipeline
from skrub import SelectCols

df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

select = SelectCols("patient_id")
encode = ApplyToCols(OneHotEncoder(sparse_output=False))
reduce = ApplyToFrame(PCA(n_components=2))

transform = make_pipeline(select, encode, reduce)
transform.fit_transform(df)

Unnamed: 0,pca0,pca1
0,1.451188e-17,9.393890000000001e-18
1,-0.02405452,0.9397337
2,-0.2305851,0.009374222
3,-0.05287468,0.009374222
4,-0.07954573,-0.006807817
5,-0.1389234,-0.006807817
6,0.03552354,-0.006807817
7,-0.1348056,-0.006807817
8,-0.08659518,-0.006807817
9,0.432752,-0.006807817


### The order of column transformations is important
Some care must be taken when concatenating columnn transformers, in particular
when selection is done on datatypes. Consider this case:

In [19]:
encode = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
scale = ApplyToCols(StandardScaler(), cols=s.numeric())

In the first case, we encode and then scale, in the second case we instead 
scale first and then encode. 

In [20]:
transform_1 = make_pipeline(encode, scale)
transform_1.fit_transform(df)

Unnamed: 0,patient_id_P000,patient_id_P001,patient_id_P002,patient_id_P003,patient_id_P004,patient_id_P005,patient_id_P006,patient_id_P007,patient_id_P008,patient_id_P009,...,patient_id_P013,patient_id_P014,patient_id_P015,patient_id_P016,patient_id_P017,patient_id_P018,patient_id_P019,age,sex_F,sex_M
0,4.358899,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-1.30157,0.904534,-0.904534
1,-0.229416,4.358899,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.709947,0.904534,-0.904534
2,-0.229416,-0.229416,4.358899,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,0.059162,-1.105542,1.105542
3,-0.229416,-0.229416,-0.229416,4.358899,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-1.479057,-1.105542,1.105542
4,-0.229416,-0.229416,-0.229416,-0.229416,4.358899,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.473298,0.904534,-0.904534
5,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,4.358899,-0.229416,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,1.893193,-1.105542,1.105542
6,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,4.358899,-0.229416,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,1.834031,-1.105542,1.105542
7,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,4.358899,-0.229416,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,0.473298,0.904534,-0.904534
8,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,4.358899,-0.229416,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.53246,-1.105542,1.105542
9,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,4.358899,...,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.229416,-0.118325,-1.105542,1.105542


In [21]:
transform_2 = make_pipeline(scale, encode)
transform_2.fit_transform(df)

Unnamed: 0,patient_id_P000,patient_id_P001,patient_id_P002,patient_id_P003,patient_id_P004,patient_id_P005,patient_id_P006,patient_id_P007,patient_id_P008,patient_id_P009,...,patient_id_P013,patient_id_P014,patient_id_P015,patient_id_P016,patient_id_P017,patient_id_P018,patient_id_P019,age,sex_F,sex_M
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.30157,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.709947,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.059162,0.0,1.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.479057,0.0,1.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.473298,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.893193,0.0,1.0
6,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.834031,0.0,1.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.473298,1.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.53246,0.0,1.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.118325,0.0,1.0


The result of `transform_1` is that the features that have been generated by 
the `OneHotEncoder` are then scaled by the `StandardScaler`, because the new 
features are numeric and are therefore selected in the next step. In many cases,
this behavior is not desired. 