---
title: "Applying transformers to columns"
format:
    revealjs:
        slide-number: true
        toc: true
        code-fold: false
        code-tools: true

---

## Introduction
Often, transformers need to be applied only to a subset of columns, rather than 
the entire dataframe. 

As an example, it does not make sense to apply a `StandardScaler` to a column 
that contains strings, and indeed doing so would raise an exception. 

Scikit-learn provides the `ColumnTransformer` to deal with this: 

In [None]:
#| echo: true
import pandas as pd
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})

categorical_columns = selector(dtype_include=object)(df)
numerical_columns = selector(dtype_exclude=object)(df)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))
transformed = ct.fit_transform(df)
transformed

`make_column_selector` allows to choose columns based on their datatype, or by 
using regex to filter column names. In some cases, this degree of control is 
not sufficient. 

To address such situations, skrub implements different transformers that allow 
to modify columns from within scikit-learn pipelines. Additionally, the selectors
API allows to implement powerful, custom-made column selection filters. 

`SelectCols` and `DropCols` are transformers that can be used as part of a 
pipeline to filter columns according to the selectors API, while `ApplyToCols` and
`ApplyToFrame` replicate the `ColumnTransformer` behavior with a different syntax
and access to the selectors. 

## `ApplyToCols` and `ApplyToFrame`

### Applying a transformer to separate columns: `ApplyToCols`
In many cases, `ApplyToCols` can be a direct replacememnt for the `ColumnTransformer`,
like in the following example:

In [None]:
#| echo: true
import skrub.selectors as s
from sklearn.pipeline import make_pipeline
from skrub import ApplyToCols

numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())

transformed = make_pipeline(numeric, string).fit_transform(df)
transformed

In this case, we are applying the `StandardScaler` only to numeric features using 
`s.numeric()`, and `OneHotEncoder` with `s.string()`. 

Under the hood, `ApplyToCol` selects all columns that satisfy the condition specified
in `cols` (in this case, that the dtype is numeric), then clones and applies the
specified transformer (`StandardScaler`) to each column _separately_. 

::: {.callout-important}
Columns that are not selected are passed through without any change, thus string
columns are not touched by the `numeric` transformer. 
:::

By passing through unselected columns without changes it is possible to chain 
several `ApplyToCols` together by putting them in a scikit-learn pipeline. 

### Applying the same transformer to multiple columns at once: `ApplyToFrame`
In some cases, it may be beneficial to apply the same transformer to a subset of 
columns in a dataframe. 

This example dataframe contains some patient information, and some (random) 
metrics. 

In [None]:
import pandas as pd
import numpy as np

n_patients = 20
np.random.seed(42)
df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

for i in range(5):
    df[f"metric_{i}"] = np.random.normal(loc=50, scale=10, size=n_patients)

df["diagnosis"] = np.random.choice(["A", "B", "C"], size=n_patients)
df.head()

With `ApplyToFrame`, it is easy to apply a decomposition algorithm such as `PCA` 
to condense the `metric_*` columns into a smaller number of features: 

In [None]:
from skrub import ApplyToFrame
from sklearn.decomposition import PCA

reduce = ApplyToFrame(PCA(n_components=2), cols=s.glob("metric_*"))

df_reduced = reduce.fit_transform(df)
df_reduced.head()

## Selection operations in a scikit-learn pipeline
In some situations, it may be necessary to select or remove specific columns from
a dataframe: unlike `ApplyToCols` and `ApplyToFrame`, this means removing some 
features from the original table. This can be done with `SelectCols` and `DropCols`, 
which work as their name suggests, and can take a `cols` parameter to choose
which columns to select or drop respectively.

## Exercise: putting everything together in a scikit-learn pipeline
