[WIP] Column-wise transformations on part of a dataframe #877

jeromedockes · 2024-01-22T08:49:34Z

fixes #874 and supersedes #848

The goal of this PR is to have a way to apply column-wise transformations to a subset of a (polars or pandas) dataframe's columns.
This would allow reorganizing some of skrub's internals and among other things fix the issue of the TableVectorizer not applying the same transformations during fit_transform and during transform.

This PR is mostly about private utilities or utilities that would be leveraged by more "advanced" users, not the main things that a new skrub user would see.
In the slightly longer term some of it will be used for, or replaced by, a more high-level API that will allow building a pipeline step by step while seeing previews of the transformations applied to a subsample, to be discussed in a separate github discussion.

Supporting both polars and pandas dataframes

Handled with single dispatch; this has already been extracted into a separate PR: #888

Applying a transformation column-by-column

Scikit-learn transformers transform a 2D array into a 2D array, but many operations on dataframes are done independently on several columns.

a transformation typically does not apply to all columns (eg converting strings to datetimes doesn't apply to numerical columns).
a separate transformation / state needs to be stored for each column. For example a string to datetime conversion needs to remember the format ('%Y-%m-%d', '%d/%m/%Y', '%m/%d/%Y', …) it detected for each column, a categorical encoding needs to remember a list of categories for each column, etc.
transformations on 2 different columns can be done in parallel.

At the moment, each transformer has to come up with its own way tracking and exposing per-column state, applying the transformations (possibly in parallel), avoiding duplicate column names in the output, and collecting the result of all the column-wise transformations into a dataframe.
This is done by the GapEncoder, MinHashEncoder, DatetimeEncoder, etc. and should be done (but is not ATM) by the TableVectorizer for the cleaning/preprocessing steps it applies such as dtype conversions, replacing "N/A" with nulls, counting categories etc.
Moreover, applying a transformation to a subset of columns is only possible in the TableVectorizer or scikit-learn ColumnTransformer.
Both put all columns (including those to which we apply "passthrough") into a numpy array (which is then converted back to a dataframe).
Among other things this means all outputs are converted to a common dtype (and we get an error if that is not possible).

To make it easier we would need:

Single-column transformers

(we can't call them "column transformers" because there is already something called ColumnTransformer in scikit-learn, name suggestions welcome).
They take 1D input ie a single dataframe column. An example of that kind of object that already exists in skrub is the (private) GapEncoderColumn; the TfidfVectorizer in scikit-learn also only accepts 1D inputs.

This PR would make the use of 1D transformers more systematic.
Their API looks like

class TransformCol:
    __single_column_transformer__ = True

    def fit_transform(column):
        """
        Accepts 1 column (a pandas or polars Series,
        in the future possibly a representation of a column in a LazyFrame).

        Returns a column, a list of columns, a dataframe, or NotImplemented to
        indicate this transformation does not apply to the given column.
        """

    def transform(column):
        """
        Applies exactly the same transformation as fit_transform.
        Never returns NotImplemented and should not be called if fit_transform
        returned NotImplemented.
        """

The __single_column_transformer__ indicates it transforms a single column; that could be replaced by a scikit-learn tag.

why the NotImplemented option

The reason for allowing to reject a column with `NotImplemented` is that sometimes we can only find out if the transformation is meaningful for a column when attempting to actually perform it. For example in the TableVectorizer we don't know if a string column should be converted to datetime before trying to detect a datetime format. This produces a chicken-and-egg situation where the DatetimeEncoder must be applied only to columns that contain datetimes, but the DatetimeEncoder must be applied to a column to discover if it contains datetimes.

It is not done yet in this PR but there could be a "strict" mode (which could be the default) in which we know we are passing a valid column and rejecting it is not an option and results in an error.

Rather than returning NotImplemented, another option for rejecting a column could be to raise a specific exception, maybe called RejectColumn.
The advantage is it would make it easier to implement the strict mode at the level of the OnEachColumn transformer described below, rather than in each 1D transformer, while easily retaining a full traceback and all the information about the reason the transformation failed.
Edit after discussion with @glemaitre and @ogrisel we confirmed the choice of raising an exception rather than returning NotImplemented

The drawback is that another package wanting to implement a 1D transformer would need to depend on skrub and import the exception type from skrub.
I think the advantage outweights the drawback so I'll change it (now or if/when we add the strict mode).

One such transformer added in this PR is the ToDatetime.
ToDatetime.fit_transform accepts a single column and tries to convert it to a Datetime column.
If it can, it returns the converted column and remembers the string format it detected for that one column.
If it cannot (eg because the strings in a column do not represent dates) it returns NotImplemented to indicate that ToDatetime is not a transformation that should apply to the provided column.

EncodeDatetime works similarly, except that fit_transform returns a list of columns rather than just one.
Still, it accepts a single column, stores state for a single column, and returns some columns (year, month, day etc) that will be collected into a dataframe together with the output of other 1D transformers.

It is not done in this PR but it is possible to allow 1D transformers accept "lazy columns" (LazyFrame + column name) and to return lists of polars expressions rather than lists of columns.
This enables lazy transforms (POC in this branch).

>>> import polars as pl

>>> df = pl.DataFrame({
...     "A": [1.1, 2.2, 3.3, None, 5.5],
...     "B": list("aabc") + [None],
...     "C": ["2024-02-02", "2024-02-02", None, "2025-10-11", None],
...     "D": ["01/02/1998", "10/03/2027", "11/02/2012", None, "01/01/1901"],
...     "E": "one two three four five".split(),
...     "F": " 1.0 2.0 3.0 4.0 5.0".split(),
... })
>>> df
shape: (5, 6)
┌──────┬──────┬────────────┬────────────┬───────┬─────┐
│ A    ┆ B    ┆ C          ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---        ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ str        ┆ str        ┆ str   ┆ str │
╞══════╪══════╪════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null       ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null       ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴──────┴────────────┴────────────┴───────┴─────┘
>>> from skrub._to_datetime import ToDatetime
>>> from skrub._datetime_encoder import EncodeDatetime

>>> as_dt = ToDatetime().fit_transform(df["D"])
>>> as_dt
shape: (5,)
Series: 'D' [datetime[μs]]
[
	1998-01-02 00:00:00
	2027-10-03 00:00:00
	2012-11-02 00:00:00
	null
	1901-01-01 00:00:00
]

For some columns, we will only discover if the conversion to datetime makes sense once we try to convert them, so the transformer is allowed to reject a column.
This + separating datetime casting and encoding simplifies the TableVectorizer handling of datetimes.

>>> ToDatetime().fit_transform(df["A"])
NotImplemented

A column can only be rejected during `fit`, not `transform`:

>>> to_dt = ToDatetime()
>>> to_dt.fit_transform(df["C"])
shape: (5,)
Series: 'C' [datetime[μs]]
[
	2024-02-02 00:00:00
	2024-02-02 00:00:00
	null
	2025-10-11 00:00:00
	null
]
>>> to_dt.transform(pl.Series("C", "bad bad 2021-10-20".split()))
shape: (3,)
Series: 'C' [datetime[μs]]
[
	null
	null
	2021-10-20 00:00:00
]

As mentioned above we can add a "strict" mode where rejecting a column is not allowed.

EncodeDatetime is similar but instead of just one output column it returns a list of columns:

>>> EncodeDatetime().fit_transform(as_dt)
[shape: (5,)
Series: 'D_year' [f32]
[
	1998.0
	2027.0
	2012.0
	null
	1901.0
], shape: (5,)
Series: 'D_month' [f32]
[
	1.0
	10.0
	11.0
	null
	1.0
], shape: (5,)
Series: 'D_day' [f32]
[
	2.0
	3.0
	2.0
	null
	1.0
], shape: (5,)
Series: 'D_total_seconds' [f32]
[
	8.836992e8
	1.8225e9
	1.3518e9
	null
	-2.1775e9
]]

`OnEachColumn`

To use the 1D transformers in a scikit-learn pipeline we need a (regular) scikit-learn transformer that applies a 1D-transformer to each column (possibly in parallel) and collects the results in a dataframe to be fed down the rest of the pipeline.
OnEachColumn provides that and all the logic for mapping transformers to columns, storing them, handling column names etc can be handled in one place.
Columns that the univariate transformer rejects are passed through unchanged.

>>> from skrub._on_each_column import OnEachColumn

>>> df
shape: (5, 6)
┌──────┬──────┬────────────┬────────────┬───────┬─────┐
│ A    ┆ B    ┆ C          ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---        ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ str        ┆ str        ┆ str   ┆ str │
╞══════╪══════╪════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null       ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null       ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴──────┴────────────┴────────────┴───────┴─────┘
>>> OnEachColumn(ToDatetime()).fit_transform(df)
shape: (5, 6)
┌──────┬──────┬─────────────────────┬─────────────────────┬───────┬─────┐
│ A    ┆ B    ┆ C                   ┆ D                   ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---                 ┆ ---                 ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ datetime[μs]        ┆ datetime[μs]        ┆ str   ┆ str │
╞══════╪══════╪═════════════════════╪═════════════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 1998-01-02 00:00:00 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 2027-10-03 00:00:00 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null                ┆ 2012-11-02 00:00:00 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 00:00:00 ┆ null                ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null                ┆ 1901-01-01 00:00:00 ┆ five  ┆ 5.0 │
└──────┴──────┴─────────────────────┴─────────────────────┴───────┴─────┘

It is possible to restrict the columns on which the transformation is attempted:

>>> OnEachColumn(ToDatetime(), cols=["C"]).fit_transform(df)
shape: (5, 6)
┌──────┬──────┬─────────────────────┬────────────┬───────┬─────┐
│ A    ┆ B    ┆ C                   ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---                 ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ datetime[μs]        ┆ str        ┆ str   ┆ str │
╞══════╪══════╪═════════════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null                ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 00:00:00 ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null                ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴──────┴─────────────────────┴────────────┴───────┴─────┘

Above only "C" has been transformed, not "D".

We can inspect which columns were transformed:

>>> to_dt = OnEachColumn(ToDatetime()).fit(df)
>>> to_dt.transformers_
{'C': ToDatetime(), 'D': ToDatetime()}
>>> to_dt.input_to_outputs_
{'C': ['C'], 'D': ['D']}

Combining several transformations:

>>> from sklearn.pipeline import make_pipeline
>>> from skrub._to_numeric import ToNumeric
>>> from skrub._to_float import ToFloat32
>>> from sklearn.preprocessing import OrdinalEncoder

>>> make_pipeline(
...     OnEachColumn(ToNumeric()),
...     OnEachColumn(ToDatetime()),
...     OnEachColumn(EncodeDatetime()),
...     OnEachColumn(OrdinalEncoder(), cols=["B", "E"]),
...     OnEachColumn(ToFloat32())
... ).fit_transform(df)
shape: (5, 12)
┌──────┬─────┬────────┬─────────┬───┬───────┬─────────────────┬─────┬─────┐
│ A    ┆ B   ┆ C_year ┆ C_month ┆ … ┆ D_day ┆ D_total_seconds ┆ E   ┆ F   │
│ ---  ┆ --- ┆ ---    ┆ ---     ┆   ┆ ---   ┆ ---             ┆ --- ┆ --- │
│ f32  ┆ f32 ┆ f32    ┆ f32     ┆   ┆ f32   ┆ f32             ┆ f32 ┆ f32 │
╞══════╪═════╪════════╪═════════╪═══╪═══════╪═════════════════╪═════╪═════╡
│ 1.1  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 2.0   ┆ 8.836992e8      ┆ 2.0 ┆ 1.0 │
│ 2.2  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 3.0   ┆ 1.8225e9        ┆ 4.0 ┆ 2.0 │
│ 3.3  ┆ 1.0 ┆ null   ┆ null    ┆ … ┆ 2.0   ┆ 1.3518e9        ┆ 3.0 ┆ 3.0 │
│ null ┆ 2.0 ┆ 2025.0 ┆ 10.0    ┆ … ┆ null  ┆ null            ┆ 1.0 ┆ 4.0 │
│ 5.5  ┆ 3.0 ┆ null   ┆ null    ┆ … ┆ 1.0   ┆ -2.1775e9       ┆ 0.0 ┆ 5.0 │
└──────┴─────┴────────┴─────────┴───┴───────┴─────────────────┴─────┴─────┘

This PR also adds more convenient ways of selecting columns than the cols=["B", "E"] above, which are described later.

>>> from skrub import _selectors as s
>>> transformed = make_pipeline(
...     OnEachColumn(ToNumeric()),
...     OnEachColumn(ToDatetime()),
...     OnEachColumn(EncodeDatetime()),
...     OnEachColumn(OrdinalEncoder(), cols=~s.numeric()), # select by excluding dtype
...     OnEachColumn(ToFloat32())
... ).fit_transform(df)
>>> transformed
shape: (5, 12)
┌──────┬─────┬────────┬─────────┬───┬───────┬─────────────────┬─────┬─────┐
│ A    ┆ B   ┆ C_year ┆ C_month ┆ … ┆ D_day ┆ D_total_seconds ┆ E   ┆ F   │
│ ---  ┆ --- ┆ ---    ┆ ---     ┆   ┆ ---   ┆ ---             ┆ --- ┆ --- │
│ f32  ┆ f32 ┆ f32    ┆ f32     ┆   ┆ f32   ┆ f32             ┆ f32 ┆ f32 │
╞══════╪═════╪════════╪═════════╪═══╪═══════╪═════════════════╪═════╪═════╡
│ 1.1  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 2.0   ┆ 8.836992e8      ┆ 2.0 ┆ 1.0 │
│ 2.2  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 3.0   ┆ 1.8225e9        ┆ 4.0 ┆ 2.0 │
│ 3.3  ┆ 1.0 ┆ null   ┆ null    ┆ … ┆ 2.0   ┆ 1.3518e9        ┆ 3.0 ┆ 3.0 │
│ null ┆ 2.0 ┆ 2025.0 ┆ 10.0    ┆ … ┆ null  ┆ null            ┆ 1.0 ┆ 4.0 │
│ 5.5  ┆ 3.0 ┆ null   ┆ null    ┆ … ┆ 1.0   ┆ -2.1775e9       ┆ 0.0 ┆ 5.0 │
└──────┴─────┴────────┴─────────┴───┴───────┴─────────────────┴─────┴─────┘

OnEachColumn also renames duplicate columns

>>> from sklearn.base import BaseEstimator

>>> class Transform(BaseEstimator):
...    __single_column_transformer__ = True
...
...    def fit_transform(self, col):
...        return col.rename("A")

>>> OnEachColumn(Transform(), cols=list("ABC")).fit_transform(df)
shape: (5, 6)
┌──────┬─────────────────────┬─────────────────────┬────────────┬───────┬─────┐
│ A    ┆ A__skrub_86e6d810__ ┆ A__skrub_61baa0d1__ ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---                 ┆ ---                 ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str                 ┆ str                 ┆ str        ┆ str   ┆ str │
╞══════╪═════════════════════╪═════════════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a                   ┆ 2024-02-02          ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a                   ┆ 2024-02-02          ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b                   ┆ null                ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c                   ┆ 2025-10-11          ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null                ┆ null                ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴─────────────────────┴─────────────────────┴────────────┴───────┴─────┘

Applying a 2D transformation to a subset of columns

Sometimes we want to apply a transformation to a subset of columns, but it is still a 2D -> 2D transformation.
For example we may want to SelectKBest some columns but enforce that some should be kept.
Another transformer, OnColumnSelection, provides that.

>>> import numpy as np
>>> from sklearn.feature_selection import SelectKBest
>>> from skrub._on_column_selection import OnColumnSelection

>>> y = np.random.default_rng(0).integers(2, size=df.shape[0]).astype(bool)
>>> OnColumnSelection(SelectKBest(k=1), cols=["E", "F"]).fit_transform(transformed, y)
shape: (5, 11)
┌──────┬─────┬────────┬─────────┬───┬─────────┬───────┬─────────────────┬─────┐
│ A    ┆ B   ┆ C_year ┆ C_month ┆ … ┆ D_month ┆ D_day ┆ D_total_seconds ┆ F   │
│ ---  ┆ --- ┆ ---    ┆ ---     ┆   ┆ ---     ┆ ---   ┆ ---             ┆ --- │
│ f32  ┆ f32 ┆ f32    ┆ f32     ┆   ┆ f32     ┆ f32   ┆ f32             ┆ f32 │
╞══════╪═════╪════════╪═════════╪═══╪═════════╪═══════╪═════════════════╪═════╡
│ 1.1  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 1.0     ┆ 2.0   ┆ 8.836992e8      ┆ 1.0 │
│ 2.2  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 10.0    ┆ 3.0   ┆ 1.8225e9        ┆ 2.0 │
│ 3.3  ┆ 1.0 ┆ null   ┆ null    ┆ … ┆ 11.0    ┆ 2.0   ┆ 1.3518e9        ┆ 3.0 │
│ null ┆ 2.0 ┆ 2025.0 ┆ 10.0    ┆ … ┆ null    ┆ null  ┆ null            ┆ 4.0 │
│ 5.5  ┆ 3.0 ┆ null   ┆ null    ┆ … ┆ 1.0     ┆ 1.0   ┆ -2.1775e9       ┆ 5.0 │
└──────┴─────┴────────┴─────────┴───┴─────────┴───────┴─────────────────┴─────┘

Here among "E" and "F", "F" has been selected and "E" has been dropped.
The other columns (outside of cols) have been passed through unchanged.
Note that unlike the current TableVectorizer or ColumnTransformer this allows maintaining heterogeneous dtypes in the dataframe and in particular keeping the original dtype of the columns that are passed through.

Selecting columns

The skrub._selectors module provides convenient ways to select columns to which the transformations are applied, similar to polars.selectors and ibis.selectors.

>>> from skrub import _selectors as s
>>> from datetime import date
>>> df = pl.DataFrame(
...     dict(
...         a=[10, 20, 30],
...         b=["a", "b", "b"],
...         c=["b", "b", "b"],
...         d=[date(2024, 2, 26), date(2024, 2, 27), None],
...         e=[1, 2, 3],
...     )
... )
>>> df
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ i64 ┆ str ┆ str ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 10  ┆ a   ┆ b   ┆ 2024-02-26 ┆ 1   │
│ 20  ┆ b   ┆ b   ┆ 2024-02-27 ┆ 2   │
│ 30  ┆ b   ┆ b   ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘
>>> OnEachColumn(OrdinalEncoder(), cols=s.string()).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 10  ┆ 0.0 ┆ 0.0 ┆ 2024-02-26 ┆ 1   │
│ 20  ┆ 1.0 ┆ 0.0 ┆ 2024-02-27 ┆ 2   │
│ 30  ┆ 1.0 ┆ 0.0 ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘

>>> OnEachColumn(OrdinalEncoder(), cols=s.string() | s.any_date()).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c   ┆ d   ┆ e   │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 10  ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 1   │
│ 20  ┆ 1.0 ┆ 0.0 ┆ 1.0 ┆ 2   │
│ 30  ┆ 1.0 ┆ 0.0 ┆ 2.0 ┆ 3   │
└─────┴─────┴─────┴─────┴─────┘
>>> OnEachColumn(OrdinalEncoder(), cols=s.string() & s.cardinality_below(2)).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ i64 ┆ str ┆ f64 ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 10  ┆ a   ┆ 0.0 ┆ 2024-02-26 ┆ 1   │
│ 20  ┆ b   ┆ 0.0 ┆ 2024-02-27 ┆ 2   │
│ 30  ┆ b   ┆ 0.0 ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘

The usual short-circuit rules for operators are applied so for example s.categorical() & s.cardinality_below(2) will not compute the cardinality of non-categorical columns.

A simple column name or list of column names is always accepted where a selector is expected:

>>> OnEachColumn(OrdinalEncoder(), cols=["a", "b", "c"]).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 0.0 ┆ 0.0 ┆ 0.0 ┆ 2024-02-26 ┆ 1   │
│ 1.0 ┆ 1.0 ┆ 0.0 ┆ 2024-02-27 ┆ 2   │
│ 2.0 ┆ 1.0 ┆ 0.0 ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘

For convenience, the selectors also have a use method that builds the transformer for us.
So OnEachColumn(EncodeDatetime(), cols=s.any_date()).fit_transform(df) can also be written as:

>>> s.any_date().use(EncodeDatetime()).fit_transform(df)
shape: (3, 8)
┌─────┬─────┬─────┬────────┬─────────┬───────┬─────────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d_year ┆ d_month ┆ d_day ┆ d_total_seconds ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---    ┆ ---     ┆ ---   ┆ ---             ┆ --- │
│ i64 ┆ str ┆ str ┆ f32    ┆ f32     ┆ f32   ┆ f32             ┆ i64 │
╞═════╪═════╪═════╪════════╪═════════╪═══════╪═════════════════╪═════╡
│ 10  ┆ a   ┆ b   ┆ 2024.0 ┆ 2.0     ┆ 26.0  ┆ 1.7089e9        ┆ 1   │
│ 20  ┆ b   ┆ b   ┆ 2024.0 ┆ 2.0     ┆ 27.0  ┆ 1.7090e9        ┆ 2   │
│ 30  ┆ b   ┆ b   ┆ null   ┆ null    ┆ null  ┆ null            ┆ 3   │
└─────┴─────┴─────┴────────┴─────────┴───────┴─────────────────┴─────┘

By default, use will produce either a OnEachColumn or OnColumnSubset depending on the type of the transformer we pass to it, but we can also force the choice

>>> s.any_date().use(EncodeDatetime())
<Transformer: EncodeDatetime.transform(col) for col in X[any_date()]>
>>> s.string().use(SelectKBest())
<Transformer: SelectKBest.transform(X[string()])>
>>> s.string().use(OrdinalEncoder(), columnwise=True)
<Transformer: OrdinalEncoder.transform(col) for col in X[string()]>

Therefore use avoids importing these transformers, choosing between them and manually constructing them, and it also makes the construction of a pipeline look more similar to a polars select or with_columns call in which a transformation starts from the selector then specifies an operation.

>>> from sklearn.dummy import DummyRegressor

>>> make_pipeline(
...     s.string().use(ToNumeric()),
...     s.string().use(ToDatetime()),
...     s.any_date().use(EncodeDatetime()),
...     (~s.numeric()).use(OrdinalEncoder()),
...     s.all().use(ToFloat32()),
...     s.glob("WikiData-*").use(SelectKBest(k=5)),
...     DummyRegressor(),
... )
Pipeline(steps=[('oneachcolumn-1',
                 OnEachColumn(cols=string(), transformer=ToNumeric())),
                ('oneachcolumn-2',
                 OnEachColumn(cols=string(), transformer=ToDatetime())),
                ('oneachcolumn-3',
                 OnEachColumn(cols=any_date(), transformer=EncodeDatetime())),
                ('oncolumnselection-1',
                 OnColumnSelection(cols=~(numeric()), transformer=OrdinalEncoder())),
                ('oneachcolumn-4', OnEachColumn(transformer=ToFloat32())),
                ('oncolumnselection-2',
                 OnColumnSelection(cols=glob('WikiData-*'), transformer=SelectKBest(k=5))),
                ('dummyregressor', DummyRegressor())])

In any case the use thing is just for convenience so we could remove it if we find it more confusing than helpful.

Summary

Adding columnwise transformations could help simplify the code of the TableVectorizer while fixing the issue of inconsistent transforms.
It can also help simplify other transformers that apply column-wise transformations such as the DatetimeEncoder by allowing them to focus only on transforming one column and offloading the task of mapping the transformation to relevant columns.
It avoids relying on the ColumnTransformer and hstacking all outputs in a numpy array.

If any of it is ever made public it can help advanced users build custom pipelines without being tied to the categorization of columns defined by the TableVectorizer (numeric, high-cardinality, low-cardinality, datetime).

jeromedockes · 2024-02-28T11:04:08Z

One question I have is whether the TableVectorizer should output a dataframe or a numpy array (I guess the name indicates the former).

If we consider the TableVectorizer is the exit door of the dataframe/skrub world and entry into the numpy or sparse/scikit-learn part of the pipeline, an array (or sparse array) seems like a good choice.
This means the last step of the TableVectorizer processing, which applies the user-provided final transformer, should continue to use the ColumnTransformer internally or do something very similar -- apply each transformar and hstack the 4 outputs into a numpy array.
It means all transformers must output floats and there should not be any columns of different types that are passed through.
A big advantage is that we can easily have sparse outputs.
One drawback is that the output column names have to be read from the TableVectorizer itself rather than from an attribute of the output, and that we don't have output columns with "categorical" dtype that can be recognized by the HistGradientBoostingClassifier(categorical_columns="auto").

If the TableVectorizer produces a dataframe, that means passthrough of arbitrary columns and outputs of heterogeneous types are possible, but sparse outputs are not (at least not easily, AFAIK polars and Arrow don't have a way of handling sparse data out-of-the-box)

jeromedockes · 2024-05-28T15:18:06Z

completed in three PRs : #888 , #895 and #902

Vincent-Maladiere and others added 24 commits November 9, 2023 18:15

using to_datetime within the TV

7d1f3c2

replacing high_card_cat and low_card_cat

c75c235

precommit fail

29446b3

replace types_ by type_per_column_

33f2813

finish first refacto of auto_cast

2aee510

Merge branch 'main' into revamp_datetime_logic_in_tv

7e03257

update docstrings

010d67b

Merge branch 'main' into revamp_datetime_logic_in_tv

008938e

revamp and add tests for the table_vectorizer

d1775ac

Merge branch 'main' into revamp_datetime_logic_in_tv

8d55c9d

add tests for missing categories

c65f7af

small test extension

2bd46ff

Merge branch 'main' into revamp_datetime_logic_in_tv

65fb4a9

fix to_numeric casting to object for Pandas 1.5

894634c

attempt to fix windows 32bits default

dbc5209

apply suggestions and debug string extension dtype

7b03224

replace non_numerical with categorical

c1ff02b

remove impute missing + enforce type seen during fit

ed6ff96

Apply Jerome's suggestions

424828a

Merge branch 'main' into revamp_datetime_logic_in_tv

daba44b

add parsers

a2dc7d3

improve docstrings

088e2c7

fixing tests for polars

01618d7

sketch MapCols transformer

a330721

jeromedockes marked this pull request as draft January 22, 2024 08:49

jeromedockes added 5 commits January 22, 2024 17:33

ToNumericCol and ToCategoricalCol with pandas & polars support

10b92bf

start ToDatetimeCol

9788c5f

use new dispatch

77a925f

Merge remote-tracking branch 'upstream/main' into continue-tv-refactor

6c56de4

make unknown category unique

2009685

jeromedockes added 17 commits February 7, 2024 17:15

add name_in

34f1d9e

detail

01052f9

add on_each_column and on_dataframe methods to selectors

6396f1d

use on_each_column in tablevectorizer

e9331bd

rename Map and Apply

35a9591

don't use sbs

a14c8e3

on_each_column -> use

d37489c

make columnwise = "auto" by default

fc32ca4

unnneeded columnwise

806fa32

better handling of pd object columns + choice of null value

12d0063

OnEachColumn.transform in parallel

641baaf

make tv pipeline private

3c6e4f1

drop nulls in pandas is_date

7f60fd0

anydate -> any_date

051ad76

fix is_string for pandas

0d04a45

DatetimeColumnEncoder -> EncodeDatetime

10f8838

Merge remote-tracking branch 'upstream/main' into continue-tv-refactor

be30d8f

jeromedockes changed the title ~~[WIP] TableVectorizer improvements~~ [WIP] Column-wise transformations on part of a dataframe Feb 27, 2024

jeromedockes mentioned this pull request Feb 27, 2024

Adding a transformer to sessionize a table #750

Open

jeromedockes added 4 commits March 6, 2024 13:40

Merge remote-tracking branch 'upstream/main' into continue-tv-refactor

b95adb8

remove old dispatch files

a4c38b6

fix merge result

4dedc46

fix doctest

5f7a567

This was referenced Mar 18, 2024

Polars deprecation inbound #894

Closed

Add selectors #895

Merged

jeromedockes mentioned this pull request Apr 22, 2024

Potential performance issue: .to_dict method slow in pandas below 2.2 #891

Closed

jeromedockes mentioned this pull request May 2, 2024

Add column-wise transforms & refactor TableVectorizer #902

Merged

jeromedockes closed this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Column-wise transformations on part of a dataframe #877

[WIP] Column-wise transformations on part of a dataframe #877

jeromedockes commented Jan 22, 2024 •

edited

Loading

jeromedockes commented Feb 28, 2024 •

edited

Loading

jeromedockes commented May 28, 2024

[WIP] Column-wise transformations on part of a dataframe #877

[WIP] Column-wise transformations on part of a dataframe #877

Conversation

jeromedockes commented Jan 22, 2024 • edited Loading

Supporting both polars and pandas dataframes

Applying a transformation column-by-column

Single-column transformers

OnEachColumn

Applying a 2D transformation to a subset of columns

Selecting columns

Summary

jeromedockes commented Feb 28, 2024 • edited Loading

jeromedockes commented May 28, 2024

jeromedockes commented Jan 22, 2024 •

edited

Loading

`OnEachColumn`

jeromedockes commented Feb 28, 2024 •

edited

Loading