Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Column-wise transformations on part of a dataframe #877

Closed
wants to merge 142 commits into from

Conversation

jeromedockes
Copy link
Member

@jeromedockes jeromedockes commented Jan 22, 2024

fixes #874 and supersedes #848

The goal of this PR is to have a way to apply column-wise transformations to a subset of a (polars or pandas) dataframe's columns.
This would allow reorganizing some of skrub's internals and among other things fix the issue of the TableVectorizer not applying the same transformations during fit_transform and during transform.

This PR is mostly about private utilities or utilities that would be leveraged by more "advanced" users, not the main things that a new skrub user would see.
In the slightly longer term some of it will be used for, or replaced by, a more high-level API that will allow building a pipeline step by step while seeing previews of the transformations applied to a subsample, to be discussed in a separate github discussion.

Supporting both polars and pandas dataframes

Handled with single dispatch; this has already been extracted into a separate PR: #888

Applying a transformation column-by-column

Scikit-learn transformers transform a 2D array into a 2D array, but many operations on dataframes are done independently on several columns.

  • a transformation typically does not apply to all columns (eg converting strings to datetimes doesn't apply to numerical columns).
  • a separate transformation / state needs to be stored for each column. For example a string to datetime conversion needs to remember the format ('%Y-%m-%d', '%d/%m/%Y', '%m/%d/%Y', …) it detected for each column, a categorical encoding needs to remember a list of categories for each column, etc.
  • transformations on 2 different columns can be done in parallel.

At the moment, each transformer has to come up with its own way tracking and exposing per-column state, applying the transformations (possibly in parallel), avoiding duplicate column names in the output, and collecting the result of all the column-wise transformations into a dataframe.
This is done by the GapEncoder, MinHashEncoder, DatetimeEncoder, etc. and should be done (but is not ATM) by the TableVectorizer for the cleaning/preprocessing steps it applies such as dtype conversions, replacing "N/A" with nulls, counting categories etc.
Moreover, applying a transformation to a subset of columns is only possible in the TableVectorizer or scikit-learn ColumnTransformer.
Both put all columns (including those to which we apply "passthrough") into a numpy array (which is then converted back to a dataframe).
Among other things this means all outputs are converted to a common dtype (and we get an error if that is not possible).

To make it easier we would need:

Single-column transformers

(we can't call them "column transformers" because there is already something called ColumnTransformer in scikit-learn, name suggestions welcome).
They take 1D input ie a single dataframe column. An example of that kind of object that already exists in skrub is the (private) GapEncoderColumn; the TfidfVectorizer in scikit-learn also only accepts 1D inputs.

This PR would make the use of 1D transformers more systematic.
Their API looks like

class TransformCol:
    __single_column_transformer__ = True

    def fit_transform(column):
        """
        Accepts 1 column (a pandas or polars Series,
        in the future possibly a representation of a column in a LazyFrame).

        Returns a column, a list of columns, a dataframe, or NotImplemented to
        indicate this transformation does not apply to the given column.
        """

    def transform(column):
        """
        Applies exactly the same transformation as fit_transform.
        Never returns NotImplemented and should not be called if fit_transform
        returned NotImplemented.
        """

The __single_column_transformer__ indicates it transforms a single column; that could be replaced by a scikit-learn tag.

why the NotImplemented option The reason for allowing to reject a column with `NotImplemented` is that sometimes we can only find out if the transformation is meaningful for a column when attempting to actually perform it. For example in the TableVectorizer we don't know if a string column should be converted to datetime before trying to detect a datetime format. This produces a chicken-and-egg situation where the DatetimeEncoder must be applied only to columns that contain datetimes, but the DatetimeEncoder must be applied to a column to discover if it contains datetimes.

It is not done yet in this PR but there could be a "strict" mode (which could be the default) in which we know we are passing a valid column and rejecting it is not an option and results in an error.

Rather than returning NotImplemented, another option for rejecting a column could be to raise a specific exception, maybe called RejectColumn.
The advantage is it would make it easier to implement the strict mode at the level of the OnEachColumn transformer described below, rather than in each 1D transformer, while easily retaining a full traceback and all the information about the reason the transformation failed.
Edit after discussion with @glemaitre and @ogrisel we confirmed the choice of raising an exception rather than returning NotImplemented

The drawback is that another package wanting to implement a 1D transformer would need to depend on skrub and import the exception type from skrub.
I think the advantage outweights the drawback so I'll change it (now or if/when we add the strict mode).

One such transformer added in this PR is the ToDatetime.
ToDatetime.fit_transform accepts a single column and tries to convert it to a Datetime column.
If it can, it returns the converted column and remembers the string format it detected for that one column.
If it cannot (eg because the strings in a column do not represent dates) it returns NotImplemented to indicate that ToDatetime is not a transformation that should apply to the provided column.

EncodeDatetime works similarly, except that fit_transform returns a list of columns rather than just one.
Still, it accepts a single column, stores state for a single column, and returns some columns (year, month, day etc) that will be collected into a dataframe together with the output of other 1D transformers.

It is not done in this PR but it is possible to allow 1D transformers accept "lazy columns" (LazyFrame + column name) and to return lists of polars expressions rather than lists of columns.
This enables lazy transforms (POC in this branch).

>>> import polars as pl

>>> df = pl.DataFrame({
...     "A": [1.1, 2.2, 3.3, None, 5.5],
...     "B": list("aabc") + [None],
...     "C": ["2024-02-02", "2024-02-02", None, "2025-10-11", None],
...     "D": ["01/02/1998", "10/03/2027", "11/02/2012", None, "01/01/1901"],
...     "E": "one two three four five".split(),
...     "F": " 1.0 2.0 3.0 4.0 5.0".split(),
... })
>>> df
shape: (5, 6)
┌──────┬──────┬────────────┬────────────┬───────┬─────┐
│ A    ┆ B    ┆ C          ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---        ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ str        ┆ str        ┆ str   ┆ str │
╞══════╪══════╪════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null       ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null       ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴──────┴────────────┴────────────┴───────┴─────┘
>>> from skrub._to_datetime import ToDatetime
>>> from skrub._datetime_encoder import EncodeDatetime

>>> as_dt = ToDatetime().fit_transform(df["D"])
>>> as_dt
shape: (5,)
Series: 'D' [datetime[μs]]
[
	1998-01-02 00:00:00
	2027-10-03 00:00:00
	2012-11-02 00:00:00
	null
	1901-01-01 00:00:00
]


For some columns, we will only discover if the conversion to datetime makes sense once we try to convert them, so the transformer is allowed to reject a column.
This + separating datetime casting and encoding simplifies the TableVectorizer handling of datetimes.

>>> ToDatetime().fit_transform(df["A"])
NotImplemented

A column can only be rejected during `fit`, not `transform`:

>>> to_dt = ToDatetime()
>>> to_dt.fit_transform(df["C"])
shape: (5,)
Series: 'C' [datetime[μs]]
[
	2024-02-02 00:00:00
	2024-02-02 00:00:00
	null
	2025-10-11 00:00:00
	null
]
>>> to_dt.transform(pl.Series("C", "bad bad 2021-10-20".split()))
shape: (3,)
Series: 'C' [datetime[μs]]
[
	null
	null
	2021-10-20 00:00:00
]


As mentioned above we can add a "strict" mode where rejecting a column is not allowed.

EncodeDatetime is similar but instead of just one output column it returns a list of columns:

>>> EncodeDatetime().fit_transform(as_dt)
[shape: (5,)
Series: 'D_year' [f32]
[
	1998.0
	2027.0
	2012.0
	null
	1901.0
], shape: (5,)
Series: 'D_month' [f32]
[
	1.0
	10.0
	11.0
	null
	1.0
], shape: (5,)
Series: 'D_day' [f32]
[
	2.0
	3.0
	2.0
	null
	1.0
], shape: (5,)
Series: 'D_total_seconds' [f32]
[
	8.836992e8
	1.8225e9
	1.3518e9
	null
	-2.1775e9
]]


OnEachColumn

To use the 1D transformers in a scikit-learn pipeline we need a (regular) scikit-learn transformer that applies a 1D-transformer to each column (possibly in parallel) and collects the results in a dataframe to be fed down the rest of the pipeline.
OnEachColumn provides that and all the logic for mapping transformers to columns, storing them, handling column names etc can be handled in one place.
Columns that the univariate transformer rejects are passed through unchanged.

>>> from skrub._on_each_column import OnEachColumn

>>> df
shape: (5, 6)
┌──────┬──────┬────────────┬────────────┬───────┬─────┐
│ A    ┆ B    ┆ C          ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---        ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ str        ┆ str        ┆ str   ┆ str │
╞══════╪══════╪════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null       ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null       ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴──────┴────────────┴────────────┴───────┴─────┘
>>> OnEachColumn(ToDatetime()).fit_transform(df)
shape: (5, 6)
┌──────┬──────┬─────────────────────┬─────────────────────┬───────┬─────┐
│ A    ┆ B    ┆ C                   ┆ D                   ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---                 ┆ ---                 ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ datetime[μs]        ┆ datetime[μs]        ┆ str   ┆ str │
╞══════╪══════╪═════════════════════╪═════════════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 1998-01-02 00:00:00 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 2027-10-03 00:00:00 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null                ┆ 2012-11-02 00:00:00 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 00:00:00 ┆ null                ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null                ┆ 1901-01-01 00:00:00 ┆ five  ┆ 5.0 │
└──────┴──────┴─────────────────────┴─────────────────────┴───────┴─────┘


It is possible to restrict the columns on which the transformation is attempted:

>>> OnEachColumn(ToDatetime(), cols=["C"]).fit_transform(df)
shape: (5, 6)
┌──────┬──────┬─────────────────────┬────────────┬───────┬─────┐
│ A    ┆ B    ┆ C                   ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---  ┆ ---                 ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str  ┆ datetime[μs]        ┆ str        ┆ str   ┆ str │
╞══════╪══════╪═════════════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a    ┆ 2024-02-02 00:00:00 ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b    ┆ null                ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c    ┆ 2025-10-11 00:00:00 ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null ┆ null                ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴──────┴─────────────────────┴────────────┴───────┴─────┘


Above only "C" has been transformed, not "D".

We can inspect which columns were transformed:

>>> to_dt = OnEachColumn(ToDatetime()).fit(df)
>>> to_dt.transformers_
{'C': ToDatetime(), 'D': ToDatetime()}
>>> to_dt.input_to_outputs_
{'C': ['C'], 'D': ['D']}


Combining several transformations:

>>> from sklearn.pipeline import make_pipeline
>>> from skrub._to_numeric import ToNumeric
>>> from skrub._to_float import ToFloat32
>>> from sklearn.preprocessing import OrdinalEncoder

>>> make_pipeline(
...     OnEachColumn(ToNumeric()),
...     OnEachColumn(ToDatetime()),
...     OnEachColumn(EncodeDatetime()),
...     OnEachColumn(OrdinalEncoder(), cols=["B", "E"]),
...     OnEachColumn(ToFloat32())
... ).fit_transform(df)
shape: (5, 12)
┌──────┬─────┬────────┬─────────┬───┬───────┬─────────────────┬─────┬─────┐
│ A    ┆ B   ┆ C_year ┆ C_month ┆ … ┆ D_day ┆ D_total_seconds ┆ E   ┆ F   │
│ ---  ┆ --- ┆ ---    ┆ ---     ┆   ┆ ---   ┆ ---             ┆ --- ┆ --- │
│ f32  ┆ f32 ┆ f32    ┆ f32     ┆   ┆ f32   ┆ f32             ┆ f32 ┆ f32 │
╞══════╪═════╪════════╪═════════╪═══╪═══════╪═════════════════╪═════╪═════╡
│ 1.1  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 2.0   ┆ 8.836992e8      ┆ 2.0 ┆ 1.0 │
│ 2.2  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 3.0   ┆ 1.8225e9        ┆ 4.0 ┆ 2.0 │
│ 3.3  ┆ 1.0 ┆ null   ┆ null    ┆ … ┆ 2.0   ┆ 1.3518e9        ┆ 3.0 ┆ 3.0 │
│ null ┆ 2.0 ┆ 2025.0 ┆ 10.0    ┆ … ┆ null  ┆ null            ┆ 1.0 ┆ 4.0 │
│ 5.5  ┆ 3.0 ┆ null   ┆ null    ┆ … ┆ 1.0   ┆ -2.1775e9       ┆ 0.0 ┆ 5.0 │
└──────┴─────┴────────┴─────────┴───┴───────┴─────────────────┴─────┴─────┘


This PR also adds more convenient ways of selecting columns than the cols=["B", "E"] above, which are described later.

>>> from skrub import _selectors as s
>>> transformed = make_pipeline(
...     OnEachColumn(ToNumeric()),
...     OnEachColumn(ToDatetime()),
...     OnEachColumn(EncodeDatetime()),
...     OnEachColumn(OrdinalEncoder(), cols=~s.numeric()), # select by excluding dtype
...     OnEachColumn(ToFloat32())
... ).fit_transform(df)
>>> transformed
shape: (5, 12)
┌──────┬─────┬────────┬─────────┬───┬───────┬─────────────────┬─────┬─────┐
│ A    ┆ B   ┆ C_year ┆ C_month ┆ … ┆ D_day ┆ D_total_seconds ┆ E   ┆ F   │
│ ---  ┆ --- ┆ ---    ┆ ---     ┆   ┆ ---   ┆ ---             ┆ --- ┆ --- │
│ f32  ┆ f32 ┆ f32    ┆ f32     ┆   ┆ f32   ┆ f32             ┆ f32 ┆ f32 │
╞══════╪═════╪════════╪═════════╪═══╪═══════╪═════════════════╪═════╪═════╡
│ 1.1  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 2.0   ┆ 8.836992e8      ┆ 2.0 ┆ 1.0 │
│ 2.2  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 3.0   ┆ 1.8225e9        ┆ 4.0 ┆ 2.0 │
│ 3.3  ┆ 1.0 ┆ null   ┆ null    ┆ … ┆ 2.0   ┆ 1.3518e9        ┆ 3.0 ┆ 3.0 │
│ null ┆ 2.0 ┆ 2025.0 ┆ 10.0    ┆ … ┆ null  ┆ null            ┆ 1.0 ┆ 4.0 │
│ 5.5  ┆ 3.0 ┆ null   ┆ null    ┆ … ┆ 1.0   ┆ -2.1775e9       ┆ 0.0 ┆ 5.0 │
└──────┴─────┴────────┴─────────┴───┴───────┴─────────────────┴─────┴─────┘


OnEachColumn also renames duplicate columns
>>> from sklearn.base import BaseEstimator

>>> class Transform(BaseEstimator):
...    __single_column_transformer__ = True
...
...    def fit_transform(self, col):
...        return col.rename("A")

>>> OnEachColumn(Transform(), cols=list("ABC")).fit_transform(df)
shape: (5, 6)
┌──────┬─────────────────────┬─────────────────────┬────────────┬───────┬─────┐
│ A    ┆ A__skrub_86e6d810__ ┆ A__skrub_61baa0d1__ ┆ D          ┆ E     ┆ F   │
│ ---  ┆ ---                 ┆ ---                 ┆ ---        ┆ ---   ┆ --- │
│ f64  ┆ str                 ┆ str                 ┆ str        ┆ str   ┆ str │
╞══════╪═════════════════════╪═════════════════════╪════════════╪═══════╪═════╡
│ 1.1  ┆ a                   ┆ 2024-02-02          ┆ 01/02/1998 ┆ one   ┆ 1.0 │
│ 2.2  ┆ a                   ┆ 2024-02-02          ┆ 10/03/2027 ┆ two   ┆ 2.0 │
│ 3.3  ┆ b                   ┆ null                ┆ 11/02/2012 ┆ three ┆ 3.0 │
│ null ┆ c                   ┆ 2025-10-11          ┆ null       ┆ four  ┆ 4.0 │
│ 5.5  ┆ null                ┆ null                ┆ 01/01/1901 ┆ five  ┆ 5.0 │
└──────┴─────────────────────┴─────────────────────┴────────────┴───────┴─────┘


Applying a 2D transformation to a subset of columns

Sometimes we want to apply a transformation to a subset of columns, but it is still a 2D -> 2D transformation.
For example we may want to SelectKBest some columns but enforce that some should be kept.
Another transformer, OnColumnSelection, provides that.

>>> import numpy as np
>>> from sklearn.feature_selection import SelectKBest
>>> from skrub._on_column_selection import OnColumnSelection

>>> y = np.random.default_rng(0).integers(2, size=df.shape[0]).astype(bool)
>>> OnColumnSelection(SelectKBest(k=1), cols=["E", "F"]).fit_transform(transformed, y)
shape: (5, 11)
┌──────┬─────┬────────┬─────────┬───┬─────────┬───────┬─────────────────┬─────┐
│ A    ┆ B   ┆ C_year ┆ C_month ┆ … ┆ D_month ┆ D_day ┆ D_total_seconds ┆ F   │
│ ---  ┆ --- ┆ ---    ┆ ---     ┆   ┆ ---     ┆ ---   ┆ ---             ┆ --- │
│ f32  ┆ f32 ┆ f32    ┆ f32     ┆   ┆ f32     ┆ f32   ┆ f32             ┆ f32 │
╞══════╪═════╪════════╪═════════╪═══╪═════════╪═══════╪═════════════════╪═════╡
│ 1.1  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 1.0     ┆ 2.0   ┆ 8.836992e8      ┆ 1.0 │
│ 2.2  ┆ 0.0 ┆ 2024.0 ┆ 2.0     ┆ … ┆ 10.0    ┆ 3.0   ┆ 1.8225e9        ┆ 2.0 │
│ 3.3  ┆ 1.0 ┆ null   ┆ null    ┆ … ┆ 11.0    ┆ 2.0   ┆ 1.3518e9        ┆ 3.0 │
│ null ┆ 2.0 ┆ 2025.0 ┆ 10.0    ┆ … ┆ null    ┆ null  ┆ null            ┆ 4.0 │
│ 5.5  ┆ 3.0 ┆ null   ┆ null    ┆ … ┆ 1.0     ┆ 1.0   ┆ -2.1775e9       ┆ 5.0 │
└──────┴─────┴────────┴─────────┴───┴─────────┴───────┴─────────────────┴─────┘


Here among "E" and "F", "F" has been selected and "E" has been dropped.
The other columns (outside of cols) have been passed through unchanged.
Note that unlike the current TableVectorizer or ColumnTransformer this allows maintaining heterogeneous dtypes in the dataframe and in particular keeping the original dtype of the columns that are passed through.

Selecting columns

The skrub._selectors module provides convenient ways to select columns to which the transformations are applied, similar to polars.selectors and ibis.selectors.

>>> from skrub import _selectors as s
>>> from datetime import date
>>> df = pl.DataFrame(
...     dict(
...         a=[10, 20, 30],
...         b=["a", "b", "b"],
...         c=["b", "b", "b"],
...         d=[date(2024, 2, 26), date(2024, 2, 27), None],
...         e=[1, 2, 3],
...     )
... )
>>> df
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ i64 ┆ str ┆ str ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 10  ┆ a   ┆ b   ┆ 2024-02-26 ┆ 1   │
│ 20  ┆ b   ┆ b   ┆ 2024-02-27 ┆ 2   │
│ 30  ┆ b   ┆ b   ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘
>>> OnEachColumn(OrdinalEncoder(), cols=s.string()).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 10  ┆ 0.0 ┆ 0.0 ┆ 2024-02-26 ┆ 1   │
│ 20  ┆ 1.0 ┆ 0.0 ┆ 2024-02-27 ┆ 2   │
│ 30  ┆ 1.0 ┆ 0.0 ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘

>>> OnEachColumn(OrdinalEncoder(), cols=s.string() | s.any_date()).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c   ┆ d   ┆ e   │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 10  ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 1   │
│ 20  ┆ 1.0 ┆ 0.0 ┆ 1.0 ┆ 2   │
│ 30  ┆ 1.0 ┆ 0.0 ┆ 2.0 ┆ 3   │
└─────┴─────┴─────┴─────┴─────┘
>>> OnEachColumn(OrdinalEncoder(), cols=s.string() & s.cardinality_below(2)).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ i64 ┆ str ┆ f64 ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 10  ┆ a   ┆ 0.0 ┆ 2024-02-26 ┆ 1   │
│ 20  ┆ b   ┆ 0.0 ┆ 2024-02-27 ┆ 2   │
│ 30  ┆ b   ┆ 0.0 ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘

The usual short-circuit rules for operators are applied so for example s.categorical() & s.cardinality_below(2) will not compute the cardinality of non-categorical columns.

A simple column name or list of column names is always accepted where a selector is expected:

>>> OnEachColumn(OrdinalEncoder(), cols=["a", "b", "c"]).fit_transform(df)
shape: (3, 5)
┌─────┬─────┬─────┬────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d          ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---        ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ date       ┆ i64 │
╞═════╪═════╪═════╪════════════╪═════╡
│ 0.0 ┆ 0.0 ┆ 0.0 ┆ 2024-02-26 ┆ 1   │
│ 1.0 ┆ 1.0 ┆ 0.0 ┆ 2024-02-27 ┆ 2   │
│ 2.0 ┆ 1.0 ┆ 0.0 ┆ null       ┆ 3   │
└─────┴─────┴─────┴────────────┴─────┘


For convenience, the selectors also have a use method that builds the transformer for us.
So OnEachColumn(EncodeDatetime(), cols=s.any_date()).fit_transform(df) can also be written as:

>>> s.any_date().use(EncodeDatetime()).fit_transform(df)
shape: (3, 8)
┌─────┬─────┬─────┬────────┬─────────┬───────┬─────────────────┬─────┐
│ a   ┆ b   ┆ c   ┆ d_year ┆ d_month ┆ d_day ┆ d_total_seconds ┆ e   │
│ --- ┆ --- ┆ --- ┆ ---    ┆ ---     ┆ ---   ┆ ---             ┆ --- │
│ i64 ┆ str ┆ str ┆ f32    ┆ f32     ┆ f32   ┆ f32             ┆ i64 │
╞═════╪═════╪═════╪════════╪═════════╪═══════╪═════════════════╪═════╡
│ 10  ┆ a   ┆ b   ┆ 2024.0 ┆ 2.0     ┆ 26.0  ┆ 1.7089e9        ┆ 1   │
│ 20  ┆ b   ┆ b   ┆ 2024.0 ┆ 2.0     ┆ 27.0  ┆ 1.7090e9        ┆ 2   │
│ 30  ┆ b   ┆ b   ┆ null   ┆ null    ┆ null  ┆ null            ┆ 3   │
└─────┴─────┴─────┴────────┴─────────┴───────┴─────────────────┴─────┘

By default, use will produce either a OnEachColumn or OnColumnSubset depending on the type of the transformer we pass to it, but we can also force the choice

>>> s.any_date().use(EncodeDatetime())
<Transformer: EncodeDatetime.transform(col) for col in X[any_date()]>
>>> s.string().use(SelectKBest())
<Transformer: SelectKBest.transform(X[string()])>
>>> s.string().use(OrdinalEncoder(), columnwise=True)
<Transformer: OrdinalEncoder.transform(col) for col in X[string()]>

Therefore use avoids importing these transformers, choosing between them and manually constructing them, and it also makes the construction of a pipeline look more similar to a polars select or with_columns call in which a transformation starts from the selector then specifies an operation.

>>> from sklearn.dummy import DummyRegressor

>>> make_pipeline(
...     s.string().use(ToNumeric()),
...     s.string().use(ToDatetime()),
...     s.any_date().use(EncodeDatetime()),
...     (~s.numeric()).use(OrdinalEncoder()),
...     s.all().use(ToFloat32()),
...     s.glob("WikiData-*").use(SelectKBest(k=5)),
...     DummyRegressor(),
... )
Pipeline(steps=[('oneachcolumn-1',
                 OnEachColumn(cols=string(), transformer=ToNumeric())),
                ('oneachcolumn-2',
                 OnEachColumn(cols=string(), transformer=ToDatetime())),
                ('oneachcolumn-3',
                 OnEachColumn(cols=any_date(), transformer=EncodeDatetime())),
                ('oncolumnselection-1',
                 OnColumnSelection(cols=~(numeric()), transformer=OrdinalEncoder())),
                ('oneachcolumn-4', OnEachColumn(transformer=ToFloat32())),
                ('oncolumnselection-2',
                 OnColumnSelection(cols=glob('WikiData-*'), transformer=SelectKBest(k=5))),
                ('dummyregressor', DummyRegressor())])


In any case the use thing is just for convenience so we could remove it if we find it more confusing than helpful.

Summary

Adding columnwise transformations could help simplify the code of the TableVectorizer while fixing the issue of inconsistent transforms.
It can also help simplify other transformers that apply column-wise transformations such as the DatetimeEncoder by allowing them to focus only on transforming one column and offloading the task of mapping the transformation to relevant columns.
It avoids relying on the ColumnTransformer and hstacking all outputs in a numpy array.

If any of it is ever made public it can help advanced users build custom pipelines without being tied to the categorization of columns defined by the TableVectorizer (numeric, high-cardinality, low-cardinality, datetime).

@jeromedockes jeromedockes marked this pull request as draft January 22, 2024 08:49
@jeromedockes jeromedockes changed the title [WIP] TableVectorizer improvements [WIP] Column-wise transformations on part of a dataframe Feb 27, 2024
@jeromedockes
Copy link
Member Author

jeromedockes commented Feb 28, 2024

One question I have is whether the TableVectorizer should output a dataframe or a numpy array (I guess the name indicates the former).

If we consider the TableVectorizer is the exit door of the dataframe/skrub world and entry into the numpy or sparse/scikit-learn part of the pipeline, an array (or sparse array) seems like a good choice.
This means the last step of the TableVectorizer processing, which applies the user-provided final transformer, should continue to use the ColumnTransformer internally or do something very similar -- apply each transformar and hstack the 4 outputs into a numpy array.
It means all transformers must output floats and there should not be any columns of different types that are passed through.
A big advantage is that we can easily have sparse outputs.
One drawback is that the output column names have to be read from the TableVectorizer itself rather than from an attribute of the output, and that we don't have output columns with "categorical" dtype that can be recognized by the HistGradientBoostingClassifier(categorical_columns="auto").

If the TableVectorizer produces a dataframe, that means passthrough of arbitrary columns and outputs of heterogeneous types are possible, but sparse outputs are not (at least not easily, AFAIK polars and Arrow don't have a way of handling sparse data out-of-the-box)

@jeromedockes
Copy link
Member Author

completed in three PRs : #888 , #895 and #902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basic regression problem raises exception on inference
2 participants