-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Column-wise transformations on part of a dataframe #877
Conversation
One question I have is whether the TableVectorizer should output a dataframe or a numpy array (I guess the name indicates the former). If we consider the TableVectorizer is the exit door of the dataframe/skrub world and entry into the numpy or sparse/scikit-learn part of the pipeline, an array (or sparse array) seems like a good choice. If the TableVectorizer produces a dataframe, that means passthrough of arbitrary columns and outputs of heterogeneous types are possible, but sparse outputs are not (at least not easily, AFAIK polars and Arrow don't have a way of handling sparse data out-of-the-box) |
fixes #874 and supersedes #848
The goal of this PR is to have a way to apply column-wise transformations to a subset of a (polars or pandas) dataframe's columns.
This would allow reorganizing some of skrub's internals and among other things fix the issue of the TableVectorizer not applying the same transformations during
fit_transform
and duringtransform
.This PR is mostly about private utilities or utilities that would be leveraged by more "advanced" users, not the main things that a new skrub user would see.
In the slightly longer term some of it will be used for, or replaced by, a more high-level API that will allow building a pipeline step by step while seeing previews of the transformations applied to a subsample, to be discussed in a separate github discussion.
Supporting both polars and pandas dataframes
Handled with single dispatch; this has already been extracted into a separate PR: #888
Applying a transformation column-by-column
Scikit-learn transformers transform a 2D array into a 2D array, but many operations on dataframes are done independently on several columns.
'%Y-%m-%d'
,'%d/%m/%Y'
,'%m/%d/%Y'
, …) it detected for each column, a categorical encoding needs to remember a list of categories for each column, etc.At the moment, each transformer has to come up with its own way tracking and exposing per-column state, applying the transformations (possibly in parallel), avoiding duplicate column names in the output, and collecting the result of all the column-wise transformations into a dataframe.
This is done by the GapEncoder, MinHashEncoder, DatetimeEncoder, etc. and should be done (but is not ATM) by the TableVectorizer for the cleaning/preprocessing steps it applies such as dtype conversions, replacing "N/A" with nulls, counting categories etc.
Moreover, applying a transformation to a subset of columns is only possible in the TableVectorizer or scikit-learn ColumnTransformer.
Both put all columns (including those to which we apply "passthrough") into a numpy array (which is then converted back to a dataframe).
Among other things this means all outputs are converted to a common dtype (and we get an error if that is not possible).
To make it easier we would need:
Single-column transformers
(we can't call them "column transformers" because there is already something called
ColumnTransformer
in scikit-learn, name suggestions welcome).They take 1D input ie a single dataframe column. An example of that kind of object that already exists in skrub is the (private)
GapEncoderColumn
; the TfidfVectorizer in scikit-learn also only accepts 1D inputs.This PR would make the use of 1D transformers more systematic.
Their API looks like
The
__single_column_transformer__
indicates it transforms a single column; that could be replaced by a scikit-learn tag.why the NotImplemented option
The reason for allowing to reject a column with `NotImplemented` is that sometimes we can only find out if the transformation is meaningful for a column when attempting to actually perform it. For example in the TableVectorizer we don't know if a string column should be converted to datetime before trying to detect a datetime format. This produces a chicken-and-egg situation where the DatetimeEncoder must be applied only to columns that contain datetimes, but the DatetimeEncoder must be applied to a column to discover if it contains datetimes.It is not done yet in this PR but there could be a "strict" mode (which could be the default) in which we know we are passing a valid column and rejecting it is not an option and results in an error.
Rather than returning
NotImplemented
, another option for rejecting a column could be to raise a specific exception, maybe calledRejectColumn
.The advantage is it would make it easier to implement the strict mode at the level of the
OnEachColumn
transformer described below, rather than in each 1D transformer, while easily retaining a full traceback and all the information about the reason the transformation failed.Edit after discussion with @glemaitre and @ogrisel we confirmed the choice of raising an exception rather than returning
NotImplemented
The drawback is that another package wanting to implement a 1D transformer would need to depend on skrub and import the exception type from skrub.
I think the advantage outweights the drawback so I'll change it (now or if/when we add the strict mode).
One such transformer added in this PR is the
ToDatetime
.ToDatetime.fit_transform
accepts a single column and tries to convert it to aDatetime
column.If it can, it returns the converted column and remembers the string format it detected for that one column.
If it cannot (eg because the strings in a column do not represent dates) it returns
NotImplemented
to indicate thatToDatetime
is not a transformation that should apply to the provided column.EncodeDatetime
works similarly, except thatfit_transform
returns a list of columns rather than just one.Still, it accepts a single column, stores state for a single column, and returns some columns (year, month, day etc) that will be collected into a dataframe together with the output of other 1D transformers.
It is not done in this PR but it is possible to allow 1D transformers accept "lazy columns" (LazyFrame + column name) and to return lists of polars expressions rather than lists of columns.
This enables lazy transforms (POC in this branch).
For some columns, we will only discover if the conversion to datetime makes sense once we try to convert them, so the transformer is allowed to reject a column.
This + separating datetime casting and encoding simplifies the TableVectorizer handling of datetimes.
As mentioned above we can add a "strict" mode where rejecting a column is not allowed.
EncodeDatetime
is similar but instead of just one output column it returns a list of columns:OnEachColumn
To use the 1D transformers in a scikit-learn pipeline we need a (regular) scikit-learn transformer that applies a 1D-transformer to each column (possibly in parallel) and collects the results in a dataframe to be fed down the rest of the pipeline.
OnEachColumn
provides that and all the logic for mapping transformers to columns, storing them, handling column names etc can be handled in one place.Columns that the univariate transformer rejects are passed through unchanged.
It is possible to restrict the columns on which the transformation is attempted:
Above only "C" has been transformed, not "D".
We can inspect which columns were transformed:
Combining several transformations:
This PR also adds more convenient ways of selecting columns than the
cols=["B", "E"]
above, which are described later.OnEachColumn also renames duplicate columns
Applying a 2D transformation to a subset of columns
Sometimes we want to apply a transformation to a subset of columns, but it is still a 2D -> 2D transformation.
For example we may want to
SelectKBest
some columns but enforce that some should be kept.Another transformer,
OnColumnSelection
, provides that.Here among "E" and "F", "F" has been selected and "E" has been dropped.
The other columns (outside of
cols
) have been passed through unchanged.Note that unlike the current TableVectorizer or ColumnTransformer this allows maintaining heterogeneous dtypes in the dataframe and in particular keeping the original dtype of the columns that are passed through.
Selecting columns
The
skrub._selectors
module provides convenient ways to select columns to which the transformations are applied, similar topolars.selectors
andibis.selectors
.The usual short-circuit rules for operators are applied so for example
s.categorical() & s.cardinality_below(2)
will not compute the cardinality of non-categorical columns.A simple column name or list of column names is always accepted where a selector is expected:
For convenience, the selectors also have a
use
method that builds the transformer for us.So
OnEachColumn(EncodeDatetime(), cols=s.any_date()).fit_transform(df)
can also be written as:By default,
use
will produce either aOnEachColumn
orOnColumnSubset
depending on the type of the transformer we pass to it, but we can also force the choiceTherefore
use
avoids importing these transformers, choosing between them and manually constructing them, and it also makes the construction of a pipeline look more similar to a polarsselect
orwith_columns
call in which a transformation starts from the selector then specifies an operation.In any case the
use
thing is just for convenience so we could remove it if we find it more confusing than helpful.Summary
Adding columnwise transformations could help simplify the code of the TableVectorizer while fixing the issue of inconsistent transforms.
It can also help simplify other transformers that apply column-wise transformations such as the DatetimeEncoder by allowing them to focus only on transforming one column and offloading the task of mapping the transformation to relevant columns.
It avoids relying on the ColumnTransformer and hstacking all outputs in a numpy array.
If any of it is ever made public it can help advanced users build custom pipelines without being tied to the categorization of columns defined by the TableVectorizer (numeric, high-cardinality, low-cardinality, datetime).