Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: transformers that modify y #2

Merged
merged 8 commits into from
Dec 27, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
383 changes: 383 additions & 0 deletions slep001/discussion.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,383 @@
=====================================
Transformers that modify their target
=====================================

.. topic:: **Summary**

Transformers implement::

self = estimator.fit(X, y=None)
X_transform = estimator.transform(X)
estimator.fit(X, y=None).transform(X) == estimator.fit_transform(X, y)

Within a chain or processing sequence of estimators, many usecases
require modifying y. How do we support this?

Doing many of these things is possible "by hand". The question is:
how to avoid writing custom connecting logic.

.. sectnum::

.. contents:: Table of contents
:depth: 2

Rational
==========

Summary of the contract of transformers
----------------------------------------

* .transform(...) returns a data matrix X

* .transform(...) returns one feature vector for each sample of the input

* .fit_transform(...) is the same and .fit(...).transform(...)

Examples of usecases targetted
-------------------------------

#. Over sampling:

#. Class rembalancing: over sampling the minority class in
unbalanced dataset
#. Data enhancement (nudgging images for instance)

#. Under-sampling

#. Stateless undersampling: Take one sample out of two
#. Stateful undersampling: apply clustering and transform to cluster
centers
#. Coresets: return a smaller number of samples and associated sample
weights

#. Outlier detection:

#. Remove outlier from train set
#. Create a special class 'y' for outliers

#. Completing y:

#. Missing data imputation on y
#. Semi-supervised learning (related to above)

#. Data loading / conversion

#. Pandas in => (X, y) out
#. Images in => patches out
#. Filename in => (X, y) with multiple samples (very useful in
combination with online learning)
#. Database query => (X, y) out

#. Aggregate statistics over multiple samples

#. Windowing-like functions on time-series

In a sense, these are dodgy with scikit-learn's cross-validation API
that knows nothing about sample structure. But the refactor of the CV
API is really helping in this regard.

____

These usecases pretty much require breaking the contract of the
Transformer, as detailed above.

The intuition driving this enhancement proposal is that the more the
data-processing pipeline becomes rich, the more the data grow, the more
the usecases above become important.

Enhancements proposed
=======================

Option A: meta-estimators
---------------------------

Proposal
........

This option advocates that any transformer-like usecase that wants to
modify y or the number of samples should not be a transformer-like but a
specific meta-estimator. A core-set object would thus look like:

* From the user perspective::

from sklearn.sample_shrink import BirchCoreSet
from sklearn.ensemble import RandomForest
estimator = BirchCoreSet(RandomForest())

* From the developer perspective::

class BirchCoreSet(BaseEstimator):

def fit(self, X, y):
# The logic here is wrong, as we need to handle y:
super(BirchCoreSet, self).fit(X)
X_red = self.subcluster_centers_
self.estimator_.fit(X_red)

Benefits
.........

#. No change to the existing API

#. The meta-estimator pattern is very powerful, and pretty much anything
is possible.

Limitations
............

The different limitations listed below are variants of the same
conceptual difficulty

#. It is hard to have mental models and garantees of what a
meta-estimator does, as it is by definition super versatile

This is both a problem for the beginner, that needs to learn them on
an almost case-by-case basis, and for the advanced user, that needs to
maintain a set of case-specific code

#. The "estimator heap" problem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "chain" instead of heap and stack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"chain" is linear. But point taken, I am rephrasing a bit this paragraph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta-estimators I had in mind are always linear, and the new objects are always linear. FeatureUnion is the only thing that can make it a DAG, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between fit time and predict time in the case of things like the previously discussed undersampler effectively create a conditional DAG, depending on whether you are at fit or predict time


Here the word heap is used to denote the multiple pipelines and
meta-estimators. It corresponds to what we would naturally call a
"data processing pipeline", but we use "heap" to avoid confusion with
the pipeline object.

Heaps combining many steps of pipelines and meta-estimators become
very hard to inspect and manipulate, both for the user, and for
pipeline-management (aka "heap-management") code. Currently, these
difficulties are mostly in user code, so we don't see them too much in
scikit-learn. Here are concrete examples

#. Trying to retrieve coefficients from a model estimated in a
"heap". Eg:

* you know there is a lasso in your stack and you want to
get it's coef (in whatever space that resides?):
`pipeline.named_steps['lasso'].coef_` is possible.

* you want to retrieve the coef of the last step:
`pipeline.steps[-1][1].coef_` is possible.

With meta estimators this is tricky.
Solving this problem requires
https://github.com/scikit-learn/scikit-learn/issues/2562#issuecomment-27543186
(this enhancement proposal is not advocating to solve the problem
above, but pointing it out as an illustration)

#. DaskLearn has modified the logic of pipeline to expose it as a
computation graph. The reason that it was relatively easy to do is
that there was mostly one object to modify to do the dispatching,
the Pipeline object.

#. A future, out-of-core "conductor" object to fit a "heap" in out of
core by connecting it to a data-store would need to have a
representation of the heap. For instance, when chaining random
projections with Birch coresets and finally SGD, the user would
need to specify that random projections are stateless, birch needs
to do one pass of the data, and SGD a few. Given this information,
the conductor could orchestrate pull the data from the data source,
and sending it to the various steps. Such an object is much harder
to implement if the various steps are to be combined in a heap.
Note that the scikit-learn pipeline can only implement a linear
"chain" like set of processing. For instance a One vs All will
never be able to be implemented in a scikit-learn pipeline.

This is not a problem in non out-of-core settings, in the sense
that the BirchCoreSet meta-estimator would take care of doing a
pass on the data before feeding it to its sub estimator.

In conclusion, meta-estimators are harder to comprehend (problem 1) and
write (problem 2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "interface with". I think they are much easier to write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+In conclusion, meta-estimators are harder to comprehend (problem 1) and
+write (problem 2).

I would say "interface with". I think they are much easier to write.

Not clear. They need to use the delegate decorators. Most people in
machine learning have no idea what delegation is, or what "delegate"
means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 10/23/2015 12:38 AM, Gael Varoquaux wrote:

+In conclusion, meta-estimators are harder to comprehend (problem 1)
and > +write (problem 2). I would say "interface with". I think they
are much easier to write.
Not clear. They need to use the delegate decorators. Most people in
machine learning have no idea what delegation is, or what "delegate"
means.

Point taken.


That said, we will never get rid of meta estimators. It is a very
powerful pattern. The discussion here is about extending a bit the
estimator API to have a less pressing need for meta-estimators.

Option B: transformer-like that modify y
------------------------------------------

.. topic:: **Two variants**

1. Changing the semantics of transformers to modify y and return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is 1.a) add a new method to transformers and 1.b) replace transform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is 1.a) add a new method to transformers and 1.b) replace transform.

Well, given that we are pretty lax about inheritence, and given that the
mandatory / optional methods of the new interface is still not figured
out, I think that 1.a) is similar enough to 2. I have edited the text to
make it open to 1.a).

something more complex than a data matrix X

2. Introducing new methods (and a new type of object)

There is an emerging consensus for option 2.

.. topic:: **`transform` modifying y**

Variant 1 above could be implementing by allowing transform to modify
y. However, the return signature of transform would be unclear.

Do we modify all transformers to return a y (y=None for unsupervised
transformers that are not given y?). This sounds like leading to code
full of surprises and difficult to maintain from the user perspective.

We would loose the contract that the number of samples is unchanged by
a transformer. This contract is very useful (eg for model selection:
measuring error for each sample).

For these reasons, we feel new methods are necessary.

Proposal
.........

Introduce a `TransModifier` type of object with the following API
(names are discussed below):

* `X_new, y_new = estimator.fit_modify(X, y)`

* `X_new, y_new = estimator.trans_modify(X, y)`

Or:

* `X_new, y_new, sample_props = estimator.fit_modify(X, y)`

* `X_new, y_new, sample_props = estimator.trans_modify(X, y)`

Contracts (these are weaker contracts than the transformer:

* Neither `fit_modify` nor `trans_modify` are guarantied to keep the
number of samples unchanged.

* `fit_modify` may not exist (questionnable)

Design questions and difficulties
..................................

Should there be a fit method?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In such estimators, it may not be a good idea to call fit rather than
fit_modify (for instance in coreset).


How does a pipeline use such an object?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In particular at test time?

#. Should there be a transform method used at test time?

#. What to do with objects that implement both `transform` and
`trans_modify`?

**Creating y in a pipeline makes error measurement harder** For some
usecases, test time needs to modify the number of samples (for instance
data loading from a file). However, these will by construction a problem
for eg cross-val-score, as in supervised settings, these expect a y_true.
Indeed, the problem is the following:

- To measure an error, we need y_true at the level of
`cross_val_score` or `GridSearchCV`

- y_true is created inside the pipeline by the data-loading object.

It is thus unclear that the data-loading usecases can be fully
integrated in the CV framework (which is not an argument against
enabling them).

|

For our CV framework, we need the number of samples to remain
constant: for each y_pred, we need a corresponding y_true.

|

**Proposal 1**: use transform at `predict` time.

#. Objects implementing both `transform` and `trans_modify` are valid

#. The pipeline's `predict` method use `transform` on its intermediate
steps

The different semantics of `trans_modify` and `transform` can be very useful,
as `transform` keeps untouched the notion of sample, and `y_true`.

|

**Proposal 2** Modify the scoring framework

One option is to modify the scoring framework to be able to handle
these things, the scoring gets the output of the chain of
trans_modify for y. This should rely on clever code in the `score` method
of pipeline. Maybe it should be controlled by a keyword argument on the
pipeline, and turned off by default.


How do we deal with sample weights and other sample properties?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This discussion feeds in the `sample_props` discussion (that should
be discussed in a different enhancement proposal).

The suggestion is to have the sample properties as a dictionary of
arrays `sample_props`.

**Example usecase** useful to think about sample properties: coresets:
given (X, y) return (X_new, y_new, weights) with a much smaller number
of samples.

This example is interesting because it shows that TransModifiers can
legitimately create sample properties.

**Proposed solution**:

TransModifiers always return (X_new, y_new, sample_props) where
sample_props can be an empty dictionary.


Naming suggestions
..................

In term of name choice, the rational would be to have method names that
are close to 'fit' and 'transform', to make discoverability and
readability of the code easier.

* Name of the object (referred in the docs):
- TransModifier
- TransformPipe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tube? Chute? Hose? ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processor? Transmuter? Mutator? Transfiguerer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- PipeTransformer

* Method to fit and apply on training
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit_process and process?
if we don't have a fit function, we actually don't need to worry about breach of contract stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On seconds though, users might expect these to do the same thing:

thing = Thing()
X_1, y_1, sample_props_1 = thing.fit_pipe(X, y, sample_props)
X_2, y_2, sample_props_2 = thing.transform_pipe(X, y, sample_props)

And we should choose naming such that this is not suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thesaurus also suggests "alter" "handle" "treat" "transmute" "compute" "run" "propagate" "mutate" "transmogrify" "transfigure" and "replace" (that one is actually interesting!) (also "convert" which I don't like)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit_process and process?

I would like the 'transform-like' method to contain the word transform
(or at least 'transsomething') for discoverability.

if we don't have a fit function, we actually don't need to worry about breach
of contract stuff.

Indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On seconds though, users might expect these to do the same thing:

thing = Thing()
X_1, y_1, sample_props_1 = thing.fit_pipe(X, y, sample_props)
X_2, y_2, sample_props_2 = thing.transform_pipe(X, y, sample_props)

And we should choose naming such that this is not suggested.

I am not sure I am following what you are saying above.

I've been thinking about name. I think that the prefix that captures the
idea is 'trans'. It has a sexual connotation that we have to be careful
about. But maybe there is something to be found here, around it.

brainstorming mode:

  • find a new name for the "transform-like method" around this idea:

    translate, transmutate, transcript, transfix, transcend, transfer,
    transmod

    for the 'fit like': the same with 'fit_' in front.

  • fit_modify / trans_modify.

I somewhat like this last suggestion. I find that 'TransModifier' could be
an object name. I also find that it convey this idea that it is more
violent than 'transform' but in the same alley.

- fit_modify
- fit_pipe
- pipe_fit
- fit_filter

* Method to apply on new data
- trans_modify
- transform_pipe
- pipe_transform

Benefits
.........

* Many usecases listed above will be implemented scikit-learn without a
meta-estimator, and thus will be easy to use (eg in a pipeline). Many
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to interface with? I guess easier to use using GridSearchCV is true...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+* Many usecases listed above will be implemented scikit-learn without a

  • meta-estimator, and thus will be easy to use (eg in a pipeline). Many

Easier to interface with? I guess easier to use using GridSearchCV is true...

If pipeline hides this connections away from the user (provided that we
can come up with the right semantics), it should be easier to use: just a
question of make_pipeline.

of these are patterns that we should be encouraging.

* The API being more versatile, it will be easier to create
application-specific code or framework wrappers (ala DaskLearn) that
are scikit-learn compatible, and thus that can be used with the
parameter-selection framework. This will be especially true for ETL
(extract transform and load) pattern.

Limitations
............

* Introducing new methods, and a new type of estimator object. There are
probably a total of **3 new methods** that will get introduced by this
enhancement: fit_modify, trans_modify, and partial_fit_modify.

* Cannot solve all possible cases, and thus we will not get rid of
meta-estimators.

TODO
====

* Implement an example doing outlier filtering

* Implement an example doing data downsampling