Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache final transformer in pipeline with memory setting #23112

Open
bmreiniger opened this issue Apr 11, 2022 · 9 comments · May be fixed by #26008
Open

Cache final transformer in pipeline with memory setting #23112

bmreiniger opened this issue Apr 11, 2022 · 9 comments · May be fixed by #26008

Comments

@bmreiniger
Copy link
Contributor

Describe the bug

When setting the memory parameter of a transformer Pipeline (i.e., one whose last step is a transformer), the final transformer is not cached.

Discovered at https://stackoverflow.com/q/71812869/10495893.

Steps/Code to Reproduce

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import time

class Test(BaseEstimator, TransformerMixin):
    def __init__(self, col):
        self.col = col

    def fit(self, X, y=None):
        print(self.col)
        return self

    def transform(self, X, y=None):
        for t in range(5):
            # just to slow it down / check caching.
            print(".")
            time.sleep(1)
        #print(self.col)
        return X

pipline = Pipeline(
    [
        ("test", Test(col="this_column")),
        ("test2", Test(col="that_column"))
    ],
    memory="tmp/cache",
)

pipline.fit(None)
pipline.fit(None)
pipline.fit(None)

Expected Results

this_column
.
.
.
.
.
that_column

Actual Results

this_column
.
.
.
.
.
that_column
that_column
that_column

Versions

System:
    python: 3.7.13 (default, Mar 16 2022, 17:37:17)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.5
        scipy: 1.4.1
       Cython: 0.29.28
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
@bmreiniger bmreiniger added Bug Needs Triage Issue requires triage labels Apr 11, 2022
@thomasjpfan thomasjpfan added module:pipeline and removed Needs Triage Issue requires triage labels Apr 11, 2022
@thomasjpfan
Copy link
Member

The original use case for caching is for the final step to be a classifier or regressor, where all previous steps are cached. As for this issue, the final step is a transformer, so technically it should be cached according to pipeline's docstring:

Used to cache the fitted transformers of the pipeline.

If we strictly follow this, then we would have caching behavior depending on what the final step is, which can be confusing semantically. The alternative solution is to update the docs and say that we only cache steps[:-1].

What do you think @jnothman ?

@glemaitre
Copy link
Member

We can first update the documentation to make it clear that we cache up to the last step (not included).

We can later think of an enhancement where we can cache all transformers of a pipeline.

windiana42 added a commit to windiana42/scikit-learn that referenced this issue Mar 28, 2023
In compose.rst and pipeline.py there are three places where pipeline caching is explained. An extra sentence was added that currently, the last step will never be cached. In one place it is mentioned that this might change in the future.
@windiana42
Copy link
Contributor

Discussion with @glemaitre suggests that if the last step of a pipeline is a transformer, then fit/transform/fit_transform methods of Pipeline class should also cache the fit/transform/fit_transform call of the last step.

@windiana42
Copy link
Contributor

@glemaitre suggested that detecting transformers is most robust with isinstance(step, TransformerMixin) as opposed to using ducktyping checks with getattr(step, "fit_transform") or getattr(step, "transform")

@windiana42
Copy link
Contributor

I just realized that caching is currently also supported by Pipeline.fit_predict and it is not supported by Pipeline.transform. So I prepare the first solution for fit and fit_transform only.

windiana42 added a commit to windiana42/scikit-learn that referenced this issue Mar 28, 2023
@windiana42 windiana42 linked a pull request Mar 28, 2023 that will close this issue
@windiana42
Copy link
Contributor

This Draft PR illustrates the idea before implementing test code: https://github.com/scikit-learn/scikit-learn/pull/26008/files

windiana42 added a commit to windiana42/scikit-learn that referenced this issue Mar 29, 2023
@windiana42
Copy link
Contributor

For Pipeline.transform() I suggest a bigger change:

  1. the Pipeline object should remember whether Pipeline.fit() was called
  2. Pipeline.transform() throws an error if Pipeline.fit() was not called and only executes step.transform() of the last step because all previous steps have been fit-transformed within Pipeline.fit()

@windiana42
Copy link
Contributor

The main problem with caching the last transformer step in Pipeline.transform() is due to the fact that in Pipeline.fit_transform() we only cache _fit_transform_one as a combined operation. One conclusion could be that we might always cache fit() and transform() separately for the last transformer.

@windiana42
Copy link
Contributor

In case you say that we are generally on the right track, here, I would add test cases to the draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants