Cache final transformer in pipeline with memory setting #23112

bmreiniger · 2022-04-11T14:55:57Z

Describe the bug

When setting the memory parameter of a transformer Pipeline (i.e., one whose last step is a transformer), the final transformer is not cached.

Discovered at https://stackoverflow.com/q/71812869/10495893.

Steps/Code to Reproduce

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import time

class Test(BaseEstimator, TransformerMixin):
    def __init__(self, col):
        self.col = col

    def fit(self, X, y=None):
        print(self.col)
        return self

    def transform(self, X, y=None):
        for t in range(5):
            # just to slow it down / check caching.
            print(".")
            time.sleep(1)
        #print(self.col)
        return X

pipline = Pipeline(
    [
        ("test", Test(col="this_column")),
        ("test2", Test(col="that_column"))
    ],
    memory="tmp/cache",
)

pipline.fit(None)
pipline.fit(None)
pipline.fit(None)

Expected Results

this_column
.
.
.
.
.
that_column

Actual Results

this_column
.
.
.
.
.
that_column
that_column
that_column

Versions

System:
    python: 3.7.13 (default, Mar 16 2022, 17:37:17)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.5
        scipy: 1.4.1
       Cython: 0.29.28
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2022-04-11T16:56:09Z

The original use case for caching is for the final step to be a classifier or regressor, where all previous steps are cached. As for this issue, the final step is a transformer, so technically it should be cached according to pipeline's docstring:

Used to cache the fitted transformers of the pipeline.

If we strictly follow this, then we would have caching behavior depending on what the final step is, which can be confusing semantically. The alternative solution is to update the docs and say that we only cache steps[:-1].

What do you think @jnothman ?

glemaitre · 2023-03-28T17:16:36Z

We can first update the documentation to make it clear that we cache up to the last step (not included).

We can later think of an enhancement where we can cache all transformers of a pipeline.

In compose.rst and pipeline.py there are three places where pipeline caching is explained. An extra sentence was added that currently, the last step will never be cached. In one place it is mentioned that this might change in the future.

windiana42 · 2023-03-28T19:00:37Z

Discussion with @glemaitre suggests that if the last step of a pipeline is a transformer, then fit/transform/fit_transform methods of Pipeline class should also cache the fit/transform/fit_transform call of the last step.

windiana42 · 2023-03-28T19:02:24Z

@glemaitre suggested that detecting transformers is most robust with isinstance(step, TransformerMixin) as opposed to using ducktyping checks with getattr(step, "fit_transform") or getattr(step, "transform")

windiana42 · 2023-03-28T19:08:53Z

I just realized that caching is currently also supported by Pipeline.fit_predict and it is not supported by Pipeline.transform. So I prepare the first solution for fit and fit_transform only.

windiana42 · 2023-03-28T19:25:24Z

This Draft PR illustrates the idea before implementing test code: https://github.com/scikit-learn/scikit-learn/pull/26008/files

windiana42 · 2023-03-29T12:59:10Z

For Pipeline.transform() I suggest a bigger change:

the Pipeline object should remember whether Pipeline.fit() was called
Pipeline.transform() throws an error if Pipeline.fit() was not called and only executes step.transform() of the last step because all previous steps have been fit-transformed within Pipeline.fit()

windiana42 · 2023-03-29T13:06:03Z

The main problem with caching the last transformer step in Pipeline.transform() is due to the fact that in Pipeline.fit_transform() we only cache _fit_transform_one as a combined operation. One conclusion could be that we might always cache fit() and transform() separately for the last transformer.

windiana42 · 2023-03-29T13:07:08Z

In case you say that we are generally on the right track, here, I would add test cases to the draft PR.

bmreiniger added Bug Needs Triage Issue requires triage labels Apr 11, 2022

thomasjpfan added module:pipeline and removed Needs Triage Issue requires triage labels Apr 11, 2022

windiana42 mentioned this issue Mar 28, 2023

DOC document that last step is never cached in pipeline #25995

Merged

windiana42 added a commit to windiana42/scikit-learn that referenced this issue Mar 28, 2023

scikit-learn#23112: showcase idea to cache last step

9ed3c13

windiana42 linked a pull request Mar 28, 2023 that will close this issue

#23112: showcase idea to cache last step #26008

Draft

windiana42 added a commit to windiana42/scikit-learn that referenced this issue Mar 29, 2023

scikit-learn#23112: only cache transformer last step

a73eed4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache final transformer in pipeline with memory setting #23112

Cache final transformer in pipeline with memory setting #23112

bmreiniger commented Apr 11, 2022

thomasjpfan commented Apr 11, 2022

glemaitre commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 29, 2023

windiana42 commented Mar 29, 2023

windiana42 commented Mar 29, 2023

Cache final transformer in pipeline with memory setting #23112

Cache final transformer in pipeline with memory setting #23112

Comments

bmreiniger commented Apr 11, 2022

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Apr 11, 2022

glemaitre commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 28, 2023

windiana42 commented Mar 29, 2023

windiana42 commented Mar 29, 2023

windiana42 commented Mar 29, 2023