You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sklearn's pipeline caches the output of transformers in the pipeline. The caching is based on a hash of the arguments of function _fit_transform_one. Unfortunately, the hash changes when any of the transformer's parameters change, including those that don't affect the output, for example the verbose parameter (there could be other ones, perhaps copy?).
It would be a nice feature if there would be a way to indicate to the pipeline (or if the pipeline can detect it automatically) which parameters within the transformers to ignore for caching.
Use case
While developing, I tend to always set a high verbosity to understand what's happening under the hood. Once I am content with the results, I turn the verbosity off. At this point, the results are already calculated and cached, but need to be recalculated because the change of parameter.
It uses joblib's Memory.cache, which accepts an ignore parameter to ignore arguments in the hashing, but you can't ignore parameters within one of the arguments (the first argument to _fit_transform_one is the transformer and we would like to ignore verbose within arg 1).
I couldn't find any trivial solution without monkey patching joblib's code or programmatically changing the verbose parameter in the transformers self (which would lead to unexpected results for the user, for example the lack of output messages). Happy to open a PR if anyone can think of a solution.
Describe alternatives you've considered, if relevant
No response
Additional context
Similar to #23788 but presenting a new use case, which as far as I know, can't be resolved now.
The text was updated successfully, but these errors were encountered:
Indeed, we should address this issue. I don't know yet what is the best option here, maybe creating a list of parameter to ignore? Maybe using the tag mechanism?
Probably the easiest solution, but how would you tell Memory.cache to ignore those parameters within the transformer? They are part of the transformer's hash.
Maybe using the tag mechanism
Better solution imo, specially since it would allow the creation of custom transformers with custom parameters ignored. But it still presents the issue of how to pass this information to the hashing mechanism.
Maybe a good way for this to work would be for each estimator to be able to declare which of its constructor arguments should be ignored. I think this goes beyond what the ignore= argument of joblib's Memory provides, because that only concerns itself with the arguments of the function it is caching, while we want to ignore "arguments to an argument of the function". Maybe cache_validation_callback= can help us out, because if not we might be out of luck :-/
Describe the workflow you want to enable
Introduction
sklearn's pipeline caches the output of transformers in the pipeline. The caching is based on a hash of the arguments of function
_fit_transform_one
. Unfortunately, the hash changes when any of the transformer's parameters change, including those that don't affect the output, for example theverbose
parameter (there could be other ones, perhapscopy
?).It would be a nice feature if there would be a way to indicate to the pipeline (or if the pipeline can detect it automatically) which parameters within the transformers to ignore for caching.
Use case
While developing, I tend to always set a high verbosity to understand what's happening under the hood. Once I am content with the results, I turn the verbosity off. At this point, the results are already calculated and cached, but need to be recalculated because the change of parameter.
Examples of affected transformers
Describe your proposed solution
The pipeline's caching is performed here:
scikit-learn/sklearn/pipeline.py
Line 392 in 38b39a4
It uses joblib's
Memory.cache
, which accepts anignore
parameter to ignore arguments in the hashing, but you can't ignore parameters within one of the arguments (the first argument to_fit_transform_one
is the transformer and we would like to ignoreverbose
within arg 1).I couldn't find any trivial solution without monkey patching joblib's code or programmatically changing the
verbose
parameter in the transformers self (which would lead to unexpected results for the user, for example the lack of output messages). Happy to open a PR if anyone can think of a solution.Describe alternatives you've considered, if relevant
No response
Additional context
Similar to #23788 but presenting a new use case, which as far as I know, can't be resolved now.
The text was updated successfully, but these errors were encountered: