Skip to content

Pipeline et al. design issues

jnothman edited this page Dec 3, 2013 · 20 revisions

This page is to collate issues related to Pipelines and other meta-estimator API design. In general, a meta-estimator M with (primary) sub-estimator S should be more-or-less usable in place of S. Deficiencies in the current models mean this is not always the case; which of these deficiencies should be fixed and how? Other issues related to meta-estimator support (e.g. nested parameter setting) may also be relevant.

General meta-estimator issues

Duck-typing and methods (#1805, #2019)

hasattr may be used to check an estimator supports a particular functionality (e.g. fit_transform, predict_proba). In meta-estimators this is conditioned on the presence of that method on a sub-estimator. This behaviour can be ensured using magic methods (__getattr__ or __getattribute__) or using descriptors (e.g. property): when these raise AttributeError, hasattr returns false.

PR #2019 supports common methods using property, sacrificing some readability. The question of which common methods need to be supported is a further issue.

A further concern is that in traditional estimators, hasattr will work before or after fitting. If something like GridSearchCV delegates hasattr to its best_estimator_, this will only have effect after fitting.

Accessing fitted attributes (cf. #2561, #2568, #2630 in the context of Pipeline)

It can be cumbersome to access a fitted attribute of an estimator (e.g. in a Pipeline within GridSearchCV, this may involve gs.best_estimator_.steps[-1][1].coef_). To be interpreted with respect to the input space, this may require further transformation (e.g. Pipeline(gs.best_estimator_.steps[:-1]).inverse_transform(gs.best_estimator_.steps[-1][1].coef_)).

Moreover, some fitted attributes are used by meta-estimators; AdaBoostClassifier assumes its sub-estimator has a classes_ attribute after fitting, which means that presently Pipeline cannot be used as the sub-estimator of AdaBoostClassifier. Either meta-estimators such as AdaBoostClassifier need to be configurable in how they access this attribute, or meta-estimators such as Pipeline need to make some fitted attributes of sub-estimators accessible.

Pipeline / FeatureUnion issues

Passing parameters such as sample_weight to methods (cf. #2630)

Pipeline.get_feature_names() (#2007)

Efficiently reusing partial models/transformations during grid search (#2086)

Inconsistency between get_params and set_params treatment of sub-estimators (#1769, #1800)

Minor functionality and syntax issues:

  • constructor verbosity (#2589)
  • alternating or disabling components through set_params() (#1769)
  • retrieving a final model in input feature space (#2561, #2568)
  • heterogeneous input in FeatureUnion (#2034)
  • partitioning the FeatureUnion output space by transformer (#1952)
Clone this wiki locally