Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST factorize some common test of ensemble of heterogeneous es… #15387

Merged

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Oct 28, 2019

Reference Issues/PRs

A follow-up to #15375
Include a test to illustrate non-consistency between Stacking and Voting

TODO:

  • Decide what should be the correct API (Probably modify Stacking to report dropped estimator in named_estimators_)

What does this implement/fix? Explain your changes.

We merged #15375 but included a regression/inconsistent between Stacking and Voting estimators.
The documentation is not straightforward but estimators_ does not include the dropped estimators. For Stacking estimators named_estimators_ expose the same semantic while Voting estimators report the dropped estimator with the 'drop' values.

Any other comments?

@glemaitre
Copy link
Member Author

Looking at the voting tests, it seems that Voting.named_estimators_ is expected to report dropped estimators, meaning that we should change the Stacking estimators instead of the Voting.

ping @jnothman @amueller @NicolasHug @thomasjpfan @qinhanmin2014 Could you confirm that we have to change the Stacking?

@jnothman
Copy link
Member

By "expected to report dropped estimators" do you mean named_estimators_['dropped'] == 'drop'? This seems reasonable to me? Either way seems reasonable really, but conforming to existing API is best?

@thomasjpfan
Copy link
Member

I have a preference for the interface in ColumnTransformer, where the named_transformer_ maps to the 'drop' and transformers_ matches transformers when it comes to 'drop'.

In the case of Stacking* and Voting*, I would prefer estimators_ to contain 'drop' and named_estimators_ to map to 'drop' when the estimator is dropped.

@glemaitre
Copy link
Member Author

I did not consider the ColumnTransformer. It would make sense to have similar behaviour.

So we should be able to map 'drop' to a named estimator without breaking the backward compatibility. In this case, we keep #15375 and change the Stacking* to have the same behaviour.

For estimators_, this is more tricky. We would break the back-compatibility to introduce 'drop' in the list of estimators for the Voting* (so we need a change of behaviour there). We can change Stacking* since this is only in master.

So we need:

  1. change named_estimators_ in Stacking*
  2. change estimators_ in Stacking* and Voting* + announcing a change of behaviour for the latest.

This PR could focus on 1.

WDYT?

@NicolasHug
Copy link
Member

I have a preference for the interface in ColumnTransformer, where the named_transformer_ maps to the 'drop' and transformers_ matches transformers when it comes to 'drop'.

Reading the code, that's not true.

transformers_ ignores the dropped estimators, and named_transformer_ too:

        return Bunch(**{name: trans for name, trans, _
                        in self.transformers_})

@glemaitre
Copy link
Member Author

@NicolasHug Are you sure?

In [5]: import numpy as np 
   ...: from sklearn.compose import ColumnTransformer 
   ...: from sklearn.preprocessing import Normalizer 
   ...: ct = ColumnTransformer( 
   ...:     [("norm1", Normalizer(norm='l1'), [0, 1]), 
   ...:      ("norm2", Normalizer(norm='l1'), slice(2, 4))]) 
   ...: X = np.array([[0., 1., 2., 2.], 
   ...:               [1., 1., 0., 1.]]) 
   ...: # Normalizer scales each row of X to unit norm. A separate scaling 
   ...: # is applied for the two first and two last elements of each 
   ...: # row independently. 
   ...: ct.fit_transform(X)     
   ...:                                                                              
Out[5]: 
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

In [6]: ct.set_params(norm1='drop')                                                  
Out[6]: 
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('norm1', 'drop', [0, 1]),
                                ('norm2', Normalizer(copy=True, norm='l1'),
                                 slice(2, 4, None))],
                  verbose=False)

In [7]: ct.fit_transform(X)                                                          
Out[7]: 
array([[0.5, 0.5],
       [0. , 1. ]])

In [8]: ct.transformers_                                                             
Out[8]: 
[('norm1', 'drop', [0, 1]),
 ('norm2', Normalizer(copy=True, norm='l1'), slice(2, 4, None))]

In [9]: ct.named_transformers_                                                       
Out[9]: {'norm1': 'drop', 'norm2': Normalizer(copy=True, norm='l1')}

norm1 is reported in both even if dropped.

@glemaitre
Copy link
Member Author

We also have another issues with estimators_ if we want to make it behave as transformers_.

transformers_ returns a list of tuple (similar to transfomers). However, estimators_ is only a list of fitted estimators and not a tuple (name, estimator).

@NicolasHug
Copy link
Member

Sorry on my phone rn so can't double check but both the code comments and the docstrings suggest otherwise. Docstrings say "fitted estimator" which inpoes9the dropped estimators aren't there

@glemaitre
Copy link
Member Author

I am confused now. If you refer to the transformers_ the documentation mentioned the following:

fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’

If you refer to the estimators_ then I agree. That's why, I am asking if we should go toward making things with a similar semantic (list of tuple instead of just estimator).

@NicolasHug
Copy link
Member

NicolasHug commented Oct 29, 2019

Sorry, read too fast. I was confused that a 'drop' qualifies as a "fitted estimator" Ignore my comments ^^

@thomasjpfan
Copy link
Member

I am okay with changing Stack* to include dropped estimators in Stacking*.

@glemaitre glemaitre changed the title TST add common test for ensemble of heterogeneous estimators TST factorize some common test of ensemble of heterogeneous estimators Oct 30, 2019
@glemaitre
Copy link
Member Author

So I added the dropped estimator in named_estimators_ for stacking. I factorize the tests in common at the same time. I am unsure that there is a need for a what's new here since this maintenance and changing the behavior of an estimator which is not released yet.

We can have further discussion regarding what to do with estimators_ in Voting* and Stackign*, later on. This is not a release-critical issue apart from that we will have to deprecate in both classes if we want to make a change.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2019

there is a need for a what's new here since this maintenance and changing the behavior of an estimator which is not released yet.

No need to update the changelog for this PR.

@glemaitre
Copy link
Member Author

ping @NicolasHug @jnothman @thomasjpfan

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, thank you.

sklearn/ensemble/tests/test_common.py Outdated Show resolved Hide resolved
glemaitre and others added 2 commits November 6, 2019 08:57
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan
Copy link
Member

Merged with master to make sure the docs are okay (the lint error comes from the renaming of the flake8_diff.sh file on master).

Will merge when tests pass.

@thomasjpfan thomasjpfan changed the title TST factorize some common test of ensemble of heterogeneous estimators TST factorize some common test of ensemble of heterogeneous es… Nov 6, 2019
@thomasjpfan thomasjpfan merged commit 1578132 into scikit-learn:master Nov 6, 2019
@glemaitre
Copy link
Member Author

Thanks for merging

@thomasjpfan
Copy link
Member

Thanks you for the PR! @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants