Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Pass original dataset to Stacking final estimator #15138

Merged
merged 23 commits into from Oct 30, 2019

Conversation

@jcusick13
Copy link
Contributor

jcusick13 commented Oct 5, 2019

Reference Issues/PRs

Fixes #15076.

What does this implement/fix? Explain your changes.

Adds an argument, pass_through, to the base Stacking estimator. This allows for concatenating the original dataset X with the output from all individual estimators for use in training the final_estimator.

Any other comments?

I added tests but haven't been able to check them throughly. I haven't been able to get even the existing tests to run since calling (even on the master branch)

pytest sklearn/ensemble/tests/test_stacking.py

returns the below error.

_______ ERROR collecting sklearn/ensemble/tests/test_stacking.py _______ 
ImportError while importing test module '.../scikit-learn/sklearn/ensemble/tests/test_stacking.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
sklearn/ensemble/__init__.py:7: in <module>
    from .forest import RandomForestClassifier
sklearn/ensemble/forest.py:56: in <module>
    from ..tree import (DecisionTreeClassifier, DecisionTreeRegressor,
sklearn/tree/__init__.py:6: in <module>
    from .tree import DecisionTreeClassifier
sklearn/tree/tree.py:44: in <module>
    from ._tree import _build_pruned_tree_ccp
E   ImportError: cannot import name '_build_pruned_tree_ccp'

I've looked around for a little and am not too sure how to resolve this -- does anyone have any suggestions?

jcusick13 added 8 commits Oct 5, 2019
@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 7, 2019

Something's still off with my local build while trying to import '_build_pruned_tree_ccp' but I think things should otherwise be all set for a review now.

@jcusick13 jcusick13 changed the title [WIP] Pass original dataset to Stacking final estimator [MRG] Pass original dataset to Stacking final estimator Oct 7, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 7, 2019

@@ -39,22 +39,25 @@ class _BaseStacking(TransformerMixin, _BaseHeterogeneousEnsemble,

@abstractmethod
def __init__(self, estimators, final_estimator=None, cv=None,
stack_method='auto', n_jobs=None, verbose=0):
stack_method='auto', n_jobs=None, verbose=0,
pass_through=False):

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 7, 2019

Member

When we elsewhere use "passthrough" I don't think pass_through is a good idea

This comment has been minimized.

Copy link
@jcusick13

jcusick13 Oct 7, 2019

Author Contributor

Ah good catch, I'll update to be consistent with passthrough.

@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 7, 2019

Thanks for the advice on reinstalling with the editable command -- that fixed things on my end!

@@ -178,14 +185,16 @@ def test_stacking_regressor_diabetes(cv, final_estimator, predict_params):
assert len(result) == expected_result_length

X_trans = reg.transform(X_test)
assert X_trans.shape[1] == 2
expected_column_count = 12 if passthrough else 2
assert X_trans.shape[1] == expected_column_count

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 7, 2019

Member

You can assert that the original values are the same as the passthrough values.

Same goes for the other asserts.

This comment has been minimized.

Copy link
@jcusick13

jcusick13 Oct 7, 2019

Author Contributor

Sounds good. Should that be a distinct test or should that be part of the existing test_stacking_regressor_diabetes and test_stacking_classifier_iris?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 7, 2019

Member

I think this can be include here:

if passthrough:
    assert_allclose(...)
@@ -74,7 +77,10 @@ def _concatenate_predictions(self, predictions):
X_meta.append(preds[:, 1:])
else:
X_meta.append(preds)
return np.concatenate(X_meta, axis=1)
if self.passthrough:

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 7, 2019

Member

Nit:

if self.passthrough:
    X_meta.append(X)

return np.concatenate(X_meta, axis=1)
Copy link
Member

jnothman left a comment

Please test behaviour with sparse matrix X

@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 12, 2019

Sounds good -- the original np.concatenate was failing for sparse matrices, so I added an additional check before combining the original X matrix with interim learner predictions.

@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 24, 2019

@jnothman, @thomasjpfan I just wanted to follow up on this -- let me know when you guys have a chance for another review, thanks!

Copy link
Member

thomasjpfan left a comment

We should document the behavior in transform when X is sparse. i.e. currently:

  1. If X is sparse and passthrough=False, the output of transform will be dense (the predictions).
  2. If X is sparse and passthrough=true, the output of transform will be sparse.
)
clf.fit(X_train, y_train)
X_trans = clf.transform(X_test)
assert_allclose(X_test.toarray(), X_trans[:, -10:].toarray())

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 24, 2019

Member

Nit: This can use:

from sklearn.utils.testing import assert_allclose_dense_sparse

assert_allclose_dense_sparse(X-test, X_trans[:, -10:]
)
clf.fit(X_train, y_train)
X_trans = clf.transform(X_test)
assert_allclose(X_test.toarray(), X_trans[:, -4:].toarray())

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 24, 2019

Member

Nit: This can use:

from sklearn.utils.testing import assert_allclose_dense_sparse

assert_allclose_dense_sparse(X-test, X_trans[:, -4:]
"""Concatenate the predictions of each first layer learner.
def _concatenate_predictions(self, X, predictions):
"""Concatenate the predictions of each first layer learner and
possibly the input dataset `X`.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 24, 2019

Member

Please document the sparse behavior here.

@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 25, 2019

Just added the new changes -- thanks for pointing out assert_allclose_dense_sparse!

Copy link
Member

thomasjpfan left a comment

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

if issparse(X):
return sparse_hstack(X_meta).tocsr()
Comment on lines 89 to 90

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 25, 2019

Member

Nit: It would most likely be clearer for future devs:

import scipy.sparse as sparse

...
if self.passthrough:
    X_meta.append(X)

hstack = sparse.hstack if sparse.issparse(X) else np.hstack
return hstack(X_meta)
jcusick13 added 2 commits Oct 25, 2019
Copy link
Member

thomasjpfan left a comment

Otherwise LGTM

if self.passthrough:
X_meta.append(X)
if sparse.issparse(X):
return sparse.hstack(X_meta).tocsr()

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 25, 2019

Member

More simplifications Nit:

if self.passthrough:
    X_meta.append(X)

hstack = sparse.hstack if sparse.issparse(X) else np.hstack
return hstack(X_meta)

This comment has been minimized.

Copy link
@jcusick13

jcusick13 Oct 25, 2019

Author Contributor

I tried a few ways to get this to work but kept running into errors since sparse.hstack returns a coo_matrix instead of a csr_matrix.

Is there a clever way to embed the .tocsr() function within the if/else line? I didn't want to add it as return hstack(X_meta).tocsr() since it would try to force dense matrices into csr format also.

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 27, 2019

Member

should this be toformat(X.format)??

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 27, 2019

Member

Not sure what you mean by embedding .tocsr in the if/else line.

(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 28, 2019

Member

(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)

It is surprising. I am unsure if we should rely on this behavior.

@jcusick13 Maybe something like this will work:

if self.passthrough:
    X_meta.append(X)

if sparse.issparse(X):
    return sparse.hstack(X_meta, format=X.format)

return np.hstack(X_meta)

This comment has been minimized.

Copy link
@jcusick13

jcusick13 Oct 29, 2019

Author Contributor

Good catch on using X.format instead of hardcoding csr.

The sparse.hstack documentation is a bit worrying with the explicit mention of sparse inputs (and the sparse statements are just as explicit in the sparse.bmat docs, which is called directly by sparse.hstack).

I took a look at the sparse.bmat source and it looks like it's converting all of its inputs into a COO matrix in a way that would work regardless of sparse vs. dense input (by just using sparse.coo_matrix()). I ran some simple tests locally and it never seemed to raise a problem --

sparse.hstack([np.ones(10), sparse.csr_matrix(np.ones(10))])
>>> <1x20 sparse matrix of type '<class 'numpy.float64'>'
	with 20 stored elements in COOrdinate format>

The current setup was passing all tests locally but definitely open to hearing your guys' thoughts on the matter. One suggestion is that we could first convert X_meta to be sparse before appending the sparse X to it and then running sparse.hstack. Something like

if self.passthrough:
    if sparse.issparse(X):
        X_meta = sparse.coo_matrix(X_meta)
        return sparse.hstack(X_meta.append(X), format=X.format)
    X_meta.append(X)

return np.hstack(X_meta)
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

Otherwise looking good!

sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
if self.passthrough:
X_meta.append(X)
if sparse.issparse(X):
return sparse.hstack(X_meta).tocsr()

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 27, 2019

Member

should this be toformat(X.format)??

if self.passthrough:
X_meta.append(X)
if sparse.issparse(X):
return sparse.hstack(X_meta).tocsr()

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 27, 2019

Member

Not sure what you mean by embedding .tocsr in the if/else line.

(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)

estimators=estimators, final_estimator=rf, cv=5, passthrough=True
)
clf.fit(X_train, y_train)
X_trans = clf.transform(X_test)

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 29, 2019

Member

You don't actually confirm that the output is sparse (nor that its format matches that of X)

This comment has been minimized.

Copy link
@jcusick13

jcusick13 Oct 29, 2019

Author Contributor

Good call -- I added tests to check both a sparse output and matching format. I also adjusted one of the sparse inputs to be CSC format (instead of CSR) so that the tests run against multiple sparse format types.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 29, 2019

@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 30, 2019

All set -- parameterized CSR, CSC, and COO matrices for both stacking regressor and classifier tests.

jcusick13 added 2 commits Oct 30, 2019

@pytest.mark.parametrize(
'X',
[sparse.csc_matrix(scale(X_diabetes)),

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 30, 2019

Member

Cleaner if you just use format as the param: ['csc', 'csr', 'coo']. Then use asformat to convert from an initial format

Copy link
Member

jnothman left a comment

Sorry to be not picking but I think we are closer to a neat and idiomatic style

@jcusick13

This comment has been minimized.

Copy link
Contributor Author

jcusick13 commented Oct 30, 2019

No worries -- I just adjusted it now, let me know what you think.

Copy link
Member

jnothman left a comment

Happy now :)

@jnothman jnothman merged commit 132ad99 into scikit-learn:master Oct 30, 2019
19 checks passed
19 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.29%)
Details
codecov/project 97.15% (+0.86%) compared to 7c47337
Details
scikit-learn.scikit-learn Build #20191030.18 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 30, 2019

Thanks @jcusick13!

@jcusick13 jcusick13 deleted the jcusick13:add-stacking-passthrough-15076 branch Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.