Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707

Merged
merged 9 commits into from May 31, 2018

Conversation

Projects
None yet
4 participants
@jimmywan
Copy link
Contributor

jimmywan commented Sep 7, 2017

I'd like to make this change in order to support using BaggingRegressor on top of pipelines that handle imputation.

Reference Issue

#9708

What does this implement/fix? Explain your changes.

Removes finite checking on X. This seems wholly unnecessary as whatever regressor BaggingRegressor is wrapping should already handle its own consistency checking.

Relax finite checking on X in order to support using BaggingRegressor…
… on top of pipelines that handle imputation.
@ramhiser

This comment has been minimized.

Copy link

ramhiser commented Sep 7, 2017

+1. The call to check_X_y(..., force_all_finite=False) makes more sense when a Pipeline has a missing data imputer in it.

Maybe the check should be removed entirely so that the base estimator handles the check?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 8, 2017

We can't remove the check entirely: Bagging requires pulling out selected rows and columns, and that requires 2d data with an array-like interface. I suppose we don't need to require that it's numeric, though...? We also do not support Pandas DataFrames here, as that would require use of iloc.

Please add a test.

@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented Sep 8, 2017

Updated the PR to drop the type requirement and allow multiple output columns. Will work on a test case.

We also do not support Pandas DataFrames here, as that would require use of iloc.

Hmm, can you walk me through how to make this Pandas-compatible?
A pointer to existing code would be fine if it exists.

@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented Sep 8, 2017

Added some test cases for classification and regression, and had to make some minor tweaks to accept-multi-output regression.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 8, 2017

check_array will convert a dataframe to a numpy array

@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented Sep 8, 2017

check_array will convert a dataframe to a numpy array

So nothing else required to pass in a pandas DataFrame?

Context is that my pipeline is capable of handling the conversion of categorical string values to numerics before it hits the actual regressor/classifier at the end of the pipeline.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 8, 2017

well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline.

Is there a reason you do the feature expansion in the pipeline and not before the bagging estimator?

@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented Sep 11, 2017

well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline.

Assuming that everything before my custom components support things like get_feature_names or support masks, My components can deal with converting numpy arrays back to DataFrames where I need them. I should be good to go.

@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented Sep 11, 2017

I suppose we don't need to require that it's numeric, though...?

I removed that check as well and wanted to test it, but given the lack of a built-in pipeline friendly categorical encoder, I decided not to add test cases for that particular input.

@jnothman
Copy link
Member

jnothman left a comment

Thanks

@@ -280,7 +279,10 @@ def _fit(self, X, y, max_samples=None, max_depth=None, sample_weight=None):
random_state = check_random_state(self.random_state)

# Convert data
X, y = check_X_y(X, y, ['csr', 'csc'])
X, y = check_X_y(

This comment has been minimized.

@jnothman

jnothman Sep 11, 2017

Member

Perhaps add a comment that we require X to be 2d and indexable on both axes.

This comment has been minimized.

@jimmywan

jimmywan Sep 12, 2017

Author Contributor

Added



def test_bagging_pipeline_with_sparse_inputs():
# Check that BaggingRegressor can accept sparse pipelines inputs

This comment has been minimized.

@jnothman

jnothman Sep 11, 2017

Member

What you're testing with is not what we call sparse. I presume sparse inputs are already tested too. Do you mean to test nans and multioutput (which probably belongs in a separate test function)?

This comment has been minimized.

@jimmywan

jimmywan Sep 12, 2017

Author Contributor

Renamed "sparse" to "missing"
I created separate tests for single and multioutput BaggingRegressors.

# Check that BaggingRegressor can accept sparse pipelines inputs
X = [
[1, 3, 5],
[2, None, 6],

This comment has been minimized.

@jnothman

jnothman Sep 11, 2017

Member

I think you want np.nan rather than None?

This comment has been minimized.

@jimmywan

jimmywan Sep 12, 2017

Author Contributor

TBH, I generally test against both as I'm not entirely sure what happens under the covers. I know I can initialize a Pandas DataFrame either way.

This comment has been minimized.

@jimmywan

jimmywan Sep 12, 2017

Author Contributor

Added separate imputers/inputs for None/np.nan, np.inf, np.NINF.

jimmywan added some commits Sep 12, 2017

Update PR based on review comments:
Add comments.
Rename things for consistency.
Add additional test case.
@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented Sep 13, 2017

@jnothman Updated as per your PR comments and have updated test cases as well. Please review at your earliest convenience.

@jnothman
Copy link
Member

jnothman left a comment

Thanks

)
pipeline.fit(X, Y).predict(X)
bagging_regressor = BaggingRegressor(pipeline)
bagging_regressor.fit(X, Y).predict(X)

This comment has been minimized.

@jnothman

jnothman Sep 13, 2017

Member

should check decision_function etc also, no?

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

There is no decision_function on BaggingRegressor.
I'll add it to the BaggingClassifier test.

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

I actually can't add it to the test for BaggingClassifier because DecisionTreeRegressor does not implement decision_function. I don't normally work with classification and this PR was meant to be a patch for the BaggingRegressor. Feel free to add additional tests if you think it's necessary.

pipeline = make_pipeline(regressor)
assert_raises(ValueError, pipeline.fit, X, Y)
bagging_regressor = BaggingRegressor(pipeline)
assert_raises(ValueError, bagging_regressor.fit, X, Y)

This comment has been minimized.

@jnothman

jnothman Sep 13, 2017

Member

Might be good to check the message here, but okay.

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

Since this is just a test for BaggingRegressor, I don't really want to couple the behavior of the underlying regressor to this test case. The fact that it raises an error seems sufficient to me.

[2, np.inf, 6],
[2, np.NINF, 6],
]
Y = [

This comment has been minimized.

@jnothman

jnothman Sep 13, 2017

Member

I'd be okay with this test not dealing with missing data, and just being about 2d. I'd also be okay with the previous test just looping over 1d, 2d.

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

Changed, though now I have to reuse y for 1D and 2D...

[2, np.inf, 6],
[2, np.NINF, 6],
]
Y = [2, 3, 3, 3, 3]

This comment has been minimized.

@jnothman

jnothman Sep 13, 2017

Member

lowercase y if 1d please

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

Done.

)
pipeline.fit(X, Y).predict(X)
bagging_regressor = BaggingRegressor(pipeline)
bagging_regressor.fit(X, Y).predict(X)

This comment has been minimized.

@jnothman

jnothman Sep 13, 2017

Member

please check that the prediction shape matches the input shape

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

Done.

assert_raises(ValueError, bagging_regressor.fit, X, Y)


def test_bagging_classifier_with_missing_inputs():

This comment has been minimized.

@jnothman

jnothman Sep 13, 2017

Member

What does this test add?

This comment has been minimized.

@jimmywan

jimmywan Sep 14, 2017

Author Contributor

This tests BaggingClassifier while the other tests BaggingRegressor.

@jnothman
Copy link
Member

jnothman left a comment

LGTM

@jnothman jnothman changed the title Update BaggingRegressor to relax checking X for finite values [MRG+1] Update BaggingRegressor to relax checking X for finite values Sep 14, 2017

@amueller

This comment has been minimized.

Copy link
Member

amueller commented May 22, 2018

LGTM please add a whatsnew entry.

@jnothman jnothman changed the title [MRG+1] Update BaggingRegressor to relax checking X for finite values [MRG+2] Update BaggingRegressor to relax checking X for finite values May 23, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 23, 2018

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 29, 2018

Ping @jimmywan?

@jimmywan

This comment has been minimized.

Copy link
Contributor Author

jimmywan commented May 30, 2018

please add a whatsnew entry

Merged scikit-learn:master into my branch and added requested documentation.

@jnothman jnothman merged commit a31a906 into scikit-learn:master May 31, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.2%)
Details
codecov/project 95.2% (+<.01%) compared to f97b515
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 31, 2018

Thanks @jimmywan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.