Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow forestci to work on general Bagging estimators #100

Merged
merged 6 commits into from Dec 17, 2020

Conversation

richford
Copy link
Contributor

Resolves #99

This PR adds functionality to forestci.py to inspect the "forest" estimator to see if it is a random forest (i.e. inherits from BaseForest) or a bagging estimator (i.e. inherits from BaseBagging). There are some differences in the private attributes of these classes so the distinction is necessary. When the estimator is a random forest, all of the existing code applies. When it inherits from BaseBagging, we use the .estimators_samples_ attribute for the calc_inbag function. And when calibrating inside random_forest_error, it is also necessary to randomly permute the _seeds array attribute of new_forest. I've also added some tests for these new features.

I believe this PR makes forestci work well with general bagging estimators. However, I would greatly appreciate it if @arokem, @kpolimis, @bhazelton could check my work here. Most importantly, is this sensible? I think I've made the APIs compatible but am I making a mistake in applying Wager's method to general bagging methods (and not exclusively to random forests)?

@richford
Copy link
Contributor Author

Ohh, my apologies for the style changes to the files. I'm using an auto-linter in VSCode that changes lines when I save. They should still be PEP8 compliant. If it's a deal-breaker, I can revert the style changes.

@arokem
Copy link
Contributor

arokem commented Dec 10, 2020 via email

@arokem
Copy link
Contributor

arokem commented Dec 11, 2020

The code looks fine, but I have not looked closely yet. First, to the matter of principle: can we use this method for bagging estimators that are not forest/tree algorithms?

I've reread the paper. IIUC, there is nothing in section 2 of the paper, which outlines the theory and develops the equations we use, that is specific to trees/forests. For example, equation 5 (implemented here seems to apply to bagged estimators just as well, as does the bias correction from equation 7 (implemented here. Similarly, the calibration method implemented here seems to apply broadly to "noisy variance estimates". So, I think that this is OK.

That said, I wonder whether @swager would like to weigh in here, since we rely on his work quite heavily.

Alternatively, maybe @nrs02004 wants to save us from ourselves.

I also noticed that the original R implementation of this method has been deprecated in favor of what seems to be a more general software package that includes methods for variance estimation that are related, but not identical to these we implemented here (see section 4.1 in their paper). I didn't see anything about this newer paper that suggests we should abandon the method implemented here, but it's a pretty big paper 😄

Copy link
Contributor

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - now I've also looked at the code. IIUC, the relevant changes (i.e., non-formatting changes) are in calc_inbag. Is that correct? That all looks good, but I wonder whether we really need pandas to do the operation you are using pandas for. Not that I mind a pandas dependency all that much.

pd.DataFrame(index=index, data=data) for index, data in samples_and_counts
]

return pd.concat(dataframes, axis="columns").fillna(0).astype(np.int).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that this is the only place that pandas is used? Is it here just to find nan values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that was the only pandas dependency. I was using it for fillna and then also for concat to merge on the sample indices. But I just pushed a commit that gets rid of the pandas dependency.

@arokem
Copy link
Contributor

arokem commented Dec 11, 2020

I too have no idea why the documentation build is failing

@richford
Copy link
Contributor Author

I removed the pandas dependency. The other non-style change is here but that doesn't depend on pandas.

@richford
Copy link
Contributor Author

Is the doc build failure another example of numpy/numpydoc#268

@arokem
Copy link
Contributor

arokem commented Dec 17, 2020

After a bit more discussion, I am inclined to go ahead and merge this. We can fix up the docs on a separate PR.

@arokem arokem merged commit 7f36ef7 into scikit-learn-contrib:master Dec 17, 2020
@arokem arokem mentioned this pull request Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow general bagging estimators
2 participants