ENH: Allow forestci to work on general Bagging estimators #100

richford · 2020-12-10T00:47:26Z

Resolves #99

This PR adds functionality to forestci.py to inspect the "forest" estimator to see if it is a random forest (i.e. inherits from BaseForest) or a bagging estimator (i.e. inherits from BaseBagging). There are some differences in the private attributes of these classes so the distinction is necessary. When the estimator is a random forest, all of the existing code applies. When it inherits from BaseBagging, we use the .estimators_samples_ attribute for the calc_inbag function. And when calibrating inside random_forest_error, it is also necessary to randomly permute the _seeds array attribute of new_forest. I've also added some tests for these new features.

I believe this PR makes forestci work well with general bagging estimators. However, I would greatly appreciate it if @arokem, @kpolimis, @bhazelton could check my work here. Most importantly, is this sensible? I think I've made the APIs compatible but am I making a mistake in applying Wager's method to general bagging methods (and not exclusively to random forests)?

richford · 2020-12-10T00:56:29Z

Ohh, my apologies for the style changes to the files. I'm using an auto-linter in VSCode that changes lines when I save. They should still be PEP8 compliant. If it's a deal-breaker, I can revert the style changes.

arokem · 2020-12-10T01:04:58Z

Don’t worry about style changes. Will take a look later tonight or tomorrow

…

On Wed, Dec 9, 2020 at 4:56 PM Adam Richie-Halford ***@***.***> wrote: Ohh, my apologies for the style changes to the files. I'm using an auto-linter in VSCode that changes lines when I save. They should still be PEP8 compliant. If it's a deal-breaker, I can revert the style changes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA46NRAAWYE3GQ2BJCISWTSUAMEXANCNFSM4UUIYJ6Q> .

See: https://scikit-learn.org/stable/datasets/index.html#openml

arokem · 2020-12-11T06:30:42Z

The code looks fine, but I have not looked closely yet. First, to the matter of principle: can we use this method for bagging estimators that are not forest/tree algorithms?

I've reread the paper. IIUC, there is nothing in section 2 of the paper, which outlines the theory and develops the equations we use, that is specific to trees/forests. For example, equation 5 (implemented here seems to apply to bagged estimators just as well, as does the bias correction from equation 7 (implemented here. Similarly, the calibration method implemented here seems to apply broadly to "noisy variance estimates". So, I think that this is OK.

That said, I wonder whether @swager would like to weigh in here, since we rely on his work quite heavily.

Alternatively, maybe @nrs02004 wants to save us from ourselves.

I also noticed that the original R implementation of this method has been deprecated in favor of what seems to be a more general software package that includes methods for variance estimation that are related, but not identical to these we implemented here (see section 4.1 in their paper). I didn't see anything about this newer paper that suggests we should abandon the method implemented here, but it's a pretty big paper 😄

arokem

OK - now I've also looked at the code. IIUC, the relevant changes (i.e., non-formatting changes) are in calc_inbag. Is that correct? That all looks good, but I wonder whether we really need pandas to do the operation you are using pandas for. Not that I mind a pandas dependency all that much.

arokem · 2020-12-11T06:32:18Z

forestci/forestci.py

+            pd.DataFrame(index=index, data=data) for index, data in samples_and_counts
+        ]
+
+        return pd.concat(dataframes, axis="columns").fillna(0).astype(np.int).values


Am I correct that this is the only place that pandas is used? Is it here just to find nan values?

You're correct that was the only pandas dependency. I was using it for fillna and then also for concat to merge on the sample indices. But I just pushed a commit that gets rid of the pandas dependency.

arokem · 2020-12-11T06:36:46Z

I too have no idea why the documentation build is failing

…nterval into enh/bagging

richford · 2020-12-11T15:54:26Z

I removed the pandas dependency. The other non-style change is here but that doesn't depend on pandas.

richford · 2020-12-11T15:56:22Z

Is the doc build failure another example of numpy/numpydoc#268

arokem · 2020-12-17T01:05:37Z

After a bit more discussion, I am inclined to go ahead and merge this. We can fix up the docs on a separate PR.

richford added 3 commits December 9, 2020 16:23

ENH: Allow BaggingClassifier and BaggingRegressor estimators

4c95aa5

Add plot_mpg_svr.py example

9793b18

Add pandas requirement

af6f88a

Refer to the openml dataset by its id, to remove ambiguity

47878d0

See: https://scikit-learn.org/stable/datasets/index.html#openml

arokem reviewed Dec 11, 2020

View reviewed changes

richford added 2 commits December 11, 2020 07:51

DEP: Remove pandas dependence

42ed91e

Merge branch 'enh/bagging' of github.com:richford/forest-confidence-i…

52947e7

…nterval into enh/bagging

arokem merged commit 7f36ef7 into scikit-learn-contrib:master Dec 17, 2020

arokem mentioned this pull request Dec 17, 2020

Overhaul documentation #101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow forestci to work on general Bagging estimators #100

ENH: Allow forestci to work on general Bagging estimators #100

richford commented Dec 10, 2020

richford commented Dec 10, 2020

arokem commented Dec 10, 2020 via email

arokem commented Dec 11, 2020

arokem left a comment

arokem Dec 11, 2020

richford Dec 11, 2020

arokem commented Dec 11, 2020

richford commented Dec 11, 2020

richford commented Dec 11, 2020

arokem commented Dec 17, 2020

ENH: Allow forestci to work on general Bagging estimators #100

ENH: Allow forestci to work on general Bagging estimators #100

Conversation

richford commented Dec 10, 2020

richford commented Dec 10, 2020

arokem commented Dec 10, 2020 via email

arokem commented Dec 11, 2020

arokem left a comment

Choose a reason for hiding this comment

arokem Dec 11, 2020

Choose a reason for hiding this comment

richford Dec 11, 2020

Choose a reason for hiding this comment

arokem commented Dec 11, 2020

richford commented Dec 11, 2020

richford commented Dec 11, 2020

arokem commented Dec 17, 2020