New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow forestci to work on general Bagging estimators #100
Conversation
Ohh, my apologies for the style changes to the files. I'm using an auto-linter in VSCode that changes lines when I save. They should still be PEP8 compliant. If it's a deal-breaker, I can revert the style changes. |
Don’t worry about style changes. Will take a look later tonight or tomorrow
…On Wed, Dec 9, 2020 at 4:56 PM Adam Richie-Halford ***@***.***> wrote:
Ohh, my apologies for the style changes to the files. I'm using an
auto-linter in VSCode that changes lines when I save. They should still be
PEP8 compliant. If it's a deal-breaker, I can revert the style changes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA46NRAAWYE3GQ2BJCISWTSUAMEXANCNFSM4UUIYJ6Q>
.
|
The code looks fine, but I have not looked closely yet. First, to the matter of principle: can we use this method for bagging estimators that are not forest/tree algorithms? I've reread the paper. IIUC, there is nothing in section 2 of the paper, which outlines the theory and develops the equations we use, that is specific to trees/forests. For example, equation 5 (implemented here seems to apply to bagged estimators just as well, as does the bias correction from equation 7 (implemented here. Similarly, the calibration method implemented here seems to apply broadly to "noisy variance estimates". So, I think that this is OK. That said, I wonder whether @swager would like to weigh in here, since we rely on his work quite heavily. Alternatively, maybe @nrs02004 wants to save us from ourselves. I also noticed that the original R implementation of this method has been deprecated in favor of what seems to be a more general software package that includes methods for variance estimation that are related, but not identical to these we implemented here (see section 4.1 in their paper). I didn't see anything about this newer paper that suggests we should abandon the method implemented here, but it's a pretty big paper 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - now I've also looked at the code. IIUC, the relevant changes (i.e., non-formatting changes) are in calc_inbag
. Is that correct? That all looks good, but I wonder whether we really need pandas to do the operation you are using pandas for. Not that I mind a pandas dependency all that much.
forestci/forestci.py
Outdated
pd.DataFrame(index=index, data=data) for index, data in samples_and_counts | ||
] | ||
|
||
return pd.concat(dataframes, axis="columns").fillna(0).astype(np.int).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that this is the only place that pandas is used? Is it here just to find nan values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct that was the only pandas dependency. I was using it for fillna
and then also for concat
to merge on the sample indices. But I just pushed a commit that gets rid of the pandas dependency.
I too have no idea why the documentation build is failing |
…nterval into enh/bagging
I removed the pandas dependency. The other non-style change is here but that doesn't depend on pandas. |
Is the doc build failure another example of numpy/numpydoc#268 |
After a bit more discussion, I am inclined to go ahead and merge this. We can fix up the docs on a separate PR. |
Resolves #99
This PR adds functionality to
forestci.py
to inspect the "forest" estimator to see if it is a random forest (i.e. inherits fromBaseForest
) or a bagging estimator (i.e. inherits fromBaseBagging
). There are some differences in the private attributes of these classes so the distinction is necessary. When the estimator is a random forest, all of the existing code applies. When it inherits fromBaseBagging
, we use the.estimators_samples_
attribute for thecalc_inbag
function. And when calibrating insiderandom_forest_error
, it is also necessary to randomly permute the_seeds
array attribute ofnew_forest
. I've also added some tests for these new features.I believe this PR makes forestci work well with general bagging estimators. However, I would greatly appreciate it if @arokem, @kpolimis, @bhazelton could check my work here. Most importantly, is this sensible? I think I've made the APIs compatible but am I making a mistake in applying Wager's method to general bagging methods (and not exclusively to random forests)?