Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong equation for standard error of the mean #17815

Open
PavloFesenko opened this issue Jul 2, 2020 · 3 comments
Open

Wrong equation for standard error of the mean #17815

PavloFesenko opened this issue Jul 2, 2020 · 3 comments

Comments

@PavloFesenko
Copy link

PavloFesenko commented Jul 2, 2020

Describe the issue linked to the documentation

In the User Guide for permutation importance, the second code block uses 2 * r.importances_std[i] to estimate 95% confidence interval. As far as I understand, instead of 2 * standard deviation it should be 2 * standard error of the mean. The standard error of the mean is calculated as the standard deviation divided by the square root of the sample size which is r.importances.shape[1] in this case. Moreover, the difference for degrees of freedom in the Numpy standard deviation by default is ddof=0 while the most common definition of the standard deviation uses ddof=1 ("corrected sample standard deviation").

Suggest a potential alternative/fix

  1. If the intention was to return the standard error of the mean (which makes more sense to me), then I would recommend to replace the returned value importances_std to importances_sem=scipy.stats.sem(importances, axis=1). Note that the scipy.stats.sem() by default uses ddof=1 as mentioned above.

  2. If the intention was to return the standard deviation, then the equation in the documentation should be 2 * r.importances_std[i] / np.sqrt(r.importances.shape[1]) and the difference for degrees of freedom in importances_std should be adjusted like this importances_std=np.std(importances, axis=1, ddof=1).

@adrinjalali
Copy link
Member

Ping @thomasjpfan

@thomasjpfan
Copy link
Member

The User Guide update was added in #16402

to estimate 95% confidence interval.

In the case of the user guide, it does not explicitly state that it is estimating a confidence interval. It is using the 2 * standard deviation as a heuristic to filter out features. One standard deviation is shown when printing out the mean +/- std.

As far as I understand, instead of 2 * standard deviation it should be 2 * standard error of the mean. The standard error of the mean is calculated as the standard deviation divided by the square root of the sample size which is r.importances.shape[1] in this case.

It may be worth it to switch to using standard error instead of standard deviation. This would extend to *SearchCVs.

Moreover, the difference for degrees of freedom in the Numpy standard deviation by default is ddof=0 while the most common definition of the standard deviation uses ddof=1 ("corrected sample standard deviation").

I would agree with using ddof=1 to have an unbiased estimate. (Although I do not see this explicitly being done in ML).

Related to #1940

@PavloFesenko
Copy link
Author

In the case of the user guide, it does not explicitly state that it is estimating a confidence interval. It is using the 2 * standard deviation as a heuristic to filter out features. One standard deviation is shown when printing out the mean +/- std.

I agree that's 95% confidence interval isn't explicitly mentioned but in my opinion the heuristic of 2 * standard deviations is quite illogical here. Mean +/- 2 * standard deviations show the interval where 95% of the values from the sampling distribution are situated. Mean +/- 2 * standard error of the mean show the interval where 95% of the mean values from the sampling distribution are situated. The latter makes sense to filter out features because the mean should be statistically significant from 0 otherwise it doesn't have any impact on the model from the statistical point of view.

Thanks for considering it, I really appreciate your feedback! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants