-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong equation for standard error of the mean #17815
Comments
Ping @thomasjpfan |
The User Guide update was added in #16402
In the case of the user guide, it does not explicitly state that it is estimating a confidence interval. It is using the 2 * standard deviation as a heuristic to filter out features. One standard deviation is shown when printing out the
It may be worth it to switch to using standard error instead of standard deviation. This would extend to
I would agree with using Related to #1940 |
I agree that's 95% confidence interval isn't explicitly mentioned but in my opinion the heuristic of 2 * standard deviations is quite illogical here. Mean +/- 2 * standard deviations show the interval where 95% of the values from the sampling distribution are situated. Mean +/- 2 * standard error of the mean show the interval where 95% of the mean values from the sampling distribution are situated. The latter makes sense to filter out features because the mean should be statistically significant from 0 otherwise it doesn't have any impact on the model from the statistical point of view. Thanks for considering it, I really appreciate your feedback! 😊 |
Describe the issue linked to the documentation
In the User Guide for permutation importance, the second code block uses
2 * r.importances_std[i]
to estimate 95% confidence interval. As far as I understand, instead of 2 * standard deviation it should be 2 * standard error of the mean. The standard error of the mean is calculated as the standard deviation divided by the square root of the sample size which isr.importances.shape[1]
in this case. Moreover, the difference for degrees of freedom in the Numpy standard deviation by default isddof=0
while the most common definition of the standard deviation usesddof=1
("corrected sample standard deviation").Suggest a potential alternative/fix
If the intention was to return the standard error of the mean (which makes more sense to me), then I would recommend to replace the returned value
importances_std
toimportances_sem=scipy.stats.sem(importances, axis=1)
. Note that thescipy.stats.sem()
by default usesddof=1
as mentioned above.If the intention was to return the standard deviation, then the equation in the documentation should be
2 * r.importances_std[i] / np.sqrt(r.importances.shape[1])
and the difference for degrees of freedom inimportances_std
should be adjusted like thisimportances_std=np.std(importances, axis=1, ddof=1)
.The text was updated successfully, but these errors were encountered: