Wrong equation for standard error of the mean #17815

PavloFesenko · 2020-07-02T14:49:23Z

Describe the issue linked to the documentation

In the User Guide for permutation importance, the second code block uses 2 * r.importances_std[i] to estimate 95% confidence interval. As far as I understand, instead of 2 * standard deviation it should be 2 * standard error of the mean. The standard error of the mean is calculated as the standard deviation divided by the square root of the sample size which is r.importances.shape[1] in this case. Moreover, the difference for degrees of freedom in the Numpy standard deviation by default is ddof=0 while the most common definition of the standard deviation uses ddof=1 ("corrected sample standard deviation").

Suggest a potential alternative/fix

If the intention was to return the standard error of the mean (which makes more sense to me), then I would recommend to replace the returned value importances_std to importances_sem=scipy.stats.sem(importances, axis=1). Note that the scipy.stats.sem() by default uses ddof=1 as mentioned above.
If the intention was to return the standard deviation, then the equation in the documentation should be 2 * r.importances_std[i] / np.sqrt(r.importances.shape[1]) and the difference for degrees of freedom in importances_std should be adjusted like this importances_std=np.std(importances, axis=1, ddof=1).

The text was updated successfully, but these errors were encountered:

adrinjalali · 2020-07-05T12:30:18Z

Ping @thomasjpfan

thomasjpfan · 2020-07-05T16:27:11Z

The User Guide update was added in #16402

to estimate 95% confidence interval.

In the case of the user guide, it does not explicitly state that it is estimating a confidence interval. It is using the 2 * standard deviation as a heuristic to filter out features. One standard deviation is shown when printing out the mean +/- std.

As far as I understand, instead of 2 * standard deviation it should be 2 * standard error of the mean. The standard error of the mean is calculated as the standard deviation divided by the square root of the sample size which is r.importances.shape[1] in this case.

It may be worth it to switch to using standard error instead of standard deviation. This would extend to *SearchCVs.

Moreover, the difference for degrees of freedom in the Numpy standard deviation by default is ddof=0 while the most common definition of the standard deviation uses ddof=1 ("corrected sample standard deviation").

I would agree with using ddof=1 to have an unbiased estimate. (Although I do not see this explicitly being done in ML).

Related to #1940

PavloFesenko · 2020-07-06T06:42:32Z

In the case of the user guide, it does not explicitly state that it is estimating a confidence interval. It is using the 2 * standard deviation as a heuristic to filter out features. One standard deviation is shown when printing out the mean +/- std.

I agree that's 95% confidence interval isn't explicitly mentioned but in my opinion the heuristic of 2 * standard deviations is quite illogical here. Mean +/- 2 * standard deviations show the interval where 95% of the values from the sampling distribution are situated. Mean +/- 2 * standard error of the mean show the interval where 95% of the mean values from the sampling distribution are situated. The latter makes sense to filter out features because the mean should be statistically significant from 0 otherwise it doesn't have any impact on the model from the statistical point of view.

Thanks for considering it, I really appreciate your feedback! 😊

PavloFesenko added the Documentation label Jul 2, 2020

adrinjalali added the module:inspection label Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong equation for standard error of the mean #17815

Wrong equation for standard error of the mean #17815

PavloFesenko commented Jul 2, 2020 •

edited

adrinjalali commented Jul 5, 2020

thomasjpfan commented Jul 5, 2020

PavloFesenko commented Jul 6, 2020

Wrong equation for standard error of the mean #17815

Wrong equation for standard error of the mean #17815

Comments

PavloFesenko commented Jul 2, 2020 • edited

Describe the issue linked to the documentation

Suggest a potential alternative/fix

adrinjalali commented Jul 5, 2020

thomasjpfan commented Jul 5, 2020

PavloFesenko commented Jul 6, 2020

PavloFesenko commented Jul 2, 2020 •

edited