-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misconception of the median absolute deviation in stats? #11090
Comments
hi @tcjansen, in statistics, for better or worse, many things are compared against the Gaussian distribution. Here is a reference: I agree that this should perhaps be better documented to make it more clear to the user. Line 2767 in f3ef128
Are you interested in opening a pull request to clarify the documentation? |
@rlucas7 In my opinion, the documentation is not the place to correct for a misleading function name. With the default |
@u55 Yes, these are my thoughts exactly. |
@u55 and @tcjansen thanks for your comments. The original proposal for the function was for consistency with an existing function in the The name is overloaded in statistics (as are many names) to describe the loss function as well as the robust measure of scale which explains the confusion. The implementation here is for using a scaling factor for the robust measure of scale for a Gaussian as the default and using the value 1 as you both propose would be for the loss function calculation, or perhaps a robust measure of scale for some other probability distribution where the constant 1 provides a consistent estimator. You can read the original proposal: And you can read the pull request here: And the corresponding function in the A couple of concerns here would be:
Given the consideration of 1 and 2, perhaps an alternative is to make the scale argument a required rather than an optional value and not have a default? This would remove the ambiguity and confusion on the part of the end user. |
iqr has options 'raw' as default and 'normal' It really depends on the context and usage whether one or the other scaling is more convenient. For scipy it's ambiguous, for scipy.stats normal distribution scaled mad is more appropriate if they are considered as dispersion measures. However, I think it would be better if iqr and mad follow the same pattern for the default. |
@josef-pkt thanks for your comments, they are very helpful. I agree it is ambiguous w/scipy relative to other related packages.
Good call out, I hadn't considered the defaults of that method. I see what you mean:
I agree that the interface here should be consistent with existing methods inside the package that are also robust measures of scale. So the change would be to make the https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2736-L2740 @tcjansen and @u55 would either of you be interested in opening a pull request to implement the change? |
I disagree with this justification for the current scipy API for the following reasons:
Apart from changing the default value of |
mad is in In statistic, I find unscaled iqr and mad pretty meaningless or useless, because the raw values don't have a scale that can be compared to anything else. |
@josef-pkt I'm not arguing that the MAD doesn't have a statistical interpretation. I am simply arguing that multiplying it by a magic number means that it is not the MAD anymore, so calling it the |
in R https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/mad is scaled "Although practicality beats purity." Zen 9 I like the iqr way of specifying "normal" It's still a We have a lot of "magic" numbers in (Aside: This reminds me of some function in scipy.special, where it also wasn't obvious whether the functions were scaled or not and if scaled by what. This was a long time ago when the scipy.special docs where half a sentence per function. Is there a |
Most likely this needs a replacement statsmodels -> R i.e. the function looks closer to the one in R than the one in statsmodels, except default behavior is the same in all 3. |
Another comparison data point: in MATLAB, the mad function is unscaled (although it returns the Mean or Median Absolute Deviation depending on the value of an optional parameter). I like the NIST/Dataplot usage of def mad(x):
"""Median Absolute Deviation"""
return median(abs(x-median(x)))
def nmad(x):
"""Normalized Median Absolute Deviation"""
return mad(x)/scipy.stats.norm.ppf(3/4.) # approximately mad(x)/0.6745 "Explicit is better than implicit." Zen 2 One could even include: def aad(x):
"""Average Absolute Deviation"""
return mean(abs(x-mean(x)))
def naad(x):
"""Normalized Average Absolute Deviation"""
return aad(x)/sqrt(2/pi) # approximately aad(x)/0.79788 although I admit that |
To add even more confusion, astropy has |
Okay, so this is a bit of a mess, every package/language does it differently. For Changing behavior of So instead I propose:
|
@rgommers thanks for the guidance here. I think I follow what you mean. Here is how I understand it:
Note that the tests in the linked PR are updated for the I'm not super clear on what you mean on how to handle deprecation of |
…11431) * The new function `median_abs_deviation` replaces the deprecated `median_absolute_deviation`. The `scale` parameter of the new function divides the basic MAD calculation; the default value for `scale` is 1. For convenience, `scale` may be the string "normal", which uses the value `special.ndtri(0.75)` for the scale. * Deprecate `median_absolution_deviation`. There are issues with the `scale` argument that are difficult to fix with a backward-compatible change to the function's API: * To compute the result, the basic MAD is *multiplied* by `scale`, unlike in `iqr`, where the basic result is divided by `scale`. * The default value for scale, 1.4826, is an approximation to the normal quantile function at 0.75, so was unnecessarily imprecise. The main problem with the default, though, was it did not match user's expectations. Most users will likely expect the function to compute the unscaled MAD, which corresponds to `scale` equal to 1. * In `iqr`, the default value of `scale` is changed to 1.0, instead of the string "raw". The use of the string "raw" to mean 1.0 is deprecated. Closes gh-11090. Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
As far as I know, the median absolute deviation (MAD) is simply defined to be
MAD = median(|x - median(x)|)
However,
stats.median_absolute_deviation
is misleading in that it returns1.4826 * MAD
, which is really an approximation for the standard deviation of a gaussian distributed data set, not the MAD itself.I would suggest setting the default scalar to 1 (currently 1.4826) to avoid further confusion..!
The text was updated successfully, but these errors were encountered: