Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misconception of the median absolute deviation in stats? #11090

Closed
tcjansen opened this issue Nov 19, 2019 · 15 comments · Fixed by #11431
Closed

Misconception of the median absolute deviation in stats? #11090

tcjansen opened this issue Nov 19, 2019 · 15 comments · Fixed by #11431
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats
Milestone

Comments

@tcjansen
Copy link

As far as I know, the median absolute deviation (MAD) is simply defined to be
MAD = median(|x - median(x)|)
However, stats.median_absolute_deviation is misleading in that it returns 1.4826 * MAD, which is really an approximation for the standard deviation of a gaussian distributed data set, not the MAD itself.
I would suggest setting the default scalar to 1 (currently 1.4826) to avoid further confusion..!

@rlucas7
Copy link
Member

rlucas7 commented Nov 19, 2019

hi @tcjansen, in statistics, for better or worse, many things are compared against the Gaussian distribution. Here is a reference:
https://en.wikipedia.org/wiki/Median_absolute_deviation#Relation_to_standard_deviation

I agree that this should perhaps be better documented to make it more clear to the user.
You can override the default value by changing the scale= argument from the default.
Here is a link to where that exists in the code:

def median_absolute_deviation(x, axis=0, center=np.median, scale=1.4826,

Are you interested in opening a pull request to clarify the documentation?

@peterbell10 peterbell10 added Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats labels Nov 19, 2019
@u55
Copy link

u55 commented Nov 21, 2019

@rlucas7 In my opinion, the documentation is not the place to correct for a misleading function name. With the default scale=1.4826 value, this is clearly not the MAD! I shouldn't have to read the documentation to figure out if a function named median_absolute_deviation actually returns the median absolute deviation. Either the default value should be changed to scale=1.0, or the name of the function should be changed to median_absolute_deviation_approximation_to_gaussian_std, which seems silly to me.

@tcjansen
Copy link
Author

@u55 Yes, these are my thoughts exactly.

@rlucas7
Copy link
Member

rlucas7 commented Nov 23, 2019

@u55 and @tcjansen thanks for your comments. The original proposal for the function was for consistency with an existing function in the statsmodels package which is used to provide a robust standard error estimate. One of the statsmodels maintainers noted in the issue (linked beneath) that there were other packages using statsmodels as a dependency solely for the mad function, the abbreviation used in that package.

The name is overloaded in statistics (as are many names) to describe the loss function as well as the robust measure of scale which explains the confusion. The implementation here is for using a scaling factor for the robust measure of scale for a Gaussian as the default and using the value 1 as you both propose would be for the loss function calculation, or perhaps a robust measure of scale for some other probability distribution where the constant 1 provides a consistent estimator.

You can read the original proposal:
#9635

And you can read the pull request here:
#9637

And the corresponding function in the statsmodels package here:
https://github.com/statsmodels/statsmodels/blob/fa16d00b8e66ee75a8045856ece7e7ac449631db/statsmodels/robust/scale.py#L18-L49
which describes the similar functionality.

A couple of concerns here would be:

  1. Making this change would have different defaults from statsmodels and might confuse other users who are using the method as a robust scale estimate. Or who are familiar with the statsmodels implementation.
  2. The function is already out in 1.3.X release so systems which took a dependency on that default would receive a breaking change if there was a switch.
  3. I think the usual approach is to issue a warning for 2 or more releases and then make the change. At least that is what we usually do for deprecations.

Given the consideration of 1 and 2, perhaps an alternative is to make the scale argument a required rather than an optional value and not have a default?

This would remove the ambiguity and confusion on the part of the end user.
This would also give us a chance to change the name, I regret the full spelling as it is a lot of keystrokes. I'd rather have a name like med_abs_dev() instead.

@josef-pkt
Copy link
Member

iqr has options 'raw' as default and 'normal'

It really depends on the context and usage whether one or the other scaling is more convenient.
As dispersion measures or robust scale measures it is convenient to have var, iqr and mad on the same scale. That's the main usage in statsmodels robust, and the defaults are for that use case.

For scipy it's ambiguous, for scipy.stats normal distribution scaled mad is more appropriate if they are considered as dispersion measures.
If it were in numpy, then mad would just be raw MAD,

However, I think it would be better if iqr and mad follow the same pattern for the default.

@rlucas7
Copy link
Member

rlucas7 commented Nov 23, 2019

@josef-pkt thanks for your comments, they are very helpful. I agree it is ambiguous w/scipy relative to other related packages.

iqr has options 'raw' as default and 'normal'

Good call out, I hadn't considered the defaults of that method. I see what you mean:
https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2601-L2602

However, I think it would be better if iqr and mad follow the same pattern for the default.

I agree that the interface here should be consistent with existing methods inside the package that are also robust measures of scale. So the change would be to make the scale argument in median_absolute_deviation() be consistent with iqr(), maintain the name and switch the default to 'raw' but also allow values to be specified by user, similar to what iqr does in scipy here:

https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2736-L2740

@tcjansen and @u55 would either of you be interested in opening a pull request to implement the change?

@u55
Copy link

u55 commented Nov 24, 2019

@rlucas7

  1. Making this change would have different defaults from statsmodels and might confuse other users who are using the method as a robust scale estimate. Or who are familiar with the statsmodels implementation.

I disagree with this justification for the current scipy API for the following reasons:

  1. Scipy has a much larger user base than statsmodels, and users unfamiliar with statsmodels are not going to expect the MAD to be multiplied by a magic number by default.
  2. A poor API choice in statsmodels is not a good reason for scipy to replicate that bad design.
  3. The current scipy implementation of median_absolute_deviation is multiplied by its scale parameter, making it inconsistent with statsmodels.robust.scale.mad, which is divided by its c parameter.
  4. Scipy's default scale=1.4826 value is a poor approximation to the exact value 1/c = 1/scipy.stats.norm.ppf(3/4.) that statsmodels.robust.scale.mad uses, again making it inconsistent with statsmodels.

So the change would be to make the scale argument in median_absolute_deviation() be consistent with iqr()

Apart from changing the default value of scale to 'raw', which I am in favor of, making scale in median_absolute_deviation consistent with iqr is even more of a breaking change since, currently, iqr is divided by its scale value and median_absolute_deviation is multiplied by its scale value. Making them consistent at this point, even with a warning cycle, seems just too confusing. If you want to make them consistent, I wonder if it wouldn't be better to just deprecate the entire function in favor of a differently-named function with a better API.

@josef-pkt
Copy link
Member

mad is in scipy.stats, and not in numpy or scipy.special, so it's perfectly valid to have mainly a statistical interpretation for it, or defaults for the main usage in statistics.

In statistic, I find unscaled iqr and mad pretty meaningless or useless, because the raw values don't have a scale that can be compared to anything else.
(e.g. use iqr as robust scale estimator in KDE)

@u55
Copy link

u55 commented Nov 24, 2019

@josef-pkt I'm not arguing that the MAD doesn't have a statistical interpretation. I am simply arguing that multiplying it by a magic number means that it is not the MAD anymore, so calling it the median_absolute_deviation is misleading! "Main usage" must always take a back seat to the principle of least surprise. Statistics users who want to use the MAD to estimate the Gaussian standard deviation are more likely to see a function named scipy.stats.median_absolute_deviation and multiply it by 1.4826 themselves (erroneously applying the scaling twice) not realizing that scipy is trying to be too cute. An optional scaling parameter that changes the meaning of the function should be a convenience that one has to opt-in to, not a default that one has to opt-out of.

@josef-pkt
Copy link
Member

josef-pkt commented Nov 24, 2019

in R https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/mad is scaled
NIST/Dataplot has mad and scaled mad
I cannot figure out from the SAS documentation what they are doing, the basic mad function is "raw"
edit new SAS docs: MAD computes both and raw by default
https://documentation.sas.com/?docsetId=imlug&docsetTarget=imlug_langref_sect259.htm&docsetVersion=14.2&locale=en
Stata doesn't seem to have it, user provided mad use scaled

"Although practicality beats purity." Zen 9

I like the iqr way of specifying "normal"
In statsmodels and R you have to specify a number. It is easy to specify 1 for unscaled, but I wouldn't want to have to find the number for "normal" each time I use it.

It's still a mad Python function that allows us to compute scaled and unscaled mad. The default is just defined for the most common use cases in statistics.
i.e. I wouldn't change it in statsmodels to something that satisfies a "pure" definition by default.
But I'm indifferent to the default in scipy.stats. IMO both defaults can be justified in stats.

We have a lot of "magic" numbers in statsmodels.robust to calibrate it to the normal case. I only looked up a few of those as developer or user.

(Aside: This reminds me of some function in scipy.special, where it also wasn't obvious whether the functions were scaled or not and if scaled by what. This was a long time ago when the scipy.special docs where half a sentence per function. Is there a 2 pi in hermite polynomials? i.e. are they scaled for normal distribution or "raw". AFAIR)

@josef-pkt
Copy link
Member

Making this change would have different defaults from statsmodels and might confuse other users who are using the method as a robust scale estimate. Or who are familiar with the statsmodels implementation.

Most likely this needs a replacement statsmodels -> R

i.e. the function looks closer to the one in R than the one in statsmodels, except default behavior is the same in all 3.

@u55
Copy link

u55 commented Nov 24, 2019

Another comparison data point: in MATLAB, the mad function is unscaled (although it returns the Mean or Median Absolute Deviation depending on the value of an optional parameter).

I like the NIST/Dataplot usage of MAD and MADN, or the SAS usage of MAD and NMAD. If the scaled version is truly the most common usage, I would support deprecating median_absolute_deviation in favor of two new functions:

def mad(x):
    """Median Absolute Deviation"""
    return median(abs(x-median(x)))

def nmad(x):
    """Normalized Median Absolute Deviation"""
    return mad(x)/scipy.stats.norm.ppf(3/4.) # approximately mad(x)/0.6745

"Explicit is better than implicit." Zen 2

One could even include:

def aad(x):
    """Average Absolute Deviation"""
    return mean(abs(x-mean(x)))

def naad(x):
    """Normalized Average Absolute Deviation"""
    return aad(x)/sqrt(2/pi) # approximately aad(x)/0.79788

although I admit that aad looks a lot like add.

@u55
Copy link

u55 commented Nov 24, 2019

To add even more confusion, astropy has astropy.stats.median_absolute_deviation which is unscaled, and astropy.stats.mad_std which is scaled.

@rgommers
Copy link
Member

rgommers commented Mar 8, 2020

Okay, so this is a bit of a mess, every package/language does it differently. For scipy.stats I'd say R is the most important reference, however the most convincing argument here is consistency with iqr. So I'm +1 for making a change.

Changing behavior of median_absolute_deviation is a bad idea, because it will silently change numerical values for people who already use the current version. Even with a FutureWarning (that'd be the correct one rather than DeprecationWarning) we don't want to do that.

So instead I propose:

  • introduce a new function median_abs_deviation(..., scale='raw')
  • deprecate median_absolute_deviation
  • add a .. warning:: box in the docstring that explains the inconsistency in scaling between Statsmodels, R, Matlab, AstroPy et al.

@rlucas7
Copy link
Member

rlucas7 commented Mar 8, 2020

@rgommers thanks for the guidance here. I think I follow what you mean. Here is how I understand it:

  1. The changed function in PR:
    Make median_absolute_deviation scale argument aligned w/iqr #11431
    keeps the signature but changes the name from median_absolute_deviation to median_abs_deviation (a new function), I'll add this into the namespace.

Note that the tests in the linked PR are updated for the scale=raw change so I'll move those over to the new function name too.

I'm not super clear on what you mean on how to handle deprecation of median_absolute_deviation but that comment is over in existing PR so I'll comment over there.
I'll keep the old tests as well (those for median_absolute_deviation function). With appropriate changes for deprecation.

WarrenWeckesser added a commit that referenced this issue May 25, 2020
…11431)

* The new function `median_abs_deviation` replaces the deprecated
  `median_absolute_deviation`.  The `scale` parameter of the new
  function divides the basic MAD calculation; the default value
  for `scale` is 1.  For convenience, `scale` may be the string
  "normal", which uses the value `special.ndtri(0.75)` for the scale.
* Deprecate `median_absolution_deviation`.  There are issues with the
  `scale` argument that are difficult to fix with a backward-compatible
  change to the function's API:
  * To compute the result, the basic MAD is *multiplied* by `scale`,
    unlike in `iqr`, where the basic result is divided by `scale`.
  * The default value for scale, 1.4826, is an approximation to the
    normal quantile function at 0.75, so was unnecessarily imprecise.
    The main problem with the default, though, was it did not match
    user's expectations.  Most users will likely expect the function
    to compute the unscaled MAD, which corresponds to `scale` equal
    to 1.
* In `iqr`, the default value of `scale` is changed to 1.0, instead
  of the string "raw". The use of the string "raw" to mean 1.0 is
  deprecated.

Closes gh-11090.

Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
@tylerjereddy tylerjereddy added this to the 1.5.0 milestone May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants