The statsmodel median absolute deviation function robust.mad in not correct in my opinion. It returns the median. The function stand_mad returns the mad. Maybe this is intentional.
Here is some example code with its output.
import numpy as np
import statsmodels.robust.scale as sm
from scipy.stats import norm as Gaussian
x = np.random.normal(mu, sigma, 10000)
print 'mu, sigma ', mu, sigma
print 'statsmodel MAD: ', sm.mad(x)
print 'statsmodel standardized MAD: ', sm.stand_mad(x)
c = Gaussian.ppf(3.0/4.0)
print 'MAD to rms ratio: ', c, 1.0/c
m = np.median(x)
mymad = np.median(np.fabs(m - x) / c)
print 'my MAD: ', mymad
print 'median(x), median(x)/c: ', np.median(x), np.median(x)/c
mu, sigma 100.0 15.0
statsmodel MAD: 148.312355816
statsmodel standardized MAD: 15.1501297196
MAD to rms ratio: 0.674489750196 1.48260221851
my MAD: 15.1501297196
median(x), median(x)/c: 100.035163825 148.312355816
mu, sigma are the test gaussian data with mean 100 and sigma=15.0
mad returns the median
stand_mad returns the sigma
I propose that stand_mad is deprecated and mad is replaced with the code
mad is used when mu is assumed to be zero, The main current use is in RLM applied to the residuals, when the mean is automatically zero *), mad is currently the default for RLM
std_mad is the usual standalone definition, when we don't have an assumption on mu.
*) actually, after checking, I'm not sure because it's applied to resid not the weighted residuals wresid.
I don't know if there are better names, Skipper implemented this from Huber's book.
These functions were leftover from Jonathan's work IIRC. I'll have to check on the proposal vs. what we need for RLM, but I'm sympathetic to the confusion.
2 bug reports means we need to do something here
I had a statsmodels.stats.robust_descriptive in work (before I got distracted with release problems), that focuses on robust measures for skew and kurtosis, but could/should also get robust standard deviation.
(barely related: a collection of measures comparing two arrays http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures
I don't remember why it ended up in tools
A PR to change this to standard (non-confusing) usage would be very welcome.
Looking at the code, we already have stand_mad
Subtracting the median is the only difference between mad and stand_mad, AFAICS.
Yes. stand_mad is what you should use if want traditional MAD on an existing univariate random variable. Our mad is used internally on the residuals in RLM for a consistent estimator for the std error of the regression IIRC (and is also what's used by default in rlm in MASS in R and likely in Huber's book for RLM).
Probably we just combine both via a keyword argument of mad (with a deprecation warning). E.g., like in R,
and we change RLM internally to mad(x, 0) or whatever.
We need to check if the call to RLM doesn't get messier with an extra keyword.
Improving the docstring of mad to point to stand_mad and to mention the difference might be enough to direct users to the function that they want. ?
The keyword wouldn't go to RLM. The user doesn't need to control this. I was just thinking internally to change the call to mad(resid, center=0). And change mad to take a variable, a center, and a correction number in addition to just the variable and correction keyword now. Then we just get rid of stand_mad.
Yes, looks fine.
I just checked: RLM uses a string 'mad' or 'stand_mad' which is independent of the call signature for the mad function.
Yeah, I'm just removing the ability to use stand_mad.
REF: Deprecate stand_mad. Add center keyword to mad. Closes #658.