robust.mad is not being computed correctly or is non-standard definition; it returns the median #658

Closed
richardgmcmahon opened this Issue Feb 10, 2013 · 10 comments

Projects

None yet

3 participants

@richardgmcmahon

Hello,

The statsmodel median absolute deviation function robust.mad in not correct in my opinion. It returns the median. The function stand_mad returns the mad. Maybe this is intentional.

Here is some example code with its output.

import numpy as np
import statsmodels.robust.scale as sm
from scipy.stats import norm as Gaussian

mu=100.0
sigma=15.0
x = np.random.normal(mu, sigma, 10000)

print 'mu, sigma ', mu, sigma

print 'statsmodel MAD: ', sm.mad(x)
print 'statsmodel standardized MAD: ', sm.stand_mad(x)

x = np.asarray(x)

c = Gaussian.ppf(3.0/4.0)

print 'MAD to rms ratio: ', c, 1.0/c

m = np.median(x)
mymad = np.median(np.fabs(m - x) / c)

print 'my MAD: ', mymad
print 'median(x), median(x)/c: ', np.median(x), np.median(x)/c

mu, sigma 100.0 15.0
statsmodel MAD: 148.312355816
statsmodel standardized MAD: 15.1501297196
MAD to rms ratio: 0.674489750196 1.48260221851
my MAD: 15.1501297196
median(x), median(x)/c: 100.035163825 148.312355816

mu, sigma are the test gaussian data with mean 100 and sigma=15.0

mad returns the median
stand_mad returns the sigma

I propose that stand_mad is deprecated and mad is replaced with the code
in stand_mad

@josef-pkt
Member

My reading:

mad is used when mu is assumed to be zero, The main current use is in RLM applied to the residuals, when the mean is automatically zero *), mad is currently the default for RLM

std_mad is the usual standalone definition, when we don't have an assumption on mu.

*) actually, after checking, I'm not sure because it's applied to resid not the weighted residuals wresid.

I don't know if there are better names, Skipper implemented this from Huber's book.

@jseabold
Member

These functions were leftover from Jonathan's work IIRC. I'll have to check on the proposal vs. what we need for RLM, but I'm sympathetic to the confusion.

@josef-pkt
Member

2 bug reports means we need to do something here

I had a statsmodels.stats.robust_descriptive in work (before I got distracted with release problems), that focuses on robust measures for skew and kurtosis, but could/should also get robust standard deviation.

(barely related: a collection of measures comparing two arrays http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures
I don't remember why it ended up in tools
)

@jseabold
Member
jseabold commented Oct 9, 2013

A PR to change this to standard (non-confusing) usage would be very welcome.

@josef-pkt
Member

Looking at the code, we already have stand_mad

Subtracting the median is the only difference between mad and stand_mad, AFAICS.

@jseabold
Member
jseabold commented Oct 9, 2013

Yes. stand_mad is what you should use if want traditional MAD on an existing univariate random variable. Our mad is used internally on the residuals in RLM for a consistent estimator for the std error of the regression IIRC (and is also what's used by default in rlm in MASS in R and likely in Huber's book for RLM).

Probably we just combine both via a keyword argument of mad (with a deprecation warning). E.g., like in R,

mad(x, center=np.median)

and we change RLM internally to mad(x, 0) or whatever.

(where's np.nanmedian?)

@josef-pkt
Member

We need to check if the call to RLM doesn't get messier with an extra keyword.

Improving the docstring of mad to point to stand_mad and to mention the difference might be enough to direct users to the function that they want. ?

@jseabold
Member
jseabold commented Oct 9, 2013

The keyword wouldn't go to RLM. The user doesn't need to control this. I was just thinking internally to change the call to mad(resid, center=0). And change mad to take a variable, a center, and a correction number in addition to just the variable and correction keyword now. Then we just get rid of stand_mad.

@josef-pkt
Member

Yes, looks fine.
I just checked: RLM uses a string 'mad' or 'stand_mad' which is independent of the call signature for the mad function.

@jseabold
Member
jseabold commented Oct 9, 2013

Yeah, I'm just removing the ability to use stand_mad.

@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Oct 9, 2013
@jseabold jseabold REF: Deprecate stand_mad. Add center keyword to mad. Closes #658. 9415057
@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Oct 10, 2013
@jseabold jseabold REF: Deprecate stand_mad. Add center keyword to mad. Closes #658. dd5cb3e
@jseabold jseabold closed this in 9dbefc9 Oct 23, 2013
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold jseabold REF: Deprecate stand_mad. Add center keyword to mad. Closes #658. 4a9b24f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment