Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robust.mad is not being computed correctly or is non-standard definition; it returns the median #658

Closed
richardgmcmahon opened this issue Feb 10, 2013 · 10 comments

Comments

Projects
None yet
3 participants
@richardgmcmahon
Copy link

commented Feb 10, 2013

Hello,

The statsmodel median absolute deviation function robust.mad in not correct in my opinion. It returns the median. The function stand_mad returns the mad. Maybe this is intentional.

Here is some example code with its output.

import numpy as np
import statsmodels.robust.scale as sm
from scipy.stats import norm as Gaussian

mu=100.0
sigma=15.0
x = np.random.normal(mu, sigma, 10000)

print 'mu, sigma ', mu, sigma

print 'statsmodel MAD: ', sm.mad(x)
print 'statsmodel standardized MAD: ', sm.stand_mad(x)

x = np.asarray(x)

c = Gaussian.ppf(3.0/4.0)

print 'MAD to rms ratio: ', c, 1.0/c

m = np.median(x)
mymad = np.median(np.fabs(m - x) / c)

print 'my MAD: ', mymad
print 'median(x), median(x)/c: ', np.median(x), np.median(x)/c

mu, sigma 100.0 15.0
statsmodel MAD: 148.312355816
statsmodel standardized MAD: 15.1501297196
MAD to rms ratio: 0.674489750196 1.48260221851
my MAD: 15.1501297196
median(x), median(x)/c: 100.035163825 148.312355816

mu, sigma are the test gaussian data with mean 100 and sigma=15.0

mad returns the median
stand_mad returns the sigma

I propose that stand_mad is deprecated and mad is replaced with the code
in stand_mad

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Feb 10, 2013

My reading:

mad is used when mu is assumed to be zero, The main current use is in RLM applied to the residuals, when the mean is automatically zero *), mad is currently the default for RLM

std_mad is the usual standalone definition, when we don't have an assumption on mu.

*) actually, after checking, I'm not sure because it's applied to resid not the weighted residuals wresid.

I don't know if there are better names, Skipper implemented this from Huber's book.

@jseabold

This comment has been minimized.

Copy link
Member

commented Feb 10, 2013

These functions were leftover from Jonathan's work IIRC. I'll have to check on the proposal vs. what we need for RLM, but I'm sympathetic to the confusion.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

2 bug reports means we need to do something here

I had a statsmodels.stats.robust_descriptive in work (before I got distracted with release problems), that focuses on robust measures for skew and kurtosis, but could/should also get robust standard deviation.

(barely related: a collection of measures comparing two arrays http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures
I don't remember why it ended up in tools
)

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

A PR to change this to standard (non-confusing) usage would be very welcome.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

Looking at the code, we already have stand_mad

Subtracting the median is the only difference between mad and stand_mad, AFAICS.

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

Yes. stand_mad is what you should use if want traditional MAD on an existing univariate random variable. Our mad is used internally on the residuals in RLM for a consistent estimator for the std error of the regression IIRC (and is also what's used by default in rlm in MASS in R and likely in Huber's book for RLM).

Probably we just combine both via a keyword argument of mad (with a deprecation warning). E.g., like in R,

mad(x, center=np.median)

and we change RLM internally to mad(x, 0) or whatever.

(where's np.nanmedian?)

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

We need to check if the call to RLM doesn't get messier with an extra keyword.

Improving the docstring of mad to point to stand_mad and to mention the difference might be enough to direct users to the function that they want. ?

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

The keyword wouldn't go to RLM. The user doesn't need to control this. I was just thinking internally to change the call to mad(resid, center=0). And change mad to take a variable, a center, and a correction number in addition to just the variable and correction keyword now. Then we just get rid of stand_mad.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

Yes, looks fine.
I just checked: RLM uses a string 'mad' or 'stand_mad' which is independent of the call signature for the mad function.

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 9, 2013

Yeah, I'm just removing the ability to use stand_mad.

jseabold added a commit to jseabold/statsmodels that referenced this issue Oct 9, 2013

jseabold added a commit to jseabold/statsmodels that referenced this issue Oct 10, 2013

@jseabold jseabold closed this in 9dbefc9 Oct 23, 2013

jseabold added a commit that referenced this issue Oct 23, 2013

Merge pull request #1107 from jseabold/robust-mad-refactor
REF: Deprecate stand_mad. Add center keyword to mad. Closes #658.

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

Merge pull request statsmodels#1107 from jseabold/robust-mad-refactor
REF: Deprecate stand_mad. Add center keyword to mad. Closes statsmodels#658.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.