Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: rlm errors on missing values #2083

Closed
aimboden opened this issue Nov 7, 2014 · 11 comments
Closed

BUG: rlm errors on missing values #2083

aimboden opened this issue Nov 7, 2014 · 11 comments

Comments

@aimboden
Copy link

aimboden commented Nov 7, 2014

Hello,

I just upgraded to statsmodels v. 0.6.0 and found my code was not running as expected compared to v.0.5.0. After somme digging, I narrowed the error to the following problem of rlm with the formula api. Since the missing kwarg is set to 'drop' by default, I'm guessing this is a bug.

import statsmodels.formula.api as smf
import pandas as pd

d = {'Foo': [1, 2, 10, 149], 'Bar': [1, 2, 3, np.nan]}
df = pd.DataFrame(d)
mod = smf.rlm('Foo ~ Bar', data=df)

which raises the following Exception

  File "statsmodels\base\model.py", line 150, in from_formula
    mod = cls(endog, exog, *args, **kwargs)

  File "statsmodels\robust\robust_linear_model.py", line 117, in __init__
    missing=missing, **kwargs)

  File "statsmodels\base\model.py", line 60, in __init__
    **kwargs)

  File "statsmodels\base\model.py", line 84, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)

  File "statsmodels\base\data.py", line 539, in handle_data
    **kwargs)

  File "statsmodels\base\data.py", line 61, in __init__
    **kwargs)

  File "statsmodels\base\data.py", line 198, in handle_missing
    nan_mask = missing_idx | _nan_rows(*combined)

  File "statsmodels\base\data.py", line 47, in _nan_rows
    return reduce(_nan_row_maybe_two_inputs, arrs).squeeze()

TypeError: reduce() of empty sequence with no initial value
@aimboden aimboden changed the title Regression: rlm errors on missing values BUG: rlm errors on missing values Nov 7, 2014
@josef-pkt
Copy link
Member

Thanks for the report (unfortunately 0.6 is out)

it looks like there is no protection for a empty combined.
However, I don't understand (yet) why this works for ols but not for rlm. It should use exactly the same code in the generic data handling.

using arrays also works correctly

endog = d['Foo']
exog = np.column_stack((np.ones(len(endog)), d['Bar']))
mod_np = sm.RLM(endog, exog, missing='drop')
>>> patsy.__version__
'0.3.0'

@josef-pkt
Copy link
Member

The fix looks like it should specifically check for empty combined

>>> smd._nan_rows(*[])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\data.py", line 47, in _nan_rows
    return reduce(_nan_row_maybe_two_inputs, arrs).squeeze()
TypeError: reduce() of empty sequence with no initial value

aside: an empty list doesn't raise an exception in reduce

>>> smd._nan_rows([])
array([], dtype=bool)

@josef-pkt
Copy link
Member

Looks like OLS always add weights:

(adding a raise to get to the right spot with pdb)


  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\data.py", line 197, in handle_missing
    raise(ValueError)
ValueError
locals().keys()
['value_array', 'combined_names', 'missing', 'endog', 'combined_2d', 'combined', 'key', 'kwargs', 'none_array_names', 'combined_2d_names', 'missing_idx', 'exog', 'cls']
(Pdb) missing_idx
array([False, False, False,  True], dtype=bool)
(Pdb) combined
(array([ 1.,  1.,  1.,  1.]),)
(Pdb) combined_names
['weights']
(Pdb) kwargs
{'weights': array([ 1.,  1.,  1.,  1.])}

@josef-pkt
Copy link
Member

Yes, that's completely broken

glm, poisson from_formula raise the same exception

mod = smf.glm('Foo ~ Bar', data=df)
mod = smf.poisson('Foo ~ Bar', data=df)

including exposure raises another exception, (same for offset)
I don't know yet why this doesn't work. exposure and offset should be handled as extra arrays

>>> mod = smf.poisson('Foo ~ Bar', data=df, exposure=np.ones(4))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\model.py", line 150, in from_formula
    mod = cls(endog, exog, *args, **kwargs)
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\discrete\discrete_model.py", line 710, in __init__
    self._check_inputs(offset, exposure, endog) # attaches if needed
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\discrete\discrete_model.py", line 728, in _check_inputs
    raise ValueError("exposure is not the same length as endog")
ValueError: exposure is not the same length as endog

it works without formula:

>>> mod = sm.Poisson(df['Foo'], sm.add_constant(df['Bar']), data=df, exposure=np.ones(4))
>>> 

@josef-pkt
Copy link
Member

Looks like glm has the same issue with exposure

mod = smf.glm('Foo ~ Bar', data=df, offset=np.ones(len(df)))
raises
ValueError: offset is not the same length as endog

@josef-pkt
Copy link
Member

In countmodel.__init__ we _check_inputs of offset and exposure, before going through the missing value handling.

The following looks a bit ugly (having to attach twice), but it works for me

i.e. go through super first and then check_inputs. (super might be raising an exception already if there is a length mismatch - haven't tried yet)

        self.offset = offset
        self.exposure = exposure
        super(CountModel, self).__init__(endog, exog, missing=missing,
                offset=self.offset, exposure=self.exposure, **kwargs)
        self._check_inputs(self.offset, self.exposure, endog) # attaches if needed

first self. attaching is not necessary, I guess

@josef-pkt
Copy link
Member

simplified

        super(CountModel, self).__init__(endog, exog, missing=missing,
                offset=offset, exposure=exposure, **kwargs)
        self._check_inputs(self.offset, self.exposure, endog) # attaches if needed

the selfs in the call to _check_inputs are needed

@josef-pkt
Copy link
Member

incorrect missing handling in offset and exposure are not really regression bugs.
AFAIU, that never worked before, but was supposed to be fixed by the change that caused the regression bug.

@jseabold
Copy link
Member

jseabold commented Nov 7, 2014

I'm fixing these in #2084.

@josef-pkt
Copy link
Member

I leave it for now, and review #2084 when you finished the changes.

@jseabold
Copy link
Member

jseabold commented Nov 7, 2014

I fixed everything mentioned in here. Let me know any other issues. We'll need a consolidated overhaul for extra data handling to move it all up the class hierarchy at some point. I'm worried about all the special casing that's going in GLM, Discrete, GEE, MixedLM, etc. for formulas and extra arrays, though some of it is unavoidable.

I also have been meaning to document for developers what goes on at a high-level in the data handling with the super call. It was clear from recent additions (MixedLM, etc.) that the magic is not clear to anyone else.

jseabold added a commit that referenced this issue Nov 7, 2014
BUG: Correct issue if patsy handles missing. Closes #2083.
@josef-pkt josef-pkt added this to the 0.6.1 milestone Feb 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants