BUG: rlm errors on missing values #2083

Closed
Gimli510 opened this Issue Nov 7, 2014 · 11 comments

Projects

None yet

3 participants

@Gimli510
Gimli510 commented Nov 7, 2014

Hello,

I just upgraded to statsmodels v. 0.6.0 and found my code was not running as expected compared to v.0.5.0. After somme digging, I narrowed the error to the following problem of rlm with the formula api. Since the missing kwarg is set to 'drop' by default, I'm guessing this is a bug.

import statsmodels.formula.api as smf
import pandas as pd

d = {'Foo': [1, 2, 10, 149], 'Bar': [1, 2, 3, np.nan]}
df = pd.DataFrame(d)
mod = smf.rlm('Foo ~ Bar', data=df)

which raises the following Exception

  File "statsmodels\base\model.py", line 150, in from_formula
    mod = cls(endog, exog, *args, **kwargs)

  File "statsmodels\robust\robust_linear_model.py", line 117, in __init__
    missing=missing, **kwargs)

  File "statsmodels\base\model.py", line 60, in __init__
    **kwargs)

  File "statsmodels\base\model.py", line 84, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)

  File "statsmodels\base\data.py", line 539, in handle_data
    **kwargs)

  File "statsmodels\base\data.py", line 61, in __init__
    **kwargs)

  File "statsmodels\base\data.py", line 198, in handle_missing
    nan_mask = missing_idx | _nan_rows(*combined)

  File "statsmodels\base\data.py", line 47, in _nan_rows
    return reduce(_nan_row_maybe_two_inputs, arrs).squeeze()

TypeError: reduce() of empty sequence with no initial value
@Gimli510 Gimli510 changed the title from Regression: rlm errors on missing values to BUG: rlm errors on missing values Nov 7, 2014
@josef-pkt
Member

Thanks for the report (unfortunately 0.6 is out)

it looks like there is no protection for a empty combined.
However, I don't understand (yet) why this works for ols but not for rlm. It should use exactly the same code in the generic data handling.

using arrays also works correctly

endog = d['Foo']
exog = np.column_stack((np.ones(len(endog)), d['Bar']))
mod_np = sm.RLM(endog, exog, missing='drop')
>>> patsy.__version__
'0.3.0'
@josef-pkt
Member

The fix looks like it should specifically check for empty combined

>>> smd._nan_rows(*[])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\data.py", line 47, in _nan_rows
    return reduce(_nan_row_maybe_two_inputs, arrs).squeeze()
TypeError: reduce() of empty sequence with no initial value

aside: an empty list doesn't raise an exception in reduce

>>> smd._nan_rows([])
array([], dtype=bool)
@josef-pkt
Member

Looks like OLS always add weights:

(adding a raise to get to the right spot with pdb)


  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\data.py", line 197, in handle_missing
    raise(ValueError)
ValueError
locals().keys()
['value_array', 'combined_names', 'missing', 'endog', 'combined_2d', 'combined', 'key', 'kwargs', 'none_array_names', 'combined_2d_names', 'missing_idx', 'exog', 'cls']
(Pdb) missing_idx
array([False, False, False,  True], dtype=bool)
(Pdb) combined
(array([ 1.,  1.,  1.,  1.]),)
(Pdb) combined_names
['weights']
(Pdb) kwargs
{'weights': array([ 1.,  1.,  1.,  1.])}
@josef-pkt
Member

Yes, that's completely broken

glm, poisson from_formula raise the same exception

mod = smf.glm('Foo ~ Bar', data=df)
mod = smf.poisson('Foo ~ Bar', data=df)

including exposure raises another exception, (same for offset)
I don't know yet why this doesn't work. exposure and offset should be handled as extra arrays

>>> mod = smf.poisson('Foo ~ Bar', data=df, exposure=np.ones(4))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\model.py", line 150, in from_formula
    mod = cls(endog, exog, *args, **kwargs)
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\discrete\discrete_model.py", line 710, in __init__
    self._check_inputs(offset, exposure, endog) # attaches if needed
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\discrete\discrete_model.py", line 728, in _check_inputs
    raise ValueError("exposure is not the same length as endog")
ValueError: exposure is not the same length as endog

it works without formula:

>>> mod = sm.Poisson(df['Foo'], sm.add_constant(df['Bar']), data=df, exposure=np.ones(4))
>>> 
@josef-pkt
Member

Looks like glm has the same issue with exposure

mod = smf.glm('Foo ~ Bar', data=df, offset=np.ones(len(df)))
raises
ValueError: offset is not the same length as endog

@josef-pkt
Member

In countmodel.__init__ we _check_inputs of offset and exposure, before going through the missing value handling.

The following looks a bit ugly (having to attach twice), but it works for me

i.e. go through super first and then check_inputs. (super might be raising an exception already if there is a length mismatch - haven't tried yet)

        self.offset = offset
        self.exposure = exposure
        super(CountModel, self).__init__(endog, exog, missing=missing,
                offset=self.offset, exposure=self.exposure, **kwargs)
        self._check_inputs(self.offset, self.exposure, endog) # attaches if needed

first self. attaching is not necessary, I guess

@josef-pkt
Member

simplified

        super(CountModel, self).__init__(endog, exog, missing=missing,
                offset=offset, exposure=exposure, **kwargs)
        self._check_inputs(self.offset, self.exposure, endog) # attaches if needed

the selfs in the call to _check_inputs are needed

@josef-pkt
Member

incorrect missing handling in offset and exposure are not really regression bugs.
AFAIU, that never worked before, but was supposed to be fixed by the change that caused the regression bug.

@jseabold
Member
jseabold commented Nov 7, 2014

I'm fixing these in #2084.

@josef-pkt
Member

I leave it for now, and review #2084 when you finished the changes.

@jseabold
Member
jseabold commented Nov 7, 2014

I fixed everything mentioned in here. Let me know any other issues. We'll need a consolidated overhaul for extra data handling to move it all up the class hierarchy at some point. I'm worried about all the special casing that's going in GLM, Discrete, GEE, MixedLM, etc. for formulas and extra arrays, though some of it is unavoidable.

I also have been meaning to document for developers what goes on at a high-level in the data handling with the super call. It was clear from recent additions (MixedLM, etc.) that the magic is not clear to anyone else.

@jseabold jseabold closed this in e5ba50f Nov 7, 2014
@Gimli510 Gimli510 referenced this issue in winpython/winpython Dec 2, 2014
Closed

release 2014-12 follow-up #29

7 of 7 tasks complete
@josef-pkt josef-pkt added this to the 0.6.1 milestone Feb 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment