BUG: rlm errors on missing values #2083

aimboden · 2014-11-07T09:12:06Z

Hello,

I just upgraded to statsmodels v. 0.6.0 and found my code was not running as expected compared to v.0.5.0. After somme digging, I narrowed the error to the following problem of rlm with the formula api. Since the missing kwarg is set to 'drop' by default, I'm guessing this is a bug.

import statsmodels.formula.api as smf
import pandas as pd

d = {'Foo': [1, 2, 10, 149], 'Bar': [1, 2, 3, np.nan]}
df = pd.DataFrame(d)
mod = smf.rlm('Foo ~ Bar', data=df)

which raises the following Exception

  File "statsmodels\base\model.py", line 150, in from_formula
    mod = cls(endog, exog, *args, **kwargs)

  File "statsmodels\robust\robust_linear_model.py", line 117, in __init__
    missing=missing, **kwargs)

  File "statsmodels\base\model.py", line 60, in __init__
    **kwargs)

  File "statsmodels\base\model.py", line 84, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)

  File "statsmodels\base\data.py", line 539, in handle_data
    **kwargs)

  File "statsmodels\base\data.py", line 61, in __init__
    **kwargs)

  File "statsmodels\base\data.py", line 198, in handle_missing
    nan_mask = missing_idx | _nan_rows(*combined)

  File "statsmodels\base\data.py", line 47, in _nan_rows
    return reduce(_nan_row_maybe_two_inputs, arrs).squeeze()

TypeError: reduce() of empty sequence with no initial value

josef-pkt · 2014-11-07T13:30:05Z

Thanks for the report (unfortunately 0.6 is out)

it looks like there is no protection for a empty combined.
However, I don't understand (yet) why this works for ols but not for rlm. It should use exactly the same code in the generic data handling.

using arrays also works correctly

endog = d['Foo']
exog = np.column_stack((np.ones(len(endog)), d['Bar']))
mod_np = sm.RLM(endog, exog, missing='drop')

>>> patsy.__version__
'0.3.0'

josef-pkt · 2014-11-07T13:48:18Z

The fix looks like it should specifically check for empty combined

>>> smd._nan_rows(*[])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\data.py", line 47, in _nan_rows
    return reduce(_nan_row_maybe_two_inputs, arrs).squeeze()
TypeError: reduce() of empty sequence with no initial value

aside: an empty list doesn't raise an exception in reduce

>>> smd._nan_rows([])
array([], dtype=bool)

josef-pkt · 2014-11-07T14:27:49Z

Looks like OLS always add weights:

(adding a raise to get to the right spot with pdb)


  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\data.py", line 197, in handle_missing
    raise(ValueError)
ValueError
locals().keys()
['value_array', 'combined_names', 'missing', 'endog', 'combined_2d', 'combined', 'key', 'kwargs', 'none_array_names', 'combined_2d_names', 'missing_idx', 'exog', 'cls']
(Pdb) missing_idx
array([False, False, False,  True], dtype=bool)
(Pdb) combined
(array([ 1.,  1.,  1.,  1.]),)
(Pdb) combined_names
['weights']
(Pdb) kwargs
{'weights': array([ 1.,  1.,  1.,  1.])}

josef-pkt · 2014-11-07T15:02:14Z

Yes, that's completely broken

glm, poisson from_formula raise the same exception

mod = smf.glm('Foo ~ Bar', data=df)
mod = smf.poisson('Foo ~ Bar', data=df)

including exposure raises another exception, (same for offset)
I don't know yet why this doesn't work. exposure and offset should be handled as extra arrays

>>> mod = smf.poisson('Foo ~ Bar', data=df, exposure=np.ones(4))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\base\model.py", line 150, in from_formula
    mod = cls(endog, exog, *args, **kwargs)
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\discrete\discrete_model.py", line 710, in __init__
    self._check_inputs(offset, exposure, endog) # attaches if needed
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\discrete\discrete_model.py", line 728, in _check_inputs
    raise ValueError("exposure is not the same length as endog")
ValueError: exposure is not the same length as endog

it works without formula:

>>> mod = sm.Poisson(df['Foo'], sm.add_constant(df['Bar']), data=df, exposure=np.ones(4))
>>>

…smodels#2083.

josef-pkt · 2014-11-07T15:09:18Z

Looks like glm has the same issue with exposure

mod = smf.glm('Foo ~ Bar', data=df, offset=np.ones(len(df)))
raises
ValueError: offset is not the same length as endog

josef-pkt · 2014-11-07T15:34:18Z

In countmodel.__init__ we _check_inputs of offset and exposure, before going through the missing value handling.

The following looks a bit ugly (having to attach twice), but it works for me

i.e. go through super first and then check_inputs. (super might be raising an exception already if there is a length mismatch - haven't tried yet)

        self.offset = offset
        self.exposure = exposure
        super(CountModel, self).__init__(endog, exog, missing=missing,
                offset=self.offset, exposure=self.exposure, **kwargs)
        self._check_inputs(self.offset, self.exposure, endog) # attaches if needed

first self. attaching is not necessary, I guess

josef-pkt · 2014-11-07T15:37:32Z

simplified

        super(CountModel, self).__init__(endog, exog, missing=missing,
                offset=offset, exposure=exposure, **kwargs)
        self._check_inputs(self.offset, self.exposure, endog) # attaches if needed

the selfs in the call to _check_inputs are needed

josef-pkt · 2014-11-07T15:40:56Z

incorrect missing handling in offset and exposure are not really regression bugs.
AFAIU, that never worked before, but was supposed to be fixed by the change that caused the regression bug.

jseabold · 2014-11-07T15:42:14Z

I'm fixing these in #2084.

josef-pkt · 2014-11-07T15:53:27Z

I leave it for now, and review #2084 when you finished the changes.

jseabold · 2014-11-07T16:01:37Z

I fixed everything mentioned in here. Let me know any other issues. We'll need a consolidated overhaul for extra data handling to move it all up the class hierarchy at some point. I'm worried about all the special casing that's going in GLM, Discrete, GEE, MixedLM, etc. for formulas and extra arrays, though some of it is unavoidable.

I also have been meaning to document for developers what goes on at a high-level in the data handling with the super call. It was clear from recent additions (MixedLM, etc.) that the magic is not clear to anyone else.

BUG: Correct issue if patsy handles missing. Closes #2083.

…#2083.

aimboden changed the title ~~Regression: rlm errors on missing values~~ BUG: rlm errors on missing values Nov 7, 2014

josef-pkt added type-bug prio-high comp-robust labels Nov 7, 2014

jseabold added a commit to jseabold/statsmodels that referenced this issue Nov 7, 2014

BUG: combined is an empty tuple if patsy handles missing. Closes stat…

9026859

…smodels#2083.

josef-pkt mentioned this issue Nov 7, 2014

BUG: Correct issue if patsy handles missing. Closes #2083. #2084

Merged

jseabold closed this as completed in e5ba50f Nov 7, 2014

jseabold added a commit that referenced this issue Nov 7, 2014

Merge pull request #2084 from jseabold/fix-2083

c8e980d

BUG: Correct issue if patsy handles missing. Closes #2083.

aimboden mentioned this issue Dec 2, 2014

release 2014-12 follow-up winpython/winpython#29

Closed

7 tasks

jseabold added a commit that referenced this issue Dec 2, 2014

Backport PR #2084: BUG: Correct issue if patsy handles missing. Closes …

bfd5ed5

…#2083.

josef-pkt added this to the 0.6.1 milestone Feb 17, 2015

josef-pkt mentioned this issue Feb 22, 2017

REF: raise in has constant check if exog is not finite. #3498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: rlm errors on missing values #2083

BUG: rlm errors on missing values #2083

aimboden commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

jseabold commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

jseabold commented Nov 7, 2014

Navigation Menu

BUG: rlm errors on missing values #2083

BUG: rlm errors on missing values #2083

Comments

aimboden commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

jseabold commented Nov 7, 2014

josef-pkt commented Nov 7, 2014

jseabold commented Nov 7, 2014