nan with categorical in formula #805

Closed
jseabold opened this Issue May 3, 2013 · 28 comments

Projects

None yet

4 participants

@jseabold
Member
jseabold commented May 3, 2013

It looks like we will need to use Patsy's NAAction sooner rather than later. Consider the following

import statsmodels.api as sm

from statsmodels.formula.api import ols

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print res.summary()

# NaNs a category now
df = dta.data
mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df, 
          missing='drop')
res = mod.fit()
print res.summary()

Simple fix? Is NA support in patsy in a released version? We could also just move up our missing handling to before the formula handling.

cc @njsmith

@josef-pkt
Member

Related: What happens if a category disappears because of nans?
MNLogit might be handling it correctly for endog because the conversion to dummy variable, wendog, happens in initialize, after nan rows have been dropped.

@njsmith
Contributor
njsmith commented May 9, 2013

Patsy infers levels for categorical variables before applying nan removal.
(In fact the former happens in design_matrix_builders, and the latter in
build_design_matrices.)

I would argue this is correct actually, and MNLogit is wrong. If I have a
level in my data that's not estimable, I want to see that, not have it just
silently disappear. This is the whole point of having factors as a special
data type in R - because the levels that actually appear in any given
vector may be different than the full set of levels that could appear in
principle for that measurement.
On 9 May 2013 08:34, "Josef Perktold" notifications@github.com wrote:

Related: What happens if a category disappears because of nans?
MNLogit might be handling it correctly for endog because the conversion to
dummy variable, wendog, happens in initialize, after nan rows have been
dropped.


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/issues/805#issuecomment-17661944
.

@josef-pkt
Member

I never tried MNLogit with nans in any package

However, my impression is that R and others (Stata, SAS?) supress category levels without observations by default. I think I read comments about this in the SAS or Stata documentation, but don't remember details.

see comment for pandas, and my experience with R factors for cohen's kappa wrt missing categories
http://jpktd.blogspot.ca/2013/03/inter-rater-reliability-cohen-kappa-and_268.html?showComment=1367557172718#c3881544432421261243
SAS drops missing category levels by default in all statistical tests.
(but that's missing category levels in data, not because of missing values)

we could issue a warning, but if we don't delete, then we cannot estimate with MLE, and in linear models we get singular exog and unidentified parameters.

Aside: I was working with Poisson with a categorical regressor in my robust branch for maximum trimmed likelihood, and it fails because of singular exog/Hessian in MLE.

@josef-pkt
Member

with patsy 0.1, I haven't updated yet and don't know how this has changed

looks like we need to coordinate on this
missing='drop' in statsmodels comes too late, patsy breaks before

form='y ~ C(x)'
affair_model = logit(form, my_data_dict, missing='drop')
affair_result = affair_model.fit()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hanae_logit_singular.py", line 36, in <module>
    affair_model = logit(form, my_data_dict, missing='drop')
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2\statsmodels\statsmodels\base\model.py", line 98, in from_formula
    endog, exog = handle_formula_data(data, None, formula)
  File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2\statsmodels\statsmodels\formula\formulatools.py", line 47, in handle_formula_data
    return dmatrices(formula, Y, 2, return_type='dataframe')
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\highlevel.py", line 283, in dmatrices
    return_type)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\highlevel.py", line 147, in _do_highlevel_design
    builders = _try_incr_builders(formula_like, data_iter_maker, eval_env)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\highlevel.py", line 61, in _try_incr_builders
    data_iter_maker)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\build.py", line 693, in design_matrix_builders
    data_iter_maker)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\eval.py", line 431, in eval
    return self._eval(memorize_state["eval_code"], memorize_state, data)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\eval.py", line 414, in _eval
    return self._eval_env.eval(code, inner_namespace=inner_namespace)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\eval.py", line 121, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\categorical.py", line 201, in transform
    return Categorical.from_sequence(data, levels, **kwargs)
  File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\categorical.py", line 71, in from_sequence
    "expected levels" % (entry,))
patsy.PatsyError: Error converting data to categorical: object nan does not match any of the expected levels
@njsmith
Contributor
njsmith commented Aug 19, 2013

0.1 doesn't have any special missing data handling, but this ought to work
with 0.2.

There is this bug report I haven't had time to look into yet, though:
pydata/patsy#22

-n
On 19 Aug 2013 18:03, "Josef Perktold" notifications@github.com wrote:

with patsy 0.1, I haven't updated yet and don't know how this has changed

looks like we need to coordinate on this
missing='drop' in statsmodels comes too late, patsy breaks before

form='y ~ C(x)'
affair_model = logit(form, my_data_dict, missing='drop')
affair_result = affair_model.fit()

Traceback (most recent call last):
File "", line 1, in
File "hanae_logit_singular.py", line 36, in
affair_model = logit(form, my_data_dict, missing='drop')
File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2\statsmodels\statsmodels\base\model.py", line 98, in from_formula
endog, exog = handle_formula_data(data, None, formula)
File "e:\josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2\statsmodels\statsmodels\formula\formulatools.py", line 47, in handle_formula_data
return dmatrices(formula, Y, 2, return_type='dataframe')
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\highlevel.py", line 283, in dmatrices
return_type)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\highlevel.py", line 147, in _do_highlevel_design
builders = _try_incr_builders(formula_like, data_iter_maker, eval_env)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\highlevel.py", line 61, in _try_incr_builders
data_iter_maker)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\build.py", line 693, in design_matrix_builders
data_iter_maker)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\build.py", line 443, in _examine_factor_types
value = factor.eval(factor_states[factor], data)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\eval.py", line 431, in eval
return self._eval(memorize_state["eval_code"], memorize_state, data)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\eval.py", line 414, in _eval
return self._eval_env.eval(code, inner_namespace=inner_namespace)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\eval.py", line 121, in eval
+ self._namespaces))
File "", line 1, in
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\categorical.py", line 201, in transform
return Categorical.from_sequence(data, levels, **kwargs)
File "e:\josef\eclipsegworkspace\charlton\patsy\patsy\categorical.py", line 71, in from_sequence
"expected levels" % (entry,))
patsy.PatsyError: Error converting data to categorical: object nan does not match any of the expected levels


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/issues/805#issuecomment-22886940
.

@josef-pkt
Member

Even if patsy now handles this case automatically, then that still leaves us with the problem that, if patsy removes rows, then other arrays will not match anymore, like weights in WLS.

@vincentarelbundock
Member

We could use indices:

import patsy
import numpy as np
import pandas as pd

dat = pd.DataFrame(np.random.random((3,3)), columns=['a', 'b', 'c'])
dat.a[1] = None
w = pd.Series(range(3), index=dat.index)

x = patsy.dmatrices('a ~ b', dat, return_type='dataframe')
w.ix[x[0].index]
@vincentarelbundock
Member
In [10]: print dat
          a         b         c
0  0.589011  0.150032  0.582764
1       NaN  0.227665  0.965785
2  0.329215  0.483675  0.723749

In [11]: print w
0    0
1    1
2    2 
dtype: int64

In [12]: x = patsy.dmatrices('a ~ b', dat, return_type='dataframe')

In [13]: print w.ix[x[0].index]
0    0
2    2
dtype: int64
@josef-pkt
Member

However, that only works with pandas data, not if data is just a dict_like.

@njsmith
Contributor
njsmith commented Aug 19, 2013

It works fine if the data is just a dict_like... Try it :-)
On 19 Aug 2013 19:13, "Josef Perktold" notifications@github.com wrote:

However, that only works with pandas data, not if data is just a dict_like.


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/issues/805#issuecomment-22892302
.

@josef-pkt
Member

dict_like doesn't work in response to Vincent's solution
http://nbviewer.ipython.org/urls/gist.github.com/josef-pkt/6273095/raw/e5b64b21bc8d573817cc690b16c6e853c0999876/patsy_missing.ipynb

(trying out my shiny new ipython in winpython)

@njsmith
Contributor
njsmith commented Aug 20, 2013

I'm not sure what the notebook is showing -- I guess you aren't actually
looking at the index that patsy gives back to tell you which rows have been
dropped? My point was that if you pass in a dict to patsy with
return_type="dataframe", then it will invent an index (just like pandas
does when you just coerce a dict to a dataframe), and it invents this index
before doing missing-value handling, so the index on the final DataFrame
will correctly indicate which rows have been dropped, and let you drop the
appropriate rows from your weights matrix.

On Mon, Aug 19, 2013 at 8:38 PM, Josef Perktold notifications@github.comwrote:

dict_like doesn't work in response to Vincent's solution

http://nbviewer.ipython.org/urls/gist.github.com/josef-pkt/6273095/raw/e5b64b21bc8d573817cc690b16c6e853c0999876/patsy_missing.ipynb

(trying out my shiny new ipython in winpython)


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/issues/805#issuecomment-22898403
.

@jseabold
Member

Yes, there's no code that interacts with patsy for missing values. It needs to be written. You can do

res1 = smf.wls('y~x0+x1', data=data, weights=weights, missing='drop').fit()

But this won't work in the case of categorical with NaNs I don't believe.

@jseabold
Member

It might though, I'd need to look.

@josef-pkt
Member

res1 = smf.wls('y~x0+x1', data=data, weights=weights, missing='drop').fit()

also breaks because the missing handling in statsmodels assumes the extra arrays have the same shape[0] as the data returned from patsy.

@njsmith The current implementation (interaction between patsy and statsmodels) is broken if there are any extra arrays.
However, as you explain, if we can get the index back, then we can adjust the other arrays, or we could remove all nan rows before calling patsy.

in either case, this is a prime candidate for 0.5.1

@jseabold jseabold added this to the 0.5.1 milestone Mar 18, 2014
@jseabold
Member

Set 0.5.1 milestone for this.

@jseabold
Member

The issue now is that we need to pass off something to patsy, optionally from the user, to tell it how to deal with missing data. It will probably make most sense to rewrite our missing data handling to do this instead of what we have.

I don't really think it's super clean, because I'd prefer to keep missing data-handling out of patsy, since we may want to re-use it elsewhere where we don't want to have to go through patsy. We may just have some duplicate code.

@njsmith
Contributor
njsmith commented Mar 19, 2014

At least consider the option of using patsy's api to missing data as what
you expose to the user everywhere (I.e. let them pass NAAction objects as
NA_action args, and then you can use it to feed either patsy itself.or your
own code? Patsy's a dep after all... and NAAction seems as good an api for
requesting NA handling as another...
On 19 Mar 2014 04:35, "Skipper Seabold" notifications@github.com wrote:

The issue now is that we need to pass off something to patsy, optionally
from the user, to tell it how to deal with missing data. It will probably
make most sense to rewrite our missing data handling to do this instead of
what we have.

I don't really think it's super clean, because I'd prefer to keep missing
data-handling out of patsy, since we may want to re-use it elsewhere where
we don't want to have to go through patsy. We may just have some duplicate
code.


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/issues/805#issuecomment-38017136
.

@josef-pkt
Member

Moving this to 0.6.0 as prio-high
see related for GEE #1877

IMO, we just turn of missing value handling by patsy in from_formula for 0.6, and let statsmodels do all the missing value handling.

About categorical with all zero columns: I looked at some examples in another issue. There won't be any difference in whether patsy or statsmodels does it.
We need to handle zero columns as part of our singular exog handling, which is currently using pinv in RegressionModels and raises exceptions in discrete, and unclear in GLM (where it doesn't raise an exception because WLS uses pinv) and RLM.

@josef-pkt josef-pkt modified the milestone: 0.5.1, 0.6 Aug 12, 2014
@josef-pkt josef-pkt added the prio-high label Aug 12, 2014
@josef-pkt
Member

we still need to handle #1220 separately, for arrays or pandas.Series that come late

@jseabold
Member
jseabold commented Oct 9, 2014

#2034 starts to address this.

@jseabold
Member
jseabold commented Oct 9, 2014

I'm really starting to think that patsy is not the place to handle missing values given that a lot of our 'extra' arrays don't go through patsy. Unfortunately, there's no way to turn off missing value handling in patsy. Working around this is going to make some things much more complex than they otherwise would be.

@josef-pkt
Member

I thought we could just turn off patsy's nan handling completely. I didn't know that that's not possible.

One problem I thought about is what patsy is doing with missing in categorical variables. Do we get a dummy column for NaN?

@jseabold
Member
jseabold commented Oct 9, 2014

That's what this issue is about right? The way it is now, no we don't. It gets dropped, which I think is the right thing to do. If you want to keep the NaN category, you could change NaNs to another placeholder. See latest commit in #2034.

Re: turning off patsy, no, we can't. I monkey-patched NAAction to do nothing, but then you run into this. The current workaround is hackish, but I think it will get the job done for the release and we can continue to polish going forward.

@njsmith
Contributor
njsmith commented Oct 9, 2014

There's no need to monkeypatch NAAction, just pass your own version in.

I don't understand the problem you ran into with doing this, though -- the
line you link to will only cause a problem if you're actually passing in
categorical NaNs. I would have thought that if you're taking responsibility
for handling missing values, then it's your job to get rid of those NaNs
first?

On Thu, Oct 9, 2014 at 4:16 PM, Skipper Seabold notifications@github.com
wrote:

That's what this issue is about right? The way it is now, no we don't. It
gets dropped, which I think is the right thing to do. If you want to keep
the NaN category, you could change NaNs to another placeholder. See latest
commit in #2034 #2034.

Re: turning off patsy, no, we can't. I monkey-patched NAAction to do
nothing, but then you run into this
https://github.com/pydata/patsy/blob/master/patsy/build.py#L270. The
current workaround is hackish, but I think it will get the job done for the
release and we can continue to polish going forward.


Reply to this email directly or view it on GitHub
#805 (comment)
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@njsmith
Contributor
njsmith commented Oct 9, 2014

...Oh, sorry, no, I get it. You want to generate a design matrix that
contains NaNs, and then apply missing value elimination to the design
matrix. Which is really the only thing you can do, so fine.

I don't really know what we should even do here. I guess we could spit
out a row of NaNs if we encounter a NaN in an input categorical? I'm a
little worried about whether this will cause problems in the future when we
get proper NA support, and want to stop treating NaN as if it were NA. Plus
NaN is a weird value to find in a categorical in the first place, given
that these are usually strings :-/.

Enhancing patsy so it sees the "extra" arrays seems like a more viable
long-term solution then piling on workaround hacks... is that really
difficult somehow?

On Thu, Oct 9, 2014 at 5:01 PM, Nathaniel Smith njs@pobox.com wrote:

There's no need to monkeypatch NAAction, just pass your own version in.

I don't understand the problem you ran into with doing this, though -- the
line you link to will only cause a problem if you're actually passing in
categorical NaNs. I would have thought that if you're taking responsibility
for handling missing values, then it's your job to get rid of those NaNs
first?

On Thu, Oct 9, 2014 at 4:16 PM, Skipper Seabold notifications@github.com
wrote:

That's what this issue is about right? The way it is now, no we don't. It
gets dropped, which I think is the right thing to do. If you want to keep
the NaN category, you could change NaNs to another placeholder. See latest
commit in #2034 #2034.

Re: turning off patsy, no, we can't. I monkey-patched NAAction to do
nothing, but then you run into this
https://github.com/pydata/patsy/blob/master/patsy/build.py#L270. The
current workaround is hackish, but I think it will get the job done for the
release and we can continue to polish going forward.


Reply to this email directly or view it on GitHub
#805 (comment)
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jseabold
Member
jseabold commented Oct 9, 2014

I had to monkey-patch either way so I can get back the missing_mask. I don't want to write my own NAAction, given yours does what we need except this.

From the pandas perspective, NaNs in a Categorical or string is possible. Otherwise, from numpy we have object arrays. As to why they're there. Well, data can be funny.

There's a few difficulties from our perspective handing this in patsy, none insurmountable long term AFAICT.

  1. we want to release imminently and need to have a fix for this given our patsy version requirement, and the current hacks work.
  2. We create some arrays after formula handling. E.g.,
y, X = dmatrices()
if weights is None: weights = np.ones(len(y))

So it's possible users have passed weights, which we need missing value handling on, but we might also create them based off of y and if y had missing values then the missing index and the weights are no longer conformable before processing for missing values. So the NA-handling would have to be stateful in some sense.

If you want to try to support this in patsy, let's go for it, but I won't be submitting a PR.

Re: long-term NA handling (*chirp* *chirp*). I'm inclined to worry about it when we get there and work with what we have now. It's both nice and unfortunate that pandas has chosen to move on with NA-handling outside numpy (and I guess blaze too (?)), but it is what is for now.

@josef-pkt
Member

Enhancing patsy so it sees the "extra" arrays seems like a more viable
long-term solution then piling on workaround hacks... is that really
difficult somehow?

That sounds very difficult to me, and too much to pile on a formula

First we need the same missing handling also for the non-formula interface.

Second, it's easier this way to keep track of the different arrays that are involved in a model.
For example, we have or get now models that will have two formulas, like beta regression right now where we have an exog/design matrix/ formula for each parameter.
We need to be able to have a central missing handling that takes care of all the different parts (preferably without hiding it completely away)

@jseabold jseabold closed this in b1460a6 Oct 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment