ENH: accept non-int definitions of cluster groups #1437

Merged
merged 2 commits into from Apr 4, 2014

Projects

None yet

4 participants

@toobaz
Contributor
toobaz commented Feb 28, 2014

Currently, "group" is assumed to be an array of ints. There is no particular reason why this must be so (i.e. a string variable "country" being the code for a nation).

@josef-pkt
Member

I need to look at this more carefully. The change looks innocent enough, but I would like to keep the group conversion in the wrapper code.

For the conversion to int, using return_index in np.unique is the fastest way, AFAICT. That conversion is in group-utils.

I think the np.unique in here is supposed to be optimized away, if we use the group utilities in the wrapper class. (is duplicate and only needed for small sample correction IIRC)

@toobaz
Contributor
toobaz commented Mar 1, 2014

I didn't know about the optional arguments to np.unique, this is fixed in the last commit.

... but I'm sorry I wasn't able to follow the rest of your comment. If you expect some more changes from me in this branch, please point me at the meaning of "wrapper code".

@coveralls

Coverage Status

Coverage remained the same when pulling aef7ee5 on toobaz:nonint_clusters into c299e3f on statsmodels:master.

@josef-pkt
Member

about "wrappers"

My first version to include it in a model is in the linear regression Results
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1538

From a quick check, it seems that it might also assume integer in cluster option.

Similar code will be needed in several different model classes, but I haven't gotten around yet to see how this can be rewritten so it also works for other models, like the MLE models in discrete.

I'm still only partway to get the pieces to fit together https://github.com/josef-pkt/statsmodels/compare/REF_covtype_fit
I'm still running into problems when I try to work on this #1418

What I wanted to say to this PR is: it looks fine, however eventually all this should go into the more general parts, so that the pure sandwich functions don't have to worry about any of this, and we don't need to have code duplication for this.
That's my plan, however group handling is still in flux across several PRs.
Maybe we can also keep the string to int conversion in there as option when users want to call the sandwiches directly.

I'm a bit unclear, because it's not clear yet to me either.

@josef-pkt josef-pkt added the PR label Mar 11, 2014
@jseabold
Member
jseabold commented Apr 4, 2014

This looks ok to merge? We can give all this a look over for optimization at some point, but piecemeal is better than wait for perfect.

@jseabold jseabold merged commit 0c4cc15 into statsmodels:master Apr 4, 2014

1 check passed

default The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment