ENH: accept non-int definitions of cluster groups #1437

merged 2 commits into from Apr 4, 2014


None yet

4 participants

toobaz commented Feb 28, 2014

Currently, "group" is assumed to be an array of ints. There is no particular reason why this must be so (i.e. a string variable "country" being the code for a nation).


I need to look at this more carefully. The change looks innocent enough, but I would like to keep the group conversion in the wrapper code.

For the conversion to int, using return_index in np.unique is the fastest way, AFAICT. That conversion is in group-utils.

I think the np.unique in here is supposed to be optimized away, if we use the group utilities in the wrapper class. (is duplicate and only needed for small sample correction IIRC)

toobaz commented Mar 1, 2014

I didn't know about the optional arguments to np.unique, this is fixed in the last commit.

... but I'm sorry I wasn't able to follow the rest of your comment. If you expect some more changes from me in this branch, please point me at the meaning of "wrapper code".


Coverage Status

Coverage remained the same when pulling aef7ee5 on toobaz:nonint_clusters into c299e3f on statsmodels:master.


about "wrappers"

My first version to include it in a model is in the linear regression Results

From a quick check, it seems that it might also assume integer in cluster option.

Similar code will be needed in several different model classes, but I haven't gotten around yet to see how this can be rewritten so it also works for other models, like the MLE models in discrete.

I'm still only partway to get the pieces to fit together https://github.com/josef-pkt/statsmodels/compare/REF_covtype_fit
I'm still running into problems when I try to work on this #1418

What I wanted to say to this PR is: it looks fine, however eventually all this should go into the more general parts, so that the pure sandwich functions don't have to worry about any of this, and we don't need to have code duplication for this.
That's my plan, however group handling is still in flux across several PRs.
Maybe we can also keep the string to int conversion in there as option when users want to call the sandwiches directly.

I'm a bit unclear, because it's not clear yet to me either.

@josef-pkt josef-pkt added the PR label Mar 11, 2014
jseabold commented Apr 4, 2014

This looks ok to merge? We can give all this a look over for optimization at some point, but piecemeal is better than wait for perfect.

@jseabold jseabold merged commit 0c4cc15 into statsmodels:master Apr 4, 2014

1 check passed

default The Travis CI build passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment