Currently, "group" is assumed to be an array of ints. There is no particular reason why this must be so (i.e. a string variable "country" being the code for a nation).
ENH: accept non-int definitions of cluster groups
I need to look at this more carefully. The change looks innocent enough, but I would like to keep the group conversion in the wrapper code.
For the conversion to int, using return_index in np.unique is the fastest way, AFAICT. That conversion is in group-utils.
I think the np.unique in here is supposed to be optimized away, if we use the group utilities in the wrapper class. (is duplicate and only needed for small sample correction IIRC)
ENH: use np.unique for indexing clusters
I didn't know about the optional arguments to np.unique, this is fixed in the last commit.
... but I'm sorry I wasn't able to follow the rest of your comment. If you expect some more changes from me in this branch, please point me at the meaning of "wrapper code".
Coverage remained the same when pulling aef7ee5 on toobaz:nonint_clusters into c299e3f on statsmodels:master.
My first version to include it in a model is in the linear regression Results
From a quick check, it seems that it might also assume integer in cluster option.
Similar code will be needed in several different model classes, but I haven't gotten around yet to see how this can be rewritten so it also works for other models, like the MLE models in discrete.
I'm still only partway to get the pieces to fit together https://github.com/josef-pkt/statsmodels/compare/REF_covtype_fit
I'm still running into problems when I try to work on this #1418
What I wanted to say to this PR is: it looks fine, however eventually all this should go into the more general parts, so that the pure sandwich functions don't have to worry about any of this, and we don't need to have code duplication for this.
That's my plan, however group handling is still in flux across several PRs.
Maybe we can also keep the string to int conversion in there as option when users want to call the sandwiches directly.
I'm a bit unclear, because it's not clear yet to me either.
This looks ok to merge? We can give all this a look over for optimization at some point, but piecemeal is better than wait for perfect.