Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: accept non-int definitions of cluster groups #1437

Merged
merged 2 commits into from Apr 4, 2014

Conversation

Projects
None yet
4 participants
@toobaz
Copy link
Contributor

toobaz commented Feb 28, 2014

Currently, "group" is assumed to be an array of ints. There is no particular reason why this must be so (i.e. a string variable "country" being the code for a nation).

@josef-pkt

This comment has been minimized.

Copy link
Member

josef-pkt commented Mar 1, 2014

I need to look at this more carefully. The change looks innocent enough, but I would like to keep the group conversion in the wrapper code.

For the conversion to int, using return_index in np.unique is the fastest way, AFAICT. That conversion is in group-utils.

I think the np.unique in here is supposed to be optimized away, if we use the group utilities in the wrapper class. (is duplicate and only needed for small sample correction IIRC)

@toobaz

This comment has been minimized.

Copy link
Contributor Author

toobaz commented Mar 1, 2014

I didn't know about the optional arguments to np.unique, this is fixed in the last commit.

... but I'm sorry I wasn't able to follow the rest of your comment. If you expect some more changes from me in this branch, please point me at the meaning of "wrapper code".

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 1, 2014

Coverage Status

Coverage remained the same when pulling aef7ee5 on toobaz:nonint_clusters into c299e3f on statsmodels:master.

@josef-pkt

This comment has been minimized.

Copy link
Member

josef-pkt commented Mar 1, 2014

about "wrappers"

My first version to include it in a model is in the linear regression Results
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1538

From a quick check, it seems that it might also assume integer in cluster option.

Similar code will be needed in several different model classes, but I haven't gotten around yet to see how this can be rewritten so it also works for other models, like the MLE models in discrete.

I'm still only partway to get the pieces to fit together https://github.com/josef-pkt/statsmodels/compare/REF_covtype_fit
I'm still running into problems when I try to work on this #1418

What I wanted to say to this PR is: it looks fine, however eventually all this should go into the more general parts, so that the pure sandwich functions don't have to worry about any of this, and we don't need to have code duplication for this.
That's my plan, however group handling is still in flux across several PRs.
Maybe we can also keep the string to int conversion in there as option when users want to call the sandwiches directly.

I'm a bit unclear, because it's not clear yet to me either.

@josef-pkt josef-pkt added the PR label Mar 11, 2014

@jseabold

This comment has been minimized.

Copy link
Member

jseabold commented Apr 4, 2014

This looks ok to merge? We can give all this a look over for optimization at some point, but piecemeal is better than wait for perfect.

jseabold added a commit that referenced this pull request Apr 4, 2014

Merge pull request #1437 from toobaz/nonint_clusters
ENH: accept non-int definitions of cluster groups

@jseabold jseabold merged commit 0c4cc15 into statsmodels:master Apr 4, 2014

1 check passed

default The Travis CI build passed
Details

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge pull request statsmodels#1437 from toobaz/nonint_clusters
ENH: accept non-int definitions of cluster groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.