Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add Panel Data models #1133

Closed
wants to merge 72 commits into from
Closed

Conversation

jseabold
Copy link
Member

I think this is ready to at least start talking about. There are still a few TODOs in the source, particularly with making sure that twoway effects are correctly handled. Stata doesn't offer much in the way of twoway effects, assuming that for most panel models N >> T.

This supersedes #690, which can be closed but referred to for more information.

@vincentarelbundock
Copy link
Contributor

I've been thinking about this a little bit, and have now convinced myself that forcing users to use an xtset-like function to prepare data would save us a lot of under the hood trouble.

What do you think about it?

@josef-pkt
Copy link
Member

about xtset: I suggested this or similar ones before (there might be an open issue)
One problem I have with Stata in regular usage is that it only allows one dataset to be active. (Which is a pain when I try to prepare 3 examples at the same time.) (I thought that's related to having separate index and data.)

My general opinion (not having looked at this branch in a while, except for Poisson):
A more restrictive structure on the data, and a separate "xt-index" will be useful. But I think "forcing" a too restrictive structure will prevent usages, that are not typical for the area for which it was initially written.
For example, GEE (PR) allows for a multi-dimensional (continuous) time-index so that it can also be used as spatial index.
In most applications for panel/longitudinal data that I looked at recently (microeconometrics), "time" is not calender time, just an integer (discrete) or float (continuous) event time index.
(If I understand now correctly, the SUR model in sysreg is the same as a balanced short panel with unrestricted covariance matrix, if we flip time and cross-section index.)

Of course a common sub-case is the standard (macro-) panel, with calender time and cross-section and two way effects.

@vincentarelbundock
Copy link
Contributor

right, but then you can have different data prep functions, like Stata's stset for survival time data. There could be quite a bit of code reuse between these data prep functions too. It just seems ugly to handle all sorts of data input in the model classes.

@jseabold
Copy link
Member Author

Part of this PR was unifying the data-handling so it will work for any panel data model separate from just the linear case (and make it so that it's handled in this base class). The way that it works now, which I think is unchanged from before - it's just general now, is that you can either give time and panel to any panel data models. These would be (separate) indices. Or you can give y and X where the index is a MultiIndex that has time and panel as the respective levels.

https://github.com/statsmodels/statsmodels/pull/1133/files#diff-8ab5d9484c849d2418de300970ad5b58R84
https://github.com/jseabold/statsmodels/blob/6a6b01bc8ef9a79aa3cf115b8e0a399c1e1f22cb/statsmodels/panel/base/data.py

It makes sense for something like Survival models (stset) when you might have different kinds of censoring, etc. E.g., the information there can affect the estimation, but I'm not sure what we'd gain in the panel case. I'm open to this change though if it will make some things easier, but I don't see how yet.

All of the potential code re-use is in your groupings class, which, I agree, should be able to be re-used for the Surival models, though it may take a bit more work to generalize. I was just looking at them again last weekend.

@jseabold
Copy link
Member Author

Note now that groupings is attached to the model.data attribute and not the models too.

@jseabold
Copy link
Member Author

I also see that the data changes have partially broken older cases (or revealed bugs).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 6d40ebe on jseabold:panel-vincent into 3b7082c on statsmodels:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling dbd8a00 on jseabold:panel-vincent into 3b7082c on statsmodels:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f3881db on jseabold:panel-vincent into 3b7082c on statsmodels:master.

'''Apply to a sub-group of observations'''
n = subset.shape[0]
B = np.ones((n,n)) / n
out = subset - chain_dot(np.diag(theta[position]), B, subset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this can be replaced with something without (n,n) arrays
unless subset is always small

@josef-pkt
Copy link
Member

I just had a quick look, whether it can be merged soon, so we would have everything together to start compare GEE and Panel, and others.

@jseabold
Copy link
Member Author

I just realized that the handle_data subclass abstraction is in this PR and not master. I'm going to make a PR with just this change in master, because I think it's going to be generally useful. E.g., with survival models as well.

@jseabold
Copy link
Member Author

Rebased after merge of #1421.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants