Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DES: Use MultiIndex to make params *always* 1-dimensional #3652

Closed
jbrockmendel opened this issue May 10, 2017 · 5 comments
Closed

DES: Use MultiIndex to make params *always* 1-dimensional #3652

jbrockmendel opened this issue May 10, 2017 · 5 comments
Labels

Comments

@jbrockmendel
Copy link
Contributor

Discussion topic:

Over in #3651 we've been troubleshooting some problems caused in MNLogit because some methods are not expecting params (or bse or tvalues or ...) to be 2-dimensional.

In cases with 2-dimensional params I consistently have to go back and check which dimension is which. If we cast these into a pd.DataFrame internally (except for in the optimization step), we could have the labels and could attach names to the indexes, e.g. "Exog" and "Eq#".

Then any time there is a method that assumes params are 1-dimensional, we have "stack" and "unstack".

A secondary motivation is cases that are not raising errors but make it easy to shoot myself in the foot, like irf.cov(). That returns an array with shape (steps+1, neqs**2, neqs**2). (The deprecation of pd.Panel notwithstanding) Ideally these latter two dimensions should have index that behave a lot like a pd.MultiIndex.

@josef-pkt
Copy link
Member

see also the thread here https://mail.python.org/pipermail/numpy-discussion/2017-February/076474.html

The trade-off is that we might get some conveniences or improvements by using pandas inside the models, but if we mix data structures, then it is difficult, or a lot of work, to keep track of small differences in behavior. e.g. pandas for loops over columns, numpy over "rows", the 0 axis.

So far the trade-off was in favor of using only numpy inside the models, and restricting pandas to the interface or where using it really has a large advantage.

Also the core parts of the models is just "math", we/they don't care what the variables mean, and carrying around labels (columns and index names) is mostly irrelevant to the math and code.

(pandas was initially designed to be user friendly and not developer friendly with two letter names that I never remembered and too much magic, second guessing the user, besides the differences in behavior.)

Note: This is a recurrent issue because we always face this decision and trade-off, and recurrent discussions what to do in specific cases. The trend is going towards more pandas, also because the trend is that there are more users and potential contributors familiar with pandas than pure numpy.

In the specific cases, I don't think using pandas is worth the pain.
MNLogit is already much of an outlier in the model class hierarchy, and if we start to add pandas inside, then it will be even more so, and we will be even less able to write generic code that can be used by MNLogit.

@jbrockmendel
Copy link
Contributor Author

see also the thread here https://mail.python.org/pipermail/numpy-discussion/2017-February/076474.html

That's a pretty compelling case. My understanding is that the pandas team intends to address the indexing "warts" in 2.0, but that's still distant-horizon.

So far the trade-off was in favor of using only numpy inside the models, and restricting pandas to the interface or where using it really has a large advantage.

What happens to your opinion if we focus the suggestion on the Results classes and not Model classes? A lot of what I'm thinking of boils down to applying wrap operations early instead of late.

An idea I toyed with yesterday with little success was to patch attach_columns to check if the output was already a pandas object, in which case just pass it through. Upside is it could allow for easier fixes to special cases like yesterday's, and possibly smooth transition if the early-wrapping idea above were adopted. Downside is this could lead to snowflakes piling up.

@josef-pkt
Copy link
Member

The same as for the models applies to the results classes, those are essentially part of the core computation for the models. That Results are separate from Models, is mostly for internal housekeeping (Results hold fit results, the model doesn't know the results of fit.) and for code reuse and inheritance.

As you seem to have discovered, a results instance is the wrapped instance and results._results is the original instance with numpy arrays (if data is pandas or similar). This is very convenient and useful because users get wrapped results, and internal code uses numpy _results.

The main reason is that we still want to use numpy for follow-up or post-estimation computation, like the Wald tests, conf_int and similar.
In some cases where we have post-estimation outside of the results class we have code that accesses the original results like

def post_estimation(results, ...):
    if hasattr(results, '_results'):
        results = results._results

Minor detail to illustrate what we need to watch out for when mixing pandas and numpy
pandas' var and std use ddof=1 by default, numpy uses ddof=0.
If we don't check that and don't always force ddof, then our results differ depending on the type of the object.

I'm very risk averse in large refactorings of existing models, unless there is a very strong reason. I think it's better to introduce new approaches in the fringes and with new model or other classes where a new pattern promises some advantages. When we have some experience and if it turns out to be useful for the generic case, then large scale or core code refactoring is feasible.

(That's also true for many generic enhancement. Eg. the still existing code duplication in sandwich cov_type is because it was expanded in several rounds until we settled on the fit keyword pattern and had unit tests for all main classes for its use case.)

@jbrockmendel
Copy link
Contributor Author

I know the cov behavior is just an example, but for reference, it looks like they may add a kwarg to allow overriding that: pandas-dev/pandas#16227

@josef-pkt
Copy link
Member

The mental effort of having to think about ddof and similar still remains. And it won't be "fun" to debug why the precision of the results is low. (although I'm quite used to debug degrees of freedom corrections against Stata.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants