Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SummaryTable is brutally cumbersome #414

Open
jseabold opened this issue Aug 2, 2012 · 5 comments
Open

SummaryTable is brutally cumbersome #414

jseabold opened this issue Aug 2, 2012 · 5 comments
Labels

Comments

@jseabold
Copy link
Member

jseabold commented Aug 2, 2012

Having added summary tables to AR, ARIMA, and now marginal effects for discrete choice variables, I have to say that doing this is brutally difficult and rather annoying. There's got to be a better way or a refactor that could sort this out.

@vincentarelbundock
Copy link
Contributor

I was playing around with some tables today and I must say I agree with your comment. In terms of defining a vision for how to move forward with this, what's the desired feature set? Would it be sufficient to have a couple helper functions that help populate pandas DataFrame, and then use DataFrame.to_string() or DataFrame.to_html()?

Does statsmodels needs more complex tables than simple arrays (e.g. multicolumn)?

@josef-pkt
Copy link
Member

look at an example e.g. http://nbviewer.ipython.org/3484294/

the results.summary() is 5 tables, top and bottom with 2 horizontally concatenated tables and the simpler single table with params in the middle.

discrete models are missing the regression diagnostics
http://nbviewer.ipython.org/3484274/

This is not designed for quick simple tables, pandas works much better in this case.
What I worked on with summary is to have enough control to get a nice table for the standard case. Tables that follow the same pattern are largely boiler plate. New tables with a different pattern are "work".

In a branch I made some changes to the html rendering (since align on decimal doesn't exist)
bb4dafa

some problems where I think the greater control over rendering helps (compared to pandas, AFAIK):
different precision in columns,
colums that have numbers with 1e-20 and 1e4 at the same time
regression summary (top and bottom table) where each line has different units.

I don't know how much control over formatting we can get with using a DataFrame. In simpler cases with a table of just numbers (which are pretty homogenous) pandas is more convenient, but I doubt we have enough control for fancier formatting in more complex tables.

(SimpleTable also renders Latex)

@vincentarelbundock
Copy link
Contributor

Yes, flexibility does seem to matter. And Latex support is a big plus. But i'm not sure that pandas dataframes are really limited to simple cases. They can basically behave like the SimpleTable building block. You can concatenate them horizontally, or stack tables with different numbers of columns vertically by printing them one after the other and forcing them to have equal width. To_string() also allows a "formatters" argument which can apply arbitray functions to the columns you want. So in theory we could write a pretty simple "align_float_on_decimal" function that would give us neatly formatted columns. Of course, if there are too many formatter functions to write, reinventing the wheel wouldnt be worth it.

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_string.html

@josef-pkt
Copy link
Member

Lots of options.
I'm not sure you will end up with less setup and formatting code than with https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tsa/arima_model.py#L1314

Two possibilities

  • try out using pandas for at least simpler tables (no Latex for now), or
  • try to streamline the current summary helper functions and classes.

to the second: I worked on the summary for two weeks or so, and was happy enough when I got it to work as it is now. I ran out of patience with fighting with this, and didn't go back and see if it could be made more convenient or cleaner.

However, for summary()
-fetch results
-reformat them to correct string representation
-stick them in a "table"

looks to me that this will be necessary however we are creating the tables.

For homogenous tables like summary_frame in outliers, pandas DataFrame is nicer because it can do the rendering and hold the data at the same time.

@vincentarelbundock
Copy link
Contributor

Yeah you're probably right. I'm still a bit curiou, so ifi have time i'll try to put together a minimal working example with pandas, just to have a better sense of how close to an acceptable result we can get using 40 lines of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants