SimpleTable in Summary (e.g. OLS) is slow for large models #1385

Closed
jcjf opened this Issue Feb 13, 2014 · 29 comments

Projects

None yet

4 participants

@jcjf
Contributor
jcjf commented Feb 13, 2014

The following code is slow for me with statsmodels 0.5.0 on Python 3.3 64-bit (Windows, Christophe Gohlke's binary):

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

X = np.random.rand(1000, 500)  # Or whatever is big enough to be slow
y = np.random.rand(1000)
model = sm.OLS(y, X)
results = model.fit()          # This is quite fast but..
print(results.summary())       # this will take a long time
@jseabold
Member

I've also run into the problem that SimpleTable is very, very slow before for larger tables. My advice (for now) would be not to call summary for such a large problem and wrap your own solution for looking at the statistics you want.

@jseabold
Member

I'm going to edit the title to reflect the issue as I understand it.

@josef-pkt
Member

There might be a difference in the extra statistics and in the parameter table in terms of speed.
My guess is that some of the additional test statistics that are calculated are slow.

If the params_table itself is also slow, then it is a problem with SimpleTable.

In terms of usage: What information do you want from the summary() when you have 500 parameters? It looks like a long and unwieldy printout to me.

summary() for OLS is creates three tables, We could make a partial summary for large problems.

@jseabold
Member

SimpleTable itself is pretty slow for moderately sized problems. In the past I tried to use it to do formatting for ~5000 observations so I could load it into fixed-width GAMS format. In the end I had to go with my own approach because my program slowed to a crawl here. My guess at the time was that it was due to the fact that SimpleTable creates a Python object for each cell.

Doing some timings it looks like statistics start to take non-trivial time for moderately sized parameters (on my machine between 500-1000), but a decent percentage of time is spent in table.py in all cases. Looking briefly, it seems like the .var method of arrays takes up much of the time once the parameters are 'big' for the statistics. I don't see on a quick glance where this comes from.

Closing as won't fix. Summary is a big beast. If you narrow down the bottleneck by going through it, please reopen this issue and add your findings.

@jseabold jseabold closed this Feb 13, 2014
@josef-pkt
Member

The alternative, besides improving the current code, is to make a summary_light, or see if summary2 can add additional option.

summary is supposed to give a quick overview over the results. I only ever checked which statistics would be useful, not how long it takes to create them.

@vincentarelbundock
Member

SM uses basically none of the advanced features of SimpleTable. We could replace all of it with a couple string join statements and about 20 lines of code. Would probably be much faster too.

Summay2 is close back-end agnostic, so it could keep using SimpleTable or switch to a faster option easily.

@vincentarelbundock
Member

Also, those "advanced" features of SimpleTable are not nearly as useful as they sound IMHO.

@josef-pkt
Member

to_html is used in ipython notebooks AFAIR
I don't know whether anyone uses to_latex

In spite of all the problems summary has two advantages:
it creates reasonably pretty (IMO) and readable printouts in examples, tutorials and in the interpreter, and
it's easy (for me) to write a new summary for a new model (as long as that doesn't break the pattern too much). ('Better the devil you know ...')

@vincentarelbundock
Member

Key expression being "for you" :)

Again, the goal of the new system would be that it wouldn't be much of an investment for anyone to learn it. But I feel we've had this discussion before and I don't want to insist too much.

@jseabold
Member

I tend to use pandas' to_latex and I don't bother much with to_html (I have it turned off in my pandas options for the notebooks), though both features are easily replicated outside of SimpleTable.

We've definitely been down this road in #414 and should probably continue this discussion there. I tend to side with Vincent on this. Neither of those are really advantages when you consider you can do exactly the same thing with Python's built-in string template or formatting, which are dead simple. We might have to do a bit more hard-coding, but IMO it's worth it for readability and ease of use (and possibly performance).

@jcjf
Contributor
jcjf commented Feb 13, 2014

It seems ironic that the internals of SimpleTable.__str__ is rather difficult to follow along. There appears some performance issue with propagating the formatting information (fmt_dict) from SimpleTable to Row to Cell. I haven't quite understood the overall data serialization architecture, so it's not obvious to me why Cell._get_fmt() needs to be so complicated.

For text-based table formatting, the non-header formatting information is generally per-column rather than per-row.

@josef-pkt
Member

I don't want to handle string templates and other things if I don't have to.

What I mean is that it took me maybe a minute or so to add params_table to t_test results
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/contrast.py#L83
which has exactly the same formatting and content as the summary() params_table.
(It has a bug that I recently saw if there is only one parameter/hypothesis.)

For new models, copy, paste, adjust, and figure out which statistics we want to include
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/regression/gmm.py#L255
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/genmod/generalized_estimating_equations.py#L1073

It has to be considerably better than the current version before I go through learning new tools.

@jseabold
Member

There are now several people commenting on the complexity of SimpleTable. It is the new (non-standard) tool, not the built-in Python stuff that is ubiquitous. The burden has fallen on you to write these summary tables precisely because no one wants to take the time to understand what's going on when they want to quickly write stuff. If we worked with string formatting, you would still just copy-paste and be able to make additions in one minute. After an initial outlay, I think the maintenance burden would be lower, the flexibility higher, and the effects of summary tables just the same.

http://docs.python.org/2/library/string.html#format-examples
http://docs.python.org/2/library/string.html#template-strings

Your argument is like saying that SAS is simpler because it's what you know already...so everyone should just learn that.

@jcjf
Contributor
jcjf commented Feb 13, 2014

I think I've found the main problem in the existing code - the column width calculation in SimpleTable.get_colwidths is expensive (because Cell.format is rather complicated) and is getting called for every row in the table.

The table formatting code seems a bit backwards to me. Right now the SimpleTable tells its Row objects how they should be formatted and the Row objects tell their Cell objects how they should be formatted, but only the SimpleTable object knows what column widths will be necessary. Would it make sense for Cell objects and Row objects to simply lookup the appropriate formatting information from their parent (if one exists) and stringify themselves accordingly?

@josef-pkt
Member

@kshedden wrote GEE summary
I made a tiny change spreading the method name over two lines
db63ac2#diff-1535b76fc844ea73fbe14e973545b14bL1043

How do you do that in a string template?

I don't really care about SimpleTable as a backend, and the formatting options are a bit complicated and I mostly avoid them. But it's what we have, it works sufficiently well, and I'm not interested in writing a new one.

However, I think summary() itself has just the right abstraction, giving enough control over the details without getting lost in formatting details.

I'm talking about switching tools. I'm not buying a new car just because there are newer, fancier ones available, our car is fine for what we use it. And I don't spend a lot of time upgrading all packages and compilers and operating system, unless there is a really good reason.

But if you want to add a new p-value correction or a new sandwich robust covariance or add heteroscedasticity and correlation into RLM, then I'm very interested in getting the newest and greatest, or just one that works.

@jseabold
Member

As, I said, there will be some hard-coding with templates, so to do that, I'd hit enter and be done with it. It would take me 30 seconds and WYSIWYG. I wouldn't have to go back and forth at the interpreter after I spent 20-30 minutes reminding myself how SimpleTable works and fiddling to get the right result, or see if SimpleTable has changed since the last time I figured it out because there were more refactorings to make it more 'general'. Completely not worth it for me. SimpleTable is the fancy new car in your analogy. String formatting to me is the ubiquitous work horse that just gets me where I want to go. That said, I haven't prototyped anything or tried to replace summary. Maybe the next time I work on a summary method. This is just my opinion about why I think SimpleTable is so contentious. It's also how summary tables work in R, gretl, etc. No complicated abstractions.

@josef-pkt
Member

@jcjf I think only Alan Isaac really knows the SimpleTable code (and he is not on github as far as I can tell). We just made a few smaller changes to it over time.

Currently the column and table width expands if there are long string cells. So there must be some feedback between cells and the table object to calculate the widths.

statsmodels is not using the full formatting features of SimpleTable, as far as I know. If you see a way to short circuit some of the expensive parts, then we could make those changes, and propose them also to Alan.
If I understand you correctly, it might be possible to do more of the formatting and format calculations in advance, and make the cells dumber. But I only ever went through some small parts of the SimpleTable code, without understanding the overall connections.

For pure homogeneous table formatting, pandas should be much easier and faster.

@josef-pkt
Member

@jseabold to finish my analogy SimpleTable was the fancy new car a few years ago. But then I learned how to use it and now it's an old tool.

SimpleTable is just the far backend, and sometimes we have to fight with it. Like in this case, when we want to improve performance. And as I said, I don't really care how that part is implemented.

However, for writing a summary we don't have to fight with SimpleTable in the regular case.
A summary is using the API that I have written (based on Vincent Davis's code) that hides most of SimpleTable.
And that's the API that I like (except for some "warts", and some missing conveniences), top-left, top-right, params table, bottom.
params table is prefabricated, and we can choose what to put in top-left, top-right and how to format it, or we can skip top-right.
We can concatenate several params_tables for multi-equation, and so on.

@josef-pkt
Member

One possible problem if or when you start with templates:
The top tables of summary require flexible column widths across models (and in some cases across fit options). And we don't want to hard code a new template for each model.

The web people use templating packages to have enough flexibility to produce lots of different formatting and layouts.

@jseabold
Member

It'll be handled the same way it is in SimpleTable I imagine, using max column-width and alignment in the format.

Web templates are used so you don't have to write HTML that's mostly the same across pages. Formatting and layouts is mostly all handled by CSS these days.

@jcjf
Contributor
jcjf commented Feb 14, 2014

The bad performance culprit is definitely the repeated SimpleTable.get_colwidths. Any refactoring trick to store the column width information calculated after the first call should fix this issue.

The following example using monkeypatching demonstrates the above assertion:

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import statsmodels.iolib.table as table

old_widths = table.SimpleTable.get_colwidths
def new_widths(self, output_format, **fmt_dict):
    if not hasattr(self, 'widths_cache'):
        self.widths_cache = {}

    call_args = [output_format]
    for k, v in sorted(fmt_dict.items()):
        if isinstance(v, list):
            call_args.append((k, tuple(v)))
        else:
            call_args.append((k, v))
    lookup = tuple(call_args)
    try:
        return self.widths_cache[lookup]
    except KeyError:
        self.widths_cache[lookup] = old_widths(self, output_format, **fmt_dict)
        return self.widths_cache[lookup]
table.SimpleTable.get_colwidths = new_widths


X = np.random.rand(1000, 500)
y = np.random.rand(1000)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())  # Much faster now
@josef-pkt
Member

@jcjf
how much is the speedup?

a note before I forget it again:
In several other classes (not Results) we have added summary_frame to directly return a pandas.DataFrame with the same information or main table as is in the summary()
We have not added this to the Results classes yet.
summary_frame is in t-test, influence-outliers and Margins, IIRC

@josef-pkt josef-pkt reopened this Feb 14, 2014
@josef-pkt
Member

reopening, since this discussion will result in some changes, most likely.

@jcjf
Contributor
jcjf commented Feb 14, 2014

If I wrap the monkeypatching example in a function patch() with an undo function unpatch():

In [4]: %timeit patch(); str(results.summary()); unpatch()
1 loops, best of 3: 332 ms per loop

In [5]: %timeit str(results.summary())
1 loops, best of 3: 57 s per loop
@josef-pkt
Member

That's really a big improvement.
So it's really mostly SimpleTable and not so much the extra calculations in summary.

Did you look at some (smaller) examples for summary() to see if it visually changed?
You could open a pull request with this change. We have a limited test suite for SimpleTable, but only smoke tests for summary().
I can check then a few examples across models where I know we have large changes in the column and table widths.

Thanks for looking into this.

@jcjf
Contributor
jcjf commented Feb 14, 2014

I'm not intimately familiar with statsmodels or even statistics in general, so my understanding of this issue is really limited to my own very rudimentary use cases.

Apart from the monkeypatching, the way I tried to make (output_format, fmt_dict) into something hashable is still somewhat of a hack. If the technique used above seems acceptable, then a pull request should be pretty easy.

@josef-pkt
Member

SimpleTable doesn't contain any statistics, so no familiarity with it required.

I briefly looked at the code to see how complicated it is.
As far as I can tell, you are essentially adding a memoize (caching) to it, without changing the underlying calculations. This doesn't sound dangerous or too hackish to me.

Make a PR and we see how far unit tests and examples are affected.
You could rename the current get_colwidths by adding a leading underscore and add your version.

Hopefully it works on all examples, then I won't have to understand the details.

@jcjf
Contributor
jcjf commented Feb 15, 2014

The tests don't all pass for me locally, should I just rely on Travis-CI?

@josef-pkt
Member

If you open the PR, I can look. TravisCI is our first public check, then we still test across different machines and across versions of dependencies.

If there are failures with features of SimpleTable that we actually use in statsmodels, then we should fix it. I haven't tried to figure out what the hackiness with the hash in your change is, and I haven't seen test failures to guess where it may cause problems.

@josef-pkt josef-pkt added a commit that closed this issue Feb 22, 2014
@josef-pkt josef-pkt Merge branch 'memoize-colwidths_rebased'
REF performance, faster SimpleTable  closes #1385 closes #1395
f62a7b5
@josef-pkt josef-pkt closed this in f62a7b5 Feb 22, 2014
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@josef-pkt josef-pkt Merge branch 'memoize-colwidths_rebased'
REF performance, faster SimpleTable  closes #1385 closes #1395
347e3b2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment