SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

josef-pkt · 2018-05-13T12:01:59Z

There is still some slack in GLM, especially fit_gradient, to improve performance in terms of speed and memory consumption

related issues:

PERF: pinv during GLM model specification #4622. PR PERF: GLM - don't use pinv in initialize #4624 redundant/unused pinv_exog, will be merged soon
don't compute scale during irls fit isssue ????
GLM newton optimization and no start_params #4623 better start_params for fit_gradient if irls is not used
PERF/ENH: add _fit_cache for unchanging components in loglike #3772 general idea for a fit_cache

...

other possibilities:

irls start_params for fit_gradient only needs params:
- we can copy params and delete irls_rslt immediately to free up memory
- irls_rslt don't need to compute cov_params, especially not a robust cov_type
- it should be possible to use skip_hessian or a return option to only get the params out of the irls fit
irls: add option to use moment matrices
memory usage in fit_irls

The text was updated successfully, but these errors were encountered:

josef-pkt · 2018-05-13T12:21:01Z

@thequackdaddy
I think it should be possible to remove cov_type from the fit_gradient start params irls fit, and del irls_rslt after copying the params still for 0.9. (that change doesn't look dangerous)
Can you try this out, and see whether it helps memory consumption in your use cases?

thequackdaddy · 2018-05-13T18:45:42Z

@josef-pkt I’ll try to test out later today.

Another question on performance...

In my system, this line is relatively slow. It seems to only only process on 1 core. Takes about 6-7 for 12 million x 50 exog.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/genmod/generalized_linear_model.py#L479

A straight element-wise multiplication between the score_obs and exog. Is there a parallelized numpy function for this?

thequackdaddy · 2018-05-13T18:52:37Z

There’s also a few lines in the hessian calculation that is slow... it would be nice to take advantage of more cores.

josef-pkt · 2018-05-13T19:02:48Z

Takes about 6-7 for 12 million x 50 exog.

6-7 of what? minutes, hours?

Is there a parallelized numpy function for this?

No, not AFAIK. numpy and scipy don't parallelize computations, only the underlying linalg libraries do that. We would have to do it on our own, or use Dask.

One general issue is Fortran versus C order in those operations on large arrays.
In the old times numpy linalg worked better on F-ordered arrays because of the underlying Fortran BLAS/LAPACK, but numpy operations worked better on C-ordered arrays.
I have no idea what the current timing is.

Another possibility: We currently don't reuse memory much, e.g. using the out argument in numpy functions. It might improve performance if we keep memory around instead of allocating new arrays all the time. For large arrays this might have a observable effect.
(I don't know how this works technically, but avoiding new memory allocation of large arrays might also help in preventing memory fragmentation, but I have no idea how relevant this is for RAM.)

thequackdaddy · 2018-05-13T19:15:02Z

Sorry... I meant 6-7 seconds. Doesn’t seem like a lot, but that calculation is used in each iteration with the fit_gradient optimization.

thequackdaddy · 2018-05-13T19:34:31Z

Another question... could this be a place for numba? dask/ joblib seem like they bring too much overhead for this simple calculation.

josef-pkt · 2018-05-13T19:47:24Z

could this be a place for numba?

Maybe. I thought about it, but I guess not for this.
numba is also slow if it has to allocate memory, the computation itself also shouldn't differ much. (I got good speed improvements with numba in the examples after I allocated the memory and initialized the array outside of the numba jitted function.

numba parallel loop was only recently reintroduced and I wouldn't bet on it yet. And it might not help in getting a contiguous output array in parallel either.

(I'm mostly guessing.)

josef-pkt · 2018-05-13T20:02:20Z

To change the target a bit.
For the optimization we need score and not score_obs, and score is a reduce operation.
If we compute score without constructing the full score_obs array, then we don't need the full memory. This means using numba in score or using patched/blocked numpy summation in score is most likely faster, and for sure will use less memory.
We can still use score_factor because it's 1-dim. But we can reduce immediately when multiplying with exog.

Using numpy this would be easy to try in score, e.g. if nobs > I million, then sum in blocks of 100,000 (or something like that) observations.
A full numba loop wouldn't have to allocate any memory in that case.

Even simpler score = exog.T.dot(score_factor) which should be just a plain linalg operation.
The hessian still needs on elementwise multiplication before the dot.

josef-pkt · 2018-05-13T20:28:47Z

about hessian:
It is possible or likely that using np.einsum is faster than the current version for large arrays.
AFAIK einsum is still slower than linalg, but it would avoid allocating the array for the elementwise multiplication.

thequackdaddy · 2018-05-13T21:46:36Z

So using numba to calculate score shows pretty good impreovements....

thequackdaddy · 2018-05-13T22:23:40Z

@josef-pkt

I think it should be possible to remove cov_type from the fit_gradient start params irls fit, and del irls_rslt after copying the params still for 0.9

Honestly, this doesn't improve it that much. I'll fill out a PR with this.

To be totally clear, its the full IRLS fit that seems to chew up a bunch of memory. Maybe I should use some other wls_method? It feels like each iteration chews up slightly more RAM, but once the IRLS optimization converges and the loop exits, the RAM is freed.

…tsmodels#4625

josef-pkt · 2018-05-13T22:44:10Z

It feels like each iteration chews up slightly more RAM, but once the IRLS optimization converges and the loop exits, the RAM is freed.

I'm not sure what should be causing this.
We don't keep any results from iterations around except for the history.
Maybe something is slowing down garbage collection.

I don't know if this would help, but you could try adding a copy to params in _update_history in case tmp_results is not immediately garbage collected because of the reference.
history['params'].append(tmp_result.params.copy())

josef-pkt · 2018-05-13T22:46:59Z

So using numba to calculate score shows pretty good impreovements....

compare this also with just the dot product score_factor.dot(exog) or exog.T.dot(score_factor)

…tsmodels#4625

thequackdaddy · 2018-05-13T22:53:02Z

Back to numba...

I thought I read somewhere that if you declare types you can speed up things... but when I tried, it actually was about .5 seconds slower. I'll just not declare types...

@numba.jit('void(float64[:], float64[:, :], float64[:])', parallel=True, nopython=True)
def score(score_factor, exog, scores):
    nobs, param_count = exog.shape
    scores[:] = 0.
    for i in range(nobs):
        for j in range(param_count):
            scores[j] += score_factor[i] * exog[i, j]

thequackdaddy · 2018-05-13T23:14:17Z

Back to IRLS...

Why do we do this?

statsmodels/statsmodels/genmod/generalized_linear_model.py

Lines 1146 to 1149 in 3f44cbe

    
           if maxiter > 0:  # Only if iterative used 
        
               wls_method2 = 'pinv' if wls_method == 'lstsq' else wls_method 
        
               wls_model = lm.WLS(wlsendog, wlsexog, self.weights) 
        
               wls_results = wls_model.fit(method=wls_method2)

As far as I can tell, this is only used to come up with normalized_cov_params.

josef-pkt · 2018-05-13T23:24:31Z

As far as I can tell, this is only used to come up with normalized_cov_params.

essentially yes. I guess the main reason to do it is for backwards compatibility prior to the use of MinimalWLS.
It can maybe also use a better method than what might be used during optimization.

I added an option to attach it to the results instance, because it can be used for diagnostic tests and measures, but that is currently mainly in an experimental notebook of mine.

thequackdaddy · 2018-05-14T00:02:41Z

compare this also with just the dot product score_factor.dot(exog) or exog.T.dot(score_factor)

I think np.dot is much faster... about 0.2ms.

However, there seems to be a loss of precision.

Numba...

josef-pkt · 2018-05-14T00:08:00Z

That difference is much too large, check that score_factor and exog are both float64.

What linalg library are you using MKL or OpenBLAS?

thequackdaddy · 2018-05-14T00:25:21Z

Both float64.

I'm using MKL. I'm using the numpy from the Intel IDP channel on anaconda. I'm certain its MKL... Perhaps they are doing some down-converting to like a single precision float?

josef-pkt · 2018-05-14T00:33:18Z

The problem is also that we don't know which is more accurate.
Current score and numba score both do the same aggregation, sum in same sequence, so they will agree with each other even if they are not more accurate.

Whats the relative difference in this case?
score_factor.dot(exog) / scores_current - 1

absolute difference might not mean much if magnitudes are large enough.

thequackdaddy · 2018-05-14T00:36:28Z

I don't think you'll like this...

thequackdaddy · 2018-05-14T00:41:51Z

And it feels like dot is where the loss of precision in. Column 0 in exog is a bunch of ones, so score[0] shoudl equal score_factor.sum().

josef-pkt · 2018-05-14T01:17:48Z

On linux you could cast score_factor to quad precision before computing the score_factor.sum() to check accuracy.

However, Does it matter?

I forgot that you are doing this at the MLE where score is supposed to be zero, so relative errors will be large, and the numbers will be mostly numerical noise, i.e. that we don't converge to exactly zero score.
The main consequence will be that numerical noise from the sum/dot will be about the same magnitude as a convergence criterion on the score.

Compute the score/score_factor at params * 0.98 and compare again.
I'm using this in the unit tests to stay away from the score=0 noise at the MLE params.
Away from the MLE params, rtol will be relevant for the optimization.

An application for score would be the score_test, but also there a diff of 1e-7 will not change the outcome of the test result.

thequackdaddy · 2018-05-14T02:29:12Z

Compute the score/score_factor at params * 0.98 and compare again.

As always, @josef-pkt , you've hit the nail on the head.

Compute the score/score_factor at params * 0.98 and compare again.

That looks pretty good to me.

thequackdaddy · 2018-05-14T03:13:02Z

Per #4630, using np.dot in score gives a pretty cheap and easy performance improvement.

Now hessian is the clear laggard. That takes about 15 seconds--e.g., model.hessian(res.params, scale=1., observed=True). I think all the overhead is the the element-wise operations, which are not parallel.

On a quick glace at hessian, I don't see a cheap way to use np.dot there cheaply. There's a bunch of elementwise operations, but nothign that is both element-wise and reduction.

josef-pkt · 2018-05-14T04:11:03Z

We leave the hessian for 0.10.
np.einsum might help
But this pattern as in the hessian computation shows up often. If we find something faster than the plain numpy (elementwise and dot) version, then I guess we should put it in a helper function.

(In the usual textbook math, the center is a diagonal matrix, but I don't think that using scipy sparse is faster because it still need to create the same intermediate array as the elementwise multiplication.)

thequackdaddy · 2018-05-15T19:11:21Z

I played with np.einsum... very mind-bending program.

Currently, we use np.dot(hessian_factor[:, None] * model.exog, model.exog) which is about 7 seconds on my computer.

np.einsum('j,ji,jk->ik', hessian_factor, model.exog, model.exog, optimize=True). That took around 35 seconds, which is somewhat slow.

Essentially, hessian_factor[:, None] * model.exog takes around 6.6 seconds. But the real trick to the speed here is that np.dot is so fast because it uses BLAS.

Is there a way to use the underlying BLAS operations?

josef-pkt · 2018-05-15T19:35:55Z

It sounds a bit strange that dot should be so fast compared to the element wise (except again if the cost of allocating the temp array are so large.)
How many processors to you have and are used by dot?

AFAIK element wise operations are not directly exported from BLAS through numpy or scipy.

thequackdaddy · 2018-05-28T00:58:12Z

@josef-pkt Forgot to respond to this... disappeared for a little longer than I thought

This is on a 4 core xeon processor virtual machine. I don't recall the exact specs.

I read through the BLAS documentation and I don't think an element-wise approach would work. To the extent you would want to use them, the suggestions I've found is to turn the vector into a diagonal matrix, which would be way, way too big.

I ran this on my home computer (a Mac) and similar results, but not as pronounced. This is a 5,000,000 x 50 matrix for a and a 5,000,000 length vector for b. The data is from a random matrix, e.g., a = np.random.rand(5000000, 50).

In [12]: %timeit np.dot((b[:, None] * a).T, a)
3.49 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit np.einsum('j,ji,jk->ik', b, a, a)
3.97 s ± 783 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit np.einsum('j,ji,jk->ik', b, a, a, optimize=True)
3.7 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

josef-pkt · 2018-05-28T02:00:28Z

my computer is slower than yours
preallocating the array, saves close to 40% as long as the out array has the same C or F order as a

out = np.empty_like(a)
%timeit np.multiply(a.T, b, out=out.T); out.T.dot(a)

asides
in my numpy version einsum is much slower.
F versus C order makes only a small difference, < 5%, on my computer and numpy version.

thequackdaddy · 2018-05-28T02:21:57Z

Yes allocating out first seems to speed it up for me too...

In [31]: %%timeit
    ...: np.multiply(a.T, b, out=out.T)
    ...: out.T.dot(a)
    ...:
2.25 s ± 413 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

thequackdaddy · 2018-05-28T03:31:06Z

Ok thinking aloud again...

The problem appears to be that every time hessian is called, you have to allocate a bunch of memory to do the simple hessian_factor * model.exog.T calculation. Would it make sense to create a global variable when hessian is needed for this? This won't speed up the hessian method if called once, but when called repeatedly--as in during the spicy optimizers--it would cut that time.

Not a huge time saver, but a time saver.

josef-pkt added comp-genmod Performance type-refactor labels May 13, 2018

josef-pkt added this to the 0.10 milestone May 13, 2018

thequackdaddy added a commit to thequackdaddy/statsmodels that referenced this issue May 13, 2018

Simplify start_params generation for gradient optimization in GLM sta…

b1373b4

…tsmodels#4625

thequackdaddy added a commit to thequackdaddy/statsmodels that referenced this issue May 13, 2018

Simplify start_params generation for gradient optimization in GLM sta…

2d79dbb

…tsmodels#4625

thequackdaddy mentioned this issue May 13, 2018

Simplify start_params generation for gradient optimization in GLM #4625 #4628

Merged

thequackdaddy mentioned this issue May 14, 2018

PERF: Use np.dot in GLM.score #4630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

josef-pkt commented May 13, 2018 •

edited

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018 •

edited

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

josef-pkt commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018 •

edited

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 14, 2018 •

edited

josef-pkt commented May 14, 2018

thequackdaddy commented May 14, 2018

josef-pkt commented May 14, 2018

thequackdaddy commented May 14, 2018

thequackdaddy commented May 14, 2018

josef-pkt commented May 14, 2018

thequackdaddy commented May 14, 2018

thequackdaddy commented May 14, 2018 •

edited

josef-pkt commented May 14, 2018

thequackdaddy commented May 15, 2018 •

edited

josef-pkt commented May 15, 2018

thequackdaddy commented May 28, 2018

josef-pkt commented May 28, 2018

thequackdaddy commented May 28, 2018

thequackdaddy commented May 28, 2018

SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

Comments

josef-pkt commented May 13, 2018 • edited

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018 • edited

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

josef-pkt commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 13, 2018 • edited

thequackdaddy commented May 13, 2018

josef-pkt commented May 13, 2018

thequackdaddy commented May 14, 2018 • edited

josef-pkt commented May 14, 2018

thequackdaddy commented May 14, 2018

josef-pkt commented May 14, 2018

thequackdaddy commented May 14, 2018

thequackdaddy commented May 14, 2018

josef-pkt commented May 14, 2018

thequackdaddy commented May 14, 2018

thequackdaddy commented May 14, 2018 • edited

josef-pkt commented May 14, 2018

thequackdaddy commented May 15, 2018 • edited

josef-pkt commented May 15, 2018

thequackdaddy commented May 28, 2018

josef-pkt commented May 28, 2018

thequackdaddy commented May 28, 2018

thequackdaddy commented May 28, 2018

josef-pkt commented May 13, 2018 •

edited

thequackdaddy commented May 13, 2018 •

edited

thequackdaddy commented May 13, 2018 •

edited

thequackdaddy commented May 14, 2018 •

edited

thequackdaddy commented May 14, 2018 •

edited

thequackdaddy commented May 15, 2018 •

edited