Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

Open
josef-pkt opened this issue May 13, 2018 · 32 comments
Open

SUMM/ENH/REF: performance review of GLM, fit_gradient #4625

josef-pkt opened this issue May 13, 2018 · 32 comments

Comments

@josef-pkt
Copy link
Member

josef-pkt commented May 13, 2018

There is still some slack in GLM, especially fit_gradient, to improve performance in terms of speed and memory consumption

related issues:

...

other possibilities:

  • irls start_params for fit_gradient only needs params:
    • we can copy params and delete irls_rslt immediately to free up memory
    • irls_rslt don't need to compute cov_params, especially not a robust cov_type
    • it should be possible to use skip_hessian or a return option to only get the params out of the irls fit
  • irls: add option to use moment matrices
  • memory usage in fit_irls
@josef-pkt
Copy link
Member Author

@thequackdaddy
I think it should be possible to remove cov_type from the fit_gradient start params irls fit, and del irls_rslt after copying the params still for 0.9. (that change doesn't look dangerous)
Can you try this out, and see whether it helps memory consumption in your use cases?

@thequackdaddy
Copy link
Member

@josef-pkt I’ll try to test out later today.

Another question on performance...

In my system, this line is relatively slow. It seems to only only process on 1 core. Takes about 6-7 for 12 million x 50 exog.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/genmod/generalized_linear_model.py#L479

A straight element-wise multiplication between the score_obs and exog. Is there a parallelized numpy function for this?

@thequackdaddy
Copy link
Member

There’s also a few lines in the hessian calculation that is slow... it would be nice to take advantage of more cores.

@josef-pkt
Copy link
Member Author

Takes about 6-7 for 12 million x 50 exog.

6-7 of what? minutes, hours?

Is there a parallelized numpy function for this?

No, not AFAIK. numpy and scipy don't parallelize computations, only the underlying linalg libraries do that. We would have to do it on our own, or use Dask.

One general issue is Fortran versus C order in those operations on large arrays.
In the old times numpy linalg worked better on F-ordered arrays because of the underlying Fortran BLAS/LAPACK, but numpy operations worked better on C-ordered arrays.
I have no idea what the current timing is.

Another possibility: We currently don't reuse memory much, e.g. using the out argument in numpy functions. It might improve performance if we keep memory around instead of allocating new arrays all the time. For large arrays this might have a observable effect.
(I don't know how this works technically, but avoiding new memory allocation of large arrays might also help in preventing memory fragmentation, but I have no idea how relevant this is for RAM.)

@thequackdaddy
Copy link
Member

thequackdaddy commented May 13, 2018

Sorry... I meant 6-7 seconds. Doesn’t seem like a lot, but that calculation is used in each iteration with the fit_gradient optimization.

@thequackdaddy
Copy link
Member

Another question... could this be a place for numba? dask/ joblib seem like they bring too much overhead for this simple calculation.

@josef-pkt
Copy link
Member Author

could this be a place for numba?

Maybe. I thought about it, but I guess not for this.
numba is also slow if it has to allocate memory, the computation itself also shouldn't differ much. (I got good speed improvements with numba in the examples after I allocated the memory and initialized the array outside of the numba jitted function.

numba parallel loop was only recently reintroduced and I wouldn't bet on it yet. And it might not help in getting a contiguous output array in parallel either.

(I'm mostly guessing.)

@josef-pkt
Copy link
Member Author

To change the target a bit.
For the optimization we need score and not score_obs, and score is a reduce operation.
If we compute score without constructing the full score_obs array, then we don't need the full memory. This means using numba in score or using patched/blocked numpy summation in score is most likely faster, and for sure will use less memory.
We can still use score_factor because it's 1-dim. But we can reduce immediately when multiplying with exog.

Using numpy this would be easy to try in score, e.g. if nobs > I million, then sum in blocks of 100,000 (or something like that) observations.
A full numba loop wouldn't have to allocate any memory in that case.

Even simpler score = exog.T.dot(score_factor) which should be just a plain linalg operation.
The hessian still needs on elementwise multiplication before the dot.

@josef-pkt
Copy link
Member Author

about hessian:
It is possible or likely that using np.einsum is faster than the current version for large arrays.
AFAIK einsum is still slower than linalg, but it would avoid allocating the array for the elementwise multiplication.

@thequackdaddy
Copy link
Member

So using numba to calculate score shows pretty good impreovements....

image

@thequackdaddy
Copy link
Member

@josef-pkt

I think it should be possible to remove cov_type from the fit_gradient start params irls fit, and del irls_rslt after copying the params still for 0.9

Honestly, this doesn't improve it that much. I'll fill out a PR with this.

To be totally clear, its the full IRLS fit that seems to chew up a bunch of memory. Maybe I should use some other wls_method? It feels like each iteration chews up slightly more RAM, but once the IRLS optimization converges and the loop exits, the RAM is freed.

@josef-pkt
Copy link
Member Author

It feels like each iteration chews up slightly more RAM, but once the IRLS optimization converges and the loop exits, the RAM is freed.

I'm not sure what should be causing this.
We don't keep any results from iterations around except for the history.
Maybe something is slowing down garbage collection.

I don't know if this would help, but you could try adding a copy to params in _update_history in case tmp_results is not immediately garbage collected because of the reference.
history['params'].append(tmp_result.params.copy())

@josef-pkt
Copy link
Member Author

So using numba to calculate score shows pretty good impreovements....

compare this also with just the dot product score_factor.dot(exog) or exog.T.dot(score_factor)

@thequackdaddy
Copy link
Member

thequackdaddy commented May 13, 2018

Back to numba...

I thought I read somewhere that if you declare types you can speed up things... but when I tried, it actually was about .5 seconds slower. I'll just not declare types...

@numba.jit('void(float64[:], float64[:, :], float64[:])', parallel=True, nopython=True)
def score(score_factor, exog, scores):
    nobs, param_count = exog.shape
    scores[:] = 0.
    for i in range(nobs):
        for j in range(param_count):
            scores[j] += score_factor[i] * exog[i, j]

@thequackdaddy
Copy link
Member

Back to IRLS...

Why do we do this?

if maxiter > 0: # Only if iterative used
wls_method2 = 'pinv' if wls_method == 'lstsq' else wls_method
wls_model = lm.WLS(wlsendog, wlsexog, self.weights)
wls_results = wls_model.fit(method=wls_method2)

As far as I can tell, this is only used to come up with normalized_cov_params.

@josef-pkt
Copy link
Member Author

As far as I can tell, this is only used to come up with normalized_cov_params.

essentially yes. I guess the main reason to do it is for backwards compatibility prior to the use of MinimalWLS.
It can maybe also use a better method than what might be used during optimization.

I added an option to attach it to the results instance, because it can be used for diagnostic tests and measures, but that is currently mainly in an experimental notebook of mine.

@thequackdaddy
Copy link
Member

thequackdaddy commented May 14, 2018

compare this also with just the dot product score_factor.dot(exog) or exog.T.dot(score_factor)

I think np.dot is much faster... about 0.2ms.

However, there seems to be a loss of precision.

image

Numba...
image

@josef-pkt
Copy link
Member Author

That difference is much too large, check that score_factor and exog are both float64.

What linalg library are you using MKL or OpenBLAS?

@thequackdaddy
Copy link
Member

image

Both float64.

I'm using MKL. I'm using the numpy from the Intel IDP channel on anaconda. I'm certain its MKL... Perhaps they are doing some down-converting to like a single precision float?

@josef-pkt
Copy link
Member Author

The problem is also that we don't know which is more accurate.
Current score and numba score both do the same aggregation, sum in same sequence, so they will agree with each other even if they are not more accurate.

Whats the relative difference in this case?
score_factor.dot(exog) / scores_current - 1

absolute difference might not mean much if magnitudes are large enough.

@thequackdaddy
Copy link
Member

I don't think you'll like this...

image

@thequackdaddy
Copy link
Member

image

And it feels like dot is where the loss of precision in. Column 0 in exog is a bunch of ones, so score[0] shoudl equal score_factor.sum().

@josef-pkt
Copy link
Member Author

On linux you could cast score_factor to quad precision before computing the score_factor.sum() to check accuracy.

However, Does it matter?

I forgot that you are doing this at the MLE where score is supposed to be zero, so relative errors will be large, and the numbers will be mostly numerical noise, i.e. that we don't converge to exactly zero score.
The main consequence will be that numerical noise from the sum/dot will be about the same magnitude as a convergence criterion on the score.

Compute the score/score_factor at params * 0.98 and compare again.
I'm using this in the unit tests to stay away from the score=0 noise at the MLE params.
Away from the MLE params, rtol will be relevant for the optimization.

An application for score would be the score_test, but also there a diff of 1e-7 will not change the outcome of the test result.

@thequackdaddy
Copy link
Member

Compute the score/score_factor at params * 0.98 and compare again.

As always, @josef-pkt , you've hit the nail on the head.

Compute the score/score_factor at params * 0.98 and compare again.
image

That looks pretty good to me.

@thequackdaddy
Copy link
Member

thequackdaddy commented May 14, 2018

Per #4630, using np.dot in score gives a pretty cheap and easy performance improvement.

Now hessian is the clear laggard. That takes about 15 seconds--e.g., model.hessian(res.params, scale=1., observed=True). I think all the overhead is the the element-wise operations, which are not parallel.

On a quick glace at hessian, I don't see a cheap way to use np.dot there cheaply. There's a bunch of elementwise operations, but nothign that is both element-wise and reduction.

@josef-pkt
Copy link
Member Author

We leave the hessian for 0.10.
np.einsum might help
But this pattern as in the hessian computation shows up often. If we find something faster than the plain numpy (elementwise and dot) version, then I guess we should put it in a helper function.

(In the usual textbook math, the center is a diagonal matrix, but I don't think that using scipy sparse is faster because it still need to create the same intermediate array as the elementwise multiplication.)

@thequackdaddy
Copy link
Member

thequackdaddy commented May 15, 2018

I played with np.einsum... very mind-bending program.

Currently, we use np.dot(hessian_factor[:, None] * model.exog, model.exog) which is about 7 seconds on my computer.

np.einsum('j,ji,jk->ik', hessian_factor, model.exog, model.exog, optimize=True). That took around 35 seconds, which is somewhat slow.

Essentially, hessian_factor[:, None] * model.exog takes around 6.6 seconds. But the real trick to the speed here is that np.dot is so fast because it uses BLAS.

Is there a way to use the underlying BLAS operations?

@josef-pkt
Copy link
Member Author

It sounds a bit strange that dot should be so fast compared to the element wise (except again if the cost of allocating the temp array are so large.)
How many processors to you have and are used by dot?

AFAIK element wise operations are not directly exported from BLAS through numpy or scipy.

@thequackdaddy
Copy link
Member

@josef-pkt Forgot to respond to this... disappeared for a little longer than I thought

This is on a 4 core xeon processor virtual machine. I don't recall the exact specs.

I read through the BLAS documentation and I don't think an element-wise approach would work. To the extent you would want to use them, the suggestions I've found is to turn the vector into a diagonal matrix, which would be way, way too big.

I ran this on my home computer (a Mac) and similar results, but not as pronounced. This is a 5,000,000 x 50 matrix for a and a 5,000,000 length vector for b. The data is from a random matrix, e.g., a = np.random.rand(5000000, 50).

In [12]: %timeit np.dot((b[:, None] * a).T, a)
3.49 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit np.einsum('j,ji,jk->ik', b, a, a)
3.97 s ± 783 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit np.einsum('j,ji,jk->ik', b, a, a, optimize=True)
3.7 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@josef-pkt
Copy link
Member Author

my computer is slower than yours
preallocating the array, saves close to 40% as long as the out array has the same C or F order as a

out = np.empty_like(a)
%timeit np.multiply(a.T, b, out=out.T); out.T.dot(a)

asides
in my numpy version einsum is much slower.
F versus C order makes only a small difference, < 5%, on my computer and numpy version.

@thequackdaddy
Copy link
Member

Yes allocating out first seems to speed it up for me too...

In [31]: %%timeit
    ...: np.multiply(a.T, b, out=out.T)
    ...: out.T.dot(a)
    ...:
2.25 s ± 413 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@thequackdaddy
Copy link
Member

Ok thinking aloud again...

The problem appears to be that every time hessian is called, you have to allocate a bunch of memory to do the simple hessian_factor * model.exog.T calculation. Would it make sense to create a global variable when hessian is needed for this? This won't speed up the hessian method if called once, but when called repeatedly--as in during the spicy optimizers--it would cut that time.

Not a huge time saver, but a time saver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
0.10
Awaiting triage
Development

No branches or pull requests

2 participants