New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SUMM/ENH/REF: performance review of GLM, fit_gradient #4625
Comments
@thequackdaddy |
@josef-pkt I’ll try to test out later today. Another question on performance... In my system, this line is relatively slow. It seems to only only process on 1 core. Takes about 6-7 for 12 million x 50 exog. A straight element-wise multiplication between the score_obs and exog. Is there a parallelized numpy function for this? |
There’s also a few lines in the hessian calculation that is slow... it would be nice to take advantage of more cores. |
6-7 of what? minutes, hours?
No, not AFAIK. numpy and scipy don't parallelize computations, only the underlying linalg libraries do that. We would have to do it on our own, or use Dask. One general issue is Fortran versus C order in those operations on large arrays. Another possibility: We currently don't reuse memory much, e.g. using the |
Sorry... I meant 6-7 seconds. Doesn’t seem like a lot, but that calculation is used in each iteration with the fit_gradient optimization. |
Another question... could this be a place for numba? dask/ joblib seem like they bring too much overhead for this simple calculation. |
Maybe. I thought about it, but I guess not for this. numba parallel loop was only recently reintroduced and I wouldn't bet on it yet. And it might not help in getting a contiguous output array in parallel either. (I'm mostly guessing.) |
To change the target a bit. Using numpy this would be easy to try in score, e.g. if nobs > I million, then sum in blocks of 100,000 (or something like that) observations. Even simpler score = exog.T.dot(score_factor) which should be just a plain linalg operation. |
about hessian: |
Honestly, this doesn't improve it that much. I'll fill out a PR with this. To be totally clear, its the full IRLS fit that seems to chew up a bunch of memory. Maybe I should use some other |
I'm not sure what should be causing this. I don't know if this would help, but you could try adding a copy to params in |
compare this also with just the dot product |
Back to numba... I thought I read somewhere that if you declare types you can speed up things... but when I tried, it actually was about .5 seconds slower. I'll just not declare types... @numba.jit('void(float64[:], float64[:, :], float64[:])', parallel=True, nopython=True)
def score(score_factor, exog, scores):
nobs, param_count = exog.shape
scores[:] = 0.
for i in range(nobs):
for j in range(param_count):
scores[j] += score_factor[i] * exog[i, j] |
Back to IRLS... Why do we do this? statsmodels/statsmodels/genmod/generalized_linear_model.py Lines 1146 to 1149 in 3f44cbe
As far as I can tell, this is only used to come up with |
essentially yes. I guess the main reason to do it is for backwards compatibility prior to the use of MinimalWLS. I added an option to attach it to the results instance, because it can be used for diagnostic tests and measures, but that is currently mainly in an experimental notebook of mine. |
That difference is much too large, check that score_factor and exog are both float64. What linalg library are you using MKL or OpenBLAS? |
The problem is also that we don't know which is more accurate. Whats the relative difference in this case? absolute difference might not mean much if magnitudes are large enough. |
On linux you could cast score_factor to quad precision before computing the score_factor.sum() to check accuracy. However, Does it matter? I forgot that you are doing this at the MLE where score is supposed to be zero, so relative errors will be large, and the numbers will be mostly numerical noise, i.e. that we don't converge to exactly zero score. Compute the score/score_factor at An application for score would be the score_test, but also there a diff of 1e-7 will not change the outcome of the test result. |
As always, @josef-pkt , you've hit the nail on the head. Compute the score/score_factor at params * 0.98 and compare again. That looks pretty good to me. |
Per #4630, using Now hessian is the clear laggard. That takes about 15 seconds--e.g., On a quick glace at |
We leave the hessian for 0.10. (In the usual textbook math, the center is a diagonal matrix, but I don't think that using scipy sparse is faster because it still need to create the same intermediate array as the elementwise multiplication.) |
I played with np.einsum... very mind-bending program. Currently, we use
Essentially, Is there a way to use the underlying BLAS operations? |
It sounds a bit strange that AFAIK element wise operations are not directly exported from BLAS through numpy or scipy. |
@josef-pkt Forgot to respond to this... disappeared for a little longer than I thought This is on a 4 core xeon processor virtual machine. I don't recall the exact specs. I read through the BLAS documentation and I don't think an element-wise approach would work. To the extent you would want to use them, the suggestions I've found is to turn the vector into a diagonal matrix, which would be way, way too big. I ran this on my home computer (a Mac) and similar results, but not as pronounced. This is a 5,000,000 x 50 matrix for In [12]: %timeit np.dot((b[:, None] * a).T, a)
3.49 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit np.einsum('j,ji,jk->ik', b, a, a)
3.97 s ± 783 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit np.einsum('j,ji,jk->ik', b, a, a, optimize=True)
3.7 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
my computer is slower than yours
asides |
Yes allocating In [31]: %%timeit
...: np.multiply(a.T, b, out=out.T)
...: out.T.dot(a)
...:
2.25 s ± 413 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Ok thinking aloud again... The problem appears to be that every time Not a huge time saver, but a time saver. |
There is still some slack in GLM, especially fit_gradient, to improve performance in terms of speed and memory consumption
related issues:
...
other possibilities:
irls_rslt
immediately to free up memoryskip_hessian
or a return option to only get the params out of the irls fitThe text was updated successfully, but these errors were encountered: