Requiring numba? #8

HDembinski · 2020-05-13T10:13:51Z

The speed of the functions jackknife and bootstrap could be greatly enhanced with numba. While it is possible to have two implements for each function, one requiring numba and one just requiring numpy, I think it would make sense to require numba. It used to be a pain to install, but nowadays one can just pip install it and of course it is also available from anaconda.

I know numba well enough to work on this. Another option (but less preferred), would be to write the implementations in C++ and wrap the code to Python with pybind11. I can do that as well, but then the package is not pure Python anymore and has to be compiled for every target platform. This is a huge added burden, and the C++ code may not run faster than numba (I did some tests a while ago, where numba beat my C++ implemenation).

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-05-13T14:39:27Z

What kind of speedups do you see from using numba? If those are significant requiring it could make sense (I'd prefer that to including C++ code).

HDembinski · 2020-05-14T15:02:21Z

In general, the speed-ups can be order of magnitute or more, even if the code already used numpy. It is hard to say without trying, but I would expect similar speed ups for this library.

So maybe the best course of action is to write some benchmarks with the current code, and then rewrite some of it to use numba, and compare. I can do that.

HDembinski · 2020-05-14T15:03:02Z

We can then decide whether numba is optional or required, ok?

HDembinski · 2020-05-21T18:11:37Z

I tried numba on jackknife, which seemed most promising for acceleration. The speedup is about 10x on my machine, depending strongly on the number of samples in the original array. The speedup is probably also fairly platform-dependent.

https://github.com/HDembinski/essays/blob/master/resample%20and%20numba.ipynb

HDembinski · 2020-05-21T18:23:09Z

Based on this evidence, I think moving to numba is benefinical. Perhaps the code can be made even faster. The jackknife in particular could be much faster in the future, as soon as numba starts to support Python callables (currently it does not). Already today, numba can call ufuncs that offer a ctypes or cffi interface. In this case, we would not need to generate N arrays with N-1 elements at once, but only one N-1 array and reuse the allocated memory N times.

HDembinski · 2020-05-21T19:27:41Z

I just updated the numba example with a better implementation.

HDembinski · 2020-05-21T19:31:57Z

Since you are ok with making breaking changes, I think the version of jackknife that returns the Nx(N-1) array should be removed. For N > 10000 this becomes dangerous as it allocates huge amount of memory. Algorithms that have a O(N^2) memory requirement on input with arbitrary N can easily bring down computers.

dsaxton · 2020-05-21T20:50:26Z

Thanks for working on this, that does look like a nice speedup.

I think I'm fine with removing the N x (N - 1) output option. Currently the jackknife is only implemented internally with a non-None statistical function so things should still work with that feature removed. It also feels a bit cleaner to have the shape of the output not depend on the input (this is also a question I have about the behavior of the bootstrap function, but it's not as problematic since you can control the number of bootstrap replications).

HDembinski · 2020-05-22T14:45:28Z

I agree. I think we should always require, for jacknife or bootstrap, that the user passes the statistical function, to avoid holding all the resampled data in memory at once. For the bootstrap it is less of a problem, agreed.

I am concerned about the balanced bootstrap. I think the implementation requires one to hold all the N x M values in memory at once.

dsaxton · 2020-05-22T14:52:50Z

I am concerned about the balanced bootstrap. I think the implementation requires one to hold all the N x M values in memory at once.

Yes, I think you are right. Off the top of my head I'm wondering if it might be reasonable to, instead of copying a permutation of all the data repeated M times, do the same for the integer positions and then use some kind of sliding window to calculate the statistical function. Or maybe that could be done in a lazy way and only "realize" the positions during the calculation?

HDembinski · 2020-05-22T16:29:44Z

I have to look into this algorithm again, but I think one has to store at least the full NxM index array at once, where N is the number of samples and M the number of replicas. If the data values are doubles, one could safe a bit by using a smaller integer to hold the indices, say int32.

HDembinski · 2020-05-29T11:31:51Z

I am closing this, since I added numba :)

dsaxton · 2020-05-29T15:43:59Z

I am closing this, since I added numba :)

Makes sense, we can open specific issues for other applications of numba in the future

HDembinski closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requiring numba? #8

Requiring numba? #8

HDembinski commented May 13, 2020

dsaxton commented May 13, 2020

HDembinski commented May 14, 2020

HDembinski commented May 14, 2020

HDembinski commented May 21, 2020

HDembinski commented May 21, 2020 •

edited

HDembinski commented May 21, 2020

HDembinski commented May 21, 2020

dsaxton commented May 21, 2020 •

edited

HDembinski commented May 22, 2020

dsaxton commented May 22, 2020

HDembinski commented May 22, 2020

HDembinski commented May 29, 2020

dsaxton commented May 29, 2020

Requiring numba? #8

Requiring numba? #8

Comments

HDembinski commented May 13, 2020

dsaxton commented May 13, 2020

HDembinski commented May 14, 2020

HDembinski commented May 14, 2020

HDembinski commented May 21, 2020

HDembinski commented May 21, 2020 • edited

HDembinski commented May 21, 2020

HDembinski commented May 21, 2020

dsaxton commented May 21, 2020 • edited

HDembinski commented May 22, 2020

dsaxton commented May 22, 2020

HDembinski commented May 22, 2020

HDembinski commented May 29, 2020

dsaxton commented May 29, 2020

HDembinski commented May 21, 2020 •

edited

dsaxton commented May 21, 2020 •

edited