-
Notifications
You must be signed in to change notification settings - Fork 39
performance of idiomatic expressions #107
Comments
To be pure Numpy, some of the jagged method implementations have many full-pass steps. The only performance rule I mostly followed was to never do a Python for loop that scales with number of entries. (There are very few cases where it couldn't be avoided—I wonder if you found one.) In a large part because of this, I'm developing Numba implementations of all the operations, much like your PR for CUDA. An analyst's workflow might still be a little slower for passing over the data multiple times, but each operation would individually be near maximal. We're also hoping to mentor a GSoC student to do pre-compiled methods, to replace a one-time JIT-compilation cost for each type of structure with a one-time virtual method call at the beginning of each operation over many events. As for this particular performance gaff, if you can narrow it down to one operation that's taking way longer than it seems it should, I can look into it. |
@jpata if you could provide an event sample, I would be interested in trying a few different ways of expressing this, to see if another method is faster. |
@nsmith- can you access the following files? I'm curious if there is a better way to write down what I attempted!
|
Thanks, @nsmith-, for offering to offering to look for a work-around. I'd also like to know if one of the operations is a hundred times slower than it looks like it should be. (Remember |
I ran a cProfile check and the overwhelming time is spent in
|
How big is the biggest cross-join? That is, for your It could be that this is the case we were worried about with vectorizing nested loops: the scalar-code way is to iterate over the same data multiple times ( But then, maybe I'm barking up the wrong tree—maybe it's not the size of the intermediate array in this case. Another major contributor is |
Here is a slightly faster version, also a bit concise def get_os_muons_awkward_2(charge):
pair_idx = charge.argchoose(2)
valid_idx = pair_idx[charge[pair_idx.i0]*charge[pair_idx.i1] == -1]
first_valid = valid_idx[:,:1]
# Now we have to get hacky since awkward arrays are immutable
out_muon_mask = charge.zeros_like()
raw_i0 = (out_muon_mask.starts + first_valid.i0).flatten()
out_muon_mask.content[raw_i0] = 1
raw_i1 = (out_muon_mask.starts + first_valid.i1).flatten()
out_muon_mask.content[raw_i1] = 1
return out_muon_mask which clocks in at
so a factor of 2. This is also using #102 which sorts the pairs in a way that naturally satisfies your requirement to keep the first two opposite-sign muons (i.e. the highest sum-pT pair) from pyinstrument import Profiler
profiler = Profiler()
profiler.start()
get_os_muons_awkward_2(charge)
get_os_muons_awkward_2(charge)
get_os_muons_awkward_2(charge)
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
|
I've noticed that a somewhat-complex reduction is much faster (~250x) as a direct numba loop over contents and offsets as opposed to an idiomatic awkward array haiku. The task I have is that given a nested list of muon charges per event, mask all muons except the first two that have an opposite sign. I wonder if there is a better way to state this operation in awkward, or perhaps there is some unexpected performance loss?
Awkward-array idiomatic code: runtime 1.9s.
Direct Numba loop, runtime 7.5ms
CUDA, timing ~500 microseconds:
I wasn't sure what's the appropriate way to raise this, feel free to move the discussion elsewhere.
The text was updated successfully, but these errors were encountered: