Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. #96

jpivarski · 2020-09-10T14:14:07Z

As discussed in issue #90.

…ary-dependent choices. Also enables TBasket.array to be more like TBranch.array.

jpivarski · 2020-09-10T18:57:52Z

@tamasgal The good news is that, with this PR and scikit-hep/awkward#448, the following code takes 3.6 seconds instead of 194 seconds (53× faster):

import uproot4
branch = uproot4.open("issue-90.root:E/Evt/trks/trks.fitinf")
for i in range(branch.num_baskets):
    print(repr(branch.basket(i).array()))

In the above, we're reading each TBasket individually. The bad news is that it still takes a long time to concatenate TBaskets because ak.concatenate has only been (internally) implemented for pairs: concatenating n arrays means creating and throwing away n - 2 temporary arrays of quadratically increasing size. That was fine for examples of concatenating 2 or 3 arrays at a time, but when it comes to concatenating data from this file's 461 TBaskets, it's a problem.

Clearly, ak.concatenate needs to be fixed anyway. There might already be an issue open about it. (It's been on my mind for a while...) Anyway, I'll tackle that next. We want the conclusion of the above story to be that the whole array is produced in 3.7 seconds!

tamasgal · 2020-09-10T18:59:34Z

Awesome! Many thanks, Jim, that's really a huge leap.

…ds (78 times faster).

jpivarski · 2020-09-12T13:44:35Z

For cases that aren't covered by the new interpret-by-Awkward mechanism, how about parallelizing the pure Python interpretation?

parallel interpretation in 'basket_array': 259 seconds. 12 cores utilized at about 5-10% each. I'll blame the GIL. Reversing the order (putting the big TBaskets first) doesn't help—it's not about stragglers.
sequential interpretation in 'basket_array', concatenate NumPy dtype=O arrays: 190 seconds.
sequential interpretation in 'final_array' (old way): 197 seconds. Moving the interpretation from final_array into basket_array doesn't hurt. Entries that would be trimmed (because they're at the ends of the first and last basket) are now unnecessarily interpreted, but that's probably better than not having the possibility to parallelize.
interpret-by-Awkward in 'basket_array': 2.5 seconds.

So the baseline of 197 seconds wasn't unnecessarily harsh. It really is an 80× speedup.

…that it has the possibility of being parallelized, with the cost of possibly interpreting entries that will later be trimmed.

jpivarski · 2020-09-12T14:04:26Z

Making use of this will require Awkward 0.2.37 (but it won't break with earlier versions), which is being deployed now. I'll do one last test after that deployment so that GHA pulls the new Awkward from PyPI.

tamasgal · 2020-09-13T07:22:49Z

Thanks, that's really nice and helps a lot!

jpivarski added 2 commits September 10, 2020 09:12

Pass 'library' to 'basket_array' so that 'basket_array' can make libr…

6004d49

…ary-dependent choices. Also enables TBasket.array to be more like TBranch.array.

Prepare interface from Uproot to Awkward.

51c4859

jpivarski mentioned this pull request Sep 10, 2020

Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. scikit-hep/awkward#448

Merged

jpivarski added 2 commits September 10, 2020 13:44

Use arrays directly read by Awkward on the Uproot side.

716ccba

flake8.

0d6e02b

jpivarski mentioned this pull request Sep 10, 2020

Upgrade Content::merge from a single 'other' argument to a std::vector of 'others'. scikit-hep/awkward#449

Merged

Last correction: successful run in 2.5 seconds, rather than 197 secon…

6d13a46

…ds (78 times faster).

Move object interpretation from 'final_array' into 'basket_array' so …

ed47ecb

…that it has the possibility of being parallelized, with the cost of possibly interpreting entries that will later be trimmed.

Invoke build with Awkward 0.2.37.

1250808

jpivarski merged commit dd90b30 into master Sep 12, 2020

jpivarski deleted the jpivarski/awkward-interface-to-optimize-AsObjects branch September 12, 2020 14:50

jpivarski mentioned this pull request Sep 14, 2020

Reading large doubly jagged arrays iteratively/lazily? #90

Closed

tamasgal mentioned this pull request Oct 19, 2020

Doubly jagged int32_t arrays contain only 0s #128

Closed

jpivarski mentioned this pull request Dec 24, 2020

Prototype Forth VM for filling Awkward Arrays from Uproot. scikit-hep/awkward#620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. #96

Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. #96

jpivarski commented Sep 10, 2020 •

edited

jpivarski commented Sep 10, 2020

tamasgal commented Sep 10, 2020

jpivarski commented Sep 12, 2020

jpivarski commented Sep 12, 2020

tamasgal commented Sep 13, 2020

Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. #96

Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. #96

Conversation

jpivarski commented Sep 10, 2020 • edited

jpivarski commented Sep 10, 2020

tamasgal commented Sep 10, 2020

jpivarski commented Sep 12, 2020

jpivarski commented Sep 12, 2020

tamasgal commented Sep 13, 2020

jpivarski commented Sep 10, 2020 •

edited