Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up interface between Uproot and Awkward so that Awkward can be used to optimize object-reading. #96

Merged
merged 7 commits into from Sep 12, 2020

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Sep 10, 2020

As discussed in issue #90.

Corresponds to scikit-hep/awkward#448.

@jpivarski
Copy link
Member Author

@tamasgal The good news is that, with this PR and scikit-hep/awkward#448, the following code takes 3.6 seconds instead of 194 seconds (53× faster):

import uproot4
branch = uproot4.open("issue-90.root:E/Evt/trks/trks.fitinf")
for i in range(branch.num_baskets):
    print(repr(branch.basket(i).array()))

In the above, we're reading each TBasket individually. The bad news is that it still takes a long time to concatenate TBaskets because ak.concatenate has only been (internally) implemented for pairs: concatenating n arrays means creating and throwing away n - 2 temporary arrays of quadratically increasing size. That was fine for examples of concatenating 2 or 3 arrays at a time, but when it comes to concatenating data from this file's 461 TBaskets, it's a problem.

Clearly, ak.concatenate needs to be fixed anyway. There might already be an issue open about it. (It's been on my mind for a while...) Anyway, I'll tackle that next. We want the conclusion of the above story to be that the whole array is produced in 3.7 seconds!

@tamasgal
Copy link
Contributor

Awesome! Many thanks, Jim, that's really a huge leap.

@jpivarski
Copy link
Member Author

For cases that aren't covered by the new interpret-by-Awkward mechanism, how about parallelizing the pure Python interpretation?

  • parallel interpretation in 'basket_array': 259 seconds. 12 cores utilized at about 5-10% each. I'll blame the GIL. Reversing the order (putting the big TBaskets first) doesn't help—it's not about stragglers.
  • sequential interpretation in 'basket_array', concatenate NumPy dtype=O arrays: 190 seconds.
  • sequential interpretation in 'final_array' (old way): 197 seconds. Moving the interpretation from final_array into basket_array doesn't hurt. Entries that would be trimmed (because they're at the ends of the first and last basket) are now unnecessarily interpreted, but that's probably better than not having the possibility to parallelize.
  • interpret-by-Awkward in 'basket_array': 2.5 seconds.

So the baseline of 197 seconds wasn't unnecessarily harsh. It really is an 80× speedup.

…that it has the possibility of being parallelized, with the cost of possibly interpreting entries that will later be trimmed.
@jpivarski
Copy link
Member Author

Making use of this will require Awkward 0.2.37 (but it won't break with earlier versions), which is being deployed now. I'll do one last test after that deployment so that GHA pulls the new Awkward from PyPI.

@jpivarski jpivarski merged commit dd90b30 into master Sep 12, 2020
@jpivarski jpivarski deleted the jpivarski/awkward-interface-to-optimize-AsObjects branch September 12, 2020 14:50
@tamasgal
Copy link
Contributor

Thanks, that's really nice and helps a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants