Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: pandas performance on files with many branches #1086

Merged
merged 6 commits into from Jan 22, 2024

Conversation

ioanaif
Copy link
Collaborator

@ioanaif ioanaif commented Jan 18, 2024

No description provided.

@ioanaif ioanaif linked an issue Jan 18, 2024 that may be closed by this pull request
@jpivarski
Copy link
Member

The example file is small; it has 1000 branches, but they're not filled with much data. Unfortunately, we can't keep large test files in our CI, so this is something that can only be tested manually.

I found a CMS NanoAOD file (which I can share with you privately). By selecting filter_typename="bool", we can get approximately 1000 branches with a simple data type that should be easy to wrap in Pandas. In main (with the file in warm cache using vmtouch),

import uproot, time
tree = uproot.open("Run2018D-DoubleMuon-Nano25Oct2019"
    "_ver2-v1-974F28EE-0FCE-4940-92B5-870859F880B1.root:Events")
tick = time.perf_counter()
for _ in tree.iterate(filter_typename="bool", step_size=100000, library="np"):
    tock = time.perf_counter()
    print(tock - tick)
    tick = tock

prints times on average 3.4 ± 0.1 sec, the time needed to read the arrays at all (using NumPy). Swapping library="np" for library="pd" prints lots and lots of warnings, but the times are 4.19 ± 0.08, so the overhead of _pandas_memory_efficient is 0.8 sec.

In ioanaif/fix-pandas-memory-issue-1070, the library="np" time is 3.0 ± 0.2 sec and the library="pd" time is 4.1 ± 0.2. The Pandas PerformanceWarning is gone, but there's no noticeable impact on the actual performance.

I found another file, from issue #288, which has a lot of large, simple-typed branches that can be selected with

filter_typename=["double", "/double\[[0-9]+\]/", "bool", "/bool\[[0-9]+\]/"]

It takes main 2.50 ± 0.06 sec to read this file with library="np" and 2.89 ± 0.02 sec to read with library="pd", but in ioanaif/fix-pandas-memory-issue-1070, it takes 2.58 ± 0.07 sec to read with library="np" and 9.7 ± 0.1 sec to read with library="pd". The Pandas overhead went from 0.4 sec to 7.1 sec by introducing this PR. That's a regression.

The PerformanceWarning suggests using pd.concat, not the pd.DataFrame constructor. I wonder if that is relevant. With the following diff in this PR,

--- a/src/uproot/interpretation/library.py
+++ b/src/uproot/interpretation/library.py
@@ -856,7 +856,9 @@ class Pandas(Library):
 
         elif isinstance(how, str) or how is None:
             arrays, names = _pandas_only_series(pandas, arrays, expression_context)
-            return pandas.DataFrame(data=arrays, columns=names)
+            out = pandas.concat(arrays, axis=1, ignore_index=True)
+            out.columns = names
+            return out
 
         else:
             raise TypeError(

it now takes 2.6 ± 0.1 sec for library="np" and 2.96 ± 0.08 sec for library="pd", which is no regression (0.4 sec of Pandas overhead is almost exactly what the _pandas_memory_efficient function in main had).

Going back to the NanoAOD file, it's 3.1 ± 0.2 sec with library="np" and 3.6 ± 0.2 sec with library="pd", still acceptable.

So the bottom line is that using the pandas.DataFrame constructor introduces a performance regression (while eliminating Pandas's warning!) in one out of the two cases tried, but using the pandas.concat function does not. After this comment, I'll add it as a commit here.

@jpivarski
Copy link
Member

After adjusting the code so that it passes all tests (54cab87), the time with the file from from issue #288 is still 2.6 ± 0.1 sec with library="np" and 3.0 ± 0.1 sec with library="pd". The 7-second issue didn't come back.

... until I manually reverted the code to

pandas.DataFrame(data=arrays, columns=names)

just to be sure that it's persistent, that I'm not just imagining things. It is definitely the case that Pandas is doing something bad when we run its constructor. (Surely that was the first thing that I tried, way back when, and encountered some bad behavior that made me write _pandas_memory_efficient in the first place.)

Oh, if I run gc.disable() and then gc.collect() before the constructor, the time isn't as bad (6.0 ± 0.1 sec, rather than 9.6 ± 0.1 sec). Whatever bad thing Pandas is doing in its constructor, it involves creating a lot of short-lived objects.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to merge. I'll let you merge it, in case you want to do any more counter-edits or tests.

@ioanaif ioanaif merged commit 2fa6265 into main Jan 22, 2024
21 checks passed
@ioanaif ioanaif deleted the ioanaif/fix-pandas-memory-issue-1070 branch January 22, 2024 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uproot.iterate throws a pandas PerformanceWarning
2 participants