fix: pandas performance on files with many branches #1086

ioanaif · 2024-01-18T16:54:16Z

No description provided.

…x.py

jpivarski · 2024-01-19T18:54:55Z

The example file is small; it has 1000 branches, but they're not filled with much data. Unfortunately, we can't keep large test files in our CI, so this is something that can only be tested manually.

I found a CMS NanoAOD file (which I can share with you privately). By selecting filter_typename="bool", we can get approximately 1000 branches with a simple data type that should be easy to wrap in Pandas. In main (with the file in warm cache using vmtouch),

import uproot, time
tree = uproot.open("Run2018D-DoubleMuon-Nano25Oct2019"
    "_ver2-v1-974F28EE-0FCE-4940-92B5-870859F880B1.root:Events")
tick = time.perf_counter()
for _ in tree.iterate(filter_typename="bool", step_size=100000, library="np"):
    tock = time.perf_counter()
    print(tock - tick)
    tick = tock

prints times on average 3.4 ± 0.1 sec, the time needed to read the arrays at all (using NumPy). Swapping library="np" for library="pd" prints lots and lots of warnings, but the times are 4.19 ± 0.08, so the overhead of _pandas_memory_efficient is 0.8 sec.

In ioanaif/fix-pandas-memory-issue-1070, the library="np" time is 3.0 ± 0.2 sec and the library="pd" time is 4.1 ± 0.2. The Pandas PerformanceWarning is gone, but there's no noticeable impact on the actual performance.

I found another file, from issue #288, which has a lot of large, simple-typed branches that can be selected with

filter_typename=["double", "/double\[[0-9]+\]/", "bool", "/bool\[[0-9]+\]/"]

It takes main 2.50 ± 0.06 sec to read this file with library="np" and 2.89 ± 0.02 sec to read with library="pd", but in ioanaif/fix-pandas-memory-issue-1070, it takes 2.58 ± 0.07 sec to read with library="np" and 9.7 ± 0.1 sec to read with library="pd". The Pandas overhead went from 0.4 sec to 7.1 sec by introducing this PR. That's a regression.

The PerformanceWarning suggests using pd.concat, not the pd.DataFrame constructor. I wonder if that is relevant. With the following diff in this PR,

--- a/src/uproot/interpretation/library.py
+++ b/src/uproot/interpretation/library.py
@@ -856,7 +856,9 @@ class Pandas(Library):
 
         elif isinstance(how, str) or how is None:
             arrays, names = _pandas_only_series(pandas, arrays, expression_context)
-            return pandas.DataFrame(data=arrays, columns=names)
+            out = pandas.concat(arrays, axis=1, ignore_index=True)
+            out.columns = names
+            return out
 
         else:
             raise TypeError(

it now takes 2.6 ± 0.1 sec for library="np" and 2.96 ± 0.08 sec for library="pd", which is no regression (0.4 sec of Pandas overhead is almost exactly what the _pandas_memory_efficient function in main had).

Going back to the NanoAOD file, it's 3.1 ± 0.2 sec with library="np" and 3.6 ± 0.2 sec with library="pd", still acceptable.

So the bottom line is that using the pandas.DataFrame constructor introduces a performance regression (while eliminating Pandas's warning!) in one out of the two cases tried, but using the pandas.concat function does not. After this comment, I'll add it as a commit here.

jpivarski · 2024-01-19T19:17:38Z

After adjusting the code so that it passes all tests (54cab87), the time with the file from from issue #288 is still 2.6 ± 0.1 sec with library="np" and 3.0 ± 0.1 sec with library="pd". The 7-second issue didn't come back.

... until I manually reverted the code to

pandas.DataFrame(data=arrays, columns=names)

just to be sure that it's persistent, that I'm not just imagining things. It is definitely the case that Pandas is doing something bad when we run its constructor. (Surely that was the first thing that I tried, way back when, and encountered some bad behavior that made me write _pandas_memory_efficient in the first place.)

Oh, if I run gc.disable() and then gc.collect() before the constructor, the time isn't as bad (6.0 ± 0.1 sec, rather than 9.6 ± 0.1 sec). Whatever bad thing Pandas is doing in its constructor, it involves creating a lot of short-lived objects.

jpivarski

I think this is ready to merge. I'll let you merge it, in case you want to do any more counter-edits or tests.

fix: pandas performance on files with many branches

e5f45eb

ioanaif linked an issue Jan 18, 2024 that may be closed by this pull request

uproot.iterate throws a pandas PerformanceWarning #1070

Closed

pre-commit-ci bot and others added 4 commits January 18, 2024 16:54

style: pre-commit fixes

8050a8f

Merge branch 'main' into ioanaif/fix-pandas-memory-issue-1070

b374cc2

Fix return in library.py

4bfc30e

Fixes file path in test_1070_pandas_dataframe_building_performance_fi…

1173345

…x.py

ioanaif requested a review from jpivarski January 19, 2024 11:41

replace pd.DataFrame -> pd.concat with axis=1

54cab87

jpivarski approved these changes Jan 19, 2024

View reviewed changes

ioanaif merged commit 2fa6265 into main Jan 22, 2024
21 checks passed

ioanaif deleted the ioanaif/fix-pandas-memory-issue-1070 branch January 22, 2024 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pandas performance on files with many branches #1086

fix: pandas performance on files with many branches #1086

ioanaif commented Jan 18, 2024

jpivarski commented Jan 19, 2024

jpivarski commented Jan 19, 2024

jpivarski left a comment

fix: pandas performance on files with many branches #1086

fix: pandas performance on files with many branches #1086

Conversation

ioanaif commented Jan 18, 2024

jpivarski commented Jan 19, 2024

jpivarski commented Jan 19, 2024

jpivarski left a comment

Choose a reason for hiding this comment