Memory consumption during uproot.iterate
#1041
-
I am interested in processing a O(2 GB) file using
Below, I have uploaded the memory-usage vs time plots as generated by mprof. I am using awkward 2.4.5 and uproot 5.0.12. I copied my test script for reference - the argument to the test script refers to the function that is run (and thus profiled). Data Shape ComplexitiesNot sure if these are pertinent, but I think all of the details are helpful when talking about memory usage.
Runs for ReferenceFirst, (for context), I'd like to show some plots where I don't use Just have numpy generate large chunks def np_generate_chunks():
for i in range(10):
yield np.full(25_000_000, 1, dtype=np.int32)
@profile
def numpy():
for events in np_generate_chunks():
process(events) Have uproot load entire file def uproot_whole_file():
with uproot.open('test.root') as f:
events = f['LDMX_Events'].arrays(**selection_kw)
process(events) TestsNaive solution from docs def uproot_file_iterate():
for events in uproot.iterate('test.root:LDMX_Events', **selection_kw):
process(events) Different attempt from docs def uproot_tree_iterate():
with uproot.open('test.root') as f:
for events in f['LDMX_Events'].iterate(**selection_kw):
process(events) Try answer from #648 def uproot_tree_iterate_no_object_cache():
with uproot.open('test.root', object_cache=None) as f:
for events in f['LDMX_Events'].iterate(**selection_kw):
process(events) Force a specific step size that (I think) should keep the chunk-in-memory smaller than the default. def uproot_iterate_small_chunk():
for events in uproot.iterate('test.root:LDMX_Events', step_size=10000, **selection_kw):
process(events) Full test.py Scriptimport numpy as np
import uproot
import time
selection_kw = dict(
filter_name = [
'PEFF**',
'HcalSimHits**',
'EcalRecHits**',
'EventHeader**'
]
)
@profile
def process(events):
time.sleep(1)
@profile
def uproot_whole_file():
"""as a bench mark, lets just load the whole file into memory"""
with uproot.open('test.root') as f:
events = f['LDMX_Events'].arrays(**selection_kw)
process(events)
@profile
def uproot_file_iterate():
"""naive solution, first thing I tried after looking at docs"""
for events in uproot.iterate('test.root:LDMX_Events', **selection_kw):
process(events)
@profile
def uproot_tree_iterate():
"""noticed docs have an iterate method on the tree itself, try that"""
with uproot.open('test.root') as f:
for events in f['LDMX_Events'].iterate(**selection_kw):
process(events)
@profile
def uproot_tree_iterate_no_object_cache():
"""saw this option in the discussion
https://github.com/scikit-hep/uproot5/discussions/648
"""
with uproot.open('test.root', object_cache=None) as f:
for events in f['LDMX_Events'].iterate(**selection_kw):
process(events)
@profile
def uproot_iterate_small_chunk():
"""force a really small chunk to see an extreme"""
for events in uproot.iterate('test.root:LDMX_Events', step_size=10000, **selection_kw):
process(events)
def np_generate_chunks():
for i in range(10):
yield np.full(25_000_000, 1, dtype=np.int32)
@profile
def numpy():
"""another benchmark, just have numpy generate ~100MB chunks for me"""
for events in np_generate_chunks():
process(events)
def main():
import sys
globals()[sys.argv[1]]()
if __name__ == '__main__':
main() |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
The default If you want to avoid this caching, you can open the file yourself and pass the file-handle to |
Beta Was this translation helpful? Give feedback.
-
An alternative to modifying your code to pass the file-handle would be to use the I would be really interested in seeing the difference in performance between the two sources! |
Beta Was this translation helpful? Give feedback.
The default
uproot
source for reading files from disk uses memory mapping (mmap
). Memory mapping is useful to avoid an extra copy of the data when reading, to share (re-use) memory between processes, and to improve performance of random access (c.f. https://stackoverflow.com/a/6383253). However, it has some drawbacks; namely that the OS is responsible for managing the memory (and also, the cache is populated via page faulting). In this case, I think the behaviour that you're seeing is a direct consequence of the kernel caching memory pages, and choosing not to free them (c.f. https://stackoverflow.com/a/1972889).If you want to avoid this caching, you can open the file yourself and pass t…