Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading multiple large files (>4GB) onto a lazy array produces deserialization error #131

Closed
chrispap95 opened this issue Oct 29, 2020 · 20 comments
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@chrispap95
Copy link

Issue description

This is an issue regarding the question posted here: https://stackoverflow.com/questions/64565208/how-can-i-load-many-very-large-files-using-uproot

When loading 3 files of about 9 GB each onto a lazy array, the code crashes when the event loop reaches the borders between the first and second files producing a deserialization error. The code works flawlessly with small (~ 1 GB) files. Way of access (local vs XRootD) also doesn't seem to affect the issue. The error message is the following:

Traceback (most recent call last):
  File "scripts/plotEventShapes_lazy.py", line 44, in <module>
    tracks_x = events['Tracks.fCoordinates.fX'][ievt]
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/awkward1/highlevel.py", line 946, in __getitem__
    self._layout[where], self._behavior, cache=self._cache
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/awkward1/partition.py", line 366, in __getitem__
    return PartitionedArray.from_ext(self._ext.getitem_at(where))
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/behaviors/TBranch.py", line 1621, in array
    _ranges_or_baskets_to_arrays(
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/behaviors/TBranch.py", line 518, in _ranges_or_baskets_to_arrays
    uproot4.source.futures.delayed_raise(*obj)
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/source/futures.py", line 37, in delayed_raise
    raise exception_value.with_traceback(traceback)
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/behaviors/TBranch.py", line 444, in chunk_to_basket
    basket = uproot4.models.TBasket.Model_TBasket.read(
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/model.py", line 128, in read
    self.read_members(chunk, cursor, context, file)
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/models/TBasket.py", line 55, in read_members
    ) = cursor.fields(chunk, _tbasket_format2, context)
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/source/cursor.py", line 168, in fields
    return format.unpack(chunk.get(start, stop, self, context))
  File "/Users/chrispap/Library/Python/3.8/lib/python/site-packages/uproot4/source/chunk.py", line 278, in get
    raise uproot4.deserialization.DeserializationError(
uproot4.deserialization.DeserializationError: while reading

    TBasket version None as uproot4.models.TBasket.Model_TBasket (? bytes)
        fNbytes: -1607368158
        fObjlen: -1243566277
        fDatime: 2634931141
        fKeylen: -27664
        fCycle: 21409

attempting to get bytes 483015:483033
outside expected range 510698:538758 for this Chunk
in file /Users/chrispap/QCD/Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root

Reproduction instructions

The three very large files (9 GB each) are the following:

root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root
root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root
root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/Autumn18.QCD_HT2000toInf_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root

Two smaller files (1 GB) for which the code works nicely:

root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/PrivateSamples.SUEP_2018_mMed-1000_mDark-2_temp-2_decay-darkPho_13TeV-pythia8_n-100_0_RA2AnalysisTree.root
root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/PrivateSamples.SUEP_2018_mMed-1000_mDark-2_temp-2_decay-darkPhoHad_13TeV-pythia8_n-100_0_RA2AnalysisTree.root

The following code reproduces the issue using the files that I cite above:

import uproot4 as uproot
import uproot_methods
import awkward1 as ak
import numpy as np
import matplotlib.pyplot as plt

# Get the file and import using uproot
base = 'root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/'
#base = '/Users/chrispap/QCD/'
datasets = {
    base + 'Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
    base + 'Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
    base + 'Autumn18.QCD_HT2000toInf_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
}

events = uproot.lazy(datasets)

multiplicity = np.zeros(len(events['Tracks.fCoordinates.fX']))
for ievt in range(99000,len(events['Tracks.fCoordinates.fX'])):
    if ievt%1000 == 0:
        print("Processing event %d. Progress: %.2f%%"%(ievt,100*ievt/len(events['Tracks.fCoordinates.fX'])))
    if events['HT'][ievt] < 1200:
        multiplicity[ievt] = -1
        continue
    tracks_x = events['Tracks.fCoordinates.fX'][ievt]
    tracks_y = events['Tracks.fCoordinates.fY'][ievt]
    tracks_z = events['Tracks.fCoordinates.fZ'][ievt]
    tracks_E = np.sqrt(tracks_x**2+tracks_y**2+tracks_z**2+0.13957**2)
    tracks = uproot_methods.TLorentzVectorArray.from_cartesian(ak.to_awkward0(tracks_x),
                                                               ak.to_awkward0(tracks_y),
                                                               ak.to_awkward0(tracks_z),
                                                               ak.to_awkward0(tracks_E))
    tracks_fromPV0 = events['Tracks_fromPV0'][ievt]
    tracks_matchedToPFCandidate = events['Tracks_matchedToPFCandidate'][ievt]
    tracks = tracks[(tracks.pt > 1.) & (tracks.eta < 2.5) &
                    (ak.to_awkward0(tracks_fromPV0) >= 2) &
                    (ak.to_awkward0(tracks_matchedToPFCandidate) > 0)]
    multiplicity[ievt] = tracks.size

# Plot results
fig = plt.figure(figsize=(8,8))
ax = plt.gca()

CrossSection = events['CrossSection'][events['HT'] > 1200]
multiplicity = multiplicity[events['HT'] > 1200]

ax.hist(multiplicity, bins=100, density=True, weights=CrossSection, histtype='step', color='b')

plt.show()
@jpivarski
Copy link
Member

I don't have authorization to access these files:

% xrdcp root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root issue-131.root
[0B/0B][100%][==================================================][0B/s]  
Run: [FATAL] Auth failed:  (source)

@chrispap95
Copy link
Author

Ahh. Sorry about that. You probably need to have Kerberos initialized for FNAL. I am uploading them on CERNBox. I will post a link in a few minutes.

@jpivarski
Copy link
Member

I'll only need one 9 GB and one 1 GB, so you can save space.

@chrispap95
Copy link
Author

@jpivarski
Copy link
Member

Actually, I ran it on both and couldn't find a bug. Specifically, what I ran was a translation of your code to its columnar equivalent:

import numpy as np
import awkward1 as ak
import uproot4

# f = uproot4.open("~/storage/data/uproot4-big/issue-131little.root")
f = uproot4.open("~/storage/data/uproot4-big/issue-131big.root")
t = f["TreeMaker2/PreSelection"]

events = t.arrays([
    "HT",
    "CrossSection",
    "Tracks.fCoordinates.fX",
    "Tracks.fCoordinates.fY",
    "Tracks.fCoordinates.fZ",
    "Tracks_fromPV0",
    "Tracks_matchedToPFCandidate",
])  #, entry_start=99000)

cut_events = events[events.HT > 1200]
HT, CrossSection, x, y, z, num_fromPV0, matched_to_candidate = ak.unzip(cut_events)

pt = np.sqrt(x**2 + y**2 + z**2)
eta = np.arcsinh(z / np.sqrt(x**2 + y**2))

track_cut = (pt > 1) & abs(eta < 2.5) & (num_fromPV0 >= 2) & matched_to_candidate
multiplicity = ak.sum(track_cut, axis=1)

which ought to be a significant boost in efficiency (both speed and memory use), but it shouldn't cause or fix a bug. What you showed in StackOverflow had an actual error in deserialization. Perhaps something went wrong in the XRootD source?

(I commented out the limit on entry_start and entry_stop because I could read all events into memory, for these 7 branches. It took about a minute: watching htop, a single Python thread popped up now and then while it was accessing disk, and then there was a burst of 10 threads while it decompressed. The math took almost no time after the I/O was done. Doing the HT cut early vs. late hardly mattered, either.)

(Also, I assume that you wanted to cut on abs(eta < 2.5), not eta < 2.5.)

Since this ran without troubles from a local file, we should suspect the XRootD source. I no longer think it was a "reading position" issue (interpreting the wrong bytes as the TBasket header), because such a thing would affect a local source the same way. Maybe all the memory you were using confused PyXRootD? (If so, this would be a bug, and a subtle one, in PyXRootD's C++.) Or maybe Uproot's XRootDSource has a bug in it? If it's the former, then using the low-resource-usage code I wrote above would at least fix the symptom; if it's the latter, you'd get the same error either way, so try switching to the above with XRootD—at least it would be a good diagnostic.

@jpivarski
Copy link
Member

Oh, I should mention that this took 4.57 GB of memory; if you have less than that, you'll need to use entry_start/entry_stop to split it up. Actually, that's what uproot4.iterate is for: it repeatedly calls arrays with different entry_start/entry_stop values, though it's a little smarter than that in that it avoids re-reading TBaskets that are needed by more than one step in the iteration.

@chrispap95
Copy link
Author

This real issue is that I don't think that I can go fully columnar. In fact, I need to recluster AK15 jets and to my knowledge the best approach so far is to use pyjet on an event loop. (Is there a way to pass jagged arrays with multiple events to pyjet or some equivalent?) My full code reproduces the error locally and looks like:

import uproot4 as uproot
import uproot_methods
import awkward1 as ak
import numpy as np
from math import pi
import matplotlib.pyplot as plt
import mplhep as hep
import pyjet
import matplotlib.colors as colors
import matplotlib.cm as cmx

plt.style.use(hep.style.ROOT)

def sphericityTensor(particles,r=2):
    s = np.zeros((3,3))
    s[0][0] = particles.x.dot(particles.x)
    s[0][1] = particles.x.dot(particles.y)
    s[0][2] = particles.x.dot(particles.z)
    s[1][0] = particles.y.dot(particles.x)
    s[1][1] = particles.y.dot(particles.y)
    s[1][2] = particles.y.dot(particles.z)
    s[2][0] = particles.z.dot(particles.x)
    s[2][1] = particles.z.dot(particles.y)
    s[2][2] = particles.z.dot(particles.z)
    s = s/particles.p.dot(particles.p)
    return s

def sphericity(s):
    s_eigvalues, s_eigvectors = np.linalg.eig(s)
    s_eigvalues = np.sort(s_eigvalues)
    sphericity = 1.5*(s_eigvalues[0]+s_eigvalues[1])
    return sphericity

def makeJets(tracks, R, p=-1):
    # Cluster AK(R) jets
    vectors = np.zeros(tracks.size, np.dtype([('pT', 'f8'), ('eta', 'f8'),
                                              ('phi', 'f8'), ('mass', 'f8')]))
    i = 0
    for track in tracks:
        vectors[i] = np.array((track.pt, track.eta, track.phi, track.mass),
                              np.dtype([('pT', 'f8'), ('eta', 'f8'),
                                        ('phi', 'f8'), ('mass', 'f8')]))
        i += 1
    sequence = pyjet.cluster(vectors, R=R, p=p)
    jets = sequence.inclusive_jets()
    return jets

def isrTagger(jets, warn=True, warnThresh=130):
    mult0 = len(jets[0])
    mult1 = len(jets[1])
    if (mult0 > warnThresh) & (mult1 > warnThresh) & warn:
        print("Warning: both multiplicities are above %d!"%warnThresh)
    elif (mult0 < warnThresh) & (mult1 < warnThresh) & warn:
        print("Warning: both multiplicities are below %d!"%warnThresh)
    if mult0 < mult1:
        return uproot_methods.TLorentzVectorArray.from_ptetaphim([jets[1].pt],
                                                                 [jets[1].eta],
                                                                 [jets[1].phi],
                                                                 [jets[1].mass])
    else:
        return uproot_methods.TLorentzVectorArray.from_ptetaphim([jets[0].pt],
                                                                 [jets[0].eta],
                                                                 [jets[0].phi],
                                                                 [jets[0].mass])

# Get the file and import using uproot
base = 'root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/'
#base = '/User/chrispap/QCD/'
datasets = {
    base + 'Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
    base + 'Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
    base + 'Autumn18.QCD_HT2000toInf_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
}

events = uproot.lazy(datasets)

evtShape = np.zeros(len(events['Tracks.fCoordinates.fX']))
for ievt in range(len(events['Tracks.fCoordinates.fX'])):
    if ievt%1000 == 0:
        print("Processing event %d. Progress: %.2f%%"%(ievt,100*ievt/len(events['Tracks.fCoordinates.fX'])))
    if events['HT'][ievt] < 1200:
        evtShape[ievt] = -1
        continue
    tracks_x = events['Tracks.fCoordinates.fX'][ievt]
    tracks_y = events['Tracks.fCoordinates.fY'][ievt]
    tracks_z = events['Tracks.fCoordinates.fZ'][ievt]
    tracks_E = np.sqrt(tracks_x**2+tracks_y**2+tracks_z**2+0.13957**2)
    tracks = uproot_methods.TLorentzVectorArray.from_cartesian(ak.to_awkward0(tracks_x),
                                                               ak.to_awkward0(tracks_y),
                                                               ak.to_awkward0(tracks_z),
                                                               ak.to_awkward0(tracks_E))
    tracks_fromPV0 = events['Tracks_fromPV0'][ievt]
    tracks_matchedToPFCandidate = events['Tracks_matchedToPFCandidate'][ievt]
    tracks = tracks[(tracks.pt > 1.) & (abs(tracks.eta) < 2.5) &
                    (ak.to_awkward0(tracks_fromPV0) >= 2) &
                    (ak.to_awkward0(tracks_matchedToPFCandidate) > 0)]
    jetsAK15 = makeJets(tracks, 1.5)
    isrJet = isrTagger(jetsAK15, warn=False)
    tracks_boosted_minusISR = tracks.boost(-isrJet.p3/isrJet.energy)
    s = sphericityTensor(tracks_boosted_minusISR)
    evtShape[ievt] = sphericity(s)

# Plot results
fig = plt.figure(figsize=(8,8))
ax = plt.gca()

CrossSection = events['CrossSection'][events['HT'] > 1200]
evtShape = evtShape[events['HT'] > 1200]

ax.hist(evtShape, bins=100, density=True, weights=CrossSection, histtype='step', color='b')

plt.show()

@chrispap95
Copy link
Author

Perhaps uproot4.iterate could solve this issue?

@jpivarski
Copy link
Member

As I said above, uproot4.iterate iterates over chunks of events, so it doesn't change the for-loop vs. columnnarness. You can certainly use columnar techniques in some parts of your script, even if a restrictive interface prevents you from using it everywhere.

For a long time now, I've been wanting to give pyjet an interface that takes jagged arrays of Lorentz vectors, so that it wouldn't become a bottleneck by having to do Python for-loops over events. As it is, you'll need to do for loops in that one spot. Even this is much better than FastJet's own Python interface, which makes you do for loops over all the particles!

@jpivarski
Copy link
Member

Is this issue still happening?

uproot4.deserialization.DeserializationError: while reading

    TBasket version None as uproot4.models.TBasket.Model_TBasket (? bytes)
        fNbytes: -1607368158
        fObjlen: -1243566277
        fDatime: 2634931141
        fKeylen: -27664
        fCycle: 21409

attempting to get bytes 483015:483033
outside expected range 510698:538758 for this Chunk
in file /Users/chrispap/QCD/Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root

I can't test the XRootD access because of permissions, but we've narrowed it down to that: the bug does not occur with local files. If you can't reproduce it now; I'll close this issue.

@jpivarski jpivarski added the bug (unverified) The problem described would be a bug, but needs to be triaged label Oct 30, 2020
@chrispap95
Copy link
Author

The error message above has been produced with local files and the code that I posted when I created the issue. If you need the second large file I can upload it, too.

@chrispap95
Copy link
Author

You can see that the file is local in the last line of the error message, too.

@jpivarski
Copy link
Member

Is the 9 GB file (Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root) not large enough to cause the error? Could you put a file that is large enough to cause the error on Google Drive?

(This is good, because I'd much rather debug an Uproot issue than an XRootD one!)

@chrispap95
Copy link
Author

In fact, I think that you need two of them since the issue happens exactly when the loop goes from ievt = 99999 to 100000. (i.e. when it goes to the next file). I will post a link a the second file in a few minutes.

@jpivarski
Copy link
Member

That fact will likely be important in the solution; thanks!

@jpivarski
Copy link
Member

Does it work if you try going from one file to itself, or from the small file to the large file, or from the large file to the small file? In any of those configurations, I would be able to reproduce it with the two files I have.

@chrispap95
Copy link
Author

Here's the link https://drive.google.com/drive/folders/1s_sX10H7pN2Xq1JK8QMuoS7i7gpSiOs2?usp=sharing
The other file is the QCD with HT1500to2000. I haven't tried these configurations. I will do it now.

@chrispap95
Copy link
Author

So I tested a few different configurations and it turns out that the issue only occurs when trying to access the first entry of the HT1500to2000 file. Could it be a bad file? That's confusing though because I have run code on that file before and it worked.

@jpivarski
Copy link
Member

It's entirely possible that it's a bad file. It's easy to make mistakes in bookkeeping: more than half the time, when I think I've found a test case that demonstrates an error in the main codebase, it's the test case that's wrong, not the main codebase. I'm downloading it and I'll test it soon. I'm going through a huge backlog of issues—I might as well start with this one.

jpivarski added a commit that referenced this issue Oct 30, 2020
@jpivarski
Copy link
Member

Yeah, there's something wrong with that file. The "Tracks.fCoordinates.fX" and "Tracks.fCoordinates.fZ" branches have this junk in the first TBasket, and the "Tracks.fCoordinates.fY" has a "count branch" that points to "Tracks.fCoordinates.fX". (It should be pointing to a branch of "number of items per entry", not "x" values.) I used TBranch.debug to search around nearby to see if the file pointer is just a little off, but I didn't see anything that looks like the beginning of a TBasket (which includes some easily recognizable strings, but I didn't see them).

So it's just that one file. Something went wrong while writing it. Also, your TBaskets are very small, like 7 entries each. Reading them would be more efficient if they were big enough that NumPy is doing more work than Python. For these 7 branches, loading 100 MB at a time (big, but can't possibly be a problem for the amount of RAM on your computer), you'd want 4000‒7000 entries per basket:

>>> branches = ["HT", "CrossSection", "Tracks.fCoordinates.fX", "Tracks.fCoordinates.fY", "Tracks.fCoordinates.fZ", "Tracks_fromPV0", "Tracks_matchedToPFCandidate"]
>>> t = uproot4.open("~/storage/data/uproot4-big/issue-131-big1.root:TreeMaker2/PreSelection")
>>> t.num_entries_for("100 MB", branches)
7657
>>> t = uproot4.open("~/storage/data/uproot4-big/issue-131-little.root:TreeMaker2/PreSelection")
>>> t.num_entries_for("100 MB", branches)
4689

Finally, for reference, you'll probably want a procedure like the following for gathering multiplicities (and anything else) from your set of files—it's what I used in debugging:

import numpy as np
import awkward1 as ak
import uproot4

multiplicity_chunks = []

for events, report in uproot4.iterate(
    [
        "~/storage/data/uproot4-big/issue-131-little.root:TreeMaker2/PreSelection",
        "~/storage/data/uproot4-big/issue-131-big1.root:TreeMaker2/PreSelection",
    ], [
        "HT",
        "CrossSection",
        "Tracks.fCoordinates.fX",
        "Tracks.fCoordinates.fY",
        "Tracks.fCoordinates.fZ",
        "Tracks_fromPV0",
        "Tracks_matchedToPFCandidate",
    ], step_size="1 GB", report=True
):
    print(report)

    cut_events = events[events.HT > 1200]
    HT, CrossSection, x, y, z, num_fromPV0, matched_to_candidate = ak.unzip(cut_events)

    pt = np.sqrt(x**2 + y**2 + z**2)
    eta = np.arcsinh(z / np.sqrt(x**2 + y**2))

    track_cut = (pt > 1) & abs(eta < 2.5) & (num_fromPV0 >= 2) & matched_to_candidate
    multiplicity_chunk = ak.sum(track_cut, axis=1)

    multiplicity_chunks.append(multiplicity_chunk)

multiplicity = ak.concatenate(multiplicity_chunks)
print(multiplicity)

I'm going to close this because I think it's your file that's wrong. If you have any other problems, ping this issue or open a new one!

jpivarski added a commit that referenced this issue Oct 30, 2020
…#166)

* Working on issue #131.

* Add Model.is_instance and make TBranch.count_branch safer.

* Pass flake8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants