New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading multiple large files (>4GB) onto a lazy array produces deserialization error #131
Comments
I don't have authorization to access these files:
|
Ahh. Sorry about that. You probably need to have Kerberos initialized for FNAL. I am uploading them on CERNBox. I will post a link in a few minutes. |
I'll only need one 9 GB and one 1 GB, so you can save space. |
Alright, I think this should work: https://drive.google.com/drive/folders/1s_sX10H7pN2Xq1JK8QMuoS7i7gpSiOs2?usp=sharing |
Actually, I ran it on both and couldn't find a bug. Specifically, what I ran was a translation of your code to its columnar equivalent: import numpy as np
import awkward1 as ak
import uproot4
# f = uproot4.open("~/storage/data/uproot4-big/issue-131little.root")
f = uproot4.open("~/storage/data/uproot4-big/issue-131big.root")
t = f["TreeMaker2/PreSelection"]
events = t.arrays([
"HT",
"CrossSection",
"Tracks.fCoordinates.fX",
"Tracks.fCoordinates.fY",
"Tracks.fCoordinates.fZ",
"Tracks_fromPV0",
"Tracks_matchedToPFCandidate",
]) #, entry_start=99000)
cut_events = events[events.HT > 1200]
HT, CrossSection, x, y, z, num_fromPV0, matched_to_candidate = ak.unzip(cut_events)
pt = np.sqrt(x**2 + y**2 + z**2)
eta = np.arcsinh(z / np.sqrt(x**2 + y**2))
track_cut = (pt > 1) & abs(eta < 2.5) & (num_fromPV0 >= 2) & matched_to_candidate
multiplicity = ak.sum(track_cut, axis=1) which ought to be a significant boost in efficiency (both speed and memory use), but it shouldn't cause or fix a bug. What you showed in StackOverflow had an actual error in deserialization. Perhaps something went wrong in the XRootD source? (I commented out the limit on (Also, I assume that you wanted to cut on Since this ran without troubles from a local file, we should suspect the XRootD source. I no longer think it was a "reading position" issue (interpreting the wrong bytes as the TBasket header), because such a thing would affect a local source the same way. Maybe all the memory you were using confused PyXRootD? (If so, this would be a bug, and a subtle one, in PyXRootD's C++.) Or maybe Uproot's XRootDSource has a bug in it? If it's the former, then using the low-resource-usage code I wrote above would at least fix the symptom; if it's the latter, you'd get the same error either way, so try switching to the above with XRootD—at least it would be a good diagnostic. |
Oh, I should mention that this took 4.57 GB of memory; if you have less than that, you'll need to use |
This real issue is that I don't think that I can go fully columnar. In fact, I need to recluster AK15 jets and to my knowledge the best approach so far is to use pyjet on an event loop. (Is there a way to pass jagged arrays with multiple events to pyjet or some equivalent?) My full code reproduces the error locally and looks like: import uproot4 as uproot
import uproot_methods
import awkward1 as ak
import numpy as np
from math import pi
import matplotlib.pyplot as plt
import mplhep as hep
import pyjet
import matplotlib.colors as colors
import matplotlib.cm as cmx
plt.style.use(hep.style.ROOT)
def sphericityTensor(particles,r=2):
s = np.zeros((3,3))
s[0][0] = particles.x.dot(particles.x)
s[0][1] = particles.x.dot(particles.y)
s[0][2] = particles.x.dot(particles.z)
s[1][0] = particles.y.dot(particles.x)
s[1][1] = particles.y.dot(particles.y)
s[1][2] = particles.y.dot(particles.z)
s[2][0] = particles.z.dot(particles.x)
s[2][1] = particles.z.dot(particles.y)
s[2][2] = particles.z.dot(particles.z)
s = s/particles.p.dot(particles.p)
return s
def sphericity(s):
s_eigvalues, s_eigvectors = np.linalg.eig(s)
s_eigvalues = np.sort(s_eigvalues)
sphericity = 1.5*(s_eigvalues[0]+s_eigvalues[1])
return sphericity
def makeJets(tracks, R, p=-1):
# Cluster AK(R) jets
vectors = np.zeros(tracks.size, np.dtype([('pT', 'f8'), ('eta', 'f8'),
('phi', 'f8'), ('mass', 'f8')]))
i = 0
for track in tracks:
vectors[i] = np.array((track.pt, track.eta, track.phi, track.mass),
np.dtype([('pT', 'f8'), ('eta', 'f8'),
('phi', 'f8'), ('mass', 'f8')]))
i += 1
sequence = pyjet.cluster(vectors, R=R, p=p)
jets = sequence.inclusive_jets()
return jets
def isrTagger(jets, warn=True, warnThresh=130):
mult0 = len(jets[0])
mult1 = len(jets[1])
if (mult0 > warnThresh) & (mult1 > warnThresh) & warn:
print("Warning: both multiplicities are above %d!"%warnThresh)
elif (mult0 < warnThresh) & (mult1 < warnThresh) & warn:
print("Warning: both multiplicities are below %d!"%warnThresh)
if mult0 < mult1:
return uproot_methods.TLorentzVectorArray.from_ptetaphim([jets[1].pt],
[jets[1].eta],
[jets[1].phi],
[jets[1].mass])
else:
return uproot_methods.TLorentzVectorArray.from_ptetaphim([jets[0].pt],
[jets[0].eta],
[jets[0].phi],
[jets[0].mass])
# Get the file and import using uproot
base = 'root://cmseos.fnal.gov//store/user/kdipetri/SUEP/Production_v0.2/2018/NTUP/'
#base = '/User/chrispap/QCD/'
datasets = {
base + 'Autumn18.QCD_HT1000to1500_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
base + 'Autumn18.QCD_HT1500to2000_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
base + 'Autumn18.QCD_HT2000toInf_TuneCP5_13TeV-madgraphMLM-pythia8_RA2AnalysisTree.root': 'TreeMaker2/PreSelection',
}
events = uproot.lazy(datasets)
evtShape = np.zeros(len(events['Tracks.fCoordinates.fX']))
for ievt in range(len(events['Tracks.fCoordinates.fX'])):
if ievt%1000 == 0:
print("Processing event %d. Progress: %.2f%%"%(ievt,100*ievt/len(events['Tracks.fCoordinates.fX'])))
if events['HT'][ievt] < 1200:
evtShape[ievt] = -1
continue
tracks_x = events['Tracks.fCoordinates.fX'][ievt]
tracks_y = events['Tracks.fCoordinates.fY'][ievt]
tracks_z = events['Tracks.fCoordinates.fZ'][ievt]
tracks_E = np.sqrt(tracks_x**2+tracks_y**2+tracks_z**2+0.13957**2)
tracks = uproot_methods.TLorentzVectorArray.from_cartesian(ak.to_awkward0(tracks_x),
ak.to_awkward0(tracks_y),
ak.to_awkward0(tracks_z),
ak.to_awkward0(tracks_E))
tracks_fromPV0 = events['Tracks_fromPV0'][ievt]
tracks_matchedToPFCandidate = events['Tracks_matchedToPFCandidate'][ievt]
tracks = tracks[(tracks.pt > 1.) & (abs(tracks.eta) < 2.5) &
(ak.to_awkward0(tracks_fromPV0) >= 2) &
(ak.to_awkward0(tracks_matchedToPFCandidate) > 0)]
jetsAK15 = makeJets(tracks, 1.5)
isrJet = isrTagger(jetsAK15, warn=False)
tracks_boosted_minusISR = tracks.boost(-isrJet.p3/isrJet.energy)
s = sphericityTensor(tracks_boosted_minusISR)
evtShape[ievt] = sphericity(s)
# Plot results
fig = plt.figure(figsize=(8,8))
ax = plt.gca()
CrossSection = events['CrossSection'][events['HT'] > 1200]
evtShape = evtShape[events['HT'] > 1200]
ax.hist(evtShape, bins=100, density=True, weights=CrossSection, histtype='step', color='b')
plt.show() |
Perhaps uproot4.iterate could solve this issue? |
As I said above, uproot4.iterate iterates over chunks of events, so it doesn't change the for-loop vs. columnnarness. You can certainly use columnar techniques in some parts of your script, even if a restrictive interface prevents you from using it everywhere. For a long time now, I've been wanting to give pyjet an interface that takes jagged arrays of Lorentz vectors, so that it wouldn't become a bottleneck by having to do Python for-loops over events. As it is, you'll need to do for loops in that one spot. Even this is much better than FastJet's own Python interface, which makes you do for loops over all the particles! |
Is this issue still happening?
I can't test the XRootD access because of permissions, but we've narrowed it down to that: the bug does not occur with local files. If you can't reproduce it now; I'll close this issue. |
The error message above has been produced with local files and the code that I posted when I created the issue. If you need the second large file I can upload it, too. |
You can see that the file is local in the last line of the error message, too. |
Is the 9 GB file ( (This is good, because I'd much rather debug an Uproot issue than an XRootD one!) |
In fact, I think that you need two of them since the issue happens exactly when the loop goes from ievt = 99999 to 100000. (i.e. when it goes to the next file). I will post a link a the second file in a few minutes. |
That fact will likely be important in the solution; thanks! |
Does it work if you try going from one file to itself, or from the small file to the large file, or from the large file to the small file? In any of those configurations, I would be able to reproduce it with the two files I have. |
Here's the link https://drive.google.com/drive/folders/1s_sX10H7pN2Xq1JK8QMuoS7i7gpSiOs2?usp=sharing |
So I tested a few different configurations and it turns out that the issue only occurs when trying to access the first entry of the HT1500to2000 file. Could it be a bad file? That's confusing though because I have run code on that file before and it worked. |
It's entirely possible that it's a bad file. It's easy to make mistakes in bookkeeping: more than half the time, when I think I've found a test case that demonstrates an error in the main codebase, it's the test case that's wrong, not the main codebase. I'm downloading it and I'll test it soon. I'm going through a huge backlog of issues—I might as well start with this one. |
Yeah, there's something wrong with that file. The So it's just that one file. Something went wrong while writing it. Also, your TBaskets are very small, like 7 entries each. Reading them would be more efficient if they were big enough that NumPy is doing more work than Python. For these 7 branches, loading 100 MB at a time (big, but can't possibly be a problem for the amount of RAM on your computer), you'd want 4000‒7000 entries per basket: >>> branches = ["HT", "CrossSection", "Tracks.fCoordinates.fX", "Tracks.fCoordinates.fY", "Tracks.fCoordinates.fZ", "Tracks_fromPV0", "Tracks_matchedToPFCandidate"]
>>> t = uproot4.open("~/storage/data/uproot4-big/issue-131-big1.root:TreeMaker2/PreSelection")
>>> t.num_entries_for("100 MB", branches)
7657
>>> t = uproot4.open("~/storage/data/uproot4-big/issue-131-little.root:TreeMaker2/PreSelection")
>>> t.num_entries_for("100 MB", branches)
4689 Finally, for reference, you'll probably want a procedure like the following for gathering multiplicities (and anything else) from your set of files—it's what I used in debugging: import numpy as np
import awkward1 as ak
import uproot4
multiplicity_chunks = []
for events, report in uproot4.iterate(
[
"~/storage/data/uproot4-big/issue-131-little.root:TreeMaker2/PreSelection",
"~/storage/data/uproot4-big/issue-131-big1.root:TreeMaker2/PreSelection",
], [
"HT",
"CrossSection",
"Tracks.fCoordinates.fX",
"Tracks.fCoordinates.fY",
"Tracks.fCoordinates.fZ",
"Tracks_fromPV0",
"Tracks_matchedToPFCandidate",
], step_size="1 GB", report=True
):
print(report)
cut_events = events[events.HT > 1200]
HT, CrossSection, x, y, z, num_fromPV0, matched_to_candidate = ak.unzip(cut_events)
pt = np.sqrt(x**2 + y**2 + z**2)
eta = np.arcsinh(z / np.sqrt(x**2 + y**2))
track_cut = (pt > 1) & abs(eta < 2.5) & (num_fromPV0 >= 2) & matched_to_candidate
multiplicity_chunk = ak.sum(track_cut, axis=1)
multiplicity_chunks.append(multiplicity_chunk)
multiplicity = ak.concatenate(multiplicity_chunks)
print(multiplicity) I'm going to close this because I think it's your file that's wrong. If you have any other problems, ping this issue or open a new one! |
Issue description
This is an issue regarding the question posted here: https://stackoverflow.com/questions/64565208/how-can-i-load-many-very-large-files-using-uproot
When loading 3 files of about 9 GB each onto a lazy array, the code crashes when the event loop reaches the borders between the first and second files producing a deserialization error. The code works flawlessly with small (~ 1 GB) files. Way of access (local vs XRootD) also doesn't seem to affect the issue. The error message is the following:
Reproduction instructions
The three very large files (9 GB each) are the following:
Two smaller files (1 GB) for which the code works nicely:
The following code reproduces the issue using the files that I cite above:
The text was updated successfully, but these errors were encountered: