You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 9, 2023. It is now read-only.
If a ROOT file is loaded in chunks, the individual DataFrames will have the same starting value for the index. If you then save these chunks to a single file (with mode='a') and then load from that file in to a single DataFrame, its index will have duplicate values.
importnumpyasnpimportroot_numpyimportroot_pandas# Create the file with root_numpy directly, so the input doesn't have an indexxs=np.array(np.vstack(np.random.normal(0, 1, 100)), dtype=[('x', float)])
root_numpy.array2root(xs, 'input.root', 'tree', mode='recreate')
# Read the file in chunks and then write to the output# Use write-mode for the first chunk, then use append mode, to make sure the output is re-createdforidx, dfinenumerate(root_pandas.read_root('input.root', chunksize=10)):
ifidx==0:
mode='w'else:
mode='a'df.to_root('output.root', mode=mode)
df=root_pandas.read_root('output.root')
dup_mask=df.index.duplicated()
print(dup_mask.any(), df[dup_mask].index.size)
prints True, 90.
These duplicate values are problematic when performing certain operations.
One work-around is to set the index values by hand in the loop.
chunksize=10foridx, dfinenumerate(root_pandas.read_root('input.root',
chunksize=chunksize)):
ifidx==0:
writemode='w'else:
writemode='a'# Offset the index of this chunkdf.index+=idx*chunksizedf.to_root('output.root', mode=writemode)
Should root_pandas do this for us? It surprised me when I first saw it. But if you're only manipulating the chunks, i.e. not saving them to the same file, you won't encounter this problem, and maybe that's the more common use case of chunksize.
The text was updated successfully, but these errors were encountered:
If a ROOT file is loaded in chunks, the individual DataFrames will have the same starting value for the index. If you then save these chunks to a single file (with
mode='a'
) and then load from that file in to a single DataFrame, its index will have duplicate values.prints
True, 90
.These duplicate values are problematic when performing certain operations.
One work-around is to set the index values by hand in the loop.
Should
root_pandas
do this for us? It surprised me when I first saw it. But if you're only manipulating the chunks, i.e. not saving them to the same file, you won't encounter this problem, and maybe that's the more common use case ofchunksize
.The text was updated successfully, but these errors were encountered: