DataFrame chunks will have duplicate indices #24

alexpearce · 2016-06-20T07:48:54Z

If a ROOT file is loaded in chunks, the individual DataFrames will have the same starting value for the index. If you then save these chunks to a single file (with mode='a') and then load from that file in to a single DataFrame, its index will have duplicate values.

import numpy as np
import root_numpy
import root_pandas

# Create the file with root_numpy directly, so the input doesn't have an index
xs = np.array(np.vstack(np.random.normal(0, 1, 100)), dtype=[('x', float)])
root_numpy.array2root(xs, 'input.root', 'tree', mode='recreate')

# Read the file in chunks and then write to the output
# Use write-mode for the first chunk, then use append mode, to make sure the output is re-created
for idx, df in enumerate(root_pandas.read_root('input.root', chunksize=10)):
    if idx == 0:
        mode = 'w'
    else:
        mode = 'a'
    df.to_root('output.root', mode=mode)

df = root_pandas.read_root('output.root')
dup_mask = df.index.duplicated()
print(dup_mask.any(), df[dup_mask].index.size)

prints True, 90.

These duplicate values are problematic when performing certain operations.

One work-around is to set the index values by hand in the loop.

chunksize = 10
for idx, df in enumerate(root_pandas.read_root('input.root',
                                               chunksize=chunksize)):
    if idx == 0:
        writemode = 'w'
    else:
        writemode = 'a'
    # Offset the index of this chunk
    df.index += idx*chunksize
    df.to_root('output.root', mode=writemode)

Should root_pandas do this for us? It surprised me when I first saw it. But if you're only manipulating the chunks, i.e. not saving them to the same file, you won't encounter this problem, and maybe that's the more common use case of chunksize.

The text was updated successfully, but these errors were encountered:

ibab · 2016-08-11T02:16:10Z

It definitely shouldn't do this :(
I'll have a look at fixing this.

ibab closed this as completed in 2001dcc Aug 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame chunks will have duplicate indices #24

DataFrame chunks will have duplicate indices #24

alexpearce commented Jun 20, 2016

ibab commented Aug 11, 2016

DataFrame chunks will have duplicate indices #24

DataFrame chunks will have duplicate indices #24

Comments

alexpearce commented Jun 20, 2016

ibab commented Aug 11, 2016