Skip to content
This repository has been archived by the owner on Jan 9, 2023. It is now read-only.

DataFrame chunks will have duplicate indices #24

Closed
alexpearce opened this issue Jun 20, 2016 · 1 comment
Closed

DataFrame chunks will have duplicate indices #24

alexpearce opened this issue Jun 20, 2016 · 1 comment

Comments

@alexpearce
Copy link
Contributor

If a ROOT file is loaded in chunks, the individual DataFrames will have the same starting value for the index. If you then save these chunks to a single file (with mode='a') and then load from that file in to a single DataFrame, its index will have duplicate values.

import numpy as np
import root_numpy
import root_pandas

# Create the file with root_numpy directly, so the input doesn't have an index
xs = np.array(np.vstack(np.random.normal(0, 1, 100)), dtype=[('x', float)])
root_numpy.array2root(xs, 'input.root', 'tree', mode='recreate')

# Read the file in chunks and then write to the output
# Use write-mode for the first chunk, then use append mode, to make sure the output is re-created
for idx, df in enumerate(root_pandas.read_root('input.root', chunksize=10)):
    if idx == 0:
        mode = 'w'
    else:
        mode = 'a'
    df.to_root('output.root', mode=mode)

df = root_pandas.read_root('output.root')
dup_mask = df.index.duplicated()
print(dup_mask.any(), df[dup_mask].index.size)

prints True, 90.

These duplicate values are problematic when performing certain operations.

One work-around is to set the index values by hand in the loop.

chunksize = 10
for idx, df in enumerate(root_pandas.read_root('input.root',
                                               chunksize=chunksize)):
    if idx == 0:
        writemode = 'w'
    else:
        writemode = 'a'
    # Offset the index of this chunk
    df.index += idx*chunksize
    df.to_root('output.root', mode=writemode)

Should root_pandas do this for us? It surprised me when I first saw it. But if you're only manipulating the chunks, i.e. not saving them to the same file, you won't encounter this problem, and maybe that's the more common use case of chunksize.

@ibab
Copy link
Collaborator

ibab commented Aug 11, 2016

It definitely shouldn't do this :(
I'll have a look at fixing this.

@ibab ibab closed this as completed in 2001dcc Aug 19, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants