tree.pandas.df() with branches==None AttributeError #102

balarsen · 2018-07-15T16:31:01Z

This seems like it crept in recently, as it used to work.

>>> import uproot
>>> tree = uproot.open("Zmumu.root")["events"]
>>> tree.pandas.df(["pt1", "eta1", "phi1", "pt2", "eta2", "phi2"])

... works

>>> import uproot
>>> tree = uproot.open("Zmumu.root")["events"]
>>> tree.pandas.df()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-6bcef7f7c748> in <module>()
      1 import uproot
      2 tree = uproot.open("Zmumu.root")["events"]
----> 3 tree.pandas.df()
      4

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/uproot/_connect/to_pandas.py in df(self, branches, entrystart, entrystop, flatten, cache, basketcache, keycache, executor, blocking)
     41     def df(self, branches=None, entrystart=None, entrystop=None, flatten=True, cache=None, basketcache=None, keycache=None, executor=None, blocking=True):
     42         import pandas
---> 43         return self._tree.arrays(branches=branches, outputtype=pandas.DataFrame, entrystart=entrystart, entrystop=entrystop, flatten=flatten, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/uproot/tree.py in arrays(self, branches, outputtype, entrystart, entrystop, flatten, cache, basketcache, keycache, executor, blocking)
    498         # if blocking, return the result of that function; otherwise, the function itself
    499         if blocking:
--> 500             return wait()
    501         else:
    502             return wait

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/uproot/tree.py in wait()
    451                             array = future()
    452
--> 453                             entries = numpy.empty(len(array.content), dtype=numpy.int64)
    454                             subentries = numpy.empty(len(array.content), dtype=numpy.int64)
    455                             starts, stops = array.starts, array.stops

AttributeError: 'Strings' object has no attribute 'content'

The text was updated successfully, but these errors were encountered:

balarsen · 2018-07-15T16:31:45Z

47:balarsen@rbsp4 tests $ python --version
Python 3.6.5

18:balarsen@rbsp4 Downloads $ pip list
Package                    Version
-------------------------- ----------------------------
alabaster                  0.7.11
appnope                    0.1.0
arrow                      0.12.1
astropy                    3.0.3
atomicwrites               1.1.5
attrs                      18.1.0
Babel                      2.6.0
backcall                   0.1.0
BayesInst                  0.3.dev22+ga7d5c0d.d20180706
bleach                     2.1.3
certifi                    2018.4.16
chardet                    3.0.4
click                      6.7
coverage                   4.5.1
cycler                     0.10.0
Cython                     0.28.4
decorator                  4.3.0
docutils                   0.14
docx                       0.2.4
entrypoints                0.2.3
et-xmlfile                 1.0.1
gitdb2                     2.0.4
GitPython                  2.1.11
h5py                       2.8.0
html5lib                   1.0.1
hypothesis                 3.66.1
idna                       2.7
imagesize                  1.0.0
ipykernel                  4.8.2
ipython                    6.4.0
ipython-genutils           0.2.0
ipywidgets                 7.2.1
jdcal                      1.4
jedi                       0.12.1
Jinja2                     2.10
joblib                     0.12.0
jsonschema                 2.6.0
jupyter                    1.0.0
jupyter-client             5.2.3
jupyter-console            5.2.0
jupyter-core               4.4.0
kiwisolver                 1.0.1
LANLpygeometry             0.1
lxml                       4.2.3
MarkupSafe                 1.0
matplotlib                 2.2.2
mistune                    0.8.3
more-itertools             4.2.0
nbconvert                  5.3.1
nbformat                   4.4.0
nose                       1.3.7
notebook                   5.6.0
numexpr                    2.6.5
numpy                      1.14.5
openpyxl                   2.5.4
packaging                  17.1
pandas                     0.23.3
pandocfilters              1.4.2
parameterized              0.6.1
parso                      0.3.1
path.py                    11.0.1
patsy                      0.5.0
pbr                        4.1.0
pexpect                    4.6.0
pickleshare                0.7.4
Pillow                     5.2.0
pip                        10.0.1
pluggy                     0.6.0
prometheus-client          0.3.0
prompt-toolkit             1.0.15
ptyprocess                 0.6.0
py                         1.5.4
Pygments                   2.2.0
pymc3                      3.4.1
pyparsing                  2.2.0
pytest                     3.6.3
pytest-cov                 2.5.1
python-dateutil            2.7.3
pytz                       2018.5
pyzmq                      17.1.0
qtconsole                  4.3.1
requests                   2.19.1
ruamel.appconfig           0.5.4
ruamel.std.argparse        0.8.1
ruamel.std.pathlib         0.6.3
scikit-learn               0.19.1
scikit-optimize            0.5.2
scipy                      1.1.0
seaborn                    0.8.1
Send2Trash                 1.5.0
setuptools                 40.0.0
setuptools-scm             2.1.0
setuptools-scm-git-archive 1.0
simplegeneric              0.8.1
six                        1.11.0
sklearn                    0.0
smmap2                     2.0.4
snakefood                  1.4
snowballstemmer            1.2.1
spacepy                    0.1.6
Sphinx                     1.7.5
sphinx-git                 10.1.1
sphinxcontrib-websupport   1.1.0
stevedore                  1.28.0
STUDIO                     0.0.0
tables                     3.4.4
terminado                  0.8.1
testpath                   0.3.1
Theano                     1.0.2
tornado                    5.1
tqdm                       4.23.4
traitlets                  4.3.2
uproot                     2.9.4
urllib3                    1.23
version-information        1.0.3
virtualenv                 16.0.0
virtualenv-clone           0.3.0
virtualenvwrapper          4.8.2
wcwidth                    0.1.7
webencodings               0.5.1
widgetsnbextension         3.2.1
xarray                     0.10.7
xlwt                       1.3.0
xmltodict                  0.11.0

balarsen · 2018-07-15T16:40:19Z

I have no idea how to go about figuring out the issue, but adding

    def test_issue102(self):
        t = uproot.open("tests/samples/Zmumu.root")["events"]
        assert len(t.pandas.df(["pt1", "eta1", "phi1", "pt2", "eta2", "phi2"])) == 2304
        assert len(t.pandas.df()) == 2304

tests/test_issues.py

At least captures the error

jpivarski · 2018-07-15T17:55:09Z

Thanks for catching this— I'll fix it as soon as possible. It's not a mysterious bug— but it illustrates that I need to systemize some of the special case handing for different branch types. You're getting this error because some of your branches have string type and the DataFrame-handling code doesn't handle that case.

You can avoid this error (for now) by reading the data as arrays and converting them into a DataFrame:

df = pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())

The thing you would be missing by doing this is the ability to flatten jagged data into a DataFrame with a MultiIndex. (It's equivalent to flatyen=False).

I'll fix it as soon as possible (but it could be a week— on vacation).

balarsen · 2018-07-15T21:38:38Z

Perfect, no problem. Thanks for the hard work! Enjoy the vacation.

balarsen · 2018-07-16T16:56:13Z

An interesting observation on some 100M plain (not jagged etc) root files.

tree = uproot.open("./zep_hemisphere_1_95.root")['EventInfo']
%timeit df = pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays())
# 1min 16s ± 552 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def makedf(tree):
         df = {}
         for k in tree.keys():
             df['k'] = tree[k].array()
         return pd.DataFrame(df)
%timeit makedf(tree)
# 44.1 s ± 539 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a pretty notable difference in speed and seems like the first should be faster but maybe has other stuff behind the scenes that slow it down.

jpivarski · 2018-07-16T17:19:35Z

That's weird. The tree.arrays method is literally each branch.array in sequence when there is no executor (parallel processing). It internally creates functions and uses them once to accommodate parallel processing, which might be a slight hit that scales with the number of branches (not the size of their contents).

The other confounding variable here is constructing the DataFrame, which has performance characteristics that are mysterious to me. My prescription of setting columns and data is because it gives Pandas all the information at once (presumably, it can make use of that information to optimize) and columns sets the order. columns is the only Pandas difference between your two examples.

If it turns out to be array versus arrays, I'll look into it with more examples. I won't change it on the basis of one example because it might not be general, especially if it means complicating the code (separate parallel and sequential cases) for the sake of a speedup.

balarsen · 2018-07-16T19:10:28Z

And the plot thickens with not understanding how the heck data frames are constructed.

In my case pandas.DataFrame(tree.arrays()) and pandas.DataFrame(columns=tree.allkeys(), data=tree.arrays()) give the same result.

%timeit df = pandas.DataFrame(tree.arrays())
# 30.9 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

jpivarski · 2018-07-16T19:30:45Z

(Seeing as I'm traveling with my family, I can't try these things myself, but I can keep giving you suggestions of things to try. I don't know, however, if your physics case really needs more performance on reading these DataFrames. As a library developer, I'm on the lookout for speedups, but if your focus is on physics, the difference between a minute and half a minute isn't that different.)

I suggested using columns because the order of the columns might be important. The dict returned by tree.arrays() might yield the same order, but not necessarily. It might also work to use an OrderedDict:

import collections
df = pandas.DataFrame(tree.arrays(outputtype=collections.OrderedDict))

When uproot fills an OrderedDict, it does so in the TTree's natural branch order.

But then again, maybe the column order doesn't matter to you. :)

balarsen · 2018-07-16T22:03:59Z

Thanks, I really only point these out for interesting things as a method to more fully understand what is at the bottom of the whole system. Seems like the actual bug is easy to fix when you return and that the rest is really file it away deep in the brain as a "oh I remember that" when it comes back as enhancements or someone's application requires more speed that currently is there. As you point out that is not me currently other than being nerd driven.

My particular love of this package is driven by moving away from root as early in my processing as possible and enabling me to use tools I am more comfortable with. I'm not a HEP guy but a space physics guy using geant for instrument responses.

…ther objects that use JaggedArrays to define their structure without being jagged conceptually

… really are jagged (as opposed to strings)

jpivarski · 2018-07-18T10:16:31Z

I got up before everyone else and fixed the original bug that started this thread. I couldn't find the performance difference, but I don't have your file. Considering the changes that are in store for this bit of code, however, it might not be worth tuning it until after the awkward-arrays are in.

jpivarski added a commit that referenced this issue Jul 18, 2018

fixes issue #102; Pandas DataFrame handling is safe for strings and o…

eaecaa6

…ther objects that use JaggedArrays to define their structure without being jagged conceptually

jpivarski added a commit that referenced this issue Jul 18, 2018

also addresses issue #102; only enter the slow branch if some columns…

7ce9265

… really are jagged (as opposed to strings)

jpivarski closed this as completed Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tree.pandas.df() with branches==None AttributeError #102

tree.pandas.df() with branches==None AttributeError #102

balarsen commented Jul 15, 2018

balarsen commented Jul 15, 2018 •

edited

Loading

balarsen commented Jul 15, 2018

jpivarski commented Jul 15, 2018

balarsen commented Jul 15, 2018

balarsen commented Jul 16, 2018

jpivarski commented Jul 16, 2018

balarsen commented Jul 16, 2018

jpivarski commented Jul 16, 2018

balarsen commented Jul 16, 2018

jpivarski commented Jul 18, 2018

tree.pandas.df() with branches==None AttributeError #102

tree.pandas.df() with branches==None AttributeError #102

Comments

balarsen commented Jul 15, 2018

balarsen commented Jul 15, 2018 • edited Loading

balarsen commented Jul 15, 2018

jpivarski commented Jul 15, 2018

balarsen commented Jul 15, 2018

balarsen commented Jul 16, 2018

jpivarski commented Jul 16, 2018

balarsen commented Jul 16, 2018

jpivarski commented Jul 16, 2018

balarsen commented Jul 16, 2018

jpivarski commented Jul 18, 2018

balarsen commented Jul 15, 2018 •

edited

Loading