Multiple clusters support for RNTuple #682

Moelf · 2022-08-19T04:51:52Z

akako@archlinux ~/D/g/uproot4 (RNTuple)> python -c 'from pprint import pprint; import uproot as up; import 
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=1, entry_stop=2).to_list())'
[{'one_bit': True, 'two_int64': 1}]

akako@archlinux ~/D/g/uproot4 (RNTuple)> python -c 'from pprint import pprint; import uproot as up; import 
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=11111112, entry_stop=11111113).to_list())'
[{'one_bit': True, 'two_int64': 11111112},
 {'one_bit': True, 'two_int64': 11111113}]

the one_bit field is not being read out correctly because it's actually bit not byte encoded in the serialization

Moelf · 2022-08-19T05:18:50Z

inefficient but works already, this is reading across two clusters

> python -c 'from pprint import pprint; import uproot as up; import 
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=1, entry_stop=11111113)[-10:].to_list())'
[{'one_bit': False, 'two_int64': 11111104},
 {'one_bit': True, 'two_int64': 11111105},
 {'one_bit': False, 'two_int64': 11111106},
 {'one_bit': False, 'two_int64': 11111107},
 {'one_bit': True, 'two_int64': 11111108},
 {'one_bit': True, 'two_int64': 11111109},
 {'one_bit': False, 'two_int64': 11111110},
 {'one_bit': False, 'two_int64': 11111111},
 {'one_bit': True, 'two_int64': 11111112},
 {'one_bit': True, 'two_int64': 11111113}]

for more information, see https://pre-commit.ci

Moelf · 2022-08-20T03:30:57Z

> python -c 'from pprint import pprint; import uproot as up; import skhep_testdata; f
ilename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filename)["ntuple"]; pprint(r.arrays(ent
ry_start=0, entry_stop=50).to_list())'
[{'one_bit': True, 'two_int64': 0},
 {'one_bit': False, 'two_int64': 1},
 {'one_bit': False, 'two_int64': 2},
 {'one_bit': False, 'two_int64': 3},
 {'one_bit': False, 'two_int64': 4},
 {'one_bit': True, 'two_int64': 5},
 {'one_bit': False, 'two_int64': 6},
 {'one_bit': False, 'two_int64': 7},
 {'one_bit': False, 'two_int64': 8},
 {'one_bit': False, 'two_int64': 9},
 {'one_bit': True, 'two_int64': 10},
 {'one_bit': False, 'two_int64': 11},

turns out I had one extra unpackbit at a different location due to wrong design decision made earlier, now with bitorder = "little" it works flawlessly

for more information, see https://pre-commit.ci

Moelf · 2022-08-20T05:00:18Z

@jpivarski ready in the sense that with this:

we can read bits column now
we can also read multiple clusters

one "problem" is we don't have test files for 2. because it would be at least O(10) MB, we forgot we ask ROOT ppl how top make small cluster limit

jpivarski

It looks good to me!

I see that you've unified the read_pagedesc function by introducing an isbit flag, to divide by 1.0 in the usual case, which is okay. I personally would have just had two functions. (It was Lua's documentation that convinced me to ignore my instinctive fear of using floating point numbers for integer purposes as long as the values remain in the set of integers or fractions of 2 of integers. And as long as performance is not an issue, which is true here.)

Also on style, you have quite a few one-letter variables, many of them capital letters. There's a style guide (PEP 8?) that says that variables should start with a lower-case, and a one-letter capital technically violates that, though you tend to use them for types and template parameters in compiled languages have traditionally been one-letter capitals.

Just as a question, how well do you control the use of dtypes (i.e. your variables and function arguments named dtype)? These can mean a few different things:

NumPy np.dtype instances, which have the most information (including, for instance, endianness and even units on temporal types) but "np.dtype" is only one Python type. The distinctions among integers, floats, booleans, etc. are different values of this type. A NumPy scalar cannot have Python type np.dtype (e.g. isinstance(np.int32(123), np.dtype) is false).
NumPy numeric types, such as np.int32 and np.float64. Each of these is a distinct Python type, and NumPy scalars can have these types (e.g. isinstance(np.int32(123), np.integer) is true).
Python types, such as int, float, and bool. (isinstance(np.int32(123), int) is false.)
Strings that represent dtypes, such as "int32" or "i4" for np.dtype(np.int32), which you don't seem to be using.

The reason I ask is because you're using Python bool for the dtype of the boolean column, but this is different from NumPy's boolean type, which is np.bool_. They even have different numeric towers: Python's opinion is that booleans are a subtype of integers:

>>> issubclass(bool, int)
True

but NumPy thinks that booleans are distinct from integers:

>>> issubclass(np.bool_, np.integer)
False

(Python is wrong, by the way. Especially about truthiness, which I consider to be the worst feature of the Python language because it's not just dynamic typing, which is fine, it's actually weak typing, which is not fine. Weak typing made Perl completely unusable, and that's what sent me to Python in the first place.)

The interchangeability of these things favors hacking away at a data analysis, but it can make a mess of programming in the large, so we just need to be conscious of these distinctions and use them consistently. Personally, I use np.dtype as types in some contexts (when it's good for types to be Python values) and np.number subclasses in others (when it's good for types to be Python types), and generally don't use the Python types like bool or the string representations at all.

src/uproot/models/RNTuple.py

jpivarski · 2022-08-20T16:54:06Z

Oh, and one more thing: after we drop Python 3.6, the minimum version of NumPy becomes 1.16.5, which is still too early to assume that np.unpackbits has a bitorder argument.

Fortunately, there's a cool hack:

>>> # the default
>>> np.unpackbits(np.array([123, 71], "u1"))
array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1], dtype=uint8)

>>> # is the same is bitorder="b", which we don't want
>>> np.unpackbits(np.array([123, 71], "u1"), bitorder="b")
array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1], dtype=uint8)

>>> # we want bitorder="l"
>>> np.unpackbits(np.array([123, 71], "u1"), bitorder="l")
array([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], dtype=uint8)

>>> # so don't specify a bitorder ("b") and rotate in groups of 8:
>>> np.unpackbits(np.array([123, 71], "u1")).reshape(-1, 8)[:, ::-1].reshape(-1)
array([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], dtype=uint8)

The bitorder flag might be faster (haven't checked), but the reshape-slice-reshape method will always work.

…e-letter variable

Moelf · 2022-08-20T22:34:07Z

@jpivarski addressed the variable name style, dtype and bitorder comment

for more information, see https://pre-commit.ci

jpivarski

It's good! Go ahead and squash-and-merge.

Moelf added 2 commits August 19, 2022 00:49

trying to work with multiple clusters

c25ff6a

fix offset

f0caf63

Moelf force-pushed the RNTuple branch from c8c2156 to f0caf63 Compare August 19, 2022 05:06

Moelf changed the title ~~[WIP] Multiple clusters (group) support for RNTuple~~ [WIP] Multiple clusters support for RNTuple Aug 19, 2022

fix multiple cluster

5d8c557

Moelf force-pushed the RNTuple branch from 7739288 to 5d8c557 Compare August 19, 2022 05:18

rename variables for more consistency

26dcc1e

Moelf force-pushed the RNTuple branch from a10fe77 to 26dcc1e Compare August 19, 2022 12:22

pre-commit-ci bot and others added 3 commits August 19, 2022 12:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b7ae04

for more information, see https://pre-commit.ci

fix bit reading

4b43800

[pre-commit.ci] auto fixes from pre-commit.com hooks

a1b9922

for more information, see https://pre-commit.ci

Moelf added 2 commits August 20, 2022 00:50

clean up

b938457

CI go

1d8b67d

Moelf force-pushed the RNTuple branch from 7931b04 to 1d8b67d Compare August 20, 2022 04:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

2f06d34

for more information, see https://pre-commit.ci

Moelf requested a review from jpivarski August 20, 2022 04:59

Moelf changed the title ~~[WIP] Multiple clusters support for RNTuple~~ Multiple clusters support for RNTuple Aug 20, 2022

Moelf marked this pull request as ready for review August 20, 2022 05:00

jpivarski approved these changes Aug 20, 2022

View reviewed changes

src/uproot/models/RNTuple.py Outdated Show resolved Hide resolved

Moelf added 2 commits August 20, 2022 18:29

clena up; harmonize use of dtype, dtype_byte, dtype_str; reduce singl…

6838a7b

…e-letter variable

don't assume numpy has bitorder='little'

10e16c9

Moelf force-pushed the RNTuple branch from e4b9de1 to 10e16c9 Compare August 20, 2022 22:33

Moelf requested a review from jpivarski August 20, 2022 22:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

64f1b00

for more information, see https://pre-commit.ci

jpivarski approved these changes Aug 21, 2022

View reviewed changes

Moelf merged commit e7e8be1 into scikit-hep:main Aug 21, 2022

Moelf mentioned this pull request Sep 12, 2022

feat: Infrastructure for writing of RNTuple (incomplete functionality) #705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple clusters support for RNTuple #682

Multiple clusters support for RNTuple #682

Moelf commented Aug 19, 2022

Moelf commented Aug 19, 2022 •

edited

Loading

Moelf commented Aug 20, 2022

Moelf commented Aug 20, 2022

jpivarski left a comment

jpivarski commented Aug 20, 2022

Moelf commented Aug 20, 2022 •

edited

Loading

jpivarski left a comment

Multiple clusters support for RNTuple #682

Multiple clusters support for RNTuple #682

Conversation

Moelf commented Aug 19, 2022

Moelf commented Aug 19, 2022 • edited Loading

Moelf commented Aug 20, 2022

Moelf commented Aug 20, 2022

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski commented Aug 20, 2022

Moelf commented Aug 20, 2022 • edited Loading

jpivarski left a comment

Choose a reason for hiding this comment

Moelf commented Aug 19, 2022 •

edited

Loading

Moelf commented Aug 20, 2022 •

edited

Loading