Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple clusters support for RNTuple #682

Merged
merged 13 commits into from
Aug 21, 2022
Merged

Multiple clusters support for RNTuple #682

merged 13 commits into from
Aug 21, 2022

Conversation

Moelf
Copy link
Collaborator

@Moelf Moelf commented Aug 19, 2022

akako@archlinux ~/D/g/uproot4 (RNTuple)> python -c 'from pprint import pprint; import uproot as up; import 
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=1, entry_stop=2).to_list())'
[{'one_bit': True, 'two_int64': 1}]

akako@archlinux ~/D/g/uproot4 (RNTuple)> python -c 'from pprint import pprint; import uproot as up; import 
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=11111112, entry_stop=11111113).to_list())'
[{'one_bit': True, 'two_int64': 11111112},
 {'one_bit': True, 'two_int64': 11111113}]

the one_bit field is not being read out correctly because it's actually bit not byte encoded in the serialization

@Moelf Moelf changed the title [WIP] Multiple clusters (group) support for RNTuple [WIP] Multiple clusters support for RNTuple Aug 19, 2022
@Moelf
Copy link
Collaborator Author

Moelf commented Aug 19, 2022

inefficient but works already, this is reading across two clusters

> python -c 'from pprint import pprint; import uproot as up; import 
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=1, entry_stop=11111113)[-10:].to_list())'
[{'one_bit': False, 'two_int64': 11111104},
 {'one_bit': True, 'two_int64': 11111105},
 {'one_bit': False, 'two_int64': 11111106},
 {'one_bit': False, 'two_int64': 11111107},
 {'one_bit': True, 'two_int64': 11111108},
 {'one_bit': True, 'two_int64': 11111109},
 {'one_bit': False, 'two_int64': 11111110},
 {'one_bit': False, 'two_int64': 11111111},
 {'one_bit': True, 'two_int64': 11111112},
 {'one_bit': True, 'two_int64': 11111113}]

@Moelf
Copy link
Collaborator Author

Moelf commented Aug 20, 2022

> python -c 'from pprint import pprint; import uproot as up; import skhep_testdata; f
ilename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filename)["ntuple"]; pprint(r.arrays(ent
ry_start=0, entry_stop=50).to_list())'
[{'one_bit': True, 'two_int64': 0},
 {'one_bit': False, 'two_int64': 1},
 {'one_bit': False, 'two_int64': 2},
 {'one_bit': False, 'two_int64': 3},
 {'one_bit': False, 'two_int64': 4},
 {'one_bit': True, 'two_int64': 5},
 {'one_bit': False, 'two_int64': 6},
 {'one_bit': False, 'two_int64': 7},
 {'one_bit': False, 'two_int64': 8},
 {'one_bit': False, 'two_int64': 9},
 {'one_bit': True, 'two_int64': 10},
 {'one_bit': False, 'two_int64': 11},

turns out I had one extra unpackbit at a different location due to wrong design decision made earlier, now with bitorder = "little" it works flawlessly

@Moelf Moelf requested a review from jpivarski August 20, 2022 04:59
@Moelf
Copy link
Collaborator Author

Moelf commented Aug 20, 2022

@jpivarski ready in the sense that with this:

  1. we can read bits column now
  2. we can also read multiple clusters

one "problem" is we don't have test files for 2. because it would be at least O(10) MB, we forgot we ask ROOT ppl how top make small cluster limit

@Moelf Moelf changed the title [WIP] Multiple clusters support for RNTuple Multiple clusters support for RNTuple Aug 20, 2022
@Moelf Moelf marked this pull request as ready for review August 20, 2022 05:00
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me!

I see that you've unified the read_pagedesc function by introducing an isbit flag, to divide by 1.0 in the usual case, which is okay. I personally would have just had two functions. (It was Lua's documentation that convinced me to ignore my instinctive fear of using floating point numbers for integer purposes as long as the values remain in the set of integers or fractions of 2 of integers. And as long as performance is not an issue, which is true here.)

Also on style, you have quite a few one-letter variables, many of them capital letters. There's a style guide (PEP 8?) that says that variables should start with a lower-case, and a one-letter capital technically violates that, though you tend to use them for types and template parameters in compiled languages have traditionally been one-letter capitals.

Just as a question, how well do you control the use of dtypes (i.e. your variables and function arguments named dtype)? These can mean a few different things:

  • NumPy np.dtype instances, which have the most information (including, for instance, endianness and even units on temporal types) but "np.dtype" is only one Python type. The distinctions among integers, floats, booleans, etc. are different values of this type. A NumPy scalar cannot have Python type np.dtype (e.g. isinstance(np.int32(123), np.dtype) is false).
  • NumPy numeric types, such as np.int32 and np.float64. Each of these is a distinct Python type, and NumPy scalars can have these types (e.g. isinstance(np.int32(123), np.integer) is true).
  • Python types, such as int, float, and bool. (isinstance(np.int32(123), int) is false.)
  • Strings that represent dtypes, such as "int32" or "i4" for np.dtype(np.int32), which you don't seem to be using.

The reason I ask is because you're using Python bool for the dtype of the boolean column, but this is different from NumPy's boolean type, which is np.bool_. They even have different numeric towers: Python's opinion is that booleans are a subtype of integers:

>>> issubclass(bool, int)
True

but NumPy thinks that booleans are distinct from integers:

>>> issubclass(np.bool_, np.integer)
False

(Python is wrong, by the way. Especially about truthiness, which I consider to be the worst feature of the Python language because it's not just dynamic typing, which is fine, it's actually weak typing, which is not fine. Weak typing made Perl completely unusable, and that's what sent me to Python in the first place.)

The interchangeability of these things favors hacking away at a data analysis, but it can make a mess of programming in the large, so we just need to be conscious of these distinctions and use them consistently. Personally, I use np.dtype as types in some contexts (when it's good for types to be Python values) and np.number subclasses in others (when it's good for types to be Python types), and generally don't use the Python types like bool or the string representations at all.

src/uproot/models/RNTuple.py Outdated Show resolved Hide resolved
@jpivarski
Copy link
Member

Oh, and one more thing: after we drop Python 3.6, the minimum version of NumPy becomes 1.16.5, which is still too early to assume that np.unpackbits has a bitorder argument.

Fortunately, there's a cool hack:

>>> # the default
>>> np.unpackbits(np.array([123, 71], "u1"))
array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1], dtype=uint8)

>>> # is the same is bitorder="b", which we don't want
>>> np.unpackbits(np.array([123, 71], "u1"), bitorder="b")
array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1], dtype=uint8)

>>> # we want bitorder="l"
>>> np.unpackbits(np.array([123, 71], "u1"), bitorder="l")
array([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], dtype=uint8)

>>> # so don't specify a bitorder ("b") and rotate in groups of 8:
>>> np.unpackbits(np.array([123, 71], "u1")).reshape(-1, 8)[:, ::-1].reshape(-1)
array([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], dtype=uint8)

The bitorder flag might be faster (haven't checked), but the reshape-slice-reshape method will always work.

@Moelf
Copy link
Collaborator Author

Moelf commented Aug 20, 2022

@jpivarski addressed the variable name style, dtype and bitorder comment

@Moelf Moelf requested a review from jpivarski August 20, 2022 22:34
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good! Go ahead and squash-and-merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants