-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple clusters support for RNTuple #682
Conversation
inefficient but works already, this is reading across two clusters > python -c 'from pprint import pprint; import uproot as up; import
skhep_testdata; filename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filena
me)["ntuple"]; pprint(r.arrays(entry_start=1, entry_stop=11111113)[-10:].to_list())'
[{'one_bit': False, 'two_int64': 11111104},
{'one_bit': True, 'two_int64': 11111105},
{'one_bit': False, 'two_int64': 11111106},
{'one_bit': False, 'two_int64': 11111107},
{'one_bit': True, 'two_int64': 11111108},
{'one_bit': True, 'two_int64': 11111109},
{'one_bit': False, 'two_int64': 11111110},
{'one_bit': False, 'two_int64': 11111111},
{'one_bit': True, 'two_int64': 11111112},
{'one_bit': True, 'two_int64': 11111113}] |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
> python -c 'from pprint import pprint; import uproot as up; import skhep_testdata; f
ilename = skhep_testdata.data_path("test_ntuple_large_bit_int64.root"); r = up.open(filename)["ntuple"]; pprint(r.arrays(ent
ry_start=0, entry_stop=50).to_list())'
[{'one_bit': True, 'two_int64': 0},
{'one_bit': False, 'two_int64': 1},
{'one_bit': False, 'two_int64': 2},
{'one_bit': False, 'two_int64': 3},
{'one_bit': False, 'two_int64': 4},
{'one_bit': True, 'two_int64': 5},
{'one_bit': False, 'two_int64': 6},
{'one_bit': False, 'two_int64': 7},
{'one_bit': False, 'two_int64': 8},
{'one_bit': False, 'two_int64': 9},
{'one_bit': True, 'two_int64': 10},
{'one_bit': False, 'two_int64': 11}, turns out I had one extra |
for more information, see https://pre-commit.ci
@jpivarski ready in the sense that with this:
one "problem" is we don't have test files for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me!
I see that you've unified the read_pagedesc
function by introducing an isbit
flag, to divide by 1.0 in the usual case, which is okay. I personally would have just had two functions. (It was Lua's documentation that convinced me to ignore my instinctive fear of using floating point numbers for integer purposes as long as the values remain in the set of integers or fractions of 2 of integers. And as long as performance is not an issue, which is true here.)
Also on style, you have quite a few one-letter variables, many of them capital letters. There's a style guide (PEP 8?) that says that variables should start with a lower-case, and a one-letter capital technically violates that, though you tend to use them for types and template parameters in compiled languages have traditionally been one-letter capitals.
Just as a question, how well do you control the use of dtypes (i.e. your variables and function arguments named dtype
)? These can mean a few different things:
- NumPy
np.dtype
instances, which have the most information (including, for instance, endianness and even units on temporal types) but "np.dtype
" is only one Python type. The distinctions among integers, floats, booleans, etc. are different values of this type. A NumPy scalar cannot have Python typenp.dtype
(e.g.isinstance(np.int32(123), np.dtype)
is false). - NumPy numeric types, such as
np.int32
andnp.float64
. Each of these is a distinct Python type, and NumPy scalars can have these types (e.g.isinstance(np.int32(123), np.integer)
is true). - Python types, such as
int
,float
, andbool
. (isinstance(np.int32(123), int)
is false.) - Strings that represent dtypes, such as
"int32"
or"i4"
fornp.dtype(np.int32)
, which you don't seem to be using.
The reason I ask is because you're using Python bool
for the dtype of the boolean column, but this is different from NumPy's boolean type, which is np.bool_
. They even have different numeric towers: Python's opinion is that booleans are a subtype of integers:
>>> issubclass(bool, int)
True
but NumPy thinks that booleans are distinct from integers:
>>> issubclass(np.bool_, np.integer)
False
(Python is wrong, by the way. Especially about truthiness, which I consider to be the worst feature of the Python language because it's not just dynamic typing, which is fine, it's actually weak typing, which is not fine. Weak typing made Perl completely unusable, and that's what sent me to Python in the first place.)
The interchangeability of these things favors hacking away at a data analysis, but it can make a mess of programming in the large, so we just need to be conscious of these distinctions and use them consistently. Personally, I use np.dtype
as types in some contexts (when it's good for types to be Python values) and np.number
subclasses in others (when it's good for types to be Python types), and generally don't use the Python types like bool
or the string representations at all.
Oh, and one more thing: after we drop Python 3.6, the minimum version of NumPy becomes 1.16.5, which is still too early to assume that np.unpackbits has a Fortunately, there's a cool hack: >>> # the default
>>> np.unpackbits(np.array([123, 71], "u1"))
array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1], dtype=uint8)
>>> # is the same is bitorder="b", which we don't want
>>> np.unpackbits(np.array([123, 71], "u1"), bitorder="b")
array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1], dtype=uint8)
>>> # we want bitorder="l"
>>> np.unpackbits(np.array([123, 71], "u1"), bitorder="l")
array([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], dtype=uint8)
>>> # so don't specify a bitorder ("b") and rotate in groups of 8:
>>> np.unpackbits(np.array([123, 71], "u1")).reshape(-1, 8)[:, ::-1].reshape(-1)
array([1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], dtype=uint8) The |
@jpivarski addressed the variable name style, dtype and bitorder comment |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good! Go ahead and squash-and-merge.
the
one_bit
field is not being read out correctly because it's actually bit not byte encoded in the serialization