-
Notifications
You must be signed in to change notification settings - Fork 10
nmt: switch NS convention to big-endian #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current dependencies on/for this PR:
This stack of pull requests is managed by Graphite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this changes is needed otherwise the chain would panic if two blobs with namespace 1 and 256 are submitted in this order because they are ordered from an integer perspective but as soon as those blobs are pushed in the nmt-tree it would see the bytes as BE and if the integer were converted to bytes using LE then the workspace are not in order anymore and thus the nmt-rs library would return Error, unwrapped by sugondat-nmt because from its view they were ordered
a531099
to
ecca8d6
Compare
1d56a3b
to
f385b17
Compare
sugondat-nmt/src/ns.rs
Outdated
pub fn with_namespace_id(namespace_id: u32) -> Self { | ||
Self(namespace_id) | ||
/// Convert the given big-endian 32-bit integer to a namespace ID. | ||
pub fn from_be_u32(namespace_id: u32) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why from_be_u32
? the main point of the function is that the u32 is converted to a BE and not converted from BE (because the memory representation of the input argument is agnostic to the core of the function)
I propose something like from_u32_to_be
or better to just from_u32
and treat the byes as BE internally in the library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_u32_be
/ from_u32_le
would be a good consistent API.
Alternatively, we can keep only from_raw_bytes
and have the caller of the API decide whether to use big-endian or little-endian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the memory representation of the input argument is agnostic to the core of the function
You are correct and I am really happy you get it. I spent quite some time explaining this to so-called lead developers.
That said, it's still possible to view the number as it's stored in the memory. See for example u32::to_le
and ::from_le
. They operate on the concept of target endianess and according to that we can view this input argument as in target endianess.
So what this function is, it is a convenience function that takes a value in the target endianess and extracts it's big endian representation. Semantically, it's the same operation as u32::to_be_bytes
. However, to_be_bytes
takes self: u32
and returns another type [u8; 32]
. In our case here, we need to take u32
and return self: [u8; N]
. That's how the name from_be_u32
came about!
Now,
or better to just
from_u32
We are talking about a type that is canonically represented as [u8; N]. It seems to me that it's important for the caller to know which way BE, LE or even TE (which are all reasonable guesses) the u32 is going to be treated. That's why I think it warrants to rub in the face that it's be
.
I would rather
keep only from_raw_bytes and have the caller of the API decide whether to use big-endian or little-endian.
as @rphmeier suggested, than having from_u32
. Now, why I would avoid doing that is I would like to introduce a canonical way of interpreting a number as a namespace so that we don't encourage using it both ways. Introducing, from_le_u32
would kinda do that I feel. And, as it maybe clear from the OP, I think BE should be enough for everybody.
That all said, I am not married on this naming/semantics/logic. In fact, I don't think it matters too much and I wrote all the above just for the sake of getting on the same page, for education and hopefully for posterity. So let me know if you still want to change the naming and/or add the LE variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Now I see you're point and it makes more sense, of course I also don't push really hard for a specific name, the meaning of the conversion arrives anyway!
If I have to vote for something I would vote for from_u32_be
just because it seems (at least for me) more clear at first view about what it does
f385b17
to
bb990d7
Compare
The canonical representation of a namespace ID is [u8; N] (N=4 for now). However, it seems to be useful at least for debugging purposes to parse and display the namespace as an unsigned integer. Prior this change, the integer form was defined as little-endian, just because it's a default. Most of modern architectures are LE by default, wasm is LE, etc. Mostly it doesn't matter, performance differences are neglegible either way 99% of the time. I suppose the reason for this is that LE has a marginal advantage because of the things like that you don't need byte shuffle when reinterpreting value type (e.g. u32 and u16) behind the same pointer and stuff like that. However, big endian also has it's benefits. First is a human readable representation, which is important here since this is more of a developer convenience representation. Another, benefit is that BE corresponds to the byte ordering. E.g., 256 > 255 (0x0100 > 0x00ff), which doesn't hold in LE. And we do care about the ordering of namespaces, at present because the NMT must literally be built in order, but in future I also think it would be a useful property to have blobs sorted. At the same time, the performance gains of using LE are neglegible. So given all of that I propose we change the canonical integer representation to BE.
bb990d7
to
c5dccf5
Compare
The canonical representation of a namespace ID is [u8; N] (N=4 for now).
However, it seems to be useful at least for debugging purposes to
parse and display the namespace as an unsigned integer.
Prior this change, the integer form was defined as little-endian, just
because it's a default. Most of modern architectures are LE by default,
wasm is LE, etc. Mostly it doesn't matter, performance differences
are neglegible either way 99% of the time. I suppose the reason for
this is that LE has a marginal advantage because of the things like
that you don't need byte shuffle when reinterpreting value type (e.g.
u32 and u16) behind the same pointer and stuff like that.
However, big endian also has it's benefits. First is a human readable
representation, which is important here since this is more of a
developer convenience representation. Another, benefit is that
BE corresponds to the byte ordering. E.g., 256 > 255 (0x0100 > 0x00ff),
which doesn't hold in LE. And we do care about the ordering of
namespaces, at present because the NMT must literally be built in
order, but in future I also think it would be a useful property to
have blobs sorted.
At the same time, the performance gains of using LE are neglegible.
So given all of that I propose we change the canonical integer representation
to BE.