Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for integer types conversion to and from byte arrays #52963

Closed
SimonSapin opened this Issue Aug 1, 2018 · 40 comments

Comments

Projects
None yet
@SimonSapin
Copy link
Contributor

SimonSapin commented Aug 1, 2018

A previous incarnation of this API that went though FCP in #49792, but stabilization was reverted while still in Nightly in favor of this API.

Implemented in #51919, conversions between all integer types and appropriately-sized arrays of u8, using the specified endianness:

impl $Int {
    pub fn to_ne_bytes(self) -> [u8; mem::size_of::<Self>()] {…}
    pub fn to_le_bytes(self) -> [u8; mem::size_of::<Self>()] {…}
    pub fn to_be_bytes(self) -> [u8; mem::size_of::<Self>()] {…}
    pub fn from_ne_bytes(bytes: [u8; mem::size_of::<Self>()]) -> Self {…}
    pub fn from_le_bytes(bytes: [u8; mem::size_of::<Self>()]) -> Self {…}
    pub fn from_be_bytes(bytes: [u8; mem::size_of::<Self>()]) -> Self {…}
}

ne, le, and be mean native-endian, little-endian, and big-endian respectively. Native-endian means the target platform’s endianness. Those conversions are implemented as literally transmute and nothing else (but safe). The other conversions use swap or do not swap the byte order depending on the target’s endianness.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Aug 7, 2018

I think this feature is ready for stabilization:

@rfcbot fcp merge

This is very soon after landing in Nightly in its current form, but the same functionality already went through FCP for stabilization in #49792 after a couple months in Nightly. The new API contains methods identical to the previous API (only renamed), plus new methods that implement what was previously documented as a typical usage of the previous API.

So there isn’t really anything new here, I expect this final period to be a formality except for possible last-minute naming bikeshed. (The methods landed with consistent abbreviation, another option that was considered was fully expansion with for example to_native_endian_bytes and from_big_endian_bytes.)

@rfcbot

This comment has been minimized.

Copy link

rfcbot commented Aug 7, 2018

Team member @SimonSapin has proposed to merge this. The next step is review by the rest of the tagged teams:

Concerns:

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot

This comment has been minimized.

Copy link

rfcbot commented Aug 9, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Aug 10, 2018

@rfcbot concern missing_impls

I just noticed that we are still missing half the implementation for this.

#51919 did just add the changes for signed integers, but not for unsigned.

Compare:

https://doc.rust-lang.org/nightly/std/primitive.u8.html#method.to_bytes
https://doc.rust-lang.org/nightly/std/primitive.i8.html#method.to_ne_bytes

cc @tbu-

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Aug 14, 2018

@Kimundi I belive #53358 fixes your concern. (Could you review it?)

@shepmaster

This comment has been minimized.

Copy link
Member

shepmaster commented Aug 16, 2018

ne [...] means native-endian

I keep reading it as "not equal".

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Aug 17, 2018

@shepmaster I agree that’s a real negative point. The other naming pattern that was considered (consistent full expansion) is rather verbose: from_native_endian_bytes, to_little_endian_bytes. Do you think it would still be preferable? Or perhaps some other set of names?

frewsxcv added a commit to frewsxcv/rust that referenced this issue Aug 17, 2018

Rollup merge of rust-lang#53358 - SimonSapin:int-bytes, r=shepmaster
`{to,from}_{ne,le,be}_bytes` for unsigned integer types

Same as rust-lang#51919 did for signed integers.

Tracking issue: rust-lang#52963
@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Aug 18, 2018

@SimonSapin: Yeah, seems to be all in order now.

@rfcbot resolved missing_impls

@rfcbot

This comment has been minimized.

Copy link

rfcbot commented Aug 18, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

rsdy added a commit to redsift/kdb-rs-hash that referenced this issue Aug 20, 2018

@CasualX

This comment has been minimized.

Copy link

CasualX commented Aug 29, 2018

That sounds fair, thanks for indulging my commentary.

@krdln

This comment has been minimized.

Copy link
Contributor

krdln commented Aug 30, 2018

I didn't see it mentioned in this thread, so just a small comment about the the naming of ne – before reading description, I've immediately thought about this as "network endianness" (which is basically synonymous to be) only to learn later it was expected to mean "native endianness". Perhaps we can use he for "host endianness" to avoid ambiguity? This also plays nicely with htons and ntohs-like names. Although, h and b are quite similar, so this choice has its flaws.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Aug 31, 2018

Ok, this is the second (different) misinterpretation of ne (the other being "not equal") reported in this thread, so maybe we should do something about it? As mentioned previously I think that this meaning of "host" only makes sense in the context of networking, but this APIs are not necessarily networking-related. But other than that, maybe he is not too bad.

@rust-lang/libs, any thoughts?

@kjetilkjeka

This comment has been minimized.

Copy link

kjetilkjeka commented Aug 31, 2018

Host (he) will give the wrong association if interpreted in the cross compilation context (host vs target). Making it he might therefore also be confusing. I think this was mentioned in some previous discussion?

If the name is being changed it should preferably be to something that will resolve all ambigouties, not change some ambigouties for different ones. Due to the abbriviation it is already expected that users will have to look up the function in the docs the first time they're using it. Because of this, less idiomatic alternatives might be OK. In these situations there are also arguments for the idiosyncratic alternative to be better, as it makes it difficult to assume things, forcing users to look it up in the docs.

What about:

  • me - machine endianness (might look to similar to ne?)
  • ae - architecture endianness
  • te - target endianness (not common to speak of target unless cross compiling as target==host otherwise)
  • pe - processor/platform endianness
  • cpue - quite descriptive but doesn't visualy look like be or ne. It might make sense that the longest alternative is the one you're least likely to use (most prone to bugs due to platform specific behavior)
@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Sep 3, 2018

"Host" makes the wrong kind of sense in cross-compilation context, since this API is using the target machine’s endianness with #[cfg(not(target_endian = "…"))]

@kjetilkjeka

This comment has been minimized.

Copy link

kjetilkjeka commented Sep 3, 2018

"Host" makes the wrong kind of sense in cross-compilation context, since this API is using the target machine’s endianness with #[cfg(not(target_endian = "…"))]

Yes this was my exact concern. That's why i suggested to change it to something other than he if a change was going to happen. I see now that my comment might be interpreted the other way as well, I'll do a minor edit to fix that.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Sep 3, 2018

If there's confusion about the short names then I'd personally err on the side of more wordy names and expand out he to host_endian or (and similar for other names)

@kallisti5

This comment has been minimized.

Copy link
Contributor

kallisti5 commented Sep 18, 2018

aww yis. Excited for this one. It makes some clean looking code.

    let magic_bytes = header.magic.to_ne_bytes();
    if magic_bytes != ['h', 'p', 'k', 'r'] {
@CasualX

This comment has been minimized.

Copy link

CasualX commented Sep 18, 2018

@kallisti5 Did you mean to write byte character literals instead of chars? Assuming they're meant to be byte literal characters you can go even further and do it like this ;)

    let magic_bytes = header.magic.to_ne_bytes();
    if magic_bytes != *b"hpkr" {

thomcc added a commit to thomcc/rusqlite that referenced this issue Oct 8, 2018

Add a feature for storing i128 as blobs.
This is behind the `i128_blob` feature.

Blobs are stored as 16 byte big-endian values, with their most significant bit
flipped. This is so that sorting, comparison, etc all work properly, even with
negative numbers. This also allows the representation to be stable across
different computers.

It's possible that the `FromSql` implementation should handle the case that the
real value is stored in an integer. I didn't do this, but would be willing to
make the change. I don't think we should store them this way though, since I
don't think users would be able to sort/compare them sanely.

Support for `u128` is not implemented, as comparison with i128 values would work
strangely. This also is consistent with `u64` not being allowed, not that I
think that would be reason enough on it's own.

The `byteorder` crate is used if this feature is flipped, as it's quite small
and implements things more or less optimally. If/when `i128::{to,from}_be_bytes`
gets stabilized (rust-lang/rust#52963), we should
probably use that instead.
@gamozolabs

This comment has been minimized.

Copy link

gamozolabs commented Oct 9, 2018

Not sure if it's too late, but is there a reason the from routines take in an array rather than a slice? Taking a random chunk of out a slice and converting it to a number would be rather clunky as it would have to be copied to an intermediate fixed size array for holding.

In terms of performance a slice version would only have a runtime cost if used with an actual slice, since it's inline the bounds check on a fixed size array would be omit.

I ran into this today and it was a bit annoying.

@shepmaster

This comment has been minimized.

Copy link
Member

shepmaster commented Oct 9, 2018

take in an array rather than a slice

By taking in an array, you are statically guaranteed that there's enough data to make the transformation — you don't have to even allow for the error case.

Taking a random chunk of out a slice and converting it to a number would be rather clunky as it would have to be copied to an intermediate fixed size array for holding.

With TryFrom / TryInto, it's not that bad:

#![feature(int_to_from_bytes, try_from)]

use std::convert::TryInto;

fn example(bytes: &[u8]) -> i32 {
    let bytes: &[_; 4] = bytes.try_into().expect("Wrong number of bytes");
    i32::from_be_bytes(*bytes)
}
@gamozolabs

This comment has been minimized.

Copy link

gamozolabs commented Oct 9, 2018

Yeah, I agree, but it seems that a slice is just a superset of the abilities of an array.

While I agree it's cleaner and makes more sense in a standalone implementation, I don't know of many situations where I'm dealing with something I need to convert into a number did not come from a bigger vector or slice. At least for me I don't ever see using this functionality without using a wrapper of the sorts of the snippit you posted.

I guess I'm not sure how often you are dealing with packed data and are working with fixed sized arrays as the source of this data is almost always network or file streams. The only place I work with fixed sized arrays when working with files/network is when I'm casting to a structure, and at that point I'm just going to directly cast it to the u64 and skip the middle step.

The TryInto/TryFrom implentation you mention now is exactly what I expose as a basic interface in most of my tiny applications that don't warrant my safecast library. In this case the new introduction of the from_*_bytes() interfaces provide no improvement to usability compared to what I already have as I would have to write the example wrapper you posted above, which is of the same complexity and lesser portability than my existing version (lesser portability due to relying on another experimental feature).

I have a library https://github.com/gamozolabs/safecast which allows me to safely cast or copy things with runtime checks if needed. I pretty much must use this library in all of my Rust projects as almost everything I do is working with network/file/system level packed data.

To me this just happens to be a very sensitive issue as operating on binary data from safe Rust is incredibly clunky, and is the last major issue I have personally with the language. While crates like byteorder exist I do not believe a systems language should require a third party piece of code to be able to manipulate binary data in a safe way.

No hard feelings, just trying to chime in about the usability of this API from my use cases (kernel development / fuzzer development). Many of my peers (other security researchers) who also work on developing code to parse/mutate/extract information from binary packed data have had similar gripes and ultimately a few have refused to use the language for these purposes as it's not feasible for codebases that largely operate on raw binary data.

I likely will propose an RFC along the lines of my safecast library which allows for safe copying and dynamically-checked casting of PoD types that would just be exposed as another derive for structures/enums.

-B

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Oct 9, 2018

As shepmaster suggested, what does an API taking a slice do if the input is too short? Too long? Does it panic, or ignore extra items, or do we change the return type to a Result? Taking an array avoids this question, and the concerns of how to obtain that array can be separated and left to other APIs like TryFrom.

In my opinion the main goal of this API is to replace some unsafe calls to transmute with safe method calls. It doesn’t by itself need to replace byteorder or safecast entirely, and I think it’s fine to both have this and later add some higher-level convenience APIs to the standard library.

@oconnor663

This comment has been minimized.

Copy link
Contributor

oconnor663 commented Oct 13, 2018

I'm not familiar with exactly what #[inline] means in the compiler, but I notice that byteorder also uses it, and yet I think I've seen cases where I still ended up paying the cost of bounds checks.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Oct 13, 2018

@oconnor663

This comment has been minimized.

Copy link
Contributor

oconnor663 commented Oct 14, 2018

Right, I meant to say, I'm not sure what makes the compiler decide when to follow the hint and when not to. I thought I'd seen an issue with byteorder::LittleEndian::read_u64 not eliding the bounds checks when it could've, but now that I look more closely at my own code, it looks like it's actually read_u64_into (which copies a bunch of ints at once) that's somehow failing to elide. So maybe not relevant to our discussion here.

gwenn added a commit to gwenn/rusqlite that referenced this issue Oct 28, 2018

Add a feature for storing i128 as blobs.
This is behind the `i128_blob` feature.

Blobs are stored as 16 byte big-endian values, with their most significant bit
flipped. This is so that sorting, comparison, etc all work properly, even with
negative numbers. This also allows the representation to be stable across
different computers.

It's possible that the `FromSql` implementation should handle the case that the
real value is stored in an integer. I didn't do this, but would be willing to
make the change. I don't think we should store them this way though, since I
don't think users would be able to sort/compare them sanely.

Support for `u128` is not implemented, as comparison with i128 values would work
strangely. This also is consistent with `u64` not being allowed, not that I
think that would be reason enough on it's own.

The `byteorder` crate is used if this feature is flipped, as it's quite small
and implements things more or less optimally. If/when `i128::{to,from}_be_bytes`
gets stabilized (rust-lang/rust#52963), we should
probably use that instead.

SimonSapin added a commit to SimonSapin/rust that referenced this issue Nov 25, 2018

SimonSapin added a commit to SimonSapin/rust that referenced this issue Nov 25, 2018

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Nov 25, 2018

FCP to merge has completed a while ago, but there has been a number of additional comments since. However I believe not enough reasons to change the API. If you disagree, now is (past) the time to say so. Here is a stabilization PR: #56207

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Nov 25, 2018

As shepmaster suggested, what does an API taking a slice do if the input is too short? Too long? Does it panic, or ignore extra items, or do we change the return type to a Result? Taking an array avoids this question, and the concerns of how to obtain that array can be separated and left to other APIs like TryFrom.

Note, that this was the same with the API for char::encode_utf8 (which was char::encode_utf8(&mut [u8; 4]) -> usize, which was later changed to its current char::encode_utf8(&mut [u8]) -> &str (panicking if the buffer is not large enough) for convenience reasons.

Not that I’ve have any preference either way.

pietroalbini added a commit to pietroalbini/rust that referenced this issue Nov 25, 2018

Rollup merge of rust-lang#56207 - SimonSapin:int_to_from_bytes, r=nagisa
Stabilize the int_to_from_bytes feature

Fixes rust-lang#52963

FCP to merge completed: rust-lang#52963 (comment)
@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Nov 25, 2018

#56216 will hopefully make it easier to use these APIs with slices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.