New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add low-level APIs for converting integers to strings #46281

Closed
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@SimonSapin
Contributor

SimonSapin commented Nov 26, 2017

This adds the following new APIs on primitive integer types, unstable under the int_to_str feature gate:

impl $Int {
    pub fn to_uppercase_str_radix(self, buffer: &mut [u8], radix: u32) -> &mut str { /* … */ }
    pub fn to_lowercase_str_radix(self, buffer: &mut [u8], radix: u32) -> &mut str { /* … */ }
    pub fn to_str(self, buffer: &mut [u8]) -> &mut str { /* … */ }
    pub const MAX_STR_LEN: usize = /* … */;
}

They are the “mirror” APIs of from_str and from_str_radix.

The &mut [u8] -> &mut str pattern is the same as char::encode_utf8: take a (possibly stack-allocated) buffer, write into it, and return a slice of it. It is maximally flexible in terms of allocation/copying strategy.

to_str is equivalent to the existing impls of the fmt::Display trait (it uses the same optimized algorithm) but does not have the overhead of creating a fmt::Formatter and otherwise invoking the string formatting machinery. This overhead was deemed high enough to make a crate dedicated to avoiding it: https://github.com/dtolnay/itoa/

It always requires a buffer of at least MAX_STR_LEN bytes, even for values that would fit in a smaller buffer. This is to avoid adding bound checks to the algorithm. Alternatives are to add bound checks, or add them only for to_str and not Display with more (internal) generics magic.

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but they support more bases and behave differently for negative integers: they use a minus sign followed by the representation of the mathematically opposite integer, whereas the formatting traits use the two’s complement. For example:

assert_eq!(format!("{:X}", -26), "FFFFFFE6");
assert_eq!((-26).to_uppercase_str_radix(&mut [0; 8], 16), "-1A");

There’s two separate methods for upper v.s. lower case to avoid a bool parameter that wouldn’t be self-explanatory at call sites, or a dedicated enum that seems like more new API surface than is warranted. An alternative might be to pick one of upper or lower case and only provide that, but that seems arbitrary since neither is obviously superior IMO.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin
Contributor

SimonSapin commented Nov 26, 2017

Show outdated Hide outdated src/libcore/num/mod.rs Outdated
@rkruppe

This comment has been minimized.

Show comment
Hide comment
@rkruppe

rkruppe Nov 27, 2017

Contributor

In general I'm in favor of exposing low-level APIs like that to be able to reuse the optimized libstd implementation as building block for tasks that can't or don't want to use std::fmt. However, this part:

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but [...] behave differently for negative integers

seems worrying. I can see the desire for the different behavior, but if these APIs are supposed to be the "high level of control" equivalent of std::fmt APIs, any difference in behavior is an unexpected footgun, especially when the difference is not due to accepting arbitrary bases rather than knowing the base ahead of time.


I'm also wondering whether it could be worthwhile to wait for const generics for the *_radix APIs. In the typical use cases I've seen, the radix is virtually always constant, and for the common case of a power-of-two radix, it being constant could enable nice optimizations.

Contributor

rkruppe commented Nov 27, 2017

In general I'm in favor of exposing low-level APIs like that to be able to reuse the optimized libstd implementation as building block for tasks that can't or don't want to use std::fmt. However, this part:

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but [...] behave differently for negative integers

seems worrying. I can see the desire for the different behavior, but if these APIs are supposed to be the "high level of control" equivalent of std::fmt APIs, any difference in behavior is an unexpected footgun, especially when the difference is not due to accepting arbitrary bases rather than knowing the base ahead of time.


I'm also wondering whether it could be worthwhile to wait for const generics for the *_radix APIs. In the typical use cases I've seen, the radix is virtually always constant, and for the common case of a power-of-two radix, it being constant could enable nice optimizations.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Nov 27, 2017

Contributor

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place. I trace the git history of this code and found no evidence that this isn’t accidental (such as discussion of this behavior or why it is desirable). I’d propose changing that behavior, but it’s been #[stable] long enough that we probably can’t. So all that’s left to do is documenting it #46285 and moving on.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix. I’ve only mentioned the formatting trait to explain why the existing functionality is not sufficient.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable. We can always add something else later, maybe to optimize hexadecimal specifically.

Contributor

SimonSapin commented Nov 27, 2017

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place. I trace the git history of this code and found no evidence that this isn’t accidental (such as discussion of this behavior or why it is desirable). I’d propose changing that behavior, but it’s been #[stable] long enough that we probably can’t. So all that’s left to do is documenting it #46285 and moving on.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix. I’ve only mentioned the formatting trait to explain why the existing functionality is not sufficient.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable. We can always add something else later, maybe to optimize hexadecimal specifically.

@rkruppe

This comment has been minimized.

Show comment
Hide comment
@rkruppe

rkruppe Nov 27, 2017

Contributor

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place.

I fully agree, but if it's here to stay, internal consistency also matters.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix

I didn't consider this. A valid perspective, but these new APIs are also solving the same problem as std::fmt solves, so they're also connected to std::fmt. So we're between a rock and a hard place -- either be consistent with the other integer formatting APIs, or be consistent with the inverse function. I'm almost tempted to suggest changing the behavior of std::fmt.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable

Good point.

Contributor

rkruppe commented Nov 27, 2017

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place.

I fully agree, but if it's here to stay, internal consistency also matters.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix

I didn't consider this. A valid perspective, but these new APIs are also solving the same problem as std::fmt solves, so they're also connected to std::fmt. So we're between a rock and a hard place -- either be consistent with the other integer formatting APIs, or be consistent with the inverse function. I'm almost tempted to suggest changing the behavior of std::fmt.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable

Good point.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 1, 2017

Member

Slipped through assignment...

r? @dtolnay

Member

shepmaster commented Dec 1, 2017

Slipped through assignment...

r? @dtolnay

/// assert_eq!(i16::min_value().to_str(&mut [0; i16::MAX_STR_LEN]), "-32768")
/// ```
#[unstable(feature = "int_to_str", issue = /* FIXME */ "0")]
pub fn to_str(self, buffer: &mut [u8]) -> &mut str {

This comment has been minimized.

@dtolnay

dtolnay Dec 7, 2017

Member

I am not thrilled about how an inherent method to_str (which is almost never what you want) competes in rustdoc searches and autocomplete with a trait method to_string (which is almost always what you want). Would it make sense to name this in a scarier way?

@dtolnay

dtolnay Dec 7, 2017

Member

I am not thrilled about how an inherent method to_str (which is almost never what you want) competes in rustdoc searches and autocomplete with a trait method to_string (which is almost always what you want). Would it make sense to name this in a scarier way?

@dtolnay

This comment has been minimized.

Show comment
Hide comment
@dtolnay

dtolnay Dec 7, 2017

Member

It would be neat to explore how typenum could work for specifying the radix, similar to what #46281 (comment) suggested.

I am sympathetic to these being the inverse of from_str_radix and not the inverse of LowerHex etc. I think that is the right choice for what to expose.

println!("{}", <i16>::from_str_radix("-ff", 16).unwrap());
println!("{}", <i16>::from_str_radix("ffff", 16).unwrap_err());

Overall I would prefer to have these bake in a high-quality integer formatting library on crates.io first. Let's run it by the rest of the team in case someone wants to justify why these need to be exposed by std.

@rfcbot fcp close

Member

dtolnay commented Dec 7, 2017

It would be neat to explore how typenum could work for specifying the radix, similar to what #46281 (comment) suggested.

I am sympathetic to these being the inverse of from_str_radix and not the inverse of LowerHex etc. I think that is the right choice for what to expose.

println!("{}", <i16>::from_str_radix("-ff", 16).unwrap());
println!("{}", <i16>::from_str_radix("ffff", 16).unwrap_err());

Overall I would prefer to have these bake in a high-quality integer formatting library on crates.io first. Let's run it by the rest of the team in case someone wants to justify why these need to be exposed by std.

@rfcbot fcp close

@rfcbot

This comment has been minimized.

Show comment
Hide comment
@rfcbot

rfcbot Dec 7, 2017

Team member @dtolnay has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot commented Dec 7, 2017

Team member @dtolnay has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Dec 7, 2017

Contributor

I don’t mind renaming to_str but I don’t have a good name idea right now. Do you have a suggestion?

It would be neat to explore how typenum could work for specifying the radix

Do we really want something like typenum in the standard library?

I would prefer to have these bake in a high-quality integer formatting library on crates.io first

Hasn’t that already happened with https://crates.io/crates/itoa ?

We could also keep only the to_str (possibly renamed) part of this PR, and leave the _radix case for later.

Contributor

SimonSapin commented Dec 7, 2017

I don’t mind renaming to_str but I don’t have a good name idea right now. Do you have a suggestion?

It would be neat to explore how typenum could work for specifying the radix

Do we really want something like typenum in the standard library?

I would prefer to have these bake in a high-quality integer formatting library on crates.io first

Hasn’t that already happened with https://crates.io/crates/itoa ?

We could also keep only the to_str (possibly renamed) part of this PR, and leave the _radix case for later.

@carols10cents

This comment has been minimized.

Show comment
Hide comment
@carols10cents

carols10cents Dec 12, 2017

Member

ping @BurntSushi @Kimundi @aturon @sfackler waiting for your ticky boxes here!

Member

carols10cents commented Dec 12, 2017

ping @BurntSushi @Kimundi @aturon @sfackler waiting for your ticky boxes here!

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Dec 13, 2017

Contributor

@dtolnay, FCP to close immediately after stating some concerns feels premature. What do you think of not keeping the _radix method, and renaming to_str to write_to for example?

Contributor

SimonSapin commented Dec 13, 2017

@dtolnay, FCP to close immediately after stating some concerns feels premature. What do you think of not keeping the _radix method, and renaming to_str to write_to for example?

@dtolnay

Thanks for the feedback Simon! I see what you mean. I apologize for not expressing my perspective clearly on the first try. It isn't so much that I had a few concerns -- it is that I have a general sense that it would be more productive to iterate on this feature in a library than adding it here now.

Even restricted to just the base-10 method, it feels like there is enough design space that we shouldn't hope to get it right on the first try. Having it bake in a library is better for everyone involved.

  • A library would be immediately available on the stable channel, so we can migrate code that currently uses itoa or similar to collect feedback. I would be happy to migrate serde_json as long as the new one is at least as fast.
  • A library can be versioned by Cargo so we avoid breaking nightly users as we settle on the best approach.

This is similar to the path that failure is following. For something like this, I believe bringing an established high-quality library to the table can be a faster path to stability in the standard library than iterating on #[unstable] stuff.

As an example of there being more design space than I think we have explored, here is a totally different approach that brings many advantages. Compare:

let mut buf = unsafe { mem::uninitialized::<[u8; i16::MAX_STR_LEN]>() };
let s = i.write_to(&mut buf);
// wrapper around uninitialized buffer of the right size
let mut buf = NumberBuffer::new();
let s = buf.stringify(i);

Advantages

  • Easier to use correctly.
  • Impossible to use incorrectly.
  • No more awkward MAX_STR_LEN constant.
  • No bounds check that your &[u8] is big enough.
  • No ambiguity about whether the &[u8] needs to be big enough to hold any integer, or just your integer.
  • No ambiguity about alignment of the result within the buffer.
  • Cannot panic.
  • Safe code gets to use an uninitialized buffer.
  • Googleable type name.
  • Forward compatible with controlling the base in a way that allocates a buffer of exactly the right size for your base using a const fn, struct NumberBuffer<const RADIX: usize = 10>.
  • No conflict with to_string autocomplete on the primitives.
  • The &mut is provided by autoref.
  • No ambiguity about whether the method stomps on the front of your buffer as scratch space.

Disadvantages

  • No longer exposes the ability to write right-aligned into an existing buffer. Unclear if this is ever valuable in practice.
  • Other things, I'm sure.
@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Dec 14, 2017

Contributor

As maintainer of the itoa crate, how do you feel about adding NumberBuffer (IntegerBuffer?) there?

Contributor

SimonSapin commented Dec 14, 2017

As maintainer of the itoa crate, how do you feel about adding NumberBuffer (IntegerBuffer?) there?

@dtolnay

This comment has been minimized.

Show comment
Hide comment
@dtolnay

dtolnay Dec 14, 2017

Member

I don't think I am going to have bandwidth for this in the near term. It would be better if someone else can drive this and changes are not blocked on me, so I think releasing a different library would make more sense.

Member

dtolnay commented Dec 14, 2017

I don't think I am going to have bandwidth for this in the near term. It would be better if someone else can drive this and changes are not blocked on me, so I think releasing a different library would make more sense.

@rfcbot

This comment has been minimized.

Show comment
Hide comment
@rfcbot

rfcbot Dec 19, 2017

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot commented Dec 19, 2017

🔔 This is now entering its final comment period, as per the review above. 🔔

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Dec 22, 2017

Contributor

☔️ The latest upstream changes (presumably #46922) made this pull request unmergeable. Please resolve the merge conflicts.

Contributor

bors commented Dec 22, 2017

☔️ The latest upstream changes (presumably #46922) made this pull request unmergeable. Please resolve the merge conflicts.

@dtolnay

This comment has been minimized.

Show comment
Hide comment
@dtolnay

dtolnay Dec 23, 2017

Member

I will go ahead and close this. Thanks for the PR and discussion! Again I am not opposed to adding a low-level number to string API in the standard library but I believe it will be more productive to iterate on this in a crate. The nb crate name is available in case anyone wants to run with the NumberBuffer design. 😉

Member

dtolnay commented Dec 23, 2017

I will go ahead and close this. Thanks for the PR and discussion! Again I am not opposed to adding a low-level number to string API in the standard library but I believe it will be more productive to iterate on this in a crate. The nb crate name is available in case anyone wants to run with the NumberBuffer design. 😉

@dtolnay dtolnay closed this Dec 23, 2017

@dtolnay dtolnay referenced this pull request Mar 17, 2018

Open

Support other radix #15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment