Skip to content

Conversation

@Kimundi
Copy link
Contributor

@Kimundi Kimundi commented Aug 14, 2014

The first commit improves code generation through a few changes:

  • The #[inline] attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to encode_utf8 in a inner loop to the pushing of a byte constant:

    let mut s = String::new();
    for _ in range(0u, 21) {
        s.push_char('a');
    }
  • Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning None instead. This makes llvm no longer emit code for causing failure for these methods.

  • A few debug assert!() calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of char as a Unicode scalar value.

The second commit is optional. It changes the methods from regular indexing with the dst[i] syntax to unsafe indexing with dst.unsafe_mut_ref(i). This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a nounwind attribute for the function.
However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.

This changes how the methods behave on a too small buffer, so this is a

[breaking-change]

@lilyball
Copy link
Contributor

I have no objection to the second commit, but it should include a comment explaining why it's using unsafe indexing.

Overall, I think returning 0 is better than failing. The more explicit failure we can get rid of, the better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this would be reasonable as:

debug_assert!(code < MAX_FOUR_B, "invalid character!");
match () {
    _ if code < MAX_ONE_B => 1,
    _ if code < MAX_TWO_B => 2,
    _ if code < MAX_THREE_B => 3,
    _ => 4
}

since it's undefined behaviour for a char to be out of range. (Or even dropping the debug_assert entirely.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right that makes more sense :)

@alexcrichton
Copy link
Member

However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.

Without concrete data showing that unsafe improves performance one way or another, this should probably be left off for now.

The other changes all look good to me though, thanks!

@Kimundi
Copy link
Contributor Author

Kimundi commented Aug 15, 2014

Okay, I removed the unsafe indexing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's undefined behavior for a char to be a surrogate character. I think we should actually just delete the debug_assert!() altogether (and remove the parenthetical from the comment).

Note that encode_utf8() doesn't bother with this test, even though a surrogate character encoding is considered invalid utf-8.

@sfackler
Copy link
Member

You could change the assert!s to debug_assert!s to keep them around without being forced to pay for the hit everywhere.

- Both can now be inlined and constant folded away
- Both can no longer cause failure
- Both now return an `Option` instead

Removed debug `assert!()`s over the valid ranges of a `char`
- It affected optimizations due to unwinding
- Char handling is now sound enought that they became uneccessary
@Kimundi
Copy link
Contributor Author

Kimundi commented Aug 16, 2014

@sfackler: That's what I did before in this PR...

@lilyball
Copy link
Contributor

But why keep them around at all? It's literally undefined behavior in Rust for a char to be invalid. No point in asserting that.

bors added a commit that referenced this pull request Aug 17, 2014
The first commit improves code generation through a few changes:
- The `#[inline]` attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to `encode_utf8` in a inner loop to the pushing of a byte constant:

 ```rust
let mut s = String::new();
for _ in range(0u, 21) {
        s.push_char('a');
}
```
- Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning `None` instead. This makes llvm no longer emit code for causing failure for these methods.
- A few debug `assert!()` calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of `char` as a Unicode scalar value.

~~The second commit is optional. It changes the methods from regular indexing with the `dst[i]` syntax to unsafe indexing with `dst.unsafe_mut_ref(i)`. This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a `nounwind` attribute for the function. 
However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.~~

This changes how the methods behave on a too small buffer, so this is a 

[breaking-change]
@bors bors closed this Aug 17, 2014
@bors bors merged commit 13079c1 into rust-lang:master Aug 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants