Skip to content

Fix example from_utf8_lossy to match the behavior of the real one. #145061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ssbr
Copy link
Contributor

@ssbr ssbr commented Aug 7, 2025

String::from_utf8_lossy("\xF0\xF0") will return "��", but the example would only push a single "�". I think that's a bit too easy for readers to miss, and the bug is easy to miss in unit tests. (In fact, I'm here right out of a code review where the same mistake existed, possibly inspired by these docs!)

Note that the documentation for utf8_chunks does not have this issue, and does the for loop: https://doc.rust-lang.org/std/primitive.slice.html#method.utf8_chunks

@rustbot
Copy link
Collaborator

rustbot commented Aug 7, 2025

r? @tgross35

rustbot has assigned @tgross35.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 7, 2025
@ssbr
Copy link
Contributor Author

ssbr commented Aug 7, 2025

Eh, am I dumb? I tested it and it was fine before. Sorry for the noise.

@ssbr ssbr closed this Aug 7, 2025
@rustbot rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 7, 2025
@tgross35
Copy link
Contributor

tgross35 commented Aug 7, 2025

While you're here, would you mind turning the where F: FnMut(&str) into mut push: F: impl FnMut(&str)? the formatting is a bit weird.

Something to actually assert the behavior wouldn't be bad either, e.g.

// We use `String` here for demonstration purposes, but `utf8_chunks`
// also works without an allocator. (If you have `String` available,
// consider using `String::from_utf8_lossy` instead)
let mut output = String::new();
let input = b"Hello \xF0\xF0\x90\x80World!";
from_utf8_lossy(input, |s| output.push_str(s));
assert_eq!(output, "Hello ��World!");

@tgross35
Copy link
Contributor

tgross35 commented Aug 7, 2025

Eh, am I dumb? I tested it and it was fine before. Sorry for the noise.

Ah, missed this too but the invalid chunk is at most one character's worth (3 bytes). So with my example, if you make the change from this PR (if !chunk.invalid().is_empty() -> for chunk in chunk.invalid()), you wind up with "Hello ����World!".

I still think it would be a good idea to assert! the output if you are interested in making that change.

@ssbr
Copy link
Contributor Author

ssbr commented Aug 7, 2025

The main, most important thing, is that actually String::from_utf8_lossy does the same thing (adds a single � per invalid block), so the code as-is is actually exactly correct -- the thing I was missing was that we get multiple invalid chunks with "", and they only get merged, presumably, if they are part of a longer invalid UTF8 sequence.

@tgross35
Copy link
Contributor

tgross35 commented Aug 7, 2025

String::from_utf8_lossy uses this exact code so it should match 🙂

for chunk in iter {
res.push_str(chunk.valid());
if !chunk.invalid().is_empty() {
res.push_str(REPLACEMENT);
}
}
. I was only suggesting that it would be good to demonstrate this behavior in the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants