Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix string indexing #167

Merged
merged 1 commit into from Aug 7, 2019

Conversation

@phansch
Copy link
Contributor

commented Apr 4, 2019

Previously rustfix was treating the highlight indices as byte offsets.
This caused it to crash when a highlight index was pointing inside a
multi-byte character.

This fix makes sure to index into the characters of the string instead
of the bytes.

Fixes rust-lang/rust#59660

Fix string indexing
Previously rustfix was treating the highlight indices as byte offsets.
This caused it to crash when a highlight index was pointing inside a
multi-byte character.

This fix makes sure to index into the characters of the string instead
of the bytes.
@estebank

This comment has been minimized.

Copy link

commented Apr 10, 2019

@rust-lang-nursery/libs, can someone review this PR?

@BurntSushi

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

I'm not sure this makes sense to me. Why are we reporting spans using codepoint offsets? I imagine we know the encoding, so we should be using byte offsets.

@ehuss

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

@BurntSushi Is that concern enough to block this PR? I'm not terribly familiar with rustc's internals, but it looks like things like Loc and LineInfo are very character-offset oriented. IIUC, all these structures would need to add a byte offset field as well, and it's not clear to me how easy that would be.

@BurntSushi

This comment has been minimized.

Copy link
Member

commented Jun 2, 2019

I'm not familiar with rustc internals either. I wasn't necessarily intending to block this PR. It just seems weird to me that codepoint offsets are being used. The only place where that makes sense, I think, is when you want to represent offsets into the same text across multiple encodings. Codepoint offsets will be invariant, but byte offsets wouldn't be. The downside is that you need either more time or more memory to convert between them.

@estebank

This comment has been minimized.

Copy link

commented Jun 4, 2019

From the rustc point of view, we could provide either, but any change in this area has the potential of being a breaking change. I guess that the ideal situation would be to have both char offset and byte offset provided in the json output, but that would be overly verbose IMO. The reason we provide char offsets instead of byte offsets is because we make this conversion internally for user consumption, to avoid leaking byte offsets in the cli ui.

@BurntSushi

This comment has been minimized.

Copy link
Member

commented Jun 4, 2019

@estebank

This comment has been minimized.

Copy link

commented Jun 4, 2019

Because the original consumer of these are the users, which count chars on screen, not bytes.

We could provide bytes in the json output, but that means opening the ticket to do so in the rust repo.

@BurntSushi

This comment has been minimized.

Copy link
Member

commented Jun 4, 2019

Ah I see. Basically, if the intent is for users to read them directly, then I agree that codepoint offsets are bettee than byte offsets. But if you want a machine to read them and use them to slice the text, then we should be using byte offsets. Having both seems fine to me. (And indees, one might even want grapheme cluster offsets instead of codepoint offsets, for human consumption.)

Either way, I don't mean to raise this as a blocking issue. I'm just giving my opinion on what we should be doing. If we need to slice by codepoint to solve a pressing issue, then by all means, go for it.

@ehuss

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

@killercup Do you think you could take a look at this? The PR itself seems fine to me, I've tested it with a few different odd characters (like combining characters), and it seems to be ok. Overall I think it would probably be safer to have byte offsets, but I think that would require significant changes in rustc, and I'm not sure what tangible situations it would actually matter.

@alexcrichton alexcrichton merged commit 836108f into rust-lang-nursery:master Aug 7, 2019

2 checks passed

Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

@phansch phansch deleted the phansch:fix_string_indexing branch Aug 7, 2019

@SimonSapin

This comment has been minimized.

Copy link

commented Aug 7, 2019

Because the original consumer of these are the users, which count chars on screen, not bytes.

By “characters on screen”, do you mean columns in a monospace font terminal’s grid? They’re not necessarily the same. And the count of char code points does not necessarily equal either of them, though it’s a less bad approximation than UTF-8 byte count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.