Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix off-by-one error converting to LSP UTF8 offsets with multi-byte char #17003

Merged
merged 1 commit into from Apr 3, 2024

Conversation

krobelus
Copy link
Contributor

@krobelus krobelus commented Apr 3, 2024

On this file,

fn main() {
    let 된장 = 1;
}

when using "positionEncodings":["utf-16"] I get an "unused variable" diagnostic on the variable
name (codepoint offset range 8..10). So far so good.

When using positionEncodings":["utf-8"], I expect to get the equivalent range in bytes (LSP:
"Character offsets count UTF-8 code units (e.g bytes)."), which is 8..14, because both
characters are 3 bytes in UTF-8. However I actually get 10..14.

Looks like this is because we accidentally treat a 1-based index as an offset value: when
converting from our internal char-indices to LSP byte offsets, we look at one character to many.
This causes wrong results if the extra character is a multi-byte one, such as when computing
the start coordinate of 된장.

Fix that by actually passing an offset. While at it, fix the variable name of the line number,
which is not an offset (yet).

Originally reported at kakoune-lsp/kakoune-lsp#740

On this file,

```rust
fn main() {
    let 된장 = 1;
}
```

when using `"positionEncodings":["utf-16"]` I get an "unused variable" diagnostic on the variable
name (codepoint offset range `8..10`). So far so good.

When using `positionEncodings":["utf-8"]`, I expect to get the equivalent range in bytes (LSP:
"Character offsets count UTF-8 code units (e.g bytes)."), which is `8..14`, because both
characters are 3 bytes in UTF-8.  However I actually get `10..14`.

Looks like this is because we accidentally treat a 1-based index as an offset value: when
converting from our internal char-indices to LSP byte offsets, we look at one character to many.
This causes wrong results if the extra character is a multi-byte one, such as when computing
the start coordinate of 된장.

Fix that by actually passing an offset. While at it, fix the variable name of the line number,
which is not an offset (yet).

Originally reported at kakoune-lsp/kakoune-lsp#740
@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 3, 2024
@krobelus
Copy link
Contributor Author

krobelus commented Apr 3, 2024

Also, when trying to work around this, I found that if the client sends positionEncodings=["utf-16", "utf-8"], we still pick Utf8:

    for enc in client_encodings {
        if enc == &PositionEncodingKind::UTF8 {
            return PositionEncoding::Utf8;
        } else if enc == &PositionEncodingKind::UTF32 {
            return PositionEncoding::Wide(WideEncoding::Utf32);
        }
        // NB: intentionally prefer just about anything else to utf-16.
    }

    PositionEncoding::Wide(WideEncoding::Utf16)

why not pick the client's preferred encoding here? (i.e. whichever of UTF8/UTF16/UTF32 comes first)

@Veykril
Copy link
Member

Veykril commented Apr 3, 2024

I believe the reasoning is that we need to do less work for utf8, given rust natively handles utf8 strings.

@krobelus
Copy link
Contributor Author

krobelus commented Apr 3, 2024

Right, LSP doesn't specify (what a surprise) whether the client preference or server preference should win.
So the best workaround is for the client to send only ["utf16"] in that case.
Fortunately this is irrelevant usually.

@Veykril
Copy link
Member

Veykril commented Apr 3, 2024

Thanks!
@bors r+

@bors
Copy link
Collaborator

bors commented Apr 3, 2024

📌 Commit d24b0ba has been approved by Veykril

It is now in the queue for this repository.

@bors
Copy link
Collaborator

bors commented Apr 3, 2024

⌛ Testing commit d24b0ba with merge 8e581ac...

@bors
Copy link
Collaborator

bors commented Apr 3, 2024

☀️ Test successful - checks-actions
Approved by: Veykril
Pushing 8e581ac to master...

@bors bors merged commit 8e581ac into rust-lang:master Apr 3, 2024
11 checks passed
@BenjaminBrienen
Copy link

BenjaminBrienen commented Apr 11, 2024

Right, LSP doesn't specify (what a surprise) whether the client preference or server preference should win.
So the best workaround is for the client to send only ["utf16"] in that case.
Fortunately this is irrelevant usually.

Isn't this the exact kind of "why" that would be useful to have as a code comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants