Fix off-by-one error converting to LSP UTF8 offsets with multi-byte char #17003

krobelus · 2024-04-03T12:59:53Z

On this file,

fn main() {
    let 된장 = 1;
}

when using "positionEncodings":["utf-16"] I get an "unused variable" diagnostic on the variable
name (codepoint offset range 8..10). So far so good.

When using positionEncodings":["utf-8"], I expect to get the equivalent range in bytes (LSP:
"Character offsets count UTF-8 code units (e.g bytes)."), which is 8..14, because both
characters are 3 bytes in UTF-8. However I actually get 10..14.

Looks like this is because we accidentally treat a 1-based index as an offset value: when
converting from our internal char-indices to LSP byte offsets, we look at one character to many.
This causes wrong results if the extra character is a multi-byte one, such as when computing
the start coordinate of 된장.

Fix that by actually passing an offset. While at it, fix the variable name of the line number,
which is not an offset (yet).

Originally reported at kakoune-lsp/kakoune-lsp#740

On this file, ```rust fn main() { let 된장 = 1; } ``` when using `"positionEncodings":["utf-16"]` I get an "unused variable" diagnostic on the variable name (codepoint offset range `8..10`). So far so good. When using `positionEncodings":["utf-8"]`, I expect to get the equivalent range in bytes (LSP: "Character offsets count UTF-8 code units (e.g bytes)."), which is `8..14`, because both characters are 3 bytes in UTF-8. However I actually get `10..14`. Looks like this is because we accidentally treat a 1-based index as an offset value: when converting from our internal char-indices to LSP byte offsets, we look at one character to many. This causes wrong results if the extra character is a multi-byte one, such as when computing the start coordinate of 된장. Fix that by actually passing an offset. While at it, fix the variable name of the line number, which is not an offset (yet). Originally reported at kakoune-lsp/kakoune-lsp#740

krobelus · 2024-04-03T13:12:32Z

Also, when trying to work around this, I found that if the client sends positionEncodings=["utf-16", "utf-8"], we still pick Utf8:

    for enc in client_encodings {
        if enc == &PositionEncodingKind::UTF8 {
            return PositionEncoding::Utf8;
        } else if enc == &PositionEncodingKind::UTF32 {
            return PositionEncoding::Wide(WideEncoding::Utf32);
        }
        // NB: intentionally prefer just about anything else to utf-16.
    }

    PositionEncoding::Wide(WideEncoding::Utf16)

why not pick the client's preferred encoding here? (i.e. whichever of UTF8/UTF16/UTF32 comes first)

Veykril · 2024-04-03T13:26:56Z

I believe the reasoning is that we need to do less work for utf8, given rust natively handles utf8 strings.

krobelus · 2024-04-03T14:12:12Z

Right, LSP doesn't specify (what a surprise) whether the client preference or server preference should win.
So the best workaround is for the client to send only ["utf16"] in that case.
Fortunately this is irrelevant usually.

Veykril · 2024-04-03T14:44:00Z

Thanks!
@bors r+

bors · 2024-04-03T14:45:23Z

📌 Commit d24b0ba has been approved by Veykril

It is now in the queue for this repository.

bors · 2024-04-03T14:58:18Z

⌛ Testing commit d24b0ba with merge 8e581ac...

bors · 2024-04-03T15:10:18Z

☀️ Test successful - checks-actions
Approved by: Veykril
Pushing 8e581ac to master...

BenjaminBrienen · 2024-04-11T17:16:47Z

Right, LSP doesn't specify (what a surprise) whether the client preference or server preference should win.
So the best workaround is for the client to send only ["utf16"] in that case.
Fortunately this is irrelevant usually.

Isn't this the exact kind of "why" that would be useful to have as a code comment?

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 3, 2024

krobelus mentioned this pull request Apr 3, 2024

ANSI-like escaped chars are shown if Korean variable names are hovered kakoune-lsp/kakoune-lsp#740

Closed

bors merged commit 8e581ac into rust-lang:master Apr 3, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix off-by-one error converting to LSP UTF8 offsets with multi-byte char #17003

Fix off-by-one error converting to LSP UTF8 offsets with multi-byte char #17003

krobelus commented Apr 3, 2024

krobelus commented Apr 3, 2024

Veykril commented Apr 3, 2024

krobelus commented Apr 3, 2024

Veykril commented Apr 3, 2024

bors commented Apr 3, 2024

bors commented Apr 3, 2024

bors commented Apr 3, 2024

BenjaminBrienen commented Apr 11, 2024 •

edited

Fix off-by-one error converting to LSP UTF8 offsets with multi-byte char #17003

Fix off-by-one error converting to LSP UTF8 offsets with multi-byte char #17003

Conversation

krobelus commented Apr 3, 2024

krobelus commented Apr 3, 2024

Veykril commented Apr 3, 2024

krobelus commented Apr 3, 2024

Veykril commented Apr 3, 2024

bors commented Apr 3, 2024

bors commented Apr 3, 2024

bors commented Apr 3, 2024

BenjaminBrienen commented Apr 11, 2024 • edited

BenjaminBrienen commented Apr 11, 2024 •

edited