Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert positions from LSP coordinates to Kakoune coordinates #98

Closed
Screwtapello opened this issue Oct 8, 2018 · 7 comments
Closed
Labels
bug Something isn't working help wanted Extra attention is needed high priority

Comments

@Screwtapello
Copy link
Contributor

The LSP specification (version 3.0) says:

A position inside a document (see Position definition below) is expressed as a zero-based line and character offset. The offsets are based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

Meanwhile, Kakoune uses one-based line and character offsets, and seems to count 1 for every kind of character, including basic ASCII, Basic Multilingual Plane characters, astral plane characters like emoji, and individual combining characters.

Currently kak-lsp converts positions by adding 1 (converting from zero-based to one-based), but does not account for the difference between codepoints and UTF-16 code units.

@ul
Copy link
Collaborator

ul commented Oct 8, 2018

Good catch!

@ul ul added bug Something isn't working high priority labels Oct 8, 2018
@ul
Copy link
Collaborator

ul commented Oct 8, 2018

Do you have any ideas how to fix it efficiently? Looks like generic solution requires kak-lsp to track and analyze contents of open buffers =(

@mawww Do you know anything already implemented on Kakoune side which might help with such conversion?

@ul ul added the help wanted Extra attention is needed label Oct 8, 2018
@mawww
Copy link
Contributor

mawww commented Oct 8, 2018

Argh, Microsoft, again ? I thought we were friends... More seriously it bothers me they cannot let utf-16 die in the MS world, utf-8 won, everybody uses utf-8 except to access the win32 api... Its even stupider as the text documents themselves are expected to be transferred as utf-8. Frankly I view this as a bug in the lsp spec, and ideally we should lobby them to fix that, but I doubt this will get fixed anytime soon...

Kakoune uses 0-based byte coordinates for selections internally, and exposes them as 1-based byte coordinates to the external world (because user side line/columns are traditionally 1-based, as seen in compiler error message for example).

I would find it really ugly for kak-lsp to have to store the buffer content itself just for that case, an alternate solution (that I am not really happy with either) would be to have a way to specify utf-16 based coordinates to kakoune (say :select -utf16 ...), and handle the ugly details there.

The best alternative remains to remind the LSP spec writer that there were 3 sane alternatives (utf8 byte coordinates, column coordinates or codepoint coordinates) and for some strange/historical reason they went with another one...

Yeah, I am a bit annoyed at you Microsoft 😄

Edit: Here is the discussion on the lsp side: microsoft/language-server-protocol#376

@Screwtapello
Copy link
Contributor Author

Screwtapello commented Oct 8, 2018

(to be fair to Microsoft, I'm guessing this particular API decision comes from VS Code being written in JavaScript, whose spec requires UTF-16 strings, not particularly the Win32/Cocoa/Java APIs)

@Screwtapello
Copy link
Contributor Author

As discussed on IRC, kak-lsp wouldn't necessarily need to cache the entire document: if you had a list of the offsets at which astral-plane characters appear, you could take each LSP coordinate and binary-search in the list to see how many astral-plane characters appear before it, and subtract that number from the offset to find the codepoint offset.

As for finding astral-plane characters, some quick investigation with Python:

>>> "\uffff".encode('utf-8')
b'\xef\xbf\xbf'
>>> "\U00010000".encode('utf-8')
b'\xf0\x90\x80\x80'
>>> "\U0010FFFF".encode('utf-8')
b'\xf4\x8f\xbf\xbf'

... suggests that any byte whose value >= 0xf0 is the initial byte of an astral-plane character. That should be pretty easy to search for, without having to transcode anything to UTF-16 and count code-points.

Kakoune uses 0-based byte coordinates for selections internally

Wait, so the line:column indicator in the status-bar (which seems to count codepoints) is unrelated to the line.column syntax used in ranges and selections? That seems... misleading.

@mawww
Copy link
Contributor

mawww commented Oct 9, 2018

Wait, so the line:column indicator in the status-bar (which seems to count codepoints) is unrelated to the line.column syntax used in ranges and selections? That seems... misleading.

ranges and selections use <line>.<byte since line start>, the indication given in the status line is <line>:<column since line start>, not sure if that is misleading or not. Both are displayed 1-based while internally they are stored (the byte ones, we do not store column information) 0-based.

@krobelus
Copy link
Member

fixed by fb972fc (Use UTF-16 code unit offsets instead of code point offsets, as per LSP, 2022-09-03)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed high priority
Projects
None yet
Development

No branches or pull requests

4 participants