Skip to content

Advance scanner position by byte length while searching for line #3612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

vinistock
Copy link
Member

@vinistock vinistock commented Jun 17, 2025

Motivation

Fixes #3494 and #2446. Second time is the charm 😅

In #3583, we fixed bytesize counting for UTF-8, but didn't advance @pos based on the byte size when finding the correct line, which meant the document would still get corrupted.

Implementation

I split scanning into 3 subclasses since I found the logic for each encoding to be sufficiently different. We also don't want to pay the price of checking the encoding inside the many loops.

For posterity, the spec explains that positions use code unit lengths. For each encoding, that means a different thing:

  • UTF-8: code units are equivalent to bytes. We simply work directly with byte sizes
  • UTF-16: code units are almost equivalent to code points. The main different is that code points after the surrogate pair are considered length 2 and everything else length 1
  • UTF-32: code units is the same as code points

I implemented the logic for each in subclasses.

Automated Tests

Added a bunch of tests for each encoding and some edge cases, which should hopefully help us prevent further regressions.

Manual Tests

Tested on VS Code and NeoVim (UTF-16 and UTF-8).

@vinistock vinistock self-assigned this Jun 17, 2025
@vinistock vinistock added the bugfix This PR will fix an existing bug label Jun 17, 2025 — with Graphite App
Copy link
Member Author


How to use the Graphite Merge Queue

Add the label graphite-merge to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vinistock vinistock added the server This pull request should be included in the server gem's release notes label Jun 17, 2025 — with Graphite App
@vinistock vinistock marked this pull request as ready for review June 17, 2025 17:29
@vinistock vinistock requested a review from a team as a code owner June 17, 2025 17:29
@vinistock vinistock force-pushed the 06-17-advance_scanner_position_by_byte_length_while_searching_for_line branch from 1a368c4 to 7b99851 Compare June 17, 2025 17:32
@vinistock vinistock requested a review from st0012 June 17, 2025 18:12
@ChallaHalla
Copy link
Contributor

ChallaHalla commented Jun 17, 2025

Tested in Neovim with UTF-32 and these changes seem to fix the problem!
Screenshot 2025-06-17 at 4 56 14 PM
Screenshot 2025-06-17 at 4 55 41 PM

Seems to be good for UTF 16 and 8 as well.

@kddnewton
Copy link
Contributor

I thought this functionality was largely taken care of inside of Prism::Location. Is there something missing?

@vinistock
Copy link
Member Author

@kddnewton Prism::Location fully handles the other direction returning the right code units for a given AST node. This one is the reverse, we need to receive the code unit positions from the editor and ensure that we're turning them into the right string indices to update our source representation.

Copy link
Contributor

@alexcrocha alexcrocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Just one small change, otherwise good to 🚢
(feel free to ignore the nits)

@vinistock vinistock force-pushed the 06-17-advance_scanner_position_by_byte_length_while_searching_for_line branch from 7b99851 to ab7bcf5 Compare June 18, 2025 18:26
And refactor each encoding scanner into a subclass
@vinistock vinistock force-pushed the 06-17-advance_scanner_position_by_byte_length_while_searching_for_line branch from ab7bcf5 to f930333 Compare June 18, 2025 18:43
@vinistock vinistock enabled auto-merge (squash) June 18, 2025 18:43
@vinistock vinistock merged commit 6b10f30 into main Jun 18, 2025
36 checks passed
@vinistock vinistock deleted the 06-17-advance_scanner_position_by_byte_length_while_searching_for_line branch June 18, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This PR will fix an existing bug server This pull request should be included in the server gem's release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

neovim-lsp: Weird behavior after inserting non-ascii characters
4 participants