Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make internal encoding of locations aware of unicode #6073

Merged
merged 1 commit into from
Mar 14, 2023
Merged

Conversation

cristianoc
Copy link
Collaborator

@cristianoc cristianoc commented Mar 14, 2023

When some unicode characters are present on a line, the existing encoding of positions, based on number of bytes since line start, is incorrect. This can be seen in e.g. error messages picked up in the editor (or on the command-line).

This PR takes unicode into account. Even though the ocaml locations are byte-based, one can trick the system by encoding as pos_cnum:
(number of bytes from file start to line start) + (number of utf16 code units since line start)
Since the compiler's printer performs a subtraction, the utf16 character position is shown. Notice that editors, vscode in particular, show you something in "Col", but its internal commands expect correct utf16 character which is different.

When some unicode characters are present on a line, the existing encoding of positions, based on number of bytes since line start, is incorrect.
This can be seen in e.g. error messages picked up in the editor (or on the command-line).

This PR takes unicode into account. Even thought the ocaml locations are byte-based, one can trick the system by encoding as pos_cnum:
  (number of bytes from file start to line start) + (number of utf16 code units since line start)
Since the compiler's printer performs a subtraction, the utf16 character position is shown.
Notice that editors, vscode in particular, show you something in "Col", but its internal commands expect correct utf16 character which is different.
@cristianoc cristianoc changed the title Make intenral encoding of locations aware of unicode Make internal encoding of locations aware of unicode Mar 14, 2023
@@ -0,0 +1,12 @@

We've found a bug for you!
/.../fixtures/unicode_location.res:1:43
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

character 43

/.../fixtures/unicode_location.res:1:43

1 │ let q = "💩💩💩💩💩💩💩💩����💩" ++ ("a" ++ 3 ++ "b")
2 │ // ^ character position 33 + 10
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33 + 10 == 43

@cristianoc cristianoc requested a review from cknitt March 14, 2023 14:31
@zth
Copy link
Collaborator

zth commented Mar 14, 2023

I think this might fix a long standing issue with semantic highlighting and my native tounge (Swedish) where using letters like åäö offsets the coloring so it ends up wrong after those letters.

@cristianoc
Copy link
Collaborator Author

I think this might fix a long standing issue with semantic highlighting and my native tounge (Swedish) where using letters like åäö offsets the coloring so it ends up wrong after those letters.

Definitely. If you have an example at hand at some point, it would be handy to make sure it does.

@zth
Copy link
Collaborator

zth commented Mar 14, 2023

I think this might fix a long standing issue with semantic highlighting and my native tounge (Swedish) where using letters like åäö offsets the coloring so it ends up wrong after those letters.

Definitely. If you have an example at hand at some point, it would be handy to make sure it does.

Out traveling, but try for example:

let str = `hej på dig ${123->Int.toString}`

That should color the interpolated part wrong.

@cristianoc
Copy link
Collaborator Author

Found one independently:

let foo = x => if ( x == "Den här positionen är avstängd, väldigt avstängd") {<div />} else { <div />}

@zth
Copy link
Collaborator

zth commented Mar 14, 2023

@cristianoc love the Swedish!

@cristianoc
Copy link
Collaborator Author

This does fix the issue, once the relevant parser files in the editor extension are updated.

@cristianoc cristianoc merged commit f21be69 into master Mar 14, 2023
cristianoc added a commit to rescript-lang/rescript-vscode that referenced this pull request Mar 14, 2023
This fixes highlighting, and other things, in presence of unicode strings.
@cristianoc cristianoc deleted the utf16 branch March 14, 2023 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants