Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[READY] Update unicode support to version 15.1 #1719

Merged
merged 9 commits into from
Dec 9, 2023

Conversation

bstaletic
Copy link
Collaborator

@bstaletic bstaletic commented Oct 30, 2023

This should bring ycmd to support unicode 15.1.

  • Parse new unicode data.
  • Implement the new rule for grapheme cluster boundaries.
  • Clean up update_unicode.py.
  • Consistently use unversioned links.
  • Update the regex submodule.
  • Clean up Word.cpp.

What is in this pull request

Brand new stuff

Unicode 15 changed the grapheme cluster boundary rules, by adding GB9c.
That thing talks about "indic conjunct break" property, from annex 44.
The data itself is DerivedCoreProperties.txt.

That has lead me to adding a new data member to our gigantic code_points object.

Bug fixes

urllib.request.urlopen returns a response whose read() (and similar) return bytes, which breaks our trusty Download( url ) helper. Yes, this was a very old bug.

Playing hide the data

Old link for the unversioned emoji-data.txt is not working, so I used the versioned one.
I have just found the unversioned one: https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt


This change is Reviewable

Fixes ##1718

@bstaletic bstaletic marked this pull request as draft October 30, 2023 20:01
Copy link
Member

@puremourning puremourning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems... ok. any clue what an indict junk is ?

Actually, I really don't care :p

Reviewed 11 of 11 files at r1, all commit messages.
Reviewable status: 0 of 2 LGTMs obtained (waiting on @bstaletic)

@bstaletic
Copy link
Collaborator Author

@DonKult I would not mind your review as well, if you are up for it.

Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Merging #1719 (36a9461) into master (0607eed) will increase coverage by 0.03%.
The diff coverage is 98.05%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1719      +/-   ##
==========================================
+ Coverage   95.41%   95.45%   +0.03%     
==========================================
  Files          83       83              
  Lines        8131     8136       +5     
  Branches      165      163       -2     
==========================================
+ Hits         7758     7766       +8     
+ Misses        322      320       -2     
+ Partials       51       50       -1     

@bstaletic bstaletic marked this pull request as ready for review November 1, 2023 10:25
@bstaletic bstaletic changed the title [WIP] Parse new unicode data [READY] Parse new unicode data Nov 1, 2023
@bstaletic bstaletic changed the title [READY] Parse new unicode data [READY] Update unicode support to version 15.1 Nov 1, 2023
@bstaletic bstaletic force-pushed the unicode-15.1 branch 2 times, most recently from d474e9d to 92503cd Compare November 1, 2023 20:25
We are passing conformance tests, but the following case seems odd:

CONSONANT (LINKER|EXTEND)+ CONSONANT

We will not break the above sequence at all if we encounter at least one
LINKER code point. Strictly reading the unicode GB9c rule seems to
suggest that

CONSONANT LINKER EXTEND CONSONANT

may be broken before the second consonant, but that leads to broken
tests.
This utility is exactly the same as all of
- std::map::erase(Key)
- std::unordered_map(Key)
- absl::flat_hash_map(Key)

No need to reinvent the wheel and have codecov complain.
- Stay within 80 character limit.
- Consistently call grapheme break properties and indic conjunct break
  properties by their full names to avoid confusion.
  - Previously "property" and "break property" always referred to
    grapheme break properties, but now we also have indic conjunct
    properties.
The PropertiesAreCorrect tests check that we actually record expected
properties in our CodePoint objects.
This commit extends that to a small sample of codepoints that have indic
conjunct break propery as well.
@DonKult
Copy link
Contributor

DonKult commented Nov 3, 2023

@DonKult I would not mind your review as well, if you are up for it.

Sadly, I can't provide much more than emotional support as I know next to nothing about the internals of utf-8 nor use/need/know any scripture that would use the finer details of word splitting. I just remembered seeing an issue about this elsewhere in Debian.

I did test this branch through (with the vim plugin) and it updates unicode, builds & runs (and tests) just fine in the previous as well as last force-push version, so as far that counts as review I can approve of the changes.

(I do apply additional changes through to remove the *.inc files & drop other third-party embeds/downloads in favor of system-provided things, which hide the Download issue in update_unicode.py for me for example).

@bstaletic
Copy link
Collaborator Author

Confirming that things are not broken on your side is good to know.

(I do apply additional changes through to remove the *.inc files & drop other third-party embeds/downloads in favor of system-provided things, which hide the Download issue in update_unicode.py for me for example).

I'm curious. May I see the patches?

@DonKult
Copy link
Contributor

DonKult commented Nov 3, 2023

(I do apply additional changes through to remove the *.inc files & drop other third-party embeds/downloads in favor of system-provided things, which hide the Download issue in update_unicode.py for me for example).

I'm curious. May I see the patches?

Sure, not a secret, just not usually an upstream favorite topic. I haven't pushed unfuzzied/updated version(s) yet, but you can see them either listed in Debian's Patch tracker for what is currently in the archive or in the VCS of the ycmd packaging bits for possibly unreleased things and as some of them don't make much sense without the vim-plugin context: patches vcs. Some of them I might/should clean up and official propose some day…, there is just always another dumpster fire to deal with first. At least, that is what I claim as an excuse.

Copy link
Member

@puremourning puremourning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 5 files at r2, 16 of 16 files at r3, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained (waiting on @bstaletic)

@puremourning puremourning added the Ship It! Manual override to merge a PR by maintainer label Dec 9, 2023
Copy link
Contributor

mergify bot commented Dec 9, 2023

Thanks for sending a PR!

Copy link
Contributor

mergify bot commented Dec 9, 2023

Thanks for sending a PR!

@mergify mergify bot merged commit 0a276f7 into ycm-core:master Dec 9, 2023
17 of 18 checks passed
@bstaletic bstaletic deleted the unicode-15.1 branch December 9, 2023 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ship It! Manual override to merge a PR by maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants