-
Notifications
You must be signed in to change notification settings - Fork 768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[READY] Update unicode support to version 15.1 #1719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems... ok. any clue what an indict junk is ?
Actually, I really don't care :p
Reviewed 11 of 11 files at r1, all commit messages.
Reviewable status: 0 of 2 LGTMs obtained (waiting on @bstaletic)
@DonKult I would not mind your review as well, if you are up for it. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #1719 +/- ##
==========================================
+ Coverage 95.41% 95.45% +0.03%
==========================================
Files 83 83
Lines 8131 8136 +5
Branches 165 163 -2
==========================================
+ Hits 7758 7766 +8
+ Misses 322 320 -2
+ Partials 51 50 -1 |
d474e9d
to
92503cd
Compare
We are passing conformance tests, but the following case seems odd: CONSONANT (LINKER|EXTEND)+ CONSONANT We will not break the above sequence at all if we encounter at least one LINKER code point. Strictly reading the unicode GB9c rule seems to suggest that CONSONANT LINKER EXTEND CONSONANT may be broken before the second consonant, but that leads to broken tests.
This utility is exactly the same as all of - std::map::erase(Key) - std::unordered_map(Key) - absl::flat_hash_map(Key) No need to reinvent the wheel and have codecov complain.
- Stay within 80 character limit. - Consistently call grapheme break properties and indic conjunct break properties by their full names to avoid confusion. - Previously "property" and "break property" always referred to grapheme break properties, but now we also have indic conjunct properties.
The PropertiesAreCorrect tests check that we actually record expected properties in our CodePoint objects. This commit extends that to a small sample of codepoints that have indic conjunct break propery as well.
92503cd
to
097a63e
Compare
Sadly, I can't provide much more than emotional support as I know next to nothing about the internals of utf-8 nor use/need/know any scripture that would use the finer details of word splitting. I just remembered seeing an issue about this elsewhere in Debian. I did test this branch through (with the vim plugin) and it updates unicode, builds & runs (and tests) just fine in the previous as well as last force-push version, so as far that counts as review I can approve of the changes. (I do apply additional changes through to remove the *.inc files & drop other third-party embeds/downloads in favor of system-provided things, which hide the |
Confirming that things are not broken on your side is good to know.
I'm curious. May I see the patches? |
Sure, not a secret, just not usually an upstream favorite topic. I haven't pushed unfuzzied/updated version(s) yet, but you can see them either listed in Debian's Patch tracker for what is currently in the archive or in the VCS of the ycmd packaging bits for possibly unreleased things and as some of them don't make much sense without the vim-plugin context: patches vcs. Some of them I might/should clean up and official propose some day…, there is just always another dumpster fire to deal with first. At least, that is what I claim as an excuse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r2, 16 of 16 files at r3, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained (waiting on @bstaletic)
Thanks for sending a PR! |
Thanks for sending a PR! |
This should bring ycmd to support unicode 15.1.
update_unicode.py
.Word.cpp
.What is in this pull request
Brand new stuff
Unicode 15 changed the grapheme cluster boundary rules, by adding GB9c.
That thing talks about "indic conjunct break" property, from annex 44.
The data itself is DerivedCoreProperties.txt.
That has lead me to adding a new data member to our gigantic
code_points
object.Bug fixes
urllib.request.urlopen
returns a response whoseread()
(and similar) returnbytes
, which breaks our trustyDownload( url )
helper. Yes, this was a very old bug.Playing hide the data
Old link for the unversioned
emoji-data.txt
is not working, so I used the versioned one.I have just found the unversioned one: https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt
This change is
Fixes ##1718