-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Unicode Version to 9.0.0 #27822
Conversation
0203f1f
to
21081f0
Compare
# GB13. [^RI] (RI RI)* RI X RI | ||
elsif codepoints[marker..pos].all? { |c| database.boundary[:regional_indicator] === c } && codepoints[marker..pos].count { |c| database.boundary[:regional_indicator] === c }.even? | ||
false | ||
# GB999. Any ÷ Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have test coverage demonstrating these grapheme cluster boundary rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt covers all lines of unpack_graphemes
:
Following diff is how I got the coverage:
diff --git a/Gemfile b/Gemfile
index 2a42a4aea0..2aaf64360a 100644
--- a/Gemfile
+++ b/Gemfile
@@ -151,3 +151,5 @@ end
gem "ibm_db" if ENV["IBM_DB"]
gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
gem "wdm", ">= 0.1.0", platforms: [:mingw, :mswin, :x64_mingw, :mswin64]
+
+gem "simplecov"
diff --git a/activesupport/cov.rb b/activesupport/cov.rb
new file mode 100644
index 0000000000..f5911a16c3
--- /dev/null
+++ b/activesupport/cov.rb
@@ -0,0 +1,7 @@
+$LOAD_PATH << "test"
+
+require "simplecov"
+
+SimpleCov.start
+
+load "test/multibyte_grapheme_break_conformance_test.rb"
$ cd activesupport
$ bundle exec ruby cov.rb
unless File.exist?(filename) | ||
$stderr.puts "Downloading #{url.split('/').last}" | ||
FileUtils.mkdir_p(File.dirname(filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Updated Unicode version to 9.0.0 | ||
|
||
*Fumiaki MATSUSHIMA* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting how this affects grapheme cluster boundaries.
Notably, this will change string.mb_chars.grapheme_length
and string.mb_chars.reverse
—perhaps fixing them to work with clusters like 👩👩👧👦!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some notes 😄
9.0.0 was released on June 21, 2016 http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html http://www.unicode.org/versions/Unicode9.0.0/ There are some changes about grapheme cluster in Unicode 9.0.0: http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules ------------ I noticed that `unpack_graphemes` returns [Other] when the argument is Other ÷ Prepend (it must be [Other, Prepend]). But in [Unicode 8.0.0's Prepend has no characters](http://www.unicode.org/reports/tr29/tr29-27.html#Prepend) so we don't have to backport following patch: ```diff should_break = + if pos == eoc + true ```
21081f0
to
bdcfdef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r? @jeremy
"👩👩👧👦".mb_chars.reverse # => "👩👩👧👦" | ||
|
||
*Fumiaki MATSUSHIMA* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Summary
9.0.0 was released on June 21, 2016
http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html
http://www.unicode.org/versions/Unicode9.0.0/
There are some changes about grapheme cluster in Unicode 9.0.0:
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
Other Information
I noticed that
unpack_graphemes
will return [Other] when the argument is Other ÷ Prepend(it must be [Other, Prepend]).
But in Unicode 8.0.0's Prepend has no characters
so we don't have to backport following patch:
If we support Ruby 2.4 only, we can simply replace
unpack_graphemes
with Ruby'sscan(/\X/)
😉(@nurse implemented grapheme extended cluster on 2.4! )
https://bugs.ruby-lang.org/issues/12831
See #26743 for more information about replacing
AS::Multibyte
with Ruby's feature.I'm not unicode expert so I'm very happy if you review this PR carefully 🙏