Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Unicode Version to 9.0.0 #27822

Merged
merged 1 commit into from
Jan 30, 2017
Merged

Conversation

mtsmfm
Copy link
Contributor

@mtsmfm mtsmfm commented Jan 27, 2017

Summary

9.0.0 was released on June 21, 2016

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

http://www.unicode.org/versions/Unicode9.0.0/

There are some changes about grapheme cluster in Unicode 9.0.0:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

Other Information

I noticed that unpack_graphemes will return [Other] when the argument is Other ÷ Prepend
(it must be [Other, Prepend]).
But in Unicode 8.0.0's Prepend has no characters
so we don't have to backport following patch:

should_break =
+ if pos == eoc
+   true
# GB3. CR X LF

If we support Ruby 2.4 only, we can simply replace unpack_graphemes with Ruby's scan(/\X/) 😉
(@nurse implemented grapheme extended cluster on 2.4! )

https://bugs.ruby-lang.org/issues/12831

def unpack_graphemes(string)
  string.scan(/\X/).map(&:codepoints)
end

See #26743 for more information about replacing AS::Multibyte with Ruby's feature.


I'm not unicode expert so I'm very happy if you review this PR carefully 🙏

# GB13. [^RI] (RI RI)* RI X RI
elsif codepoints[marker..pos].all? { |c| database.boundary[:regional_indicator] === c } && codepoints[marker..pos].count { |c| database.boundary[:regional_indicator] === c }.even?
false
# GB999. Any ÷ Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have test coverage demonstrating these grapheme cluster boundary rules?

Copy link
Contributor Author

@mtsmfm mtsmfm Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt covers all lines of unpack_graphemes:

https://s3-ap-northeast-1.amazonaws.com/mtsmfm/rails/27822/coverage/index.html#40076a266c43e5b028d6d14e50d80fabb410f3b5

Following diff is how I got the coverage:

diff --git a/Gemfile b/Gemfile
index 2a42a4aea0..2aaf64360a 100644
--- a/Gemfile
+++ b/Gemfile
@@ -151,3 +151,5 @@ end
 gem "ibm_db" if ENV["IBM_DB"]
 gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
 gem "wdm", ">= 0.1.0", platforms: [:mingw, :mswin, :x64_mingw, :mswin64]
+
+gem "simplecov"
diff --git a/activesupport/cov.rb b/activesupport/cov.rb
new file mode 100644
index 0000000000..f5911a16c3
--- /dev/null
+++ b/activesupport/cov.rb
@@ -0,0 +1,7 @@
+$LOAD_PATH << "test"
+
+require "simplecov"
+
+SimpleCov.start
+
+load "test/multibyte_grapheme_break_conformance_test.rb"
$ cd activesupport
$ bundle exec ruby cov.rb

unless File.exist?(filename)
$stderr.puts "Downloading #{url.split('/').last}"
FileUtils.mkdir_p(File.dirname(filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* Updated Unicode version to 9.0.0

*Fumiaki MATSUSHIMA*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting how this affects grapheme cluster boundaries.

Notably, this will change string.mb_chars.grapheme_length and string.mb_chars.reverse—perhaps fixing them to work with clusters like 👩‍👩‍👧‍👦!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some notes 😄

@jeremy jeremy added this to the 5.1.0 milestone Jan 27, 2017
9.0.0 was released on June 21, 2016

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

http://www.unicode.org/versions/Unicode9.0.0/

There are some changes about grapheme cluster in Unicode 9.0.0:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

------------

I noticed that `unpack_graphemes` returns [Other] when the argument is Other ÷ Prepend
(it must be [Other, Prepend]).
But in [Unicode 8.0.0's Prepend has no characters](http://www.unicode.org/reports/tr29/tr29-27.html#Prepend)
so we don't have to backport following patch:

```diff
should_break =
+ if pos == eoc
+   true
```
@amatsuda
Copy link
Member

@jeremy FYI I asked @mtsmfm to work on this because I wanted to include this in 5.1, and to me this looks all right.

Copy link
Member

@rafaelfranca rafaelfranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r? @jeremy

"👩‍👩‍👧‍👦".mb_chars.reverse # => "👩‍👩‍👧‍👦"

*Fumiaki MATSUSHIMA*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@jeremy jeremy merged commit 1fc9955 into rails:master Jan 30, 2017
@mtsmfm mtsmfm deleted the bump-unicode-version branch February 14, 2017 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants