Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Unicode Version to 9.0.0 #27822

Merged
merged 1 commit into from Jan 30, 2017
Merged

Conversation

mtsmfm
Copy link
Contributor

@mtsmfm mtsmfm commented Jan 27, 2017

Summary

9.0.0 was released on June 21, 2016

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

http://www.unicode.org/versions/Unicode9.0.0/

There are some changes about grapheme cluster in Unicode 9.0.0:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

Other Information

I noticed that unpack_graphemes will return [Other] when the argument is Other 梅 Prepend
(it must be [Other, Prepend]).
But in Unicode 8.0.0's Prepend has no characters
so we don't have to backport following patch:

should_break =
+ if pos == eoc
+   true
# GB3. CR X LF

If we support Ruby 2.4 only, we can simply replace unpack_graphemes with Ruby's scan(/\X/) 馃槈
(@nurse implemented grapheme extended cluster on 2.4! )

https://bugs.ruby-lang.org/issues/12831

def unpack_graphemes(string)
  string.scan(/\X/).map(&:codepoints)
end

See #26743 for more information about replacing AS::Multibyte with Ruby's feature.


I'm not unicode expert so I'm very happy if you review this PR carefully 馃檹

@mtsmfm mtsmfm force-pushed the bump-unicode-version branch 3 times, most recently from 0203f1f to 21081f0 Compare Jan 27, 2017
# GB13. [^RI] (RI RI)* RI X RI
elsif codepoints[marker..pos].all? { |c| database.boundary[:regional_indicator] === c } && codepoints[marker..pos].count { |c| database.boundary[:regional_indicator] === c }.even?
false
# GB999. Any 梅 Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have test coverage demonstrating these grapheme cluster boundary rules?

Copy link
Contributor Author

@mtsmfm mtsmfm Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt covers all lines of unpack_graphemes:

https://s3-ap-northeast-1.amazonaws.com/mtsmfm/rails/27822/coverage/index.html#40076a266c43e5b028d6d14e50d80fabb410f3b5

Following diff is how I got the coverage:

diff --git a/Gemfile b/Gemfile
index 2a42a4aea0..2aaf64360a 100644
--- a/Gemfile
+++ b/Gemfile
@@ -151,3 +151,5 @@ end
 gem "ibm_db" if ENV["IBM_DB"]
 gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
 gem "wdm", ">= 0.1.0", platforms: [:mingw, :mswin, :x64_mingw, :mswin64]
+
+gem "simplecov"
diff --git a/activesupport/cov.rb b/activesupport/cov.rb
new file mode 100644
index 0000000000..f5911a16c3
--- /dev/null
+++ b/activesupport/cov.rb
@@ -0,0 +1,7 @@
+$LOAD_PATH << "test"
+
+require "simplecov"
+
+SimpleCov.start
+
+load "test/multibyte_grapheme_break_conformance_test.rb"
$ cd activesupport
$ bundle exec ruby cov.rb

unless File.exist?(filename)
$stderr.puts "Downloading #{url.split('/').last}"
FileUtils.mkdir_p(File.dirname(filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

馃憤

* Updated Unicode version to 9.0.0

*Fumiaki MATSUSHIMA*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting how this affects grapheme cluster boundaries.

Notably, this will change string.mb_chars.grapheme_length and string.mb_chars.reverse鈥攑erhaps fixing them to work with clusters like 馃懇鈥嶐煈┾嶐煈р嶐煈!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some notes 馃槃

@jeremy jeremy added this to the 5.1.0 milestone Jan 27, 2017
9.0.0 was released on June 21, 2016

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

http://www.unicode.org/versions/Unicode9.0.0/

There are some changes about grapheme cluster in Unicode 9.0.0:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

------------

I noticed that `unpack_graphemes` returns [Other] when the argument is Other 梅 Prepend
(it must be [Other, Prepend]).
But in [Unicode 8.0.0's Prepend has no characters](http://www.unicode.org/reports/tr29/tr29-27.html#Prepend)
so we don't have to backport following patch:

```diff
should_break =
+ if pos == eoc
+   true
```
@amatsuda
Copy link
Member

@jeremy FYI I asked @mtsmfm to work on this because I wanted to include this in 5.1, and to me this looks all right.

Copy link
Member

@rafaelfranca rafaelfranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r? @jeremy

jeremy
jeremy approved these changes Jan 30, 2017
"馃懇鈥嶐煈┾嶐煈р嶐煈".mb_chars.reverse # => "馃懇鈥嶐煈┾嶐煈р嶐煈"

*Fumiaki MATSUSHIMA*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

鉂わ笍

@jeremy jeremy merged commit 1fc9955 into rails:master Jan 30, 2017
@mtsmfm mtsmfm deleted the bump-unicode-version branch Feb 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants