Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Unicode Version to 9.0.0 #27822

Merged
merged 1 commit into from Jan 30, 2017
Merged

Update Unicode Version to 9.0.0 #27822

merged 1 commit into from Jan 30, 2017

Conversation

@mtsmfm
Copy link
Contributor

@mtsmfm mtsmfm commented Jan 27, 2017

Summary

9.0.0 was released on June 21, 2016

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

http://www.unicode.org/versions/Unicode9.0.0/

There are some changes about grapheme cluster in Unicode 9.0.0:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

Other Information

I noticed that unpack_graphemes will return [Other] when the argument is Other 梅 Prepend
(it must be [Other, Prepend]).
But in Unicode 8.0.0's Prepend has no characters
so we don't have to backport following patch:

should_break =
+ if pos == eoc
+   true
# GB3. CR X LF

If we support Ruby 2.4 only, we can simply replace unpack_graphemes with Ruby's scan(/\X/) 馃槈
(@nurse implemented grapheme extended cluster on 2.4! )

https://bugs.ruby-lang.org/issues/12831

def unpack_graphemes(string)
  string.scan(/\X/).map(&:codepoints)
end

See #26743 for more information about replacing AS::Multibyte with Ruby's feature.


I'm not unicode expert so I'm very happy if you review this PR carefully 馃檹

@mtsmfm mtsmfm force-pushed the mtsmfm:bump-unicode-version branch 3 times, most recently Jan 27, 2017
# GB13. [^RI] (RI RI)* RI X RI
elsif codepoints[marker..pos].all? { |c| database.boundary[:regional_indicator] === c } && codepoints[marker..pos].count { |c| database.boundary[:regional_indicator] === c }.even?
false
# GB999. Any 梅 Any

This comment has been minimized.

@jeremy

jeremy Jan 27, 2017
Member

Do we have test coverage demonstrating these grapheme cluster boundary rules?

This comment has been minimized.

@mtsmfm

mtsmfm Jan 28, 2017
Author Contributor

It seems that http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt covers all lines of unpack_graphemes:

https://s3-ap-northeast-1.amazonaws.com/mtsmfm/rails/27822/coverage/index.html#40076a266c43e5b028d6d14e50d80fabb410f3b5

Following diff is how I got the coverage:

diff --git a/Gemfile b/Gemfile
index 2a42a4aea0..2aaf64360a 100644
--- a/Gemfile
+++ b/Gemfile
@@ -151,3 +151,5 @@ end
 gem "ibm_db" if ENV["IBM_DB"]
 gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
 gem "wdm", ">= 0.1.0", platforms: [:mingw, :mswin, :x64_mingw, :mswin64]
+
+gem "simplecov"
diff --git a/activesupport/cov.rb b/activesupport/cov.rb
new file mode 100644
index 0000000000..f5911a16c3
--- /dev/null
+++ b/activesupport/cov.rb
@@ -0,0 +1,7 @@
+$LOAD_PATH << "test"
+
+require "simplecov"
+
+SimpleCov.start
+
+load "test/multibyte_grapheme_break_conformance_test.rb"
$ cd activesupport
$ bundle exec ruby cov.rb
unless File.exist?(filename)
$stderr.puts "Downloading #{url.split('/').last}"
FileUtils.mkdir_p(File.dirname(filename))

This comment has been minimized.

@jeremy

jeremy Jan 27, 2017
Member

馃憤

activesupport/CHANGELOG.md Outdated
* Updated Unicode version to 9.0.0

*Fumiaki MATSUSHIMA*

This comment has been minimized.

@jeremy

jeremy Jan 27, 2017
Member

Worth noting how this affects grapheme cluster boundaries.

Notably, this will change string.mb_chars.grapheme_length and string.mb_chars.reverse鈥攑erhaps fixing them to work with clusters like 馃懇鈥嶐煈┾嶐煈р嶐煈!

This comment has been minimized.

@mtsmfm

mtsmfm Jan 28, 2017
Author Contributor

I added some notes 馃槃

@jeremy jeremy removed the needs feedback label Jan 27, 2017
@jeremy jeremy added this to the 5.1.0 milestone Jan 27, 2017
9.0.0 was released on June 21, 2016

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

http://www.unicode.org/versions/Unicode9.0.0/

There are some changes about grapheme cluster in Unicode 9.0.0:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

------------

I noticed that `unpack_graphemes` returns [Other] when the argument is Other 梅 Prepend
(it must be [Other, Prepend]).
But in [Unicode 8.0.0's Prepend has no characters](http://www.unicode.org/reports/tr29/tr29-27.html#Prepend)
so we don't have to backport following patch:

```diff
should_break =
+ if pos == eoc
+   true
```
@mtsmfm mtsmfm force-pushed the mtsmfm:bump-unicode-version branch to bdcfdef Jan 28, 2017
@amatsuda
Copy link
Member

@amatsuda amatsuda commented Jan 28, 2017

@jeremy FYI I asked @mtsmfm to work on this because I wanted to include this in 5.1, and to me this looks all right.

Copy link
Member

@rafaelfranca rafaelfranca left a comment

r? @jeremy

@jeremy
jeremy approved these changes Jan 30, 2017
"馃懇鈥嶐煈┾嶐煈р嶐煈".mb_chars.reverse # => "馃懇鈥嶐煈┾嶐煈р嶐煈"

*Fumiaki MATSUSHIMA*

This comment has been minimized.

@jeremy

jeremy Jan 30, 2017
Member

鉂わ笍

@jeremy jeremy merged commit 1fc9955 into rails:master Jan 30, 2017
2 checks passed
2 checks passed
codeclimate no new or fixed issues
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mtsmfm mtsmfm deleted the mtsmfm:bump-unicode-version branch Feb 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can鈥檛 perform that action at this time.