Remove `AS::Multibyte`'s unicode table #26743

mtsmfm · 2016-10-09T11:50:07Z

Summary

@amatsuda tells us AS::Multibyte has big Unicode table and ruby may have similar feature (https://speakerdeck.com/a_matsuda/3x-rails?slide=156).

So I investigated and noticed that we can remove the table if we accept following changes.

Supported Unicode version will depend on Ruby
- Ruby 2.2 supports Unicode 7.0.0
- Ruby 2.3 supports Unicode 8.0.0
- Ruby 2.4 supports Unicode 9.0.0
Unicode case mappings doesn't work until Ruby 2.4
~~Grapheme doesn't work perfect~~ Ruby 2.4 works perfect! https://bugs.ruby-lang.org/issues/12831

If we merge this PR, we'll have following benefits:

Reduce memory usage(about 1M)
Reduce repo/gem size(about 1M)
Reduce code lines(about 400 lines)

I'll show you the way I tried to remove the table and why the changes will cause.

AS::Multibyte::Unicode's features are divided into 4 main groups.

normalize
- normalize
- compose
- decompose
- reorder_characters (this is public method but probably for compose/decompose only)
case mapping
- upcase
- downcase
- swapcase
pack/unpack grapheme
- pack_graphemes
- unpack_graphemes
tidy bytes
- tidy_bytes

1 ~ 3 use unicode database.
So let remove them.

normalize

Since 2.2, Ruby has String#unicode_normalize.

https://www.ruby-lang.org/en/news/2014/12/25/ruby-2-2-0-released/
https://bugs.ruby-lang.org/issues/10084

It has similar option so it seems that we can simply replace it.
(I tried #26403)
But Ruby 2.2 supports Unicode 7.0.0 and Ruby 2.3 supports 8.0.0.
AS::Multibyte::Unicode has its own UNICODE_VERSION but now it depends on Ruby's version.

case mappings

Ruby 2.4 supports Unicode case mappings.
But Ruby 2.2 and 2.3 don't.

pack/unpack grapheme

Since 2.0, Ruby changed its regexp engine.
So we can split grapheme with /X/.
(ex. "\u304B\u3099\u304C\u3099\u304E\u3099".scan(/\X/))

~~But it's grapheme cluster is not "true".~~
~~So some tests will fail.~~
@nurse fixed on Ruby 2.4!
https://bugs.ruby-lang.org/issues/12831

Memory

Unicode database consume 1.5 M memory.

require 'active_support/core_ext/string/multibyte'
require 'objspace'

# load unicode database
''.mb_chars.upcase

db = ActiveSupport::Multibyte::Unicode.send(:database)
db_memsize = db.instance_variables.map do |ivar|
  ObjectSpace.memsize_of(db.instance_variable_get(ivar))
end.inject(:+)

puts "db_memsize: #{db_memsize} Bytes"
#=> db_memsize: 1501184 Bytes

Ruby's Unicode database is smaller than AS::Multibyte::Unicode::UnicodeDatabase

require 'objspace'

# load unicode database
''.unicode_normalize

db_memsize = UnicodeNormalize.constants.map do |const|
  ObjectSpace.memsize_of(UnicodeNormalize.const_get(const))
end.inject(:+)

puts "db_memsize: #{db_memsize} Bytes"
#=> db_memsize: 404558 Bytes

rails-bot · 2016-10-09T11:50:19Z

r? @kaspth

(@rails-bot has picked a reviewer for you, use r? to override)

kaspth · 2016-10-09T14:59:59Z

r? @amatsuda

rafaelfranca · 2016-10-09T22:43:20Z

While I like the idea but I don't think we should remove some features for
the supported Ruby versions because of 1mb of memory. I can see we merging
it when we drop support to Ruby < 2.4 but while we need to support 2.2 and
2.3 I think this extra memory usage is fine
On Sun, 9 Oct 2016 at 12:00 Kasper Timm Hansen notifications@github.com
wrote:

r? @amatsuda https://github.com/amatsuda

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#26743 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAC66D5WNs0IpirX9kii_Qw-9l2DvBJWks5qyQFzgaJpZM4KR-2B
.

amatsuda · 2016-10-10T02:46:32Z

@mtsmfm Thank you for the PR and summary! This is very well done!
I suppose another benefit that we can get from this change is maybe 100-200ms boot time speed up for not loading and parsing 1MB .dat file.

@rafaelfranca Agreed. Actually we (@mtsmfm and I) already talked about that in person, so he understands that this PR might not be fully merged right now (and maybe we need to wait until Rails 6).
But still, he wanted to show us the goal of this small project, and I think he did it perfectly :)

So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR.

mtsmfm · 2016-10-11T16:13:40Z

@amatsuda Thank you for your following up!

So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR.

I think it is difficult to cherry-pick
because AS::Multibyte::Unicode has its own UNICODE_VERSION.
So we must replace all methods with Ruby's feature or do nothing 😢

Should I close this PR at this time or keep opening until rails 6?

jeremy · 2017-02-13T02:06:20Z

Since the UnicodeDatabase lazy loads, perhaps we could leave it as-is for older Rubies but start switching to native implementation for Ruby 2.4+?

mtsmfm · 2017-02-14T15:56:20Z

@jeremy Sorry, I don't understand what "switching to native implementation for Ruby 2.4+" means 🙇‍♂️
Do you mean "we should implement new API such like AS::Multibyte2 and deprecate AS::Multibyte ❓

Since the UnicodeDatabase lazy loads

Ruby 2.4+ also lazy loads unicode normalize database.
https://github.com/ruby/ruby/blob/v2_4_0/lib/unicode_normalize.rb#L33
(And I can't find where ruby loads case mapping database 💦 )

This is first step toward deprecating non-native unicode implementation ref: - rails#26743 (comment) - rails#28067 (review)

mtsmfm · 2018-02-19T18:37:17Z

I think it's time to merge because Rails 6 requires Ruby 2.4+

mtsmfm · 2018-02-19T18:38:10Z

@jeremy @amatsuda Could you review?

rafaelfranca · 2018-02-19T18:43:25Z

Can you also remove bin/generate_tables?

rafaelfranca · 2018-02-19T18:46:36Z

And the unicode_tables.dat file

mtsmfm · 2018-02-19T19:00:08Z

Oops, removed and force pushed!

jeremy · 2018-02-19T19:41:44Z

🎉🎉🎉

amatsuda · 2018-02-19T21:47:57Z

👍 👍 👍

mtsmfm · 2018-02-20T01:23:01Z

😂

matthewd · 2018-02-20T01:24:43Z

🎉

Should we consider deprecating some of these APIs now? I see a lot are trivial wrappers for core methods.

rafaelfranca · 2018-02-20T05:17:44Z

👍 to deprecate some of those wrappers

frodsan · 2018-10-09T09:24:42Z

activesupport/lib/active_support/multibyte/unicode.rb

      end

      def upcase(string)
-        apply_mapping string, :uppercase_mapping
+        string.upcase
      end


@mtsmfm Hi! Do you think it's safe to the the same for #capitalize? The tests don't fail after I change it:

--- a/activesupport/lib/active_support/multibyte/chars.rb +++ b/activesupport/lib/active_support/multibyte/chars.rb @@ -145,7 +145,7 @@ def swapcase # # 'über'.mb_chars.capitalize.to_s # => "Über" def capitalize - (slice(0) || chars("")).upcase + (slice(1..-1) || chars("")).downcase + chars(@wrapped_string.capitalize) end

I've just noticed "\uFB03" will be different...

ok, thanks! :)

Let me think a while

mtsmfm · 2018-10-10T17:20:38Z

@jeremy @rafaelfranca @amatsuda

Sorry, I've found this change breaks previous behavior in some edge cases 🙇

require "bundler/inline"

gemfile(true) do
  ruby "> 2.4.0"
  source "https://rubygems.org"
  gem "activesupport", "5.2.1"
end

require "active_support/all"

str = "\uFB03"
as_result = ActiveSupport::Multibyte::Unicode.upcase(str)
mri_result = str.upcase

puts as_result # => ﬃ
puts mri_result # => FFI

p as_result.codepoints # => [64259]
p mri_result.codepoints # => [70, 70, 73]

It seems the previous implementation doesn't follow special casting but Ruby does.

I think it's a bug and Ruby is correct so it's ok to change this behavior though.

What do you think?

rafaelfranca · 2018-10-10T19:16:25Z

It seems fine to me too.

jeremy · 2018-10-10T19:27:15Z

Agreed! Following Ruby is what we intend to do, so this edge case is a good example of why 🙏

mtsmfm · 2018-10-11T01:47:05Z

So then, I think it's also ok to change Chars#capitalize @frodsan

#26743 (comment)

rails-bot assigned kaspth Oct 9, 2016

maclover7 added activesupport needs feedback labels Oct 9, 2016

rails-bot assigned amatsuda and unassigned kaspth Oct 9, 2016

mtsmfm mentioned this pull request Jan 27, 2017

Update Unicode Version to 9.0.0 #27822

Merged

mtsmfm force-pushed the remove-unicode-table branch from 62a3025 to eebf615 Compare January 27, 2017 15:45

amatsuda mentioned this pull request Feb 2, 2017

String#truncate_bytes #27319

Merged

mtsmfm mentioned this pull request Feb 19, 2017

Switch to native unicode implementation for ruby 2.4+ #28067

Closed

janlelis referenced this pull request in janlelis/uniscribe Apr 19, 2017

Do not oficially support JRuby

3baa741

janlelis mentioned this pull request Apr 20, 2017

Support grapheme detection via \X jruby/jruby#4568

Closed

mtsmfm added a commit to mtsmfm/rails that referenced this pull request Jun 28, 2017

Add backend for unicode implementation

8553236

This is first step toward deprecating non-native unicode implementation ref: - rails#26743 (comment) - rails#28067 (review)

mtsmfm mentioned this pull request Feb 17, 2018

Rails 6 requires Ruby 2.3+ #32028

Merged

jeremy added this to the 6.0.0 milestone Feb 17, 2018

mtsmfm force-pushed the remove-unicode-table branch 2 times, most recently from 6ab14e4 to c459d0a Compare February 19, 2018 16:37

mtsmfm force-pushed the remove-unicode-table branch from c459d0a to 8ffe42b Compare February 19, 2018 17:58

Remove AS::Multibyte's unicode table

c1d00d6

mtsmfm force-pushed the remove-unicode-table branch from 8ffe42b to c1d00d6 Compare February 19, 2018 18:58

rafaelfranca merged commit ffddaea into rails:master Feb 19, 2018

mtsmfm deleted the remove-unicode-table branch February 20, 2018 01:23

alimi mentioned this pull request Oct 3, 2018

String#parameterize raises Encoding::CompatibilityError for non-Unicode Strings on master #34062

Closed

jeremy mentioned this pull request Oct 8, 2018

Deprecate Unicode#downcase/upcase/swapcase. #34123

Merged

frodsan reviewed Oct 9, 2018

View reviewed changes

cpruitt mentioned this pull request Jul 14, 2019

Restore ActiveSupport::Inflector.parameterize support for non-utf-8 strings #36678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `AS::Multibyte`'s unicode table #26743

Remove `AS::Multibyte`'s unicode table #26743

mtsmfm commented Oct 9, 2016 •

edited

rails-bot commented Oct 9, 2016

kaspth commented Oct 9, 2016

rafaelfranca commented Oct 9, 2016

amatsuda commented Oct 10, 2016

mtsmfm commented Oct 11, 2016

jeremy commented Feb 13, 2017

mtsmfm commented Feb 14, 2017

mtsmfm commented Feb 19, 2018

mtsmfm commented Feb 19, 2018

rafaelfranca commented Feb 19, 2018

rafaelfranca commented Feb 19, 2018

mtsmfm commented Feb 19, 2018

jeremy commented Feb 19, 2018

amatsuda commented Feb 19, 2018

mtsmfm commented Feb 20, 2018

matthewd commented Feb 20, 2018

rafaelfranca commented Feb 20, 2018

frodsan Oct 9, 2018

mtsmfm Oct 10, 2018

frodsan Oct 10, 2018

mtsmfm Oct 10, 2018

mtsmfm commented Oct 10, 2018 •

edited

rafaelfranca commented Oct 10, 2018

jeremy commented Oct 10, 2018

mtsmfm commented Oct 11, 2018

Remove AS::Multibyte's unicode table #26743

Remove AS::Multibyte's unicode table #26743

Conversation

mtsmfm commented Oct 9, 2016 • edited

Summary

normalize

case mappings

pack/unpack grapheme

Memory

rails-bot commented Oct 9, 2016

kaspth commented Oct 9, 2016

rafaelfranca commented Oct 9, 2016

amatsuda commented Oct 10, 2016

mtsmfm commented Oct 11, 2016

jeremy commented Feb 13, 2017

mtsmfm commented Feb 14, 2017

mtsmfm commented Feb 19, 2018

mtsmfm commented Feb 19, 2018

rafaelfranca commented Feb 19, 2018

rafaelfranca commented Feb 19, 2018

mtsmfm commented Feb 19, 2018

jeremy commented Feb 19, 2018

amatsuda commented Feb 19, 2018

mtsmfm commented Feb 20, 2018

matthewd commented Feb 20, 2018

rafaelfranca commented Feb 20, 2018

frodsan Oct 9, 2018

Choose a reason for hiding this comment

mtsmfm Oct 10, 2018

Choose a reason for hiding this comment

frodsan Oct 10, 2018

Choose a reason for hiding this comment

mtsmfm Oct 10, 2018

Choose a reason for hiding this comment

mtsmfm commented Oct 10, 2018 • edited

rafaelfranca commented Oct 10, 2018

jeremy commented Oct 10, 2018

mtsmfm commented Oct 11, 2018

Remove `AS::Multibyte`'s unicode table #26743

Remove `AS::Multibyte`'s unicode table #26743

mtsmfm commented Oct 9, 2016 •

edited

mtsmfm commented Oct 10, 2018 •

edited