New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove `AS::Multibyte`'s unicode table #26743

Merged
merged 1 commit into from Feb 19, 2018

Conversation

Projects
None yet
9 participants
@mtsmfm
Contributor

mtsmfm commented Oct 9, 2016

Summary

@amatsuda tells us AS::Multibyte has big Unicode table and ruby may have similar feature (https://speakerdeck.com/a_matsuda/3x-rails?slide=156).

So I investigated and noticed that we can remove the table if we accept following changes.

  • Supported Unicode version will depend on Ruby
    • Ruby 2.2 supports Unicode 7.0.0
    • Ruby 2.3 supports Unicode 8.0.0
    • Ruby 2.4 supports Unicode 9.0.0
  • Unicode case mappings doesn't work until Ruby 2.4
  • Grapheme doesn't work perfect Ruby 2.4 works perfect! https://bugs.ruby-lang.org/issues/12831

If we merge this PR, we'll have following benefits:

  • Reduce memory usage(about 1M)
  • Reduce repo/gem size(about 1M)
  • Reduce code lines(about 400 lines)

I'll show you the way I tried to remove the table and why the changes will cause.


AS::Multibyte::Unicode's features are divided into 4 main groups.

  1. normalize
  2. case mapping
  3. pack/unpack grapheme
  4. tidy bytes

1 ~ 3 use unicode database.
So let remove them.

normalize

Since 2.2, Ruby has String#unicode_normalize.

https://www.ruby-lang.org/en/news/2014/12/25/ruby-2-2-0-released/
https://bugs.ruby-lang.org/issues/10084

It has similar option so it seems that we can simply replace it.
(I tried #26403)
But Ruby 2.2 supports Unicode 7.0.0 and Ruby 2.3 supports 8.0.0.
AS::Multibyte::Unicode has its own UNICODE_VERSION but now it depends on Ruby's version.

case mappings

Ruby 2.4 supports Unicode case mappings.
But Ruby 2.2 and 2.3 don't.

pack/unpack grapheme

Since 2.0, Ruby changed its regexp engine.
So we can split grapheme with /X/.
(ex. "\u304B\u3099\u304C\u3099\u304E\u3099".scan(/\X/))

But it's grapheme cluster is not "true".
So some tests will fail.
@nurse fixed on Ruby 2.4!
https://bugs.ruby-lang.org/issues/12831

Memory

Unicode database consume 1.5 M memory.

require 'active_support/core_ext/string/multibyte'
require 'objspace'

# load unicode database
''.mb_chars.upcase

db = ActiveSupport::Multibyte::Unicode.send(:database)
db_memsize = db.instance_variables.map do |ivar|
  ObjectSpace.memsize_of(db.instance_variable_get(ivar))
end.inject(:+)

puts "db_memsize: #{db_memsize} Bytes"
#=> db_memsize: 1501184 Bytes

Ruby's Unicode database is smaller than AS::Multibyte::Unicode::UnicodeDatabase

require 'objspace'

# load unicode database
''.unicode_normalize

db_memsize = UnicodeNormalize.constants.map do |const|
  ObjectSpace.memsize_of(UnicodeNormalize.const_get(const))
end.inject(:+)

puts "db_memsize: #{db_memsize} Bytes"
#=> db_memsize: 404558 Bytes
@rails-bot

This comment has been minimized.

rails-bot commented Oct 9, 2016

r? @kaspth

(@rails-bot has picked a reviewer for you, use r? to override)

@kaspth

This comment has been minimized.

Member

kaspth commented Oct 9, 2016

@rails-bot rails-bot assigned amatsuda and unassigned kaspth Oct 9, 2016

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Oct 9, 2016

While I like the idea but I don't think we should remove some features for
the supported Ruby versions because of 1mb of memory. I can see we merging
it when we drop support to Ruby < 2.4 but while we need to support 2.2 and
2.3 I think this extra memory usage is fine
On Sun, 9 Oct 2016 at 12:00 Kasper Timm Hansen notifications@github.com
wrote:

r? @amatsuda https://github.com/amatsuda


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#26743 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAC66D5WNs0IpirX9kii_Qw-9l2DvBJWks5qyQFzgaJpZM4KR-2B
.

@amatsuda

This comment has been minimized.

Member

amatsuda commented Oct 10, 2016

@mtsmfm Thank you for the PR and summary! This is very well done!
I suppose another benefit that we can get from this change is maybe 100-200ms boot time speed up for not loading and parsing 1MB .dat file.

@rafaelfranca Agreed. Actually we (@mtsmfm and I) already talked about that in person, so he understands that this PR might not be fully merged right now (and maybe we need to wait until Rails 6).
But still, he wanted to show us the goal of this small project, and I think he did it perfectly :)

So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR.

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Oct 11, 2016

@amatsuda Thank you for your following up!

So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR.

I think it is difficult to cherry-pick
because AS::Multibyte::Unicode has its own UNICODE_VERSION.
So we must replace all methods with Ruby's feature or do nothing 😢

Should I close this PR at this time or keep opening until rails 6?

@jeremy

This comment has been minimized.

Member

jeremy commented Feb 13, 2017

Since the UnicodeDatabase lazy loads, perhaps we could leave it as-is for older Rubies but start switching to native implementation for Ruby 2.4+?

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Feb 14, 2017

@jeremy Sorry, I don't understand what "switching to native implementation for Ruby 2.4+" means 🙇
Do you mean "we should implement new API such like AS::Multibyte2 and deprecate AS::Multibyte

Since the UnicodeDatabase lazy loads

Ruby 2.4+ also lazy loads unicode normalize database.
https://github.com/ruby/ruby/blob/v2_4_0/lib/unicode_normalize.rb#L33
(And I can't find where ruby loads case mapping database 💦 )

janlelis referenced this pull request in janlelis/uniscribe Apr 19, 2017

mtsmfm added a commit to mtsmfm/rails that referenced this pull request Jun 28, 2017

Add backend for unicode implementation
This is first step toward deprecating non-native unicode implementation

ref:

- rails#26743 (comment)
- rails#28067 (review)

@jeremy jeremy added this to the 6.0.0 milestone Feb 17, 2018

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Feb 19, 2018

I think it's time to merge because Rails 6 requires Ruby 2.4+

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Feb 19, 2018

@jeremy @amatsuda Could you review?

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Feb 19, 2018

Can you also remove bin/generate_tables?

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Feb 19, 2018

And the unicode_tables.dat file

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Feb 19, 2018

Oops, removed and force pushed!

@rafaelfranca rafaelfranca merged commit ffddaea into rails:master Feb 19, 2018

2 checks passed

codeclimate All good!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jeremy

This comment has been minimized.

Member

jeremy commented Feb 19, 2018

🎉🎉🎉

@amatsuda

This comment has been minimized.

Member

amatsuda commented Feb 19, 2018

👍 👍 👍

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Feb 20, 2018

😂

@mtsmfm mtsmfm deleted the mtsmfm:remove-unicode-table branch Feb 20, 2018

@matthewd

This comment has been minimized.

Member

matthewd commented Feb 20, 2018

🎉

Should we consider deprecating some of these APIs now? I see a lot are trivial wrappers for core methods.

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Feb 20, 2018

👍 to deprecate some of those wrappers

end
def upcase(string)
apply_mapping string, :uppercase_mapping
string.upcase
end

This comment has been minimized.

@frodsan

frodsan Oct 9, 2018

Contributor

@mtsmfm Hi! Do you think it's safe to the the same for #capitalize? The tests don't fail after I change it:

--- a/activesupport/lib/active_support/multibyte/chars.rb
+++ b/activesupport/lib/active_support/multibyte/chars.rb
@@ -145,7 +145,7 @@ def swapcase
       #
       #  'über'.mb_chars.capitalize.to_s # => "Über"
       def capitalize
-        (slice(0) || chars("")).upcase + (slice(1..-1) || chars("")).downcase
+        chars(@wrapped_string.capitalize)
       end

This comment has been minimized.

@mtsmfm

mtsmfm Oct 10, 2018

Contributor

I've just noticed "\uFB03" will be different...

This comment has been minimized.

@frodsan

frodsan Oct 10, 2018

Contributor

ok, thanks! :)

This comment has been minimized.

@mtsmfm

mtsmfm Oct 10, 2018

Contributor

Let me think a while

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Oct 10, 2018

@jeremy @rafaelfranca @amatsuda

Sorry, I've found this change breaks previous behavior in some edge cases 🙇

require "bundler/inline"

gemfile(true) do
  ruby "> 2.4.0"
  source "https://rubygems.org"
  gem "activesupport", "5.2.1"
end

require "active_support/all"

str = "\uFB03"
as_result = ActiveSupport::Multibyte::Unicode.upcase(str)
mri_result = str.upcase

puts as_result # => ffi
puts mri_result # => FFI

p as_result.codepoints # => [64259]
p mri_result.codepoints # => [70, 70, 73]

It seems the previous implementation doesn't follow special casting but Ruby does.

I think it's a bug and Ruby is correct so it's ok to change this behavior though.

What do you think?

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Oct 10, 2018

It seems fine to me too.

@jeremy

This comment has been minimized.

Member

jeremy commented Oct 10, 2018

Agreed! Following Ruby is what we intend to do, so this edge case is a good example of why 🙏

@mtsmfm

This comment has been minimized.

Contributor

mtsmfm commented Oct 11, 2018

So then, I think it's also ok to change Chars#capitalize @frodsan

#26743 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment