Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove AS::Multibyte's unicode table #26743

Merged
merged 1 commit into from Feb 19, 2018

Conversation

mtsmfm
Copy link
Contributor

@mtsmfm mtsmfm commented Oct 9, 2016

Summary

@amatsuda tells us AS::Multibyte has big Unicode table and ruby may have similar feature (https://speakerdeck.com/a_matsuda/3x-rails?slide=156).

So I investigated and noticed that we can remove the table if we accept following changes.

  • Supported Unicode version will depend on Ruby
    • Ruby 2.2 supports Unicode 7.0.0
    • Ruby 2.3 supports Unicode 8.0.0
    • Ruby 2.4 supports Unicode 9.0.0
  • Unicode case mappings doesn't work until Ruby 2.4
  • Grapheme doesn't work perfect Ruby 2.4 works perfect! https://bugs.ruby-lang.org/issues/12831

If we merge this PR, we'll have following benefits:

  • Reduce memory usage(about 1M)
  • Reduce repo/gem size(about 1M)
  • Reduce code lines(about 400 lines)

I'll show you the way I tried to remove the table and why the changes will cause.


AS::Multibyte::Unicode's features are divided into 4 main groups.

  1. normalize
  2. case mapping
  3. pack/unpack grapheme
  4. tidy bytes

1 ~ 3 use unicode database.
So let remove them.

normalize

Since 2.2, Ruby has String#unicode_normalize.

https://www.ruby-lang.org/en/news/2014/12/25/ruby-2-2-0-released/
https://bugs.ruby-lang.org/issues/10084

It has similar option so it seems that we can simply replace it.
(I tried #26403)
But Ruby 2.2 supports Unicode 7.0.0 and Ruby 2.3 supports 8.0.0.
AS::Multibyte::Unicode has its own UNICODE_VERSION but now it depends on Ruby's version.

case mappings

Ruby 2.4 supports Unicode case mappings.
But Ruby 2.2 and 2.3 don't.

pack/unpack grapheme

Since 2.0, Ruby changed its regexp engine.
So we can split grapheme with /X/.
(ex. "\u304B\u3099\u304C\u3099\u304E\u3099".scan(/\X/))

But it's grapheme cluster is not "true".
So some tests will fail.
@nurse fixed on Ruby 2.4!
https://bugs.ruby-lang.org/issues/12831

Memory

Unicode database consume 1.5 M memory.

require 'active_support/core_ext/string/multibyte'
require 'objspace'

# load unicode database
''.mb_chars.upcase

db = ActiveSupport::Multibyte::Unicode.send(:database)
db_memsize = db.instance_variables.map do |ivar|
  ObjectSpace.memsize_of(db.instance_variable_get(ivar))
end.inject(:+)

puts "db_memsize: #{db_memsize} Bytes"
#=> db_memsize: 1501184 Bytes

Ruby's Unicode database is smaller than AS::Multibyte::Unicode::UnicodeDatabase

require 'objspace'

# load unicode database
''.unicode_normalize

db_memsize = UnicodeNormalize.constants.map do |const|
  ObjectSpace.memsize_of(UnicodeNormalize.const_get(const))
end.inject(:+)

puts "db_memsize: #{db_memsize} Bytes"
#=> db_memsize: 404558 Bytes

@rails-bot
Copy link

r? @kaspth

(@rails-bot has picked a reviewer for you, use r? to override)

@kaspth
Copy link
Contributor

kaspth commented Oct 9, 2016

r? @amatsuda

@rails-bot rails-bot assigned amatsuda and unassigned kaspth Oct 9, 2016
@rafaelfranca
Copy link
Member

While I like the idea but I don't think we should remove some features for
the supported Ruby versions because of 1mb of memory. I can see we merging
it when we drop support to Ruby < 2.4 but while we need to support 2.2 and
2.3 I think this extra memory usage is fine
On Sun, 9 Oct 2016 at 12:00 Kasper Timm Hansen notifications@github.com
wrote:

r? @amatsuda https://github.com/amatsuda


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#26743 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAC66D5WNs0IpirX9kii_Qw-9l2DvBJWks5qyQFzgaJpZM4KR-2B
.

@amatsuda
Copy link
Member

@mtsmfm Thank you for the PR and summary! This is very well done!
I suppose another benefit that we can get from this change is maybe 100-200ms boot time speed up for not loading and parsing 1MB .dat file.

@rafaelfranca Agreed. Actually we (@mtsmfm and I) already talked about that in person, so he understands that this PR might not be fully merged right now (and maybe we need to wait until Rails 6).
But still, he wanted to show us the goal of this small project, and I think he did it perfectly :)

So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR.

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Oct 11, 2016

@amatsuda Thank you for your following up!

So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR.

I think it is difficult to cherry-pick
because AS::Multibyte::Unicode has its own UNICODE_VERSION.
So we must replace all methods with Ruby's feature or do nothing 😢

Should I close this PR at this time or keep opening until rails 6?

@jeremy
Copy link
Member

jeremy commented Feb 13, 2017

Since the UnicodeDatabase lazy loads, perhaps we could leave it as-is for older Rubies but start switching to native implementation for Ruby 2.4+?

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Feb 14, 2017

@jeremy Sorry, I don't understand what "switching to native implementation for Ruby 2.4+" means 🙇‍♂️
Do you mean "we should implement new API such like AS::Multibyte2 and deprecate AS::Multibyte

Since the UnicodeDatabase lazy loads

Ruby 2.4+ also lazy loads unicode normalize database.
https://github.com/ruby/ruby/blob/v2_4_0/lib/unicode_normalize.rb#L33
(And I can't find where ruby loads case mapping database 💦 )

janlelis referenced this pull request in janlelis/uniscribe Apr 19, 2017
mtsmfm added a commit to mtsmfm/rails that referenced this pull request Jun 28, 2017
This is first step toward deprecating non-native unicode implementation

ref:

- rails#26743 (comment)
- rails#28067 (review)
@jeremy jeremy added this to the 6.0.0 milestone Feb 17, 2018
@mtsmfm mtsmfm force-pushed the remove-unicode-table branch 2 times, most recently from 6ab14e4 to c459d0a Compare February 19, 2018 16:37
@mtsmfm
Copy link
Contributor Author

mtsmfm commented Feb 19, 2018

I think it's time to merge because Rails 6 requires Ruby 2.4+

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Feb 19, 2018

@jeremy @amatsuda Could you review?

@rafaelfranca
Copy link
Member

Can you also remove bin/generate_tables?

@rafaelfranca
Copy link
Member

And the unicode_tables.dat file

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Feb 19, 2018

Oops, removed and force pushed!

@rafaelfranca rafaelfranca merged commit ffddaea into rails:master Feb 19, 2018
@jeremy
Copy link
Member

jeremy commented Feb 19, 2018

🎉🎉🎉

@amatsuda
Copy link
Member

👍 👍 👍

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Feb 20, 2018

😂

@mtsmfm mtsmfm deleted the remove-unicode-table branch February 20, 2018 01:23
@matthewd
Copy link
Member

🎉

Should we consider deprecating some of these APIs now? I see a lot are trivial wrappers for core methods.

@rafaelfranca
Copy link
Member

👍 to deprecate some of those wrappers

end

def upcase(string)
apply_mapping string, :uppercase_mapping
string.upcase
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtsmfm Hi! Do you think it's safe to the the same for #capitalize? The tests don't fail after I change it:

--- a/activesupport/lib/active_support/multibyte/chars.rb
+++ b/activesupport/lib/active_support/multibyte/chars.rb
@@ -145,7 +145,7 @@ def swapcase
       #
       #  'über'.mb_chars.capitalize.to_s # => "Über"
       def capitalize
-        (slice(0) || chars("")).upcase + (slice(1..-1) || chars("")).downcase
+        chars(@wrapped_string.capitalize)
       end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just noticed "\uFB03" will be different...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think a while

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Oct 10, 2018

@jeremy @rafaelfranca @amatsuda

Sorry, I've found this change breaks previous behavior in some edge cases 🙇

require "bundler/inline"

gemfile(true) do
  ruby "> 2.4.0"
  source "https://rubygems.org"
  gem "activesupport", "5.2.1"
end

require "active_support/all"

str = "\uFB03"
as_result = ActiveSupport::Multibyte::Unicode.upcase(str)
mri_result = str.upcase

puts as_result # => ffi
puts mri_result # => FFI

p as_result.codepoints # => [64259]
p mri_result.codepoints # => [70, 70, 73]

It seems the previous implementation doesn't follow special casting but Ruby does.

I think it's a bug and Ruby is correct so it's ok to change this behavior though.

What do you think?

@rafaelfranca
Copy link
Member

It seems fine to me too.

@jeremy
Copy link
Member

jeremy commented Oct 10, 2018

Agreed! Following Ruby is what we intend to do, so this edge case is a good example of why 🙏

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Oct 11, 2018

So then, I think it's also ok to change Chars#capitalize @frodsan

#26743 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants