New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove AS::Multibyte
's unicode table
#26743
Conversation
r? @kaspth (@rails-bot has picked a reviewer for you, use r? to override) |
r? @amatsuda |
While I like the idea but I don't think we should remove some features for
|
@mtsmfm Thank you for the PR and summary! This is very well done! @rafaelfranca Agreed. Actually we (@mtsmfm and I) already talked about that in person, so he understands that this PR might not be fully merged right now (and maybe we need to wait until Rails 6). So the next step should be to find out what we can cherry-pick to 5.1 branch from this huge PR. |
@amatsuda Thank you for your following up!
I think it is difficult to cherry-pick Should I close this PR at this time or keep opening until rails 6? |
62a3025
to
eebf615
Compare
Since the UnicodeDatabase lazy loads, perhaps we could leave it as-is for older Rubies but start switching to native implementation for Ruby 2.4+? |
@jeremy Sorry, I don't understand what "switching to native implementation for Ruby 2.4+" means 🙇♂️
Ruby 2.4+ also lazy loads unicode normalize database. |
This is first step toward deprecating non-native unicode implementation ref: - rails#26743 (comment) - rails#28067 (review)
6ab14e4
to
c459d0a
Compare
c459d0a
to
8ffe42b
Compare
I think it's time to merge because Rails 6 requires Ruby 2.4+ |
Can you also remove |
And the |
8ffe42b
to
c1d00d6
Compare
Oops, removed and force pushed! |
🎉🎉🎉 |
👍 👍 👍 |
😂 |
🎉 Should we consider deprecating some of these APIs now? I see a lot are trivial wrappers for core methods. |
👍 to deprecate some of those wrappers |
end | ||
|
||
def upcase(string) | ||
apply_mapping string, :uppercase_mapping | ||
string.upcase | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtsmfm Hi! Do you think it's safe to the the same for #capitalize? The tests don't fail after I change it:
--- a/activesupport/lib/active_support/multibyte/chars.rb
+++ b/activesupport/lib/active_support/multibyte/chars.rb
@@ -145,7 +145,7 @@ def swapcase
#
# 'über'.mb_chars.capitalize.to_s # => "Über"
def capitalize
- (slice(0) || chars("")).upcase + (slice(1..-1) || chars("")).downcase
+ chars(@wrapped_string.capitalize)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just noticed "\uFB03"
will be different...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think a while
@jeremy @rafaelfranca @amatsuda Sorry, I've found this change breaks previous behavior in some edge cases 🙇 require "bundler/inline"
gemfile(true) do
ruby "> 2.4.0"
source "https://rubygems.org"
gem "activesupport", "5.2.1"
end
require "active_support/all"
str = "\uFB03"
as_result = ActiveSupport::Multibyte::Unicode.upcase(str)
mri_result = str.upcase
puts as_result # => ffi
puts mri_result # => FFI
p as_result.codepoints # => [64259]
p mri_result.codepoints # => [70, 70, 73] It seems the previous implementation doesn't follow special casting but Ruby does. I think it's a bug and Ruby is correct so it's ok to change this behavior though. What do you think? |
It seems fine to me too. |
Agreed! Following Ruby is what we intend to do, so this edge case is a good example of why 🙏 |
So then, I think it's also ok to change |
Summary
@amatsuda tells us
AS::Multibyte
has big Unicode table and ruby may have similar feature (https://speakerdeck.com/a_matsuda/3x-rails?slide=156).So I investigated and noticed that we can remove the table if we accept following changes.
Grapheme doesn't work perfectRuby 2.4 works perfect! https://bugs.ruby-lang.org/issues/12831If we merge this PR, we'll have following benefits:
I'll show you the way I tried to remove the table and why the changes will cause.
AS::Multibyte::Unicode
's features are divided into 4 main groups.1 ~ 3 use unicode database.
So let remove them.
normalize
Since 2.2, Ruby has
String#unicode_normalize
.https://www.ruby-lang.org/en/news/2014/12/25/ruby-2-2-0-released/
https://bugs.ruby-lang.org/issues/10084
It has similar option so it seems that we can simply replace it.
(I tried #26403)
But Ruby 2.2 supports Unicode 7.0.0 and Ruby 2.3 supports 8.0.0.
AS::Multibyte::Unicode
has its ownUNICODE_VERSION
but now it depends on Ruby's version.case mappings
Ruby 2.4 supports Unicode case mappings.
But Ruby 2.2 and 2.3 don't.
pack/unpack grapheme
Since 2.0, Ruby changed its regexp engine.
So we can split grapheme with
/X/
.(ex.
"\u304B\u3099\u304C\u3099\u304E\u3099".scan(/\X/)
)But it's grapheme cluster is not "true".So some tests will fail.@nurse fixed on Ruby 2.4!
https://bugs.ruby-lang.org/issues/12831
Memory
Unicode database consume 1.5 M memory.
Ruby's Unicode database is smaller than
AS::Multibyte::Unicode::UnicodeDatabase