Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unicode_normalize methods #8966

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

skryukov
Copy link
Contributor

@skryukov skryukov commented Nov 20, 2023

This PR makes several changes to the implementation of the internal UnicodeNormalize module to improve performance and reduce memory allocations.

  • this PR adds an early return for ASCII-only strings – String#ascii_only? is extremely fast and helps in cases when user tries to normalize ASCII-only strings (for example, this helps to save a lot of time and memory in the IDNA mapping from UTS46 which I want to propose as a default IDNA algorithm to ruby/uri, but that's a different story 😅)
  • biggest change in terms of algorithm is that now we use codepoints instead of chars, this helps save on memory allocations (these changes affect only cold cache)
  • finally, it makes some small adjustments to reduce memory allocation (map + join -> each, gsub -> gsub!)

Benchmark

Benchmark is kinda spoiled by the usage of cache, but still 😅

Macbook 13", M1, 2020, 16 GB, Ruby 3.2.2

require 'benchmark/ips'

value = "google.com"
uni = "\u0136\uAC01\u1100\u1161\u11A8\u1100\u1176\u11a8\uae30\u11a7\uae30\u11c3\uAC00\u0323\u0300\uAC00\u0323\u0300\u1100\u1161\u0323\u0300\u1100\u1161\u0323\u0300\u4EE4\u548C"

Benchmark.ips do |x|
  x.report("orig normalize ascii") {
    value.unicode_normalize
  }
  x.report("orig normalize unicode") {
    uni.unicode_normalize
  }
end
Object.send(:remove_const, 'UnicodeNormalize')

require_relative '../ruby/lib/unicode_normalize/normalize'

Benchmark.ips do |x|
  x.report("new normalize ascii") {
    UnicodeNormalize.normalize(value)
  }
  x.report("new normalize unicode") {
    UnicodeNormalize.normalize(uni)
  }
end
Warming up --------------------------------------
orig normalize ascii   100.632k i/100ms
orig normalize unicode
                        20.641k i/100ms
Calculating -------------------------------------
orig normalize ascii    981.274k (± 6.2%) i/s -      4.931M in   5.053755s
orig normalize unicode
                        226.720k (± 1.9%) i/s -      1.135M in   5.009079s
Warming up --------------------------------------
 new normalize ascii   802.191k i/100ms
new normalize unicode
                        22.688k i/100ms
Calculating -------------------------------------
 new normalize ascii      8.106M (± 0.9%) i/s -     40.912M in   5.047787s
new normalize unicode
                        226.993k (± 1.2%) i/s -      1.157M in   5.098191s

Memory profiling:

require 'memory_profiler'

value = "google.com"
uni = "\u0136\uAC01\u1100\u1161\u11A8\u1100\u1176\u11a8\uae30\u11a7\uae30\u11c3\uAC00\u0323\u0300\uAC00\u0323\u0300\u1100\u1161\u0323\u0300\u1100\u1161\u0323\u0300\u4EE4\u548C"
N = 100_000

# warm up
"".unicode_normalize

puts "Before ASCII-only"
MemoryProfiler.report do
  N.times {
    value.unicode_normalize
  }
end.pretty_print(detailed_report: false)

puts "Before Unicode"
MemoryProfiler.report do
  N.times {
    uni.unicode_normalize
  }
end.pretty_print(detailed_report: false)

Object.send(:remove_const, 'UnicodeNormalize')
require_relative '../ruby/lib/unicode_normalize/normalize'

puts "After ASCII-only"
MemoryProfiler.report do
  N.times {
    UnicodeNormalize.normalize(value)
  }
end.pretty_print(detailed_report: false)

puts "After Unicode"
MemoryProfiler.report do
  N.times {
    UnicodeNormalize.normalize(uni)
  }
end.pretty_print(detailed_report: false)
Before ASCII-only
Total allocated: 4000000 bytes (100000 objects)
Total retained:  0 bytes (0 objects)

Before Unicode
Total allocated: 306554744 bytes (2400094 objects)
Total retained:  200 bytes (5 objects)

After ASCII-only
Total allocated: 0 bytes (0 objects)
Total retained:  0 bytes (0 objects)

After Unicode
Total allocated: 306401688 bytes (2400027 objects)
Total retained:  120 bytes (3 objects)

@duerst
Copy link
Member

duerst commented Nov 28, 2023

For benchmarking, I'd be very interested in the results of using test/test_unicode_normalize.rb. It runs all the tests provided in the NormalizationTest.txt file (see https://unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt). I'm not sure how difficult it is to extract the actual processing part (not the part reading in the data) for a benchmark.

@skryukov
Copy link
Contributor Author

@duerst, sure!

Here is a simple benchmark. I simply ran through every normalization form for every case from the NormalizationTest.txt.

Results running on Macbook 13", M1, 2020, 16 GB, Ruby 3.2.2:

Rehearsal -----------------------------------------------------
native #normalize   4.193533   0.032114   4.225647 (  4.233881)
new #normalize      4.300044   0.031065   4.331109 (  4.491655)
-------------------------------------------- total: 8.556756sec

                        user     system      total        real
native #normalize   4.366779   0.033569   4.400348 (  4.482564)
new #normalize      4.122924   0.024987   4.147911 (  4.153229)
Calculating -------------------------------------
   native #normalize     1.421B memsize (   722.440k retained)
                        20.221M objects (    18.001k retained)
                        50.000  strings (    50.000  retained)
      new #normalize     1.231B memsize (   738.520k retained)
                        16.003M objects (    18.463k retained)
                        50.000  strings (    50.000  retained)
Rehearsal -------------------------------------------------------
native #normalized?   4.736180   1.333103   6.069283 (  6.614149)
new #normalized?      3.521938   1.143961   4.665899 (  4.860499)
--------------------------------------------- total: 10.735182sec

                          user     system      total        real
native #normalized?   3.710057   0.140508   3.850565 (  3.856705)
new #normalized?      3.121058   0.051271   3.172329 (  3.177075)
Calculating -------------------------------------
   native #normalize   760.305M memsize (   740.920k retained)
                        10.954M objects (    18.463k retained)
                        50.000  strings (    50.000  retained)
      new #normalize   616.285M memsize (   720.040k retained)
                         7.883M objects (    18.001k retained)
                        50.000  strings (    50.000  retained)

@skryukov skryukov force-pushed the unicode-normalization-optimizations branch 2 times, most recently from 4ee987c to 28636b9 Compare December 12, 2023 19:07
@skryukov skryukov force-pushed the unicode-normalization-optimizations branch from 28636b9 to 2ffa091 Compare December 13, 2023 08:08
@skryukov
Copy link
Contributor Author

Hey @duerst!

Is there anything I can do to make this PR merged?

@nobu nobu requested a review from duerst May 31, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants