Optimize unicode_normalize methods #8966

skryukov · 2023-11-20T22:11:32Z

This PR makes several changes to the implementation of the internal UnicodeNormalize module to improve performance and reduce memory allocations.

this PR adds an early return for ASCII-only strings – String#ascii_only? is extremely fast and helps in cases when user tries to normalize ASCII-only strings (for example, this helps to save a lot of time and memory in the IDNA mapping from UTS46 which I want to propose as a default IDNA algorithm to ruby/uri, but that's a different story 😅)
biggest change in terms of algorithm is that now we use codepoints instead of chars, this helps save on memory allocations (these changes affect only cold cache)
finally, it makes some small adjustments to reduce memory allocation (map + join -> each, gsub -> gsub!)

Benchmark

Benchmark is kinda spoiled by the usage of cache, but still 😅

Macbook 13", M1, 2020, 16 GB, Ruby 3.2.2

require 'benchmark/ips'

value = "google.com"
uni = "\u0136\uAC01\u1100\u1161\u11A8\u1100\u1176\u11a8\uae30\u11a7\uae30\u11c3\uAC00\u0323\u0300\uAC00\u0323\u0300\u1100\u1161\u0323\u0300\u1100\u1161\u0323\u0300\u4EE4\u548C"

Benchmark.ips do |x|
  x.report("orig normalize ascii") {
    value.unicode_normalize
  }
  x.report("orig normalize unicode") {
    uni.unicode_normalize
  }
end
Object.send(:remove_const, 'UnicodeNormalize')

require_relative '../ruby/lib/unicode_normalize/normalize'

Benchmark.ips do |x|
  x.report("new normalize ascii") {
    UnicodeNormalize.normalize(value)
  }
  x.report("new normalize unicode") {
    UnicodeNormalize.normalize(uni)
  }
end

Warming up --------------------------------------
orig normalize ascii   100.632k i/100ms
orig normalize unicode
                        20.641k i/100ms
Calculating -------------------------------------
orig normalize ascii    981.274k (± 6.2%) i/s -      4.931M in   5.053755s
orig normalize unicode
                        226.720k (± 1.9%) i/s -      1.135M in   5.009079s
Warming up --------------------------------------
 new normalize ascii   802.191k i/100ms
new normalize unicode
                        22.688k i/100ms
Calculating -------------------------------------
 new normalize ascii      8.106M (± 0.9%) i/s -     40.912M in   5.047787s
new normalize unicode
                        226.993k (± 1.2%) i/s -      1.157M in   5.098191s

Memory profiling:

require 'memory_profiler'

value = "google.com"
uni = "\u0136\uAC01\u1100\u1161\u11A8\u1100\u1176\u11a8\uae30\u11a7\uae30\u11c3\uAC00\u0323\u0300\uAC00\u0323\u0300\u1100\u1161\u0323\u0300\u1100\u1161\u0323\u0300\u4EE4\u548C"
N = 100_000

# warm up
"".unicode_normalize

puts "Before ASCII-only"
MemoryProfiler.report do
  N.times {
    value.unicode_normalize
  }
end.pretty_print(detailed_report: false)

puts "Before Unicode"
MemoryProfiler.report do
  N.times {
    uni.unicode_normalize
  }
end.pretty_print(detailed_report: false)

Object.send(:remove_const, 'UnicodeNormalize')
require_relative '../ruby/lib/unicode_normalize/normalize'

puts "After ASCII-only"
MemoryProfiler.report do
  N.times {
    UnicodeNormalize.normalize(value)
  }
end.pretty_print(detailed_report: false)

puts "After Unicode"
MemoryProfiler.report do
  N.times {
    UnicodeNormalize.normalize(uni)
  }
end.pretty_print(detailed_report: false)

Before ASCII-only
Total allocated: 4000000 bytes (100000 objects)
Total retained:  0 bytes (0 objects)

Before Unicode
Total allocated: 306554744 bytes (2400094 objects)
Total retained:  200 bytes (5 objects)

After ASCII-only
Total allocated: 0 bytes (0 objects)
Total retained:  0 bytes (0 objects)

After Unicode
Total allocated: 306401688 bytes (2400027 objects)
Total retained:  120 bytes (3 objects)

duerst · 2023-11-28T06:05:04Z

For benchmarking, I'd be very interested in the results of using test/test_unicode_normalize.rb. It runs all the tests provided in the NormalizationTest.txt file (see https://unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt). I'm not sure how difficult it is to extract the actual processing part (not the part reading in the data) for a benchmark.

skryukov · 2023-11-28T08:30:42Z

@duerst, sure!

Here is a simple benchmark. I simply ran through every normalization form for every case from the NormalizationTest.txt.

Results running on Macbook 13", M1, 2020, 16 GB, Ruby 3.2.2:

Rehearsal -----------------------------------------------------
native #normalize   4.193533   0.032114   4.225647 (  4.233881)
new #normalize      4.300044   0.031065   4.331109 (  4.491655)
-------------------------------------------- total: 8.556756sec

                        user     system      total        real
native #normalize   4.366779   0.033569   4.400348 (  4.482564)
new #normalize      4.122924   0.024987   4.147911 (  4.153229)

Calculating -------------------------------------
   native #normalize     1.421B memsize (   722.440k retained)
                        20.221M objects (    18.001k retained)
                        50.000  strings (    50.000  retained)
      new #normalize     1.231B memsize (   738.520k retained)
                        16.003M objects (    18.463k retained)
                        50.000  strings (    50.000  retained)

Rehearsal -------------------------------------------------------
native #normalized?   4.736180   1.333103   6.069283 (  6.614149)
new #normalized?      3.521938   1.143961   4.665899 (  4.860499)
--------------------------------------------- total: 10.735182sec

                          user     system      total        real
native #normalized?   3.710057   0.140508   3.850565 (  3.856705)
new #normalized?      3.121058   0.051271   3.172329 (  3.177075)

Calculating -------------------------------------
   native #normalize   760.305M memsize (   740.920k retained)
                        10.954M objects (    18.463k retained)
                        50.000  strings (    50.000  retained)
      new #normalize   616.285M memsize (   720.040k retained)
                         7.883M objects (    18.001k retained)
                        50.000  strings (    50.000  retained)

skryukov · 2023-12-21T12:44:19Z

Hey @duerst!

Is there anything I can do to make this PR merged?

skryukov force-pushed the unicode-normalization-optimizations branch from 539c652 to 6ad0857 Compare November 21, 2023 07:07

skryukov mentioned this pull request Nov 25, 2023

Refactor to a regex-based algorithm skryukov/uri-idna#4

Merged

skryukov force-pushed the unicode-normalization-optimizations branch 2 times, most recently from 4ee987c to 28636b9 Compare December 12, 2023 19:07

Optimize unicode_normalize methods

2ffa091

skryukov force-pushed the unicode-normalization-optimizations branch from 28636b9 to 2ffa091 Compare December 13, 2023 08:08

nobu requested a review from duerst May 31, 2024 14:18

nobu assigned duerst May 31, 2024

nobu added the Performance label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize unicode_normalize methods #8966

Optimize unicode_normalize methods #8966

skryukov commented Nov 20, 2023 •

edited

Loading

duerst commented Nov 28, 2023

skryukov commented Nov 28, 2023

skryukov commented Dec 21, 2023

Optimize unicode_normalize methods #8966

Are you sure you want to change the base?

Optimize unicode_normalize methods #8966

Conversation

skryukov commented Nov 20, 2023 • edited Loading

Benchmark

Memory profiling:

duerst commented Nov 28, 2023

skryukov commented Nov 28, 2023

skryukov commented Dec 21, 2023

skryukov commented Nov 20, 2023 •

edited

Loading