Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Issues with String#tr when optimizing String#each_char. #2911

Closed
razielgn opened this Issue · 3 comments

2 participants

@razielgn

I've made this simple patch, thanks to @dbussink's hints:

diff --git a/kernel/common/string.rb b/kernel/common/string.rb
index 2581b89..fbf4a44 100644
--- a/kernel/common/string.rb
+++ b/kernel/common/string.rb
@@ -304,11 +304,12 @@ class String

   def each_char
     return to_enum :each_char unless block_given?
-    i = 0
-    n = size
-    while i < n
-      yield substring(i, 1)
-      i += 1
+
+    bytes = 0
+    while bytes < @num_bytes
+      char = find_character(bytes)
+      yield char
+      bytes += char.num_bytes
     end

     self

This benchmark reveals a 130-140% increase in performance. :metal:

# encoding: utf-8

require 'benchmark'
require 'benchmark/ips'

Benchmark.ips do |x|
  string = "сtпh€аaсnиkбs¢о" * 10 # Removed a 4 byte character because GitHub doesn't allow unicode characters above 0xffff

  x.report "String#each_char mixed UTF-8 string" do |times|
    i = 0
    while i < times
      string.each_char { |char| }
      i += 1
    end
  end
end
Before
String#each_char mixed UTF-8 string
                        37128.2 (±1.3%) i/s -     186560 in   5.025625s (cycle=3520)

After
String#each_char mixed UTF-8 string
                        88780.1 (±1.3%) i/s -     444180 in   5.004021s (cycle=8076)

The problem is, it only breaks String#tr and similar methods (tests on String#each_char are green!).
I've spent some time trying to figuring it out, without any luck.
Most of the tests break because chr.ord raises ArgumentError: invalid byte sequence in UTF-8 at https://github.com/rubinius/rubinius/blob/6bbdf254d1919915c9c7ee76aadd6b67ce4554bb/kernel/common/string.rb#L2376
When I stop at line 2374 with the debugger and inspect each_char.map(&:ord) I get no errors, so I don't understand how can it blow up after just a couple of lines.

Any ideas?

@dbussink
Owner

So if you print each character there before and after the each_char change, does that print the same characters? Does the character itself have the same / correct encoding?

@razielgn

String#tr_trans modifies the string as String#each_char is yielding chars, causing the latter to yield wrong codepoints. Executing it after the fact doesn't show the problem because it starts from the beginning.
I've written a patch in which String#tr_trans writes on another string and replaces itself with it at the end. As soon as I've cleaned it up I'll send a PR. :smile:

@dbussink
Owner

Ah, we should carefully review performance impact of that change. There are already some String#tr benchmarks in place.

@dbussink dbussink closed this in #2914
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.