Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Ruby's native Encoding functionality to implement tidy_bytes #10355

Merged
merged 1 commit into from May 8, 2013

Conversation

burke
Copy link
Contributor

@burke burke commented Apr 26, 2013

The current implementation is quite slow. This leverages some of the transcoding abilities built into Ruby 1.9 instead. It is roughly 96% faster.

The roundtrip through UTF_8_MAC here is because ruby won't let you transcode from UTF_8 to UTF_8. I chose the closest encoding I could find as an intermediate.

These test cases seem quite thorough and it passes all of them: https://github.com/rails/rails/blob/master/activesupport/test/multibyte_chars_test.rb#L630-L661

Benchmarking

require 'benchmark'

kilobyte = "1" * 1024

u = ActiveSupport::Multibyte::Unicode

res = Benchmark.realtime do
  10_000.times do
    u.tidy_bytes(kilobyte)
  end
end

puts "tidy_bytes took #{res / 10}ms"

Save as ./activesupport/bm.rb, cd to ./activesupport, and run:

ruby -I./lib -ractive_support/multibyte/unicode bm.rb

Output before proposed change:

tidy_bytes took 1.952ms

After proposed change:

tidy_bytes took 0.0668ms

@jeremy
Copy link
Member

jeremy commented Apr 26, 2013

Nice, @burke. We do this in our apps, but round-trip through UTF-16.

Check out the new String#scrub in Ruby 2.1 (trunk) that does this directly, without an intermediate encode step!

@burke
Copy link
Contributor Author

burke commented Apr 26, 2013

I tried UTF-16 at first but I ran into some weirdness with BOMs, so I switched to a different encoding rather than debugging it :). I didn't benchmark thoroughly but I remember finding this way to be a bit faster as well.

I hadn't heard about String#scrub -- good news! Finally, tools to deal with encodings sanely are starting to show up.

@@ -218,51 +218,31 @@ def compose(codepoints)
# Passing +true+ will forcibly tidy all bytes, assuming that the string's
# encoding is entirely CP1252 or ISO-8859-1.
def tidy_bytes(string, force = false)
return "" if string.empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well preserve the existing object: return string if string.empty?

@burke
Copy link
Contributor Author

burke commented May 7, 2013

We just put this in production. It seems to be working well, except that the final line fails if the input string was UTF-16. On further investigation, so does the current implementation :)

@jeremy
Copy link
Member

jeremy commented May 8, 2013

Could you rebase against master and squash to a single commit?

The previous implementation was quite slow. This leverages some of the
transcoding abilities built into Ruby 1.9 instead. It is roughly 96%
faster.

The roundtrip through UTF_8_MAC here is because ruby won't let you
transcode from UTF_8 to UTF_8. I chose the closest encoding I could
find as an intermediate.
@burke
Copy link
Contributor Author

burke commented May 8, 2013

Sure. Done.

jeremy added a commit that referenced this pull request May 8, 2013
Use Ruby's native Encoding functionality to implement `tidy_bytes`
@jeremy jeremy merged commit d77c645 into rails:master May 8, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants