New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Ruby's native Encoding functionality to implement tidy_bytes
#10355
Conversation
|
Nice, @burke. We do this in our apps, but round-trip through UTF-16. Check out the new |
|
I tried UTF-16 at first but I ran into some weirdness with BOMs, so I switched to a different encoding rather than debugging it :). I didn't benchmark thoroughly but I remember finding this way to be a bit faster as well. I hadn't heard about |
| @@ -218,51 +218,31 @@ def compose(codepoints) | |||
| # Passing +true+ will forcibly tidy all bytes, assuming that the string's | |||
| # encoding is entirely CP1252 or ISO-8859-1. | |||
| def tidy_bytes(string, force = false) | |||
| return "" if string.empty? | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May as well preserve the existing object: return string if string.empty?
|
We just put this in production. It seems to be working well, except that the final line fails if the input string was UTF-16. On further investigation, so does the current implementation :) |
|
Could you rebase against master and squash to a single commit? |
The previous implementation was quite slow. This leverages some of the transcoding abilities built into Ruby 1.9 instead. It is roughly 96% faster. The roundtrip through UTF_8_MAC here is because ruby won't let you transcode from UTF_8 to UTF_8. I chose the closest encoding I could find as an intermediate.
|
Sure. Done. |
Use Ruby's native Encoding functionality to implement `tidy_bytes`
The current implementation is quite slow. This leverages some of the transcoding abilities built into Ruby 1.9 instead. It is roughly 96% faster.
The roundtrip through UTF_8_MAC here is because ruby won't let you transcode from UTF_8 to UTF_8. I chose the closest encoding I could find as an intermediate.
These test cases seem quite thorough and it passes all of them: https://github.com/rails/rails/blob/master/activesupport/test/multibyte_chars_test.rb#L630-L661
Benchmarking
Save as
./activesupport/bm.rb, cd to./activesupport, and run:Output before proposed change:
After proposed change: