Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Support for String#scrub #2912

Merged
merged 2 commits into from

3 participants

@YorickPeterse

Note that this is currently still a work in progress so expect plenty of Git rebases.

In the current implementation it uses Encoding::Converter (thanks to @headius for that one) which isn't exactly webscale. The options we have are either tuning the Converter API or moving the code to C++. Another option is to see if we can port over MRI's C logic to pure Ruby, but I'm not sure how well that would work.

@YorickPeterse

Related issue: #2901

@YorickPeterse

Corresponding MRI code can be found here: https://github.com/ruby/ruby/blob/trunk/string.c#L8022

@YorickPeterse

@brixen / @dbussink So in the current setup this uses the Encoding conversion API cooked up mostly by @headius. Having ran some basic I noticed that this API is waaaay slower than String#scrub in MRI. Are there any alternative APIs that we can use, or do we have to resort to using C++ for this?

@YorickPeterse

To clarify, it would be nice if we could keep this in Ruby (both because it's Ruby and for future references for other implementations). Having said that, if performance is an issue I can understand the need to move it to C++.

@YorickPeterse

Actually that implementation is quite different compared to what MRI does (as stated in the README). Some basic benchmarks also showed that it was even slower than the current implementation we have here. It was however a source of information on the behaviour of this method when I initially started looking into it.

@headius headius referenced this pull request from a commit in jruby/jruby
@headius headius Implement String#scrub (issues remain in jcodings).
* From my impl (modified) at rubinius/rubinius#2912
* Fails due to ArrayStoreException in jcodings
b287f70
YorickPeterse added some commits
@YorickPeterse YorickPeterse Experimental String#scrub with Encoding::Converter
Massive thanks to @headius for coming up with the idea of using
Encoding::Converter and providing a proof of concept. I mainly adapted this for
Rbx with some minor changes.

Note that this particular implementation is *really* slow compared to MRI, it
is however faster than the String#chars version used before. Although I'd
prefer not to we might want to move this over to C++ land if high performance
is required.
e467bb8
@YorickPeterse YorickPeterse Added specs for String#scrub. c0ae661
@headius headius referenced this pull request from a commit in jruby/jruby
@headius headius Implement String#scrub (issues remain in jcodings).
* From my impl (modified) at rubinius/rubinius#2912
* Fails due to ArrayStoreException in jcodings
29d8a05
@brixen brixen merged commit 49b6cb9 into master

1 check passed

Details default The Travis CI build passed
@YorickPeterse YorickPeterse deleted the string-scrub branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 7, 2014
  1. @YorickPeterse

    Experimental String#scrub with Encoding::Converter

    YorickPeterse authored
    Massive thanks to @headius for coming up with the idea of using
    Encoding::Converter and providing a proof of concept. I mainly adapted this for
    Rbx with some minor changes.
    
    Note that this particular implementation is *really* slow compared to MRI, it
    is however faster than the String#chars version used before. Although I'd
    prefer not to we might want to move this over to C++ land if high performance
    is required.
  2. @YorickPeterse
This page is out of date. Refresh to see the latest.
Showing with 109 additions and 0 deletions.
  1. +51 −0 kernel/common/string.rb
  2. +58 −0 spec/ruby/core/string/scrub_spec.rb
View
51 kernel/common/string.rb
@@ -1910,6 +1910,57 @@ def match(pattern, pos=0)
result
end
+ # Removes invalid byte sequences from a String, available since Ruby 2.1.
+ def scrub(replace = nil)
+ output = ''
+ input = dup
+
+ # The default replacement character is the "Unicode replacement" character.
+ # (U+FFFD).
+ if !replace and !block_given?
+ replace = "\xEF\xBF\xBD".force_encoding("UTF-8")
+ .encode(self.encoding, :undef => :replace, :replace => '?')
+ end
+
+ if replace
+ unless replace.is_a?(String)
+ raise(
+ TypeError,
+ "no implicit conversion of #{replace.class} into String"
+ )
+ end
+
+ unless replace.valid_encoding?
+ raise(
+ ArgumentError,
+ "replacement must be a valid byte sequence '#{replace.inspect}'"
+ )
+ end
+
+ replace = replace.force_encoding(Encoding::BINARY)
+ end
+
+ converter = Encoding::Converter.new(input.encoding, Encoding::BINARY)
+
+ while input.length > 0
+ result = converter.primitive_convert(input, output, output.length)
+
+ if result == :finished
+ break
+ elsif result == :undefined_conversion
+ output << converter.primitive_errinfo[3]
+ else
+ if block_given?
+ output << yield(converter.primitive_errinfo[3])
+ else
+ output << replace
+ end
+ end
+ end
+
+ return output.force_encoding(encoding)
+ end
+
def []=(index, count_or_replacement, replacement=undefined)
if undefined.equal?(replacement)
replacement = count_or_replacement
View
58 spec/ruby/core/string/scrub_spec.rb
@@ -0,0 +1,58 @@
+# -*- encoding: utf-8 -*-
+require File.expand_path("../../../spec_helper", __FILE__)
+
+ruby_version_is "2.1" do
+ describe "String#scrub with a default replacement" do
+ it "returns self for valid strings" do
+ input = "foo"
+
+ input.scrub.should == input
+ end
+
+ it "replaces invalid byte sequences" do
+ "abc\u3042\x81".scrub.should == "abc\u3042\uFFFD"
+ end
+ end
+
+ describe "String#scrub with a custom replacement" do
+ it "returns self for valid strings" do
+ input = "foo"
+
+ input.scrub("*").should == input
+ end
+
+ it "replaces invalid byte sequences" do
+ "abc\u3042\x81".scrub("*").should == "abc\u3042*"
+ end
+
+ it "replaces groups of sequences together with a single replacement" do
+ "\xE3\x80".scrub("*").should == "*"
+ end
+
+ it "raises ArgumentError for replacements with an invalid encoding" do
+ block = lambda { "foo".scrub("\xE4") }
+
+ block.should raise_error(ArgumentError)
+ end
+
+ it "raises TypeError when a non String replacement is given" do
+ block = lambda { "foo".scrub(1) }
+
+ block.should raise_error(TypeError)
+ end
+ end
+
+ describe "String#scrub with a block" do
+ it "returns self for valid strings" do
+ input = "foo"
+
+ input.scrub { |b| "*" }.should == input
+ end
+
+ it "replaces invalid byte sequences" do
+ replaced = "abc\u3042\xE3\x80".scrub { |b| "<#{b.unpack("H*")[0]}>" }
+
+ replaced.should == "abc\u3042<e380>"
+ end
+ end
+end
Something went wrong with that request. Please try again.