Permalink
Browse files

~3.5x speedup of String#blank? for empty strings

See the rationale in the comment in this patch.

To benchmark this I ran a number of variations, ultimately narrowing to

    require 'benchmark/ips'

    str = ''
    regexp = /\A[[:space:]]*\z/

    Benchmark.ips do |x|
      x.report('regexp') { regexp === str }
      x.report('empty')  { str.empty? || regexp === str }
      x.compare!
    end

This benchmark has consistently reported speedups around 3.5x:

    Calculating -------------------------------------
                  regexp    69.197k i/100ms
                   empty   115.468k i/100ms
    -------------------------------------------------
                  regexp      2. 6.3%) i/s -     13.839M
                   empty      9. 8.8%) i/s -     47.804M

    Comparison:
                   empty:  9642607.6 i/s
                  regexp:  2768351.9 i/s - 3.48x slower

Sometimes even reaching 4x.

Running the same bechmark on strings of 10 or 100 characters (with
whitespace or present) has shown a slowdown of just about 1.01/1.02.
Marginal, we seem to have a worthwhile trade-off here.
  • Loading branch information...
fxn committed Apr 20, 2016
1 parent a3c8a8e commit 697384df36a939e565b7c08725017d49dc83fe40
Showing with 6 additions and 1 deletion.
  1. +6 −1 activesupport/lib/active_support/core_ext/object/blank.rb
@@ -112,7 +112,12 @@ class String
#
# @return [true, false]
def blank?
BLANK_RE === self
# In practice, the majority of blank strings are empty. As of this writing
# checking for empty? is about 3.5x faster than matching against the regexp
# in MRI, so we call the predicate first, and then fallback.
#
# The penalty for blank strings with whitespace or present ones is marginal.
empty? || BLANK_RE === self
end
end

6 comments on commit 697384d

@kaspth

This comment has been minimized.

Member

kaspth replied Apr 20, 2016

BOOM 💪

@DefV

This comment has been minimized.

Contributor

DefV replied Apr 20, 2016

amazing

@jeremy

This comment has been minimized.

Member

jeremy replied Apr 20, 2016

🙇

@schneems

This comment has been minimized.

Member

schneems replied Apr 20, 2016

Really interesting idea. I think it's unique especially since the method is so small, but by adding non-regex logic we can get some speed bumps. Thanks for this write up.

I profiled codetriage, and it looks like the majority of calls to blank? on string are on non empty strings. This is a case where the performance depends largely on the samples. I ran some benchmarks with some randomly generated strings

require 'benchmark/ips'

EMPTYREGEX = /\A[[:space:]]*\z/

def string_generate
  str = " abcdefghijklmnopqrstuvwxyz\t".freeze
  str[rand(0..(str.length - 1))] * rand(0..23)
end

strings = 100.times.map { string_generate }

Benchmark.ips do |x|
  x.report('regexp           ') { strings.each {|str| EMPTYREGEX === str } }
  x.report('empty            ') { strings.each {|str| str.empty? || EMPTYREGEX === str } }
  x.compare!
end

# Warming up --------------------------------------
#    regexp                1.941k i/100ms
#    empty                 1.982k i/100ms
# Calculating -------------------------------------
#    regexp                19.170k (± 8.2%) i/s -     97.050k
#    empty                 18.259k (±14.6%) i/s -     91.172k

# Comparison:
#    regexp           :    19170.2 i/s
#    empty            :    18258.6 i/s - same-ish: difference falls within error

Even while "randomly" generating the strings, i'm able to tweak the numbers to favor one or the other, but both fall in the same ballpark.

I think it is a good idea, if we seperate out empty? strings earlier we can use extra logic to avoid the regex in even more places.

Running with it

Since the majority of the cases is a string that is not soely composed of whitespace characters, and the slow part of the check is the regex, we could skip the regex if we know the first character of the string is not a whitespace character.

If we did that we could see about a 2x speed boost (off of my sample set).

require 'benchmark/ips'

EMPTYREGEX = /\A[[:space:]]*\z/

def string_generate
  str = " abcdefghijklmnopqrstuvwxyz\t".freeze
  str[rand(0..(str.length - 1))] * rand(0..23)
end

strings = 100.times.map { string_generate }

SPACE_HASH = {}
SPACE_HASH[" "]  = true
SPACE_HASH["\t"] = true

Benchmark.ips do |x|
  x.report('regexp           ') { strings.each {|str| EMPTYREGEX === str } }
  x.report('empty            ') { strings.each {|str| str.empty? || EMPTYREGEX === str } }
  x.report('first char       ') { strings.each {|str| str.empty? || SPACE_HASH[str[0]] ? EMPTYREGEX === str : false } }
  x.compare!
end

# Warming up --------------------------------------
#    regexp                1.868k i/100ms
#    empty                 1.920k i/100ms
#    first char            3.423k i/100ms
# Calculating -------------------------------------
#    regexp                19.244k (± 7.5%) i/s -     97.136k
#    empty                 19.247k (± 7.3%) i/s -     97.920k
#    first char            35.327k (± 6.4%) i/s -    177.996k

# Comparison:
#    first char       :    35327.1 i/s
#    empty            :    19247.2 i/s - 1.84x slower
#    regexp           :    19243.8 i/s - 1.84x slower

Which is pretty good. Unfortunately this would mean that we need to know all the characters that ruby considers a /[[:space:]]/. I checked the source and well... https://github.com/ruby/ruby/blob/20cd25c86fd28eb1b5068d0db607e6aa33107f65/enc/unicode/name2ctype.h#L2794-L2807 I would need help translating that into ruby strings. It looks like it's worth the leg work.

Make the regex faster

If we don't want to do that we could speed up the regex string a little by using + instead of * since we know that it has at least one char (since it is not empty?). We could also reverse the logic and the regex to detect a non-whitespace character !(/\S/ === str) which would be slightly fater than +, as it would stop at the first non-whitespace character instead of continuing to iterate, however I don't know if thats a technically correct solution (i.e. is /\S/ the technical oposite of /\A[[:space:]]*\z/?

@fxn

This comment has been minimized.

Member

fxn replied Apr 20, 2016

Need to digest your comment (having dinner over here!), but just a quick answer to the last question in case it helps you continue investigating.

The difference is that [[:space:]] is a Unicode character class, whereas \s is an ASCII character class so to speak (a strict subset of the former). The complement of the Unicode class would be [[:^space:]].

@schneems

This comment has been minimized.

Member

schneems replied Apr 20, 2016

Thank you, please enjoy your meal :).

I see that \S would return the wrong value for "\u0085". I submitted a pull request with the new regex logic: #24658

I still think we could speed up the check with a hash of "space" characters. But we need to be careful. Even if we do that, we'll still benifit from a faster regex.

Please sign in to comment.