Skip to content

Commit

Permalink
~3.5x speedup of String#blank? for empty strings
Browse files Browse the repository at this point in the history
See the rationale in the comment in this patch.

To benchmark this I ran a number of variations, ultimately narrowing to

    require 'benchmark/ips'

    str = ''
    regexp = /\A[[:space:]]*\z/

    Benchmark.ips do |x|
      x.report('regexp') { regexp === str }
      x.report('empty')  { str.empty? || regexp === str }
      x.compare!
    end

This benchmark has consistently reported speedups around 3.5x:

    Calculating -------------------------------------
                  regexp    69.197k i/100ms
                   empty   115.468k i/100ms
    -------------------------------------------------
                  regexp      2. 6.3%) i/s -     13.839M
                   empty      9. 8.8%) i/s -     47.804M

    Comparison:
                   empty:  9642607.6 i/s
                  regexp:  2768351.9 i/s - 3.48x slower

Sometimes even reaching 4x.

Running the same bechmark on strings of 10 or 100 characters (with
whitespace or present) has shown a slowdown of just about 1.01/1.02.
Marginal, we seem to have a worthwhile trade-off here.
  • Loading branch information
fxn committed Apr 20, 2016
1 parent a3c8a8e commit 697384d
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion activesupport/lib/active_support/core_ext/object/blank.rb
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -112,7 +112,12 @@ class String
# #
# @return [true, false] # @return [true, false]
def blank? def blank?
BLANK_RE === self # In practice, the majority of blank strings are empty. As of this writing
# checking for empty? is about 3.5x faster than matching against the regexp
# in MRI, so we call the predicate first, and then fallback.
#
# The penalty for blank strings with whitespace or present ones is marginal.
empty? || BLANK_RE === self
end end
end end


Expand Down

6 comments on commit 697384d

@kaspth
Copy link
Contributor

@kaspth kaspth commented on 697384d Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOOM 💪

@DefV
Copy link
Contributor

@DefV DefV commented on 697384d Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing

@jeremy
Copy link
Member

@jeremy jeremy commented on 697384d Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇

@schneems
Copy link
Member

@schneems schneems commented on 697384d Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting idea. I think it's unique especially since the method is so small, but by adding non-regex logic we can get some speed bumps. Thanks for this write up.

I profiled codetriage, and it looks like the majority of calls to blank? on string are on non empty strings. This is a case where the performance depends largely on the samples. I ran some benchmarks with some randomly generated strings

require 'benchmark/ips'

EMPTYREGEX = /\A[[:space:]]*\z/

def string_generate
  str = " abcdefghijklmnopqrstuvwxyz\t".freeze
  str[rand(0..(str.length - 1))] * rand(0..23)
end

strings = 100.times.map { string_generate }

Benchmark.ips do |x|
  x.report('regexp           ') { strings.each {|str| EMPTYREGEX === str } }
  x.report('empty            ') { strings.each {|str| str.empty? || EMPTYREGEX === str } }
  x.compare!
end

# Warming up --------------------------------------
#    regexp                1.941k i/100ms
#    empty                 1.982k i/100ms
# Calculating -------------------------------------
#    regexp                19.170k (± 8.2%) i/s -     97.050k
#    empty                 18.259k (±14.6%) i/s -     91.172k

# Comparison:
#    regexp           :    19170.2 i/s
#    empty            :    18258.6 i/s - same-ish: difference falls within error

Even while "randomly" generating the strings, i'm able to tweak the numbers to favor one or the other, but both fall in the same ballpark.

I think it is a good idea, if we seperate out empty? strings earlier we can use extra logic to avoid the regex in even more places.

Running with it

Since the majority of the cases is a string that is not soely composed of whitespace characters, and the slow part of the check is the regex, we could skip the regex if we know the first character of the string is not a whitespace character.

If we did that we could see about a 2x speed boost (off of my sample set).

require 'benchmark/ips'

EMPTYREGEX = /\A[[:space:]]*\z/

def string_generate
  str = " abcdefghijklmnopqrstuvwxyz\t".freeze
  str[rand(0..(str.length - 1))] * rand(0..23)
end

strings = 100.times.map { string_generate }

SPACE_HASH = {}
SPACE_HASH[" "]  = true
SPACE_HASH["\t"] = true

Benchmark.ips do |x|
  x.report('regexp           ') { strings.each {|str| EMPTYREGEX === str } }
  x.report('empty            ') { strings.each {|str| str.empty? || EMPTYREGEX === str } }
  x.report('first char       ') { strings.each {|str| str.empty? || SPACE_HASH[str[0]] ? EMPTYREGEX === str : false } }
  x.compare!
end

# Warming up --------------------------------------
#    regexp                1.868k i/100ms
#    empty                 1.920k i/100ms
#    first char            3.423k i/100ms
# Calculating -------------------------------------
#    regexp                19.244k (± 7.5%) i/s -     97.136k
#    empty                 19.247k (± 7.3%) i/s -     97.920k
#    first char            35.327k (± 6.4%) i/s -    177.996k

# Comparison:
#    first char       :    35327.1 i/s
#    empty            :    19247.2 i/s - 1.84x slower
#    regexp           :    19243.8 i/s - 1.84x slower

Which is pretty good. Unfortunately this would mean that we need to know all the characters that ruby considers a /[[:space:]]/. I checked the source and well... https://github.com/ruby/ruby/blob/20cd25c86fd28eb1b5068d0db607e6aa33107f65/enc/unicode/name2ctype.h#L2794-L2807 I would need help translating that into ruby strings. It looks like it's worth the leg work.

Make the regex faster

If we don't want to do that we could speed up the regex string a little by using + instead of * since we know that it has at least one char (since it is not empty?). We could also reverse the logic and the regex to detect a non-whitespace character !(/\S/ === str) which would be slightly fater than +, as it would stop at the first non-whitespace character instead of continuing to iterate, however I don't know if thats a technically correct solution (i.e. is /\S/ the technical oposite of /\A[[:space:]]*\z/?

@fxn
Copy link
Member Author

@fxn fxn commented on 697384d Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to digest your comment (having dinner over here!), but just a quick answer to the last question in case it helps you continue investigating.

The difference is that [[:space:]] is a Unicode character class, whereas \s is an ASCII character class so to speak (a strict subset of the former). The complement of the Unicode class would be [[:^space:]].

@schneems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, please enjoy your meal :).

I see that \S would return the wrong value for "\u0085". I submitted a pull request with the new regex logic: #24658

I still think we could speed up the check with a hash of "space" characters. But we need to be careful. Even if we do that, we'll still benifit from a faster regex.

Please sign in to comment.