Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
~3.5x speedup of String#blank? for empty strings
See the rationale in the comment in this patch. To benchmark this I ran a number of variations, ultimately narrowing to require 'benchmark/ips' str = '' regexp = /\A[[:space:]]*\z/ Benchmark.ips do |x| x.report('regexp') { regexp === str } x.report('empty') { str.empty? || regexp === str } x.compare! end This benchmark has consistently reported speedups around 3.5x: Calculating ------------------------------------- regexp 69.197k i/100ms empty 115.468k i/100ms ------------------------------------------------- regexp 2. 6.3%) i/s - 13.839M empty 9. 8.8%) i/s - 47.804M Comparison: empty: 9642607.6 i/s regexp: 2768351.9 i/s - 3.48x slower Sometimes even reaching 4x. Running the same bechmark on strings of 10 or 100 characters (with whitespace or present) has shown a slowdown of just about 1.01/1.02. Marginal, we seem to have a worthwhile trade-off here.
- Loading branch information
697384d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOOM 💪
697384d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
697384d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
697384d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting idea. I think it's unique especially since the method is so small, but by adding non-regex logic we can get some speed bumps. Thanks for this write up.
I profiled codetriage, and it looks like the majority of calls to
blank?
on string are on non empty strings. This is a case where the performance depends largely on the samples. I ran some benchmarks with some randomly generated stringsEven while "randomly" generating the strings, i'm able to tweak the numbers to favor one or the other, but both fall in the same ballpark.
I think it is a good idea, if we seperate out
empty?
strings earlier we can use extra logic to avoid the regex in even more places.Running with it
Since the majority of the cases is a string that is not soely composed of whitespace characters, and the slow part of the check is the regex, we could skip the regex if we know the first character of the string is not a whitespace character.
If we did that we could see about a 2x speed boost (off of my sample set).
Which is pretty good. Unfortunately this would mean that we need to know all the characters that ruby considers a
/[[:space:]]/
. I checked the source and well... https://github.com/ruby/ruby/blob/20cd25c86fd28eb1b5068d0db607e6aa33107f65/enc/unicode/name2ctype.h#L2794-L2807 I would need help translating that into ruby strings. It looks like it's worth the leg work.Make the regex faster
If we don't want to do that we could speed up the regex string a little by using
+
instead of*
since we know that it has at least one char (since it is notempty?
). We could also reverse the logic and the regex to detect a non-whitespace character!(/\S/ === str)
which would be slightly fater than+
, as it would stop at the first non-whitespace character instead of continuing to iterate, however I don't know if thats a technically correct solution (i.e. is/\S/
the technical oposite of/\A[[:space:]]*\z/
?697384d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to digest your comment (having dinner over here!), but just a quick answer to the last question in case it helps you continue investigating.
The difference is that
[[:space:]]
is a Unicode character class, whereas\s
is an ASCII character class so to speak (a strict subset of the former). The complement of the Unicode class would be[[:^space:]]
.697384d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, please enjoy your meal :).
I see that
\S
would return the wrong value for"\u0085"
. I submitted a pull request with the new regex logic: #24658I still think we could speed up the check with a hash of "space" characters. But we need to be careful. Even if we do that, we'll still benifit from a faster regex.