Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature #15771: Add `String#split` option to set split_type string when a single space separator #2132

Open
wants to merge 1 commit into
base: trunk
from

Conversation

1 participant
@284km
Copy link

commented Apr 16, 2019

This change will result in the following:

" a  b   c    ".split(" ")
=> ["a", "b", "c"]
" a  b   c    ".split(" ", -1)
=> ["a", "b", "c", ""]
" a  b   c    ".split(" ", literal: true)
=> ["", "a", "", "b", "", "", "c"]
" a  b   c    ".split(" ", -1, literal: true)
=> ["", "a", "", "b", "", "", "c", "", "", "", ""]

In String#split, when separator is a single space character, it execute as split_type: awk.

For example, CSV library handles it like this.
https://github.com/ruby/csv/blob/7ff57a50e81c368029fa9b664700bec4a456b81b/lib/csv/parser.rb#L508-L512

if @column_separator == " ".encode(@encoding)
  @split_column_separator = Regexp.new(@escaped_column_separator)
else
  @split_column_separator = @column_separator
end

Unfortunately, in this case regexp is slower than string. For example,
the following result is about 9 times slower.
https://github.com/284km/benchmarks_no_yatu#stringsplitstring-or-regexp

$ be benchmark-driver string_split_string-regexp.yml --rbenv '2.6.2'
Comparison:
              string:   3161117.6 i/s
              regexp:    344448.0 i/s - 9.18x  slower

So I want to add the :literal option to run as split_type: string.

Add `String#split` option to set split_type string when a single space
This change will result in the following:

```
" a  b   c    ".split(" ")
=> ["a", "b", "c"]
" a  b   c    ".split(" ", -1)
=> ["a", "b", "c", ""]
" a  b   c    ".split(" ", literal: true)
=> ["", "a", "", "b", "", "", "c"]
" a  b   c    ".split(" ", -1, literal: true)
=> ["", "a", "", "b", "", "", "c", "", "", "", ""]
```

In String#split, when separator is a single space character, it execute as
split_type: awk.

For example, CSV library handles it like this.
https://github.com/ruby/csv/blob/7ff57a50e81c368029fa9b664700bec4a456b81b/lib/csv/parser.rb#L508-L512

```
if @column_separator == " ".encode(@encoding)
  @split_column_separator = Regexp.new(@escaped_column_separator)
else
  @split_column_separator = @column_separator
end
```

Unfortunately, in this case regexp is slower than string. For example,
the following result is about 9 times slower.
https://github.com/284km/benchmarks_no_yatu#stringsplitstring-or-regexp

So I want to add the :literal option to run as split_type: string.

@284km 284km force-pushed the 284km:split_space branch from ad295fe to 6d096a9 Apr 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.