Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimization for quote_char: nil #79

Merged
merged 14 commits into from Mar 12, 2019

Conversation

2 participants
@284km
Copy link
Contributor

284km commented Mar 1, 2019

I implement #56

This implementation uses line.split(column_separator) to parse. As a result, it got 2.7x faster for quote_char: nil case.
Also, as the number of columns to parse increases, quote_char: nil is more advantageous in terms of performance.

benchmark:
$ rake benchmark:parse_quote_char_nil

(Case of 50 columns)
Comparison:
      quote_char_nil:        76.4 i/s 
         encode_sjis:        52.9 i/s - 1.45x  slower
        encode_utf-8:        51.5 i/s - 1.48x  slower
       col_sep_space:        48.3 i/s - 1.58x  slower
  without_quote_char:        28.3 i/s - 2.70x  slower

(Case of 10 columns)
Comparison:
      quote_char_nil:       265.4 i/s 
         encode_sjis:       206.5 i/s - 1.29x  slower
        encode_utf-8:       196.9 i/s - 1.35x  slower
       col_sep_space:       172.6 i/s - 1.54x  slower
  without_quote_char:       120.9 i/s - 2.20x  slower

If col_sep is a single byte blank character " ", String#split is divided by a blank string after excluding leading and trailing blanks.
In order to support the following cases, use regular expression in String#split only when col_sep is a space character.

CSV.parse("a b  d", col_sep: " ", quote_char: nil)
#=> [["a", "b", nil, "d"]]

However, when using regular expressions, it is about 1.5 times slower than using strings. The above col_sep_space is the benchmark result. Even in this case it is faster than the existing parse implementation.

Unfortunately, This implementation currently does not support stream.

284km added some commits Mar 1, 2019

Add benchmark for quote_char: nil
benchmark:
$ rake benchmark:parse_quote_char_nil

Comparison:
      quote_char_nil:       111.1 i/s
         encode_sjis:        80.6 i/s - 1.38x  slower
        encode_utf-8:        70.6 i/s - 1.57x  slower
       col_sep_space:        69.8 i/s - 1.59x  slower
  without_quote_char:        42.6 i/s - 2.61x  slower
@kou
Copy link
Member

kou left a comment

Thanks.
Could you check my comments?

encode_utf-8: |-
CSV.parse(enc_utf8, quote_char: nil)
encode_sjis: |-
CSV.parse(enc_sjis, quote_char: nil)

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Can we remove UTF-8 and Windows-31J cases?
Should they have different behavior than ASCII case?

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

I removed it because it is same behavior as ASCII. 19b7508

@@ -227,6 +227,33 @@ def line
last_line
end

def parse_quote_character_nil(&block)

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Could you stop exporting this method as a public method?

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

I Changed to private. 4d38659

@@ -227,6 +227,33 @@ def line
last_line
end

def parse_quote_character_nil(&block)
while true
return nil unless value = @input.gets(@row_separator)

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

We can use while value = @input.gets(@row_separator).

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

@input.each_line(@row_separator) do is faster than while line = @input.gets(@row_separator).

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

@input.string.each_line(@row_separator) do is faster than @input.each_line(@row_separator).

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

I changed to @input.string.each_line(@row_separator) do. 33c60a3
it is the fastest.

# while value = @input.gets(@row_separator)
      quote_char_nil:       111.1 i/s
       col_sep_space:        66.3 i/s - 1.68x  slower
  without_quote_char:        44.3 i/s - 2.51x  slower

# @input.string.each_line(@row_separator) do |value|
      quote_char_nil:       119.1 i/s
       col_sep_space:        70.9 i/s - 1.68x  slower
  without_quote_char:        43.9 i/s - 2.72x  slower
columns = if @column_separator == " "
value.split(@column_end, -1)
else
value.split(@column_separator, -1)

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

We can omit -1 argument.

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Can not omit -1 in this case.

irb(main):001:0> str = %Q{a,,,}
=> "a,,,"
irb(main):002:0> str.split(',')
=> ["a"]
irb(main):003:0> str.split(',', -1)
=> ["a", "", "", ""]
end
columns.each do |column|
row << (column.empty? ? nil : column)
end

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

The following is faster:

row = if @column_separator == " "
  value.split(@column_end, -1)
else
  value.split(@column_separator, -1)
end
n_columns = row.size
i = 0
while i < n_columns
  row[i] = nil if row[i].empty?
  i += 1
end

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Wow, I did not notice. 5bab989

# before change
      quote_char_nil:       103.2 i/s
       col_sep_space:        70.3 i/s - 1.47x  slower
  without_quote_char:        44.3 i/s - 2.33x  slower

# after
      quote_char_nil:       142.3 i/s
       col_sep_space:        76.5 i/s - 1.86x  slower
  without_quote_char:        46.0 i/s - 3.10x  slower
@quoted_value = Regexp.new("[^".encode(@encoding) +
escaped_quote_character +
"]+".encode(@encoding))
if @options[:quote_character]

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Could you use @quote_character instead of @options[:...]?

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Yes, Fixed it. 8177020

escaped_backslash_character = Regexp.escape(@backslash_character)
@escaped_backslash = Regexp.new(escaped_backslash_character)
@escaped_quote = Regexp.new(escaped_quote_character)
@backslash_quote_character = @backslash_character + escaped_quote_character

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Could you initialize them here to collect escaped_* related variables the same code block?

if @quote_character
  escaped_quote_character = Regex.escape(@quote_character)
  @escaped_quote = Regexp.new(escaped_quote_character)
  @backslash_quote_character = @backslash_character + escaped_quote_character
end

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Fixed it. b523351

if @liberal_parsing
@unquoted_value = Regexp.new("[^".encode(@encoding) +
escaped_first_column_separator +
"\r\n]+".encode(@encoding))
else
elsif @options[:quote_character]

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Could you use @quote_character?

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Fixed. 8177020

].each do |edge_case|
assert_equal(edge_case.last, CSV.parse_line(edge_case.first, quote_char: nil))
end
end

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Could you split these 3 patterns to each test?
We should use one data in one test.

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Splited into 3. d6ccb1b

DATA
assert_nothing_raised(Exception) do
csv = CSV.parse(data, headers: "my,new,headers", quote_char: nil)
end

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

Could you remove assert_nothing_raised?
The assertion is meaningless.

This comment has been minimized.

@kou

kou Mar 8, 2019

Member

We can use one assert_equal in this case:

    assert_equal(CSV::Table.new(...),
                 CSV.parse(data, headers: "my,new,headers", quote_char: nil))

This comment has been minimized.

@284km

284km Mar 12, 2019

Author Contributor

Fixed. efa8916

284km added some commits Mar 12, 2019

Remove UTF-8 and Windows-31J test cases
Because it is the same behavior as ASCII.
Performance improvement
benchmark:

```
      quote_char_nil:       111.1 i/s
       col_sep_space:        66.3 i/s - 1.68x  slower
  without_quote_char:        44.3 i/s - 2.51x  slower

      quote_char_nil:       119.1 i/s
       col_sep_space:        70.9 i/s - 1.68x  slower
  without_quote_char:        43.9 i/s - 2.72x  slower
```
@284km

This comment has been minimized.

Copy link
Contributor Author

284km commented Mar 12, 2019

benchmark result is below. 3.77x faster!
https://travis-ci.org/ruby/csv/jobs/505165753

(Case of 50 columns)
      quote_char_nil:       103.7 i/s 
       col_sep_space:        56.2 i/s - 1.84x  slower
  without_quote_char:        27.5 i/s - 3.77x  slower

(Case of 10 columns)
      quote_char_nil:       309.5 i/s 
       col_sep_space:       168.6 i/s - 1.84x  slower
  without_quote_char:       122.4 i/s - 2.53x  slower
@kou

kou approved these changes Mar 12, 2019

Copy link
Member

kou left a comment

Thanks!

@kou kou merged commit c497133 into ruby:master Mar 12, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@284km 284km deleted the 284km:add_optimization_for_quote_char_nil branch Mar 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.