Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix case where an \r\n row separator will be split when reading a chunk #221

Closed
wants to merge 1 commit into from

Conversation

jeremyevans
Copy link

In this case, read one more character.

This is a suboptimal fix, as it doesn't fix handling of row
separators that aren't two characters and starting with \r.
A better fix would handle all multibyte row separators.
However, as \r\n is one of the most common row separators,
I think it's useful to merge this until a more generic
solution is developed.

Fixes [Bug #18245]

In this case, read one more character.

This is a suboptimal fix, as it doesn't fix handling of row
separators that aren't two characters and starting with \r.
A better fix would handle all multibyte row separators.
However, as \r\n is one of the most common row separators,
I think it's useful to merge this until a more generic
solution is developed.

Fixes [Bug #18245]
@kou
Copy link
Member

kou commented Oct 8, 2021

Thanks! But this is not a real fix of this problem.
There is a problem in keep_start/keep_drop over @scanner switch. I'll fix it later.

Anyway, we need to improve this case. But we can specify row separator explicitly:

diff --git a/lib/csv/parser.rb b/lib/csv/parser.rb
index 0d8a157..2d76316 100644
--- a/lib/csv/parser.rb
+++ b/lib/csv/parser.rb
@@ -85,9 +85,10 @@ class CSV
     # If there is no more data (eos? = true), it returns "".
     #
     class InputsScanner
-      def initialize(inputs, encoding, chunk_size: 8192)
+      def initialize(inputs, encoding, row_separator, chunk_size: 8192)
         @inputs = inputs.dup
         @encoding = encoding
+        @row_separator = row_separator
         @chunk_size = chunk_size
         @last_scanner = @inputs.empty?
         @keeps = []
@@ -233,7 +234,7 @@ class CSV
           @last_scanner = @inputs.empty?
           true
         else
-          chunk = input.gets(nil, @chunk_size)
+          chunk = input.gets(@row_separator, @chunk_size)
           if chunk
             raise InvalidEncoding unless chunk.valid_encoding?
             @scanner = StringScanner.new(chunk)
@@ -737,6 +738,7 @@ class CSV
         chunk_size = ENV["CSV_PARSER_SCANNER_TEST_CHUNK_SIZE"] || "1"
         InputsScanner.new(inputs,
                           @encoding,
+                          @row_separator,
                           chunk_size: Integer(chunk_size, 10))
       end
     else
@@ -763,7 +765,7 @@ class CSV
             StringIO.new(sample)
           end
           inputs << @input
-          InputsScanner.new(inputs, @encoding)
+          InputsScanner.new(inputs, @encoding, @row_separator)
         end
       end
     end

@kou
Copy link
Member

kou commented Dec 24, 2021

I pushed the row separator fix. It fixes a problem with the reported data. But it's not a real fix.
We should fix it later: #230

@kou kou closed this Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants