Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated last line in CSV.foreach #279

Closed
GabrielNagy opened this issue May 11, 2023 · 14 comments
Closed

Duplicated last line in CSV.foreach #279

GabrielNagy opened this issue May 11, 2023 · 14 comments

Comments

@GabrielNagy
Copy link
Contributor

I found a bug that only reproduces with a very specific set of prerequisites (all of the following must be true):

  • file with CRLF endings
  • file with no EOL/trailling newline (removed using a hex editor since vim always adds them back)
  • file larger than 32768 bytes
  • CSV.foreach
  • option strip: true
  • option skip_lines: /\A,+\n?\z/

The following example (where original.csv is a file containing the line AAAA1234567890 ~2500 times):

CSV.foreach('original.csv', strip: true, skip_lines: /\A,+\n?\z/) do |data|
  puts data[0]
end

will print the last line duplicated:

AAAA1234567890
AAAA1234567890
AAAA1234567890
AAAA1234567890
AAAA1234567890AAAA1234567890
...

As a workaround I used CSV.parse(File.read, ...) with the same options, but I still wanted to flag this issue.

@abcdefg-1234567
Copy link

@kou
I could reproduce this bug at hand.

@kou
Copy link
Member

kou commented May 14, 2023

Thanks for your report.
It may be a bug of internal chunk based stream parser:

csv/lib/csv/parser.rb

Lines 299 to 308 in 22e62bc

chunk = input.gets(@row_separator, @chunk_size)
if chunk
raise InvalidEncoding unless chunk.valid_encoding?
# trace(__method__, :chunk, chunk)
@scanner = StringScanner.new(chunk)
if input.respond_to?(:eof?) and input.eof?
@inputs.shift
@last_scanner = @inputs.empty?
end
true

@abcdefg-1234567 Great! Do you want to work on fixing this problem?
Could you share a script that produces a CSV that reproduces this problem as the first step?

@abcdefg-1234567
Copy link

@kou
Yes, I would like to.
I used a CSV file made by a tool for spreadsheet and IDE to reproduce this problem.
So, I will make script that produces a CSV.

@abcdefg-1234567
Copy link

@kou
I came up with the following code that creates a csv.

CSV.open('test.csv', 'w') do |csv|
  2500.times do
    csv << ['AAAA1234567890']
  end
end

However, I do not know how to make the following state by code.

  • file with CRLF endings
  • file with no EOL/trailling newline

I have confirmed that the bug is reproduced when the following conditions are set manually using the IDE.
Could you please give me any ideas?

@GabrielNagy
Copy link
Contributor Author

hey @abcdefg-1234567 you can use the following:

File.open('test.csv', 'w') do |f|
  2499.times do
    f.print("AAAA1234567890\r\n")
  end
  f.print("AAAA1234567890")
end

@abcdefg-1234567
Copy link

@GabrielNagy
Thank you!

@kou
Copy link
Member

kou commented May 20, 2023

OK. Let's reduce the reproducible CSV size as much as possible as the next step for easy to debug.
If it's difficult, we can start debugging with the reproducible CSV.
#279 (comment) may help you.

@abcdefg-1234567
Copy link

I have confirmed that the bug will not reproduce if the csv is less than 2048 rows.

@abcdefg-1234567
Copy link

I have confirmed the following.

The result of "value = parse_column_value (line 1030 of parser.rb)" when @ lineno=2048 is "AAAA1234567890AAAAA1234567890".

I am also wondering if changes are needed around the adjust_last_keep method.
@kou
Could you please explain the role of this method?

@kou
Copy link
Member

kou commented Jun 12, 2023

Sure.

#adjust_last_keep was introduced for fixing https://bugs.ruby-lang.org/issues/18245 .

InputsScanner acts as logically one StringScanner with multiple inputs. (StringScanner can't work with multiple strings.)

CSV::Parser may want to push back read data. For example, if skip_lines is specified, CSV::Parser may push back read data. CSV::Parser reads a line from its scanner (CSV::Parser::Scanner or CSV::Parser::InputsScanner) to check whether the line should be skipped. If the read line isn't skip target, CSV::Parser pushes back the read line and parses the line as a CSV line. keep_start/keep_drop/keep_back/keep_end are for it.

adjust_last_keep is related to these keep_* methods.InputsScanner processes multiple inputs. So the target data (for example, one line for skip_lines) may exist in multiple inputs. For example, "# a", "bc" and "\n" are one line but they are 3 inputs. adjust_last_keep is for the situation. If we need to concatenate data from multiple inputs, adjust_last_keep does it.

I hope that this explanation helps you.

@abcdefg-1234567
Copy link

Thank you for your detailed explanation!
I will refer to this and continue the investigation.

@kou
Copy link
Member

kou commented Jun 26, 2023

Including line number in line contents will helpful:

File.open('/tmp/test.csv', 'w') do |f|
  lines = 2500.times.collect do |i|
    "A%013d" % i
  end
  f.print(lines.join("\r\n"))
end

Output with the test file:

...
A0000000002497
A0000000002498
A0000000002499A0000000002499

It seems that the last line was used twice.

@kou
Copy link
Member

kou commented Jun 26, 2023

I cloud reproduce this with the script:

ENV["CSV_PARSER_SCANNER_TEST"] = "yes"

require "csv"

csv = CSV.new("a\r\nb", row_sep: "\r\n", strip: true, skip_lines: /\A *\z/)
csv.each do |row|
  pp row
end
["a"]
["bb"]

@kou kou closed this as completed in 183635a Jun 26, 2023
matzbot pushed a commit to ruby/ruby that referenced this issue Jun 28, 2023
GitHub: fix ruby/csv#279

It's happen when:

* `keep_start`/`keep_{drop,back}` are nested.
  (e.g.: `strip: true, skip_lines: /.../`)
* Row separator is `\r\n`.
* `InputScanner` is used. (Small input doesn't use `InputScanner`)

Reported by Gabriel Nagy. Thanks!!!

ruby/csv@183635ab56
@GabrielNagy
Copy link
Contributor Author

Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants