Skip to content

Commit

Permalink
Add handling for ambiguous parsing options (#226)
Browse files Browse the repository at this point in the history
GitHub: fix GH-225

With Ruby 3.0.2 and csv 3.2.1, the file

```ruby
require "csv"
File.open("example.tsv", "w") { |f| f.puts("foo\t\tbar") }
CSV.read("example.tsv", col_sep: "\t", strip: true)
```

produces the error

```
lib/csv/parser.rb:935:in `parse_quotable_robust': TODO: Meaningful
message in line 1. (CSV::MalformedCSVError)
```

However, the CSV in this example is not malformed; instead, ambiguous
options were provided to the parser. It is not obvious (to me) whether
the string should be parsed as

- `["foo\t\tbar"]`,
- `["foo", "bar"]`,
- `["foo", "", "bar"]`, or
- `["foo", nil, "bar"]`.

This commit adds code that raises an exception when this situation is
encountered. Specifically, it checks if the column separator either ends
with or starts with the characters that would be stripped away.

This commit also adds unit tests and updates the documentation.
  • Loading branch information
adamroyjones committed Nov 18, 2021
1 parent 27c0b66 commit cc317dd
Show file tree
Hide file tree
Showing 3 changed files with 59 additions and 6 deletions.
13 changes: 7 additions & 6 deletions lib/csv.rb
Original file line number Diff line number Diff line change
Expand Up @@ -341,14 +341,14 @@
# liberal_parsing: false,
# nil_value: nil,
# empty_value: "",
# strip: false,
# # For generating.
# write_headers: nil,
# quote_empty: true,
# force_quotes: false,
# write_converters: nil,
# write_nil_value: nil,
# write_empty_value: "",
# strip: false,
# }
#
# ==== Options for Parsing
Expand All @@ -366,8 +366,9 @@
# - +header_converters+: Specifies the header converters to be used.
# - +skip_blanks+: Specifies whether blanks lines are to be ignored.
# - +skip_lines+: Specifies how comments lines are to be recognized.
# - +strip+: Specifies whether leading and trailing whitespace are
# to be stripped from fields..
# - +strip+: Specifies whether leading and trailing whitespace are to be
# stripped from fields. This must be compatible with +col_sep+; if it is not,
# then an +ArgumentError+ exception will be raised.
# - +liberal_parsing+: Specifies whether \CSV should attempt to parse
# non-compliant data.
# - +nil_value+: Specifies the object that is to be substituted for each null (no-text) field.
Expand Down Expand Up @@ -946,14 +947,14 @@ def initialize(message, line_number)
liberal_parsing: false,
nil_value: nil,
empty_value: "",
strip: false,
# For generating.
write_headers: nil,
quote_empty: true,
force_quotes: false,
write_converters: nil,
write_nil_value: nil,
write_empty_value: "",
strip: false,
}.freeze

class << self
Expand Down Expand Up @@ -1879,11 +1880,11 @@ def initialize(data,
encoding: nil,
nil_value: nil,
empty_value: "",
strip: false,
quote_empty: true,
write_converters: nil,
write_nil_value: nil,
write_empty_value: "",
strip: false)
write_empty_value: "")
raise ArgumentError.new("Cannot parse nil as CSV") if data.nil?

if data.is_a?(String)
Expand Down
23 changes: 23 additions & 0 deletions lib/csv/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,7 @@ def prepare
prepare_skip_lines
prepare_strip
prepare_separators
validate_strip_and_col_sep_options
prepare_quoted
prepare_unquoted
prepare_line
Expand Down Expand Up @@ -531,6 +532,28 @@ def prepare_separators
@not_line_end = Regexp.new("[^\r\n]+".encode(@encoding))
end

# This method verifies that there are no (obvious) ambiguities with the
# provided +col_sep+ and +strip+ parsing options. For example, if +col_sep+
# and +strip+ were both equal to +\t+, then there would be no clear way to
# parse the input.
def validate_strip_and_col_sep_options
return unless @strip

if @strip.is_a?(String)
if @column_separator.start_with?(@strip) || @column_separator.end_with?(@strip)
raise ArgumentError,
"The provided strip (#{@escaped_strip}) and " \
"col_sep (#{@escaped_column_separator}) options are incompatible."
end
else
if Regexp.new("\\A[#{@escaped_strip}]|[#{@escaped_strip}]\\z").match?(@column_separator)
raise ArgumentError,
"The provided strip (true) and " \
"col_sep (#{@escaped_column_separator}) options are incompatible."
end
end
end

def prepare_quoted
if @quote_character
@quotes = Regexp.new(@escaped_quote_character +
Expand Down
29 changes: 29 additions & 0 deletions test/csv/parse/test_strip.rb
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,33 @@ def test_do_not_strip_crlf
%Q{"a" ,"b " \r\n},
strip: true))
end

def test_col_sep_incompatible_true
message = "The provided strip (true) and " \
"col_sep (\\t) options are incompatible."
assert_raise_with_message(ArgumentError, message) do
CSV.parse_line(%Q{"a"\t"b"\n},
col_sep: "\t",
strip: true)
end
end

def test_col_sep_incompatible_string
message = "The provided strip (\\t) and " \
"col_sep (\\t) options are incompatible."
assert_raise_with_message(ArgumentError, message) do
CSV.parse_line(%Q{"a"\t"b"\n},
col_sep: "\t",
strip: "\t")
end
end

def test_col_sep_compatible_string
assert_equal(
["a", "b"],
CSV.parse_line(%Q{\va\tb\v\n},
col_sep: "\t",
strip: "\v")
)
end
end

0 comments on commit cc317dd

Please sign in to comment.