Parsing of quoted text with quotes inside #60

larsniemann1983 · 2015-12-17T10:23:51Z

Hi, is it possible to parse this correclty with some settings?

example1;"example with " inside";example3

Or do i have to quote it?

example1;"example with ""inside";example3

thx for your help

HyukjinKwon · 2016-04-03T04:28:16Z

@larsniemann1983 Were you able to parse that data?

HyukjinKwon · 2016-04-03T04:43:26Z

@jbax I searched the issues in this library but I couldn't find the case Spark is hitting right now except for this one.
There is a Apache Spark issue opened here, SPARK-14103, which I think is an issue in this library.

The problem is, it can't parse correctly the data blow:

"a"b    ccc ddd

This produces a output

[a"b    ccc ddd]     <- As a value

I think it makes sense it should be able to parse as below:

[ab], [ccc], [ddd]

In this way, I think it should also be able to parse this with the setting below:

setMaxCharsPerColumn(4)

Currently, it throws an exception as below with the data and with the setting above.

Error processing input: Length of parsed input (5) exceeds the maximum number of characters defined in your parser settings (4). 
Hint: Number of characters processed may have exceeded limit of 4 characters per column. Use settings.setMaxCharsPerColumn(int) to define the maximum number of characters a column can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
    Column reordering enabled=true
    Empty value=null
    Header extraction enabled=false
    Headers=null
    Ignore leading whitespaces=false
    Ignore trailing whitespaces=false
    Input buffer size=128
    Input reading on separate thread=false
    Line separator detection enabled=false
    Maximum number of characters per column=4
    Maximum number of columns=20480
    Null value=
    Number of records to read=all
    Parse unescaped quotes=true
    Row processor=none
    Selected fields=none
    Skip empty lines=trueFormat configuration:
    CsvFormat:
        Comment character=\0
        Field delimiter=\t
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character=quote escape
        Quote escape escape character=\0, line=0, char=6. Content parsed: [year]

HyukjinKwon · 2016-04-03T04:45:34Z

@jbax Spark is currently using univocity 1.5.6 version. Is this fixed in higher versions? If not, would there any easy way to fix this? or is this what you intended to work like?

It would be thankful if you help me solve this issue.

HyukjinKwon · 2016-04-03T04:53:33Z

@larsniemann1983 Would you maybe reopen this if you were not able to solve this problem?

jbax · 2016-04-03T10:51:43Z

Hi, the example you provided contains an unescaped quote (which is invalid CSV). uniVocity parsers' CSV parser the only parser I know that is able to handle this sort of thing. There are a few settings that allow you to handle these cases. Let's see your examples one by one:

First

    example1;"example with " inside";example3

The quote after the word with is not escaped. If you parse this with parseUnescapedQuotes=true:

    CsvParserSettings settings = new CsvParserSettings();
    settings.getFormat().setDelimiter(';');
    settings.setParseUnescapedQuotes(true);

    String[] values = new CsvParser(settings).parseLine("example1;\"example with \" inside\";example3");
    for(String value : values){
        System.out.println(value);
    }

The result will be:

    example1
    example with " inside
    example3

The parser will handle the unescaped quote and keep going until it finds a closing quote (i.e. a quote followed by a delimiter).

Now, if want to force the CSV to be well formed and get an exception to ensure quotes are correctly escaped, set parseUnescapedQuotes=false, and then you will get the following:

    com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Unexpected character 'i' following quoted value of CSV field. Expecting ';'. Cannot parse CSV input.
    Internal state when error was thrown: line=0, column=1, record=0, charIndex=27, content parsed=example with 
    Parser Configuration: CsvParserSettings:
    ...

Second

    "a"b    ccc ddd

In this case, you have the unescaped quote (before b) but no closing quote anywhere. When parseUnescapedQuotes=true the parser will keep going until it finds a closing quote or the end of the input. There's no reasonable way to determine when the value ends. We are adding a parseUnescapedQuotesUntilDelimiter that when enabled will make the parser stop reading after the unescaped quote if a delimiter or line ending is found, that should produce the following:

    "a"b
    ccc
    ddd

This might help, but as you are parsing an input that uses tab as delimiter, you could probably use our TSV parser instead, as your input doesn't conform to CSV anyway.

I'll keep this issue open because the current version of the parser is not producing the expected result if you set parseUnescapedQuotesUntilDelimiter=true. It will be fixed in less than 24 hours.

HyukjinKwon · 2016-04-03T12:23:19Z

@jnax Thanks so much for your help. Let me cc @srowen just to inform him.

…ed to stop at delimiter.

jbax · 2016-04-04T01:53:13Z

Done. I've released a 2.0.2-SNAPSHOT that includes the fix for this. Now when parseUnescapedQuotesUntilDelimiter=true you will get the expected result.

I'll release version 2.0.2 once you confirm this is working as expected.

HyukjinKwon · 2016-04-04T04:26:23Z

@jbax Thanks. I just tested 2.0.2-SNAPSHOT with Spark and I confirm it works fine with the input and an additional setting below, producing the output below:

Input

"a"b    ccc ddd

Additional setting:

setParseUnescapedQuotesUntilDelimiter(true)

Output

["a"b], [ccc], [ddd]

jbax · 2016-04-04T05:04:47Z

Thank you! Version 2.0.2 released.

jbax · 2016-04-05T00:12:55Z

@HyukjinKwon For reference, here is the test code. I'm parsing Strings individually (saw your comment on SPARK-14103):

    CsvParserSettings settings = new CsvParserSettings();
    settings.setParseUnescapedQuotesUntilDelimiter(true);
    settings.getFormat().setDelimiter('\t');

    String[] values = new CsvParser(settings).parseLine("\"a\"b\tccc\tddd");
    assertEquals(values.length, 3);
    assertEquals(values[0], "\"a\"b");
    assertEquals(values[1], "ccc");
    assertEquals(values[2], "ddd");

Version 2 comes with lots of API improvements, check the examples here.

HyukjinKwon · 2016-04-05T00:39:03Z

@jbax Cool. Thanks!
I just also found that API and realised that Spark is using Reader only just for a better performance. I think I might have to proceed a performance test in both cases using Reader and also using String.

## What changes were proposed in this pull request? This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below: ``` "a"b,ccc,ddd e,f,g ``` produces a data below: - **Before** ```bash ["a"b,ccc,ddd[\n]e,f,g] <- as a value. ``` - **After** ```bash ["a"b], [ccc], [ddd] [e], [f], [g] ``` This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, uniVocity/univocity-parsers#60. ## How was this patch tested? Unit tests in `CSVSuite` and `sbt/sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#12226 from HyukjinKwon/SPARK-14103-quote.

larsniemann1983 closed this as completed Dec 17, 2015

jbax added the question label Feb 4, 2016

jbax reopened this Apr 3, 2016

jbax added the bug label Apr 3, 2016

jbax added this to the 2.0.2 milestone Apr 3, 2016

jbax self-assigned this Apr 3, 2016

jbax added a commit that referenced this issue Apr 4, 2016

Fixed issue #60: processing unescaped quotes with CSV parser configur…

b4a22d5

…ed to stop at delimiter.

jbax closed this as completed Apr 4, 2016

HyukjinKwon mentioned this issue Apr 7, 2016

[SPARK-14103][SQL] Parse unescaped quotes in CSV data source. apache/spark#12226

Closed

abhijitsulhyan mentioned this issue May 31, 2018

Parsing of quoted text with quotes inside #236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing of quoted text with quotes inside #60

Parsing of quoted text with quotes inside #60

larsniemann1983 commented Dec 17, 2015

HyukjinKwon commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

jbax commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

jbax commented Apr 4, 2016

HyukjinKwon commented Apr 4, 2016

jbax commented Apr 4, 2016

jbax commented Apr 5, 2016

HyukjinKwon commented Apr 5, 2016

Parsing of quoted text with quotes inside #60

Parsing of quoted text with quotes inside #60

Comments

larsniemann1983 commented Dec 17, 2015

HyukjinKwon commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

jbax commented Apr 3, 2016

HyukjinKwon commented Apr 3, 2016

jbax commented Apr 4, 2016

HyukjinKwon commented Apr 4, 2016

jbax commented Apr 4, 2016

jbax commented Apr 5, 2016

HyukjinKwon commented Apr 5, 2016