Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of quoted text with quotes inside #60

Closed
larsniemann1983 opened this issue Dec 17, 2015 · 11 comments
Closed

Parsing of quoted text with quotes inside #60

larsniemann1983 opened this issue Dec 17, 2015 · 11 comments
Assignees
Milestone

Comments

@larsniemann1983
Copy link

Hi, is it possible to parse this correclty with some settings?

example1;"example with " inside";example3

Or do i have to quote it?

example1;"example with ""inside";example3

thx for your help

@HyukjinKwon
Copy link

@larsniemann1983 Were you able to parse that data?

@HyukjinKwon
Copy link

@jbax I searched the issues in this library but I couldn't find the case Spark is hitting right now except for this one.
There is a Apache Spark issue opened here, SPARK-14103, which I think is an issue in this library.

The problem is, it can't parse correctly the data blow:

"a"b    ccc ddd

This produces a output

[a"b    ccc ddd]     <- As a value

I think it makes sense it should be able to parse as below:

[ab], [ccc], [ddd]

In this way, I think it should also be able to parse this with the setting below:

setMaxCharsPerColumn(4)

Currently, it throws an exception as below with the data and with the setting above.

Error processing input: Length of parsed input (5) exceeds the maximum number of characters defined in your parser settings (4). 
Hint: Number of characters processed may have exceeded limit of 4 characters per column. Use settings.setMaxCharsPerColumn(int) to define the maximum number of characters a column can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
    Column reordering enabled=true
    Empty value=null
    Header extraction enabled=false
    Headers=null
    Ignore leading whitespaces=false
    Ignore trailing whitespaces=false
    Input buffer size=128
    Input reading on separate thread=false
    Line separator detection enabled=false
    Maximum number of characters per column=4
    Maximum number of columns=20480
    Null value=
    Number of records to read=all
    Parse unescaped quotes=true
    Row processor=none
    Selected fields=none
    Skip empty lines=trueFormat configuration:
    CsvFormat:
        Comment character=\0
        Field delimiter=\t
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character=quote escape
        Quote escape escape character=\0, line=0, char=6. Content parsed: [year]

@HyukjinKwon
Copy link

@jbax Spark is currently using univocity 1.5.6 version. Is this fixed in higher versions? If not, would there any easy way to fix this? or is this what you intended to work like?

It would be thankful if you help me solve this issue.

@HyukjinKwon
Copy link

@larsniemann1983 Would you maybe reopen this if you were not able to solve this problem?

@jbax
Copy link
Member

jbax commented Apr 3, 2016

Hi, the example you provided contains an unescaped quote (which is invalid CSV). uniVocity parsers' CSV parser the only parser I know that is able to handle this sort of thing. There are a few settings that allow you to handle these cases. Let's see your examples one by one:

First

    example1;"example with " inside";example3

The quote after the word with is not escaped. If you parse this with parseUnescapedQuotes=true:

    CsvParserSettings settings = new CsvParserSettings();
    settings.getFormat().setDelimiter(';');
    settings.setParseUnescapedQuotes(true);

    String[] values = new CsvParser(settings).parseLine("example1;\"example with \" inside\";example3");
    for(String value : values){
        System.out.println(value);
    }

The result will be:

    example1
    example with " inside
    example3

The parser will handle the unescaped quote and keep going until it finds a closing quote (i.e. a quote followed by a delimiter).

Now, if want to force the CSV to be well formed and get an exception to ensure quotes are correctly escaped, set parseUnescapedQuotes=false, and then you will get the following:

    com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Unexpected character 'i' following quoted value of CSV field. Expecting ';'. Cannot parse CSV input.
    Internal state when error was thrown: line=0, column=1, record=0, charIndex=27, content parsed=example with 
    Parser Configuration: CsvParserSettings:
    ...

Second

    "a"b    ccc ddd

In this case, you have the unescaped quote (before b) but no closing quote anywhere. When parseUnescapedQuotes=true the parser will keep going until it finds a closing quote or the end of the input. There's no reasonable way to determine when the value ends. We are adding a parseUnescapedQuotesUntilDelimiter that when enabled will make the parser stop reading after the unescaped quote if a delimiter or line ending is found, that should produce the following:

    "a"b
    ccc
    ddd

This might help, but as you are parsing an input that uses tab as delimiter, you could probably use our TSV parser instead, as your input doesn't conform to CSV anyway.

I'll keep this issue open because the current version of the parser is not producing the expected result if you set parseUnescapedQuotesUntilDelimiter=true. It will be fixed in less than 24 hours.

@jbax jbax reopened this Apr 3, 2016
@jbax jbax added the bug label Apr 3, 2016
@jbax jbax added this to the 2.0.2 milestone Apr 3, 2016
@jbax jbax self-assigned this Apr 3, 2016
@HyukjinKwon
Copy link

@jnax Thanks so much for your help. Let me cc @srowen just to inform him.

jbax added a commit that referenced this issue Apr 4, 2016
@jbax
Copy link
Member

jbax commented Apr 4, 2016

Done. I've released a 2.0.2-SNAPSHOT that includes the fix for this. Now when parseUnescapedQuotesUntilDelimiter=true you will get the expected result.

I'll release version 2.0.2 once you confirm this is working as expected.

@jbax jbax closed this as completed Apr 4, 2016
@HyukjinKwon
Copy link

@jbax Thanks. I just tested 2.0.2-SNAPSHOT with Spark and I confirm it works fine with the input and an additional setting below, producing the output below:

  • Input
"a"b    ccc ddd
  • Additional setting:
setParseUnescapedQuotesUntilDelimiter(true)
  • Output
["a"b], [ccc], [ddd]

@jbax
Copy link
Member

jbax commented Apr 4, 2016

Thank you! Version 2.0.2 released.

@jbax
Copy link
Member

jbax commented Apr 5, 2016

@HyukjinKwon For reference, here is the test code. I'm parsing Strings individually (saw your comment on SPARK-14103):

    CsvParserSettings settings = new CsvParserSettings();
    settings.setParseUnescapedQuotesUntilDelimiter(true);
    settings.getFormat().setDelimiter('\t');

    String[] values = new CsvParser(settings).parseLine("\"a\"b\tccc\tddd");
    assertEquals(values.length, 3);
    assertEquals(values[0], "\"a\"b");
    assertEquals(values[1], "ccc");
    assertEquals(values[2], "ddd");

Version 2 comes with lots of API improvements, check the examples here.

@HyukjinKwon
Copy link

@jbax Cool. Thanks!
I just also found that API and realised that Spark is using Reader only just for a better performance. I think I might have to proceed a performance test in both cases using Reader and also using String.

ghost pushed a commit to dbtsai/spark that referenced this issue Apr 8, 2016
## What changes were proposed in this pull request?

This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:

```
"a"b,ccc,ddd
e,f,g
```

produces a data below:

- **Before**

```bash
["a"b,ccc,ddd[\n]e,f,g]  <- as a value.
```

- **After**

```bash
["a"b], [ccc], [ddd]
[e], [f], [g]
```

This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, uniVocity/univocity-parsers#60.

## How was this patch tested?

Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#12226 from HyukjinKwon/SPARK-14103-quote.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants