New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing of quoted text with quotes inside #60
Comments
@larsniemann1983 Were you able to parse that data? |
@jbax I searched the issues in this library but I couldn't find the case Spark is hitting right now except for this one. The problem is, it can't parse correctly the data blow:
This produces a output
I think it makes sense it should be able to parse as below:
In this way, I think it should also be able to parse this with the setting below:
Currently, it throws an exception as below with the data and with the setting above. Error processing input: Length of parsed input (5) exceeds the maximum number of characters defined in your parser settings (4).
Hint: Number of characters processed may have exceeded limit of 4 characters per column. Use settings.setMaxCharsPerColumn(int) to define the maximum number of characters a column can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
Column reordering enabled=true
Empty value=null
Header extraction enabled=false
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Line separator detection enabled=false
Maximum number of characters per column=4
Maximum number of columns=20480
Null value=
Number of records to read=all
Parse unescaped quotes=true
Row processor=none
Selected fields=none
Skip empty lines=trueFormat configuration:
CsvFormat:
Comment character=\0
Field delimiter=\t
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=quote escape
Quote escape escape character=\0, line=0, char=6. Content parsed: [year] |
@jbax Spark is currently using univocity 1.5.6 version. Is this fixed in higher versions? If not, would there any easy way to fix this? or is this what you intended to work like? It would be thankful if you help me solve this issue. |
@larsniemann1983 Would you maybe reopen this if you were not able to solve this problem? |
Hi, the example you provided contains an unescaped quote (which is invalid CSV). uniVocity parsers' CSV parser the only parser I know that is able to handle this sort of thing. There are a few settings that allow you to handle these cases. Let's see your examples one by one: First
The quote after the word
The result will be:
The parser will handle the unescaped quote and keep going until it finds a closing quote (i.e. a quote followed by a delimiter). Now, if want to force the CSV to be well formed and get an exception to ensure quotes are correctly escaped, set
Second
In this case, you have the unescaped quote (before b) but no closing quote anywhere. When
This might help, but as you are parsing an input that uses tab as delimiter, you could probably use our TSV parser instead, as your input doesn't conform to CSV anyway. I'll keep this issue open because the current version of the parser is not producing the expected result if you set |
Done. I've released a 2.0.2-SNAPSHOT that includes the fix for this. Now when I'll release version 2.0.2 once you confirm this is working as expected. |
@jbax Thanks. I just tested
setParseUnescapedQuotesUntilDelimiter(true)
|
Thank you! Version 2.0.2 released. |
@HyukjinKwon For reference, here is the test code. I'm parsing Strings individually (saw your comment on SPARK-14103):
Version 2 comes with lots of API improvements, check the examples here. |
@jbax Cool. Thanks! |
## What changes were proposed in this pull request? This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below: ``` "a"b,ccc,ddd e,f,g ``` produces a data below: - **Before** ```bash ["a"b,ccc,ddd[\n]e,f,g] <- as a value. ``` - **After** ```bash ["a"b], [ccc], [ddd] [e], [f], [g] ``` This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, uniVocity/univocity-parsers#60. ## How was this patch tested? Unit tests in `CSVSuite` and `sbt/sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#12226 from HyukjinKwon/SPARK-14103-quote.
Hi, is it possible to parse this correclty with some settings?
example1;"example with " inside";example3
Or do i have to quote it?
example1;"example with ""inside";example3
thx for your help
The text was updated successfully, but these errors were encountered: