1052 Simplify the results of parse-csv #1066

michaelhkay · 2024-03-06T22:03:08Z

Changes parse-csv to deliver the results in a simpler format:

(a) the result structure is less deeply nested: one record with four entries
(b) the actual data is delivered as a sequence of arrays of strings, closely aligned with the result of csv-to-arrays

The rules in the spec have also been rearranged to reflect this, so the rules are now organised according to the values delivered for each of these four fields.

The examples in the spec are changed to reflect the new output format; in addition they have been editorially reorganized so each example is more self-contained, avoiding the need for extensive scrolling to find the values of variables referenced in each example.

Fix issue #1052

ChristianGruen

👍 I have various comments…

I liked the record names proposed in parse-csv() - simplify output #1052 (get, columns, column-index),
The type of column-index should get map(xs:string, xs:integer*),
We should get rid of the newline (by dropping row-separator or reducing CRLF/CR/LF
to LF),

…but it may be better to create a separate issue for it.

michaelhkay · 2024-03-07T16:43:58Z

Revised to address the comments. Thanks.

I have also addressed issue #1044 on row delimiter. This is now simplified to be a single character defaulting to newline: line ending normalization must be done first. Note this means that other line endings cannot be retained if they appear within quoted data; this seems a price worth paying to lose some complexity.

Fix issue #1044.

michaelhkay · 2024-03-08T09:24:30Z

Note, I renamed the option "column-names" to "header" as this seems more appropriate for the simple yes/no case, and avoids confusion with the names of entries in the result map.

I'm going to propose a couple more changes.

I propose that header should be a list of strings rather than a map from strings to integers. For the common case, it's much simpler to say header=("name", "date", "amount") than header=map{"name":1, "date":2, "amount":3}. Empty strings can still be inserted in the sequence to represent unnamed columns, so there is no loss of functionality. The rules for handling duplicate names now align with the rules for duplicates in an actual header row.
I propose merging the two options filter-columns and number-of-columns into a single option, select-columns. The semantics are the same as filter-columns; the effect of number-of-columns: 6 can be achieved by writing select-columns: (1 to 6).
I propose handling the newline problem as follows: (a) row-delimiter must be a single character, defaulting to U+0010. (b) we add an option normalize-newline=true|false, defaulting to false. If set to true, CR and CRLF sequences are replaced by NL unless they appear within a quoted string, this is done prior to testing whether a character is a row delimiter.

ChristianGruen · 2024-03-08T09:58:46Z

Note, I renamed the option "column-names" to "header" as this seems more appropriate for the simple yes/no case, and avoids confusion with the names of entries in the result map.

I like it.

I propose that header should be a list of strings rather than a map from strings to integers.

I wonder whether we should add this in the parse-csv function at all, and reduce the option to a boolean value. Wouldn’t it be straightforward to add a custom header, if required, after the parsing with map:put?

I propose merging the two options filter-columns and number-of-columns into a single option, select-columns. The semantics are the same as filter-columns; the effect of number-of-columns: 6 can be achieved by writing select-columns: (1 to 6).

+1

I propose handling the newline problem as follows: (a) row-delimiter must be a single character, defaulting to U+0010. (b) we add an option normalize-newline=true|false, defaulting to false. If set to true, CR and CRLF sequences are replaced by NL unless they appear within a quoted string, this is done prior to testing whether a character is a row delimiter.

I would really be in favor of dropping this option completely. Apache’s CSVFormat has a “record separator”, but it’s restricted to \r\n, \r and \n. Python’s corresponding option is only considered when serializing data. fn:unparsed-text-lines doesn’t have an option either what a “line delimiter” is. In practice, we’ve never encountered other delimiters than newline strings. XML has never cared about the difference between CR, CRLF and NL (it’s getting more and more irrelevant anyway), so I doubt we’ll disappointent users by keeping it simple.

ChristianGruen · 2024-03-19T17:54:17Z

@fidothe My apologies for being more insistent in the meeting than I had planned. I fully agree that users must be able to process CSV input, no matter which newlines are used; I just think we should take charge of the normalization automatically. \r is already dropped when parsing XML, when calling fn:unparsed-text and when parsing quoted strings, and it should be safe and easy to perform the same normalization when parsing CSV. For controlling newlines in the output, a serialization parameter is still discussed.

ChristianGruen approved these changes Mar 7, 2024

View reviewed changes

michaelhkay added 5 commits March 12, 2024 21:30

Simplify the results of parse-csv

8581ad0

Revised to address comments plus issue qt4cg#1044

8881f2e

Further revisions to refactor some of the supplied options to parse-csv

45e978f

Minor corrections

c445e6a

correction to schema for CSV

8e8b6ce

michaelhkay force-pushed the 1052-simplify-parse-csv branch from e855d2f to 8e8b6ce Compare March 12, 2024 21:32

michaelhkay added the Tests Added Tests have been added to the test suites label Mar 14, 2024

Simplify the $display() function for showing example results

2987d2c

michaelhkay merged commit 5997e30 into qt4cg:master Mar 19, 2024
1 check passed

michaelhkay added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature Completed PR has been applied, tests written and tagged, no further action needed labels Mar 19, 2024

This was referenced Mar 19, 2024

CSV Parsing - handling line ending normalization #1103

Open

parse-csv() - simplify output #1052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1052 Simplify the results of parse-csv #1066

1052 Simplify the results of parse-csv #1066

michaelhkay commented Mar 6, 2024

ChristianGruen left a comment •

edited

Loading

michaelhkay commented Mar 7, 2024 •

edited

Loading

michaelhkay commented Mar 8, 2024

ChristianGruen commented Mar 8, 2024 •

edited

Loading

ChristianGruen commented Mar 19, 2024

1052 Simplify the results of parse-csv #1066

1052 Simplify the results of parse-csv #1066

Conversation

michaelhkay commented Mar 6, 2024

ChristianGruen left a comment • edited Loading

Choose a reason for hiding this comment

michaelhkay commented Mar 7, 2024 • edited Loading

michaelhkay commented Mar 8, 2024

ChristianGruen commented Mar 8, 2024 • edited Loading

ChristianGruen commented Mar 19, 2024

ChristianGruen left a comment •

edited

Loading

michaelhkay commented Mar 7, 2024 •

edited

Loading

ChristianGruen commented Mar 8, 2024 •

edited

Loading