Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1052 Simplify the results of parse-csv #1066

Merged
merged 6 commits into from
Mar 19, 2024

Conversation

michaelhkay
Copy link
Contributor

Changes parse-csv to deliver the results in a simpler format:

(a) the result structure is less deeply nested: one record with four entries
(b) the actual data is delivered as a sequence of arrays of strings, closely aligned with the result of csv-to-arrays

The rules in the spec have also been rearranged to reflect this, so the rules are now organised according to the values delivered for each of these four fields.

The examples in the spec are changed to reflect the new output format; in addition they have been editorially reorganized so each example is more self-contained, avoiding the need for extensive scrolling to find the values of variables referenced in each example.

Fix issue #1052

Copy link
Contributor

@ChristianGruen ChristianGruen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I have various comments…

  • I liked the record names proposed in parse-csv() - simplify output #1052 (get, columns, column-index),
  • The type of column-index should get map(xs:string, xs:integer*),
  • We should get rid of the newline (by dropping row-separator or reducing CRLF/CR/LF
  • to LF),

…but it may be better to create a separate issue for it.

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Mar 7, 2024

Revised to address the comments. Thanks.

I have also addressed issue #1044 on row delimiter. This is now simplified to be a single character defaulting to newline: line ending normalization must be done first. Note this means that other line endings cannot be retained if they appear within quoted data; this seems a price worth paying to lose some complexity.

Fix issue #1044.

@michaelhkay
Copy link
Contributor Author

Note, I renamed the option "column-names" to "header" as this seems more appropriate for the simple yes/no case, and avoids confusion with the names of entries in the result map.

I'm going to propose a couple more changes.

  1. I propose that header should be a list of strings rather than a map from strings to integers. For the common case, it's much simpler to say header=("name", "date", "amount") than header=map{"name":1, "date":2, "amount":3}. Empty strings can still be inserted in the sequence to represent unnamed columns, so there is no loss of functionality. The rules for handling duplicate names now align with the rules for duplicates in an actual header row.

  2. I propose merging the two options filter-columns and number-of-columns into a single option, select-columns. The semantics are the same as filter-columns; the effect of number-of-columns: 6 can be achieved by writing select-columns: (1 to 6).

  3. I propose handling the newline problem as follows: (a) row-delimiter must be a single character, defaulting to U+0010. (b) we add an option normalize-newline=true|false, defaulting to false. If set to true, CR and CRLF sequences are replaced by NL unless they appear within a quoted string, this is done prior to testing whether a character is a row delimiter.

@ChristianGruen
Copy link
Contributor

ChristianGruen commented Mar 8, 2024

Note, I renamed the option "column-names" to "header" as this seems more appropriate for the simple yes/no case, and avoids confusion with the names of entries in the result map.

I like it.

  1. I propose that header should be a list of strings rather than a map from strings to integers.

I wonder whether we should add this in the parse-csv function at all, and reduce the option to a boolean value. Wouldn’t it be straightforward to add a custom header, if required, after the parsing with map:put?

  1. I propose merging the two options filter-columns and number-of-columns into a single option, select-columns. The semantics are the same as filter-columns; the effect of number-of-columns: 6 can be achieved by writing select-columns: (1 to 6).

+1

  1. I propose handling the newline problem as follows: (a) row-delimiter must be a single character, defaulting to U+0010. (b) we add an option normalize-newline=true|false, defaulting to false. If set to true, CR and CRLF sequences are replaced by NL unless they appear within a quoted string, this is done prior to testing whether a character is a row delimiter.

I would really be in favor of dropping this option completely. Apache’s CSVFormat has a “record separator”, but it’s restricted to \r\n, \r and \n. Python’s corresponding option is only considered when serializing data. fn:unparsed-text-lines doesn’t have an option either what a “line delimiter” is. In practice, we’ve never encountered other delimiters than newline strings. XML has never cared about the difference between CR, CRLF and NL (it’s getting more and more irrelevant anyway), so I doubt we’ll disappointent users by keeping it simple.

@michaelhkay michaelhkay added the Tests Added Tests have been added to the test suites label Mar 14, 2024
@michaelhkay michaelhkay merged commit 5997e30 into qt4cg:master Mar 19, 2024
1 check passed
@michaelhkay michaelhkay added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature Completed PR has been applied, tests written and tagged, no further action needed labels Mar 19, 2024
@ChristianGruen
Copy link
Contributor

@fidothe My apologies for being more insistent in the meeting than I had planned. I fully agree that users must be able to process CSV input, no matter which newlines are used; I just think we should take charge of the normalization automatically. \r is already dropped when parsing XML, when calling fn:unparsed-text and when parsing quoted strings, and it should be safe and easy to perform the same normalization when parsing CSV. For controlling newlines in the output, a serialization parameter is still discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Completed PR has been applied, tests written and tagged, no further action needed Feature A change that introduces a new feature Tests Added Tests have been added to the test suites XQFO An issue related to Functions and Operators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants