New function: parse-csv() #413

michaelhkay · 2023-03-30T14:08:28Z

I propose a new function parse-csv() that accepts a CSV string (such as might be read from a CSV file using unparsed-text()). CSV is as defined in RFC 4180; implementations may be liberal in what they accept, and may define additional options.

An options parameter includes the option header=true|false to indicate whether the first line should be taken as containing column headings.

The result of the function is a sequence of maps, one map per row of the CSV file (excluding the header). Each map contains one entry per column, the key being taken from the column header if present, or an integer 1...n if not.

ChristianGruen · 2023-03-30T14:16:37Z

Additional options that I regard as relevant, if not essential:

Option	Description	Default
`separator`	Defines the character which separates the values of a single record.	`,`
`quotes`	If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.	`true()`
`backslashes`	If the option is enabled, `\r`, `\n` and `\t` will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: `\"` → `"`). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless quotes is enabled and the quote is the first or last character of a value).	`false()`

ChristianGruen · 2023-03-30T14:42:04Z

We also need to clarify what happens if multiple columns have the same header name. We could reuse the duplicates option for that purpose (from map:merge).

michaelhkay · 2023-03-30T14:58:12Z

Duplicate column names: I was going to suggest that the duplicates get suffixed to make them unique, eg. name-1, name-2.

joewiz · 2023-03-30T18:13:53Z

A wonderful idea! Thank you!

benibela · 2023-03-30T21:43:51Z

My favorite separators are US () in place of , and RS () in place of line breaks

Duplicate column names: I was going to suggest that the duplicates get suffixed to make them unique, eg. name-1, name-2.

But what if the file already has those names as headers?

michaelhkay · 2023-03-31T07:54:16Z

My favorite separators are US (�) in place of , and RS (�) in place of line breaks

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

Duplicate column names.

For allocation of namespace prefixes, we simply say "choose something that doesn't conflict with what's already there". We can either do that, or we can spell out an algorithm. Neither approach is difficult.

ChristianGruen · 2023-03-31T08:22:50Z

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

Sounds good. I've opened a new issue for that #414.

For allocation of namespace prefixes, we simply say "choose something that doesn't conflict with what's already there". We can either do that, or we can spell out an algorithm. Neither approach is difficult.

The BaseX XML representation was designed to support roundtripping. We are generating a map for CSV input, with one (optional) map entry containing an array with the header names, and another one containing a sequence of arrays with the records.

If we don't plan to serialize CSV data, the format proposed here is certainly more intuitive.

michaelhkay · 2023-05-02T10:48:30Z

I'm wondering whether there is a requirement to retain information about the order of columns? If there is, this could perhaps be achieved with an option for the first map in the returned sequence to map column names to integer positions.

ChristianGruen · 2023-05-02T11:12:17Z

I'm wondering whether there is a requirement to retain information about the order of columns? If there is, this could perhaps be achieved with an option for the first map in the returned sequence to map column names to integer positions.

If we had a compact notation for getting the key of a value, it may already be sufficient to use header=false:

(: let $csv := csv:parse('language' || char('\n') || 'fr' || char('\n') || 'ja') :)
let $csv := (
  map { 1: 'language' },
  map { 1: 'fr' },
  map { 1: 'ja' }
)
let $header := head($csv)
(: map:key-where($header, { . = 'language' }) ? :)
let $ln := map:for-each($header, function($k, $v) { if($v = 'language') { $k } })
for $record in tail($csv)
return $record?$ln

Another requirement that we just added for sparse tables is to ignore empty fields (via an skip-empty option). It could also be the default.

fidothe · 2023-05-09T11:59:08Z

I'm starting to take a serious look at this (I'm the new guy at Saxonica) from an implementor's perspective, but I've also done a lot projects making heavy use of CSV.

From that point of view, some parsing options I've found I leaned on a lot in Ruby's CSV library (where I've spent most of my time):

being able to specify row and column separators
being able to specify quote character
specifying header values independently of the input by passing a sequence of header names that will be used to the parse function.

The other thing that immediately occurs to me is the problem of missing and extra columns, where a CSV declares a number of headers in the first record, but not all the rows in the file have the same number of columns. When there are more, your record map has no keys to associate with those columns, when there are less, your record map needs empty entries.

Ruby's CSV library documentation is nice and thorough: https://rubydoc.info/gems/csv

gimsieke · 2023-05-09T12:22:22Z

👋 Hi Matt! Good to hear that you are now with Saxonica. (And sorry to all for this OT comment)

fidothe · 2023-05-09T13:36:02Z

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

Sounds good. I've opened a new issue for that #414.

I would very much prefer fn:parse-csv() deal only with xs:string, and rely on fn:unparsed-text() (or something like it) to deal with the problem of turning a byte stream into text within XDM.

fidothe · 2023-05-10T09:19:47Z

Thinking again about the problem of varying row lengths, especially when a header row (used to define keys for the maps returned) has fewer columns than some of the data rows. In that case, if your data row maps just use those keys, are the 'excess' columns discarded? (you might expect that all rows are the same size...)

If they aren't discarded, then I guess they just get integers-as-keys.

Other language implementations of this often use a Row object that acts as both a Map and an Array, which we can't, and in those cases where you're dealing with a variable number of 'anonymous' columns, array:subarray() access is almost certainly more convenient. Would it be simpler to return each row as an array of 2-item arrays, and also return a map of headers to position?

ndw · 2023-05-10T09:24:42Z

I wonder if the parse-csv function should be as primitive as possible and just return an array of rows, each row an array of columns. It could take an optional mapping function to transform that into a map. We could provide standard mapping functions for the obvious cases: unique column heads, no extra columns, etc. but let users provide their own functions for more complicated cases.

ChristianGruen · 2023-05-10T10:21:39Z

The XDM conversion we use (see #413 (comment)) was mainly designed to support bidirectional conversions. It’s not as convenient as other representations may be, but it can pretty well handle irregular CSV inputs (which are more or less the standard case in practice). The following CSV input…

A,A,,B
a1
a1,a2,,b,c

…with "header": true() will be converted to:

map {
  "names": [ "A", "A", "", "B" ],
  "records": ( ["a1"], ["a1", "a2", "", "b", "c"] )
}

The values of a specific column can be accessed as follows:

for $A-index in array:index-of($csv?names, 'A')  (: …actually »indexes of«, in this example :)
for $record in $csv?records
return array:get($record, $A-index, ())

If we want to circumvent the error handling caused by the strict array bounds, we could also represent the headers and single records as sequences and return the records as an array:

let $csv := map {
  "names": ( "A", "A", "", "B" ),
  "records": [ ("a1"), ("a1", "a2", "", "b", "c") ]
}
for $A-index in index-of($csv?names, 'A')
for $record in array:members($csv?records)
return $record?value[$A-index]

(: or :)
for member $record in $csv?records
return $record[$A-index]

Most people still prefer our XML representations for CSV data.

fidothe · 2023-05-10T10:31:20Z

There's an expectation, I think, of some special behaviour around the header row itself being part of the most-primitive functionality, but I like the idea of it being as dumb as possible at core.

Without referring to the BaseX documentation (I'll have a proper look later) I'd probably lean toward something like this pseudocode:

let $result := fn:parse-csv($my-csv-string, map { "headers": true() })
let $headers := map:get($result, 'headers') (: ['name', 'address'] :)
let $data := map:get($result, 'data') (: [ ['Christian', 'Fakestr. 1'], ['Matt', 'Fakestr. 2']] :)

with header parsing defaulting to false():

let $result := fn:parse-csv($my-csv-string)
let $headers := map:get($result, 'headers') (: [] :)
let $data := map:get($result, 'data') (: [ ['name', 'address'], ['Christian', 'Fakestr. 1'], ['Matt', 'Fakestr. 2']] :)

WDYT?

fidothe · 2023-05-10T10:33:13Z

Ooops, sorry Christian I didn't see your comment before I posted.

fidothe · 2023-05-10T10:42:31Z

let $csv := fn:parse-csv($my-csv-string, map { "headers": true() })
let $csv-getter := csv:make-getter($csv?headers) (: function(row as array(), header as item()) :)
$csv-getter($csv?data[1], 'name') (: 'Matt' :)

As per @ndw's suggestion, providing a standard function to produce a map:get-like getter using the headers. The name is terrible, but I haven't internalised a naming convention for functions which return those kind of getter functions.

ChristianGruen · 2023-05-10T11:02:07Z

Ooops, sorry Christian I didn't see your comment before I posted.

No problem; hi Matt!

The function for retrieving map entries could also be embedded in the map:

(: without headers :)
let $csv := map {
  "records": ( ["a1"], ["a1", "a2", "", "b", "c"] ),
  "value"  : function($row, $column) { (: addressed with indexes :) }
}}
return $csv?value(2, 1) (: returns "a1" :)

(: with headers :)
let $csv := map {
  "names"  : [ "A", "A", "", "B" ],
  "records": ( ["a1"], ["a1", "a2", "", "b", "c"] ),
  "value"  : function($row, $name) { (: addressed with row index and column name :) } 
}}
return $csv?value(2, "A")  (: returns "a1" and "a2" :)

michaelhkay · 2023-05-10T11:06:13Z

How about returning a map with the following fields:

"header" returns a sequence (or array) of strings containing the values of the headers as supplied;
"keys" returns a sequence (or array) of the keys actually used for the maps in the body, which will be the same as the header fields if they are all supplied and unique, and may contain generated names otherwise, and
"body" is a sequence of maps where the fields are identified by names that appear in "keys". Fields that are absent because the row has insufficient fields will be omitted from a map. If there are more fields in a body row than in the header then the relevant keys are generated, e.g. field_22.

Simple use case where you know there are headers and you know what they are, you can do

let $result := parse-csv(...)
for $row in $result?body
return <e code="{$row?code}" status="{$row?status}"/>

For more complex use cases if you need to know the order of the columns you can look in $result?keys or $result?header. If there are no headers and you want to access the fields positionally

for $row in $result?body
return <row>{
   for $key in $result?keys
   return <cell>{ $body?$key }</cell>
}</row>

ChristianGruen · 2023-05-10T12:21:09Z

How about returning a map with the following fields:

I still have some concerns with the idea of automagically renaming headers. I believe it’s never a good design choice to change data to make it easier to process. Duplicate header names aren’t that rare, and if you have to work with CSV data that you can’t tailor by yourself, the representation with renamed headers is not that convincing and difficult to convert back.

But I agree, it would be much easier to have maps for regular/consistent data.

fidothe · 2023-05-10T13:12:14Z

I'm with @ChristianGruen on automagically renaming headers, at least as part of the primitive parse function - it feels like we're doing more, and possibly different, work than is needed in all cases.

Thinking about Sequences vs Arrays for the headers row and body rows, it occurs to me that we can't use Sequence because we need to be able to have gaps, and if we following the spec (at least, the note in https://www.w3.org/TR/xpath-31/#id-array-constructors) then we shouldn't be using arrays either, but integer-keyed maps, because we are expecting sparse arrays as rows. (Unless we're proposing that missing cells be empty string in all cases.)

fn:parse-csv(...) #=> { "headers": array(xs:string), body: array(map(*)) }

fidothe · 2023-05-10T13:13:10Z

On further reflection, we should represent empty cells with the empty string, so:

fn:parse-csv(...) #=> { "headers": array(xs:string), body: array(array(*)) }

fidothe · 2023-05-10T13:15:38Z

On further further reflection I realise that I skipped over the body being returned as a Sequence rather than an Array. I am currently assuming that Sequence has potential implementation benefits, especially with large inputs and laziness in mind. That sounds fine to me.

michaelhkay · 2023-05-10T14:26:23Z

I think we should focus on the usability of the data structure that's returned. Let's look at a few use cases, like turning the CSV data into an HTML table with table headers, or into an XML structure where each row becomes an element and the cells become attributes named after the column header, or into a JSON structure where each row becomes an object with the cells mapping to properties of the object.

Another thing we haven't looked at much is data typing. It would be convenient sometimes to return fields as numbers or dates, without having to do a wholesale transformation after the parsing.

fidothe · 2023-05-10T15:21:16Z

I think we should focus on the usability of the data structure that's returned.

My experience with building things that process CSV has been that whatever my end goal is, the actual CSV parsing part of that is the smallest bit. Reliably getting the bytes off the disk and into arrays/hashes is very important, but the structure of CSV data varies so wildly that the bulk of whatever I built was always taking the sea of arrays and making it fit whatever model I needed so that I could then build my HTML table, or JSON structure.

In terms of usability, what's been important for me is ideally being able to extract headers so I use names to address the data, but usually what I end up doing is assigning headers to the columns and imposing those on the data at parse time, and then using my names to address the data. I don't want much more than that (apart from handling separators and quoting) from the parser. If it did much more most of the time it would just be in my way.

If everyone's happy, I'll find a reasonable variety of samples from my archive of projects that represent different kinds of CSVs I've encountered, and construct use cases based on what I needed to do with them.

Another thing we haven't looked at much is data typing. It would be convenient sometimes to return fields as numbers or dates, without having to do a wholesale transformation after the parsing.

It would. Ruby's CSV library allows you to register callback functions for each column so that you can do exactly that, which is a nice way of handling that without burdening the parser with too much logic. In practice I've very rarely found myself using it, and instead have tended to work with a row as it is spit out of the iterator. I think I'd like to try constructing use cases and having a crack at a simpler implementation and seeing how dealing with those kind of issues feels before throwing in more machinery.

ChristianGruen · 2023-05-10T21:06:57Z

If everyone's happy, I'll find a reasonable variety of samples from my archive of projects that represent different kinds of CSVs I've encountered, and construct use cases based on what I needed to do with them.

Thanks, that's always interesting.

Maybe we should also decide if we want to facilitate or rule out roundtripping in the future. What do others think?

michaelhkay · 2023-05-10T21:33:37Z

Roundtripping: we should retain sufficient information to reconstruct an equivalent CSV, but not necessarily an identical CSV, for example there is no need to remember which fields originally appeared in quotation marks.

fidothe · 2023-05-12T08:19:32Z

I put some examples together, unfortunately still unfinished (schedule clash), but hopefully enough to talk about, in this gist: https://gist.github.com/fidothe/5442c1e8770e8275334bebedbdafce28. They're all from real projects.

ChristianGruen · 2024-02-27T17:12:37Z

I propose to close this issue, as the function was already added to the spec, and as numerous new issues exist that refer to the status quo (#1043, #1044, others).

ndw · 2024-03-05T17:14:04Z

The CG agreed to close this issue without further action at meeting 068 .

michaelhkay added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Mar 30, 2023

ChristianGruen mentioned this issue Mar 31, 2023

Lift character set restriction of xs:string #414

Closed

ChristianGruen mentioned this issue May 3, 2023

map:keys-where: Return Keys That Match a Predicate #467

Closed

fidothe mentioned this issue May 31, 2023

413: Spec for CSV parsing with fn:parse-csv() #533

Merged

michaelhkay added the PR Pending A PR has been raised to resolve this issue label Jun 14, 2023

ChristianGruen removed the PR Pending A PR has been raised to resolve this issue label Jul 26, 2023

fidothe mentioned this issue Sep 21, 2023

413: Add tests for CSV-related functions qt4cg/qt4tests#56

Closed

ChristianGruen added the Tests Needed Tests need to be written or merged label Jan 16, 2024

ChristianGruen added the Propose Closing with No Action The WG should consider closing this issue with no action label Feb 27, 2024

ndw closed this as completed Mar 5, 2024

ChristianGruen removed the Tests Needed Tests need to be written or merged label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New function: parse-csv() #413

New function: parse-csv() #413

michaelhkay commented Mar 30, 2023

ChristianGruen commented Mar 30, 2023

ChristianGruen commented Mar 30, 2023

michaelhkay commented Mar 30, 2023

joewiz commented Mar 30, 2023

benibela commented Mar 30, 2023

michaelhkay commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023 •

edited

Loading

michaelhkay commented May 2, 2023

ChristianGruen commented May 2, 2023 •

edited

Loading

fidothe commented May 9, 2023

gimsieke commented May 9, 2023

fidothe commented May 9, 2023

fidothe commented May 10, 2023

ndw commented May 10, 2023

ChristianGruen commented May 10, 2023 •

edited

Loading

fidothe commented May 10, 2023

fidothe commented May 10, 2023

fidothe commented May 10, 2023

ChristianGruen commented May 10, 2023 •

edited

Loading

michaelhkay commented May 10, 2023 •

edited

Loading

ChristianGruen commented May 10, 2023

fidothe commented May 10, 2023

fidothe commented May 10, 2023

fidothe commented May 10, 2023

michaelhkay commented May 10, 2023

fidothe commented May 10, 2023

ChristianGruen commented May 10, 2023

michaelhkay commented May 10, 2023

fidothe commented May 12, 2023

ChristianGruen commented Feb 27, 2024

ndw commented Mar 5, 2024

New function: parse-csv() #413

New function: parse-csv() #413

Comments

michaelhkay commented Mar 30, 2023

ChristianGruen commented Mar 30, 2023

ChristianGruen commented Mar 30, 2023

michaelhkay commented Mar 30, 2023

joewiz commented Mar 30, 2023

benibela commented Mar 30, 2023

michaelhkay commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023 • edited Loading

michaelhkay commented May 2, 2023

ChristianGruen commented May 2, 2023 • edited Loading

fidothe commented May 9, 2023

gimsieke commented May 9, 2023

fidothe commented May 9, 2023

fidothe commented May 10, 2023

ndw commented May 10, 2023

ChristianGruen commented May 10, 2023 • edited Loading

fidothe commented May 10, 2023

fidothe commented May 10, 2023

fidothe commented May 10, 2023

ChristianGruen commented May 10, 2023 • edited Loading

michaelhkay commented May 10, 2023 • edited Loading

ChristianGruen commented May 10, 2023

fidothe commented May 10, 2023

fidothe commented May 10, 2023

fidothe commented May 10, 2023

michaelhkay commented May 10, 2023

fidothe commented May 10, 2023

ChristianGruen commented May 10, 2023

michaelhkay commented May 10, 2023

fidothe commented May 12, 2023

ChristianGruen commented Feb 27, 2024

ndw commented Mar 5, 2024

ChristianGruen commented Mar 31, 2023 •

edited

Loading

ChristianGruen commented May 2, 2023 •

edited

Loading

ChristianGruen commented May 10, 2023 •

edited

Loading

ChristianGruen commented May 10, 2023 •

edited

Loading

michaelhkay commented May 10, 2023 •

edited

Loading