Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function: parse-csv() #413

Closed
michaelhkay opened this issue Mar 30, 2023 · 31 comments
Closed

New function: parse-csv() #413

michaelhkay opened this issue Mar 30, 2023 · 31 comments
Labels
Feature A change that introduces a new feature Propose Closing with No Action The WG should consider closing this issue with no action XQFO An issue related to Functions and Operators

Comments

@michaelhkay
Copy link
Contributor

I propose a new function parse-csv() that accepts a CSV string (such as might be read from a CSV file using unparsed-text()). CSV is as defined in RFC 4180; implementations may be liberal in what they accept, and may define additional options.

An options parameter includes the option header=true|false to indicate whether the first line should be taken as containing column headings.

The result of the function is a sequence of maps, one map per row of the CSV file (excluding the header). Each map contains one entry per column, the key being taken from the column header if present, or an integer 1...n if not.

@michaelhkay michaelhkay added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Mar 30, 2023
@ChristianGruen
Copy link
Contributor

Additional options that I regard as relevant, if not essential:

Option Description Default
separator Defines the character which separates the values of a single record. ,
quotes If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change. true()
backslashes If the option is enabled, \r, \n and \t will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: \""). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless quotes is enabled and the quote is the first or last character of a value). false()

@ChristianGruen
Copy link
Contributor

We also need to clarify what happens if multiple columns have the same header name. We could reuse the duplicates option for that purpose (from map:merge).

@michaelhkay
Copy link
Contributor Author

Duplicate column names: I was going to suggest that the duplicates get suffixed to make them unique, eg. name-1, name-2.

@joewiz
Copy link

joewiz commented Mar 30, 2023

A wonderful idea! Thank you!

@benibela
Copy link

My favorite separators are US () in place of , and RS () in place of line breaks

Duplicate column names: I was going to suggest that the duplicates get suffixed to make them unique, eg. name-1, name-2.

But what if the file already has those names as headers?

@michaelhkay
Copy link
Contributor Author

My favorite separators are US (�) in place of , and RS (�) in place of line breaks

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

Duplicate column names.

For allocation of namespace prefixes, we simply say "choose something that doesn't conflict with what's already there". We can either do that, or we can spell out an algorithm. Neither approach is difficult.

@ChristianGruen
Copy link
Contributor

ChristianGruen commented Mar 31, 2023

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

Sounds good. I've opened a new issue for that #414.

For allocation of namespace prefixes, we simply say "choose something that doesn't conflict with what's already there". We can either do that, or we can spell out an algorithm. Neither approach is difficult.

The BaseX XML representation was designed to support roundtripping. We are generating a map for CSV input, with one (optional) map entry containing an array with the header names, and another one containing a sequence of arrays with the records.

If we don't plan to serialize CSV data, the format proposed here is certainly more intuitive.

@michaelhkay
Copy link
Contributor Author

I'm wondering whether there is a requirement to retain information about the order of columns? If there is, this could perhaps be achieved with an option for the first map in the returned sequence to map column names to integer positions.

@ChristianGruen
Copy link
Contributor

ChristianGruen commented May 2, 2023

I'm wondering whether there is a requirement to retain information about the order of columns? If there is, this could perhaps be achieved with an option for the first map in the returned sequence to map column names to integer positions.

If we had a compact notation for getting the key of a value, it may already be sufficient to use header=false:

(: let $csv := csv:parse('language' || char('\n') || 'fr' || char('\n') || 'ja') :)
let $csv := (
  map { 1: 'language' },
  map { 1: 'fr' },
  map { 1: 'ja' }
)
let $header := head($csv)
(: map:key-where($header, { . = 'language' }) ? :)
let $ln := map:for-each($header, function($k, $v) { if($v = 'language') { $k } })
for $record in tail($csv)
return $record?$ln

Another requirement that we just added for sparse tables is to ignore empty fields (via an skip-empty option). It could also be the default.

@fidothe
Copy link
Contributor

fidothe commented May 9, 2023

I'm starting to take a serious look at this (I'm the new guy at Saxonica) from an implementor's perspective, but I've also done a lot projects making heavy use of CSV.

From that point of view, some parsing options I've found I leaned on a lot in Ruby's CSV library (where I've spent most of my time):

  • being able to specify row and column separators
  • being able to specify quote character
  • specifying header values independently of the input by passing a sequence of header names that will be used to the parse function.

The other thing that immediately occurs to me is the problem of missing and extra columns, where a CSV declares a number of headers in the first record, but not all the rows in the file have the same number of columns. When there are more, your record map has no keys to associate with those columns, when there are less, your record map needs empty entries.

Ruby's CSV library documentation is nice and thorough: https://rubydoc.info/gems/csv

@gimsieke
Copy link
Contributor

gimsieke commented May 9, 2023

👋 Hi Matt! Good to hear that you are now with Saxonica. (And sorry to all for this OT comment)

@fidothe
Copy link
Contributor

fidothe commented May 9, 2023

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

Sounds good. I've opened a new issue for that #414.

I would very much prefer fn:parse-csv() deal only with xs:string, and rely on fn:unparsed-text() (or something like it) to deal with the problem of turning a byte stream into text within XDM.

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

Thinking again about the problem of varying row lengths, especially when a header row (used to define keys for the maps returned) has fewer columns than some of the data rows. In that case, if your data row maps just use those keys, are the 'excess' columns discarded? (you might expect that all rows are the same size...)

If they aren't discarded, then I guess they just get integers-as-keys.

Other language implementations of this often use a Row object that acts as both a Map and an Array, which we can't, and in those cases where you're dealing with a variable number of 'anonymous' columns, array:subarray() access is almost certainly more convenient. Would it be simpler to return each row as an array of 2-item arrays, and also return a map of headers to position?

@ndw
Copy link
Contributor

ndw commented May 10, 2023

I wonder if the parse-csv function should be as primitive as possible and just return an array of rows, each row an array of columns. It could take an optional mapping function to transform that into a map. We could provide standard mapping functions for the obvious cases: unique column heads, no extra columns, etc. but let users provide their own functions for more complicated cases.

@ChristianGruen
Copy link
Contributor

ChristianGruen commented May 10, 2023

The XDM conversion we use (see #413 (comment)) was mainly designed to support bidirectional conversions. It’s not as convenient as other representations may be, but it can pretty well handle irregular CSV inputs (which are more or less the standard case in practice). The following CSV input…

A,A,,B
a1
a1,a2,,b,c

…with "header": true() will be converted to:

map {
  "names": [ "A", "A", "", "B" ],
  "records": ( ["a1"], ["a1", "a2", "", "b", "c"] )
}

The values of a specific column can be accessed as follows:

for $A-index in array:index-of($csv?names, 'A')  (: …actually »indexes of«, in this example :)
for $record in $csv?records
return array:get($record, $A-index, ())

If we want to circumvent the error handling caused by the strict array bounds, we could also represent the headers and single records as sequences and return the records as an array:

let $csv := map {
  "names": ( "A", "A", "", "B" ),
  "records": [ ("a1"), ("a1", "a2", "", "b", "c") ]
}
for $A-index in index-of($csv?names, 'A')
for $record in array:members($csv?records)
return $record?value[$A-index]

(: or :)
for member $record in $csv?records
return $record[$A-index]

Most people still prefer our XML representations for CSV data.

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

There's an expectation, I think, of some special behaviour around the header row itself being part of the most-primitive functionality, but I like the idea of it being as dumb as possible at core.

Without referring to the BaseX documentation (I'll have a proper look later) I'd probably lean toward something like this pseudocode:

let $result := fn:parse-csv($my-csv-string, map { "headers": true() })
let $headers := map:get($result, 'headers') (: ['name', 'address'] :)
let $data := map:get($result, 'data') (: [ ['Christian', 'Fakestr. 1'], ['Matt', 'Fakestr. 2']] :)

with header parsing defaulting to false():

let $result := fn:parse-csv($my-csv-string)
let $headers := map:get($result, 'headers') (: [] :)
let $data := map:get($result, 'data') (: [ ['name', 'address'], ['Christian', 'Fakestr. 1'], ['Matt', 'Fakestr. 2']] :)

WDYT?

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

Ooops, sorry Christian I didn't see your comment before I posted.

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

let $csv := fn:parse-csv($my-csv-string, map { "headers": true() })
let $csv-getter := csv:make-getter($csv?headers) (: function(row as array(), header as item()) :)
$csv-getter($csv?data[1], 'name') (: 'Matt' :)

As per @ndw's suggestion, providing a standard function to produce a map:get-like getter using the headers. The name is terrible, but I haven't internalised a naming convention for functions which return those kind of getter functions.

@ChristianGruen
Copy link
Contributor

ChristianGruen commented May 10, 2023

Ooops, sorry Christian I didn't see your comment before I posted.

No problem; hi Matt!

The function for retrieving map entries could also be embedded in the map:

(: without headers :)
let $csv := map {
  "records": ( ["a1"], ["a1", "a2", "", "b", "c"] ),
  "value"  : function($row, $column) { (: addressed with indexes :) }
}}
return $csv?value(2, 1) (: returns "a1" :)

(: with headers :)
let $csv := map {
  "names"  : [ "A", "A", "", "B" ],
  "records": ( ["a1"], ["a1", "a2", "", "b", "c"] ),
  "value"  : function($row, $name) { (: addressed with row index and column name :) } 
}}
return $csv?value(2, "A")  (: returns "a1" and "a2" :)

@michaelhkay
Copy link
Contributor Author

michaelhkay commented May 10, 2023

How about returning a map with the following fields:

  • "header" returns a sequence (or array) of strings containing the values of the headers as supplied;
  • "keys" returns a sequence (or array) of the keys actually used for the maps in the body, which will be the same as the header fields if they are all supplied and unique, and may contain generated names otherwise, and
  • "body" is a sequence of maps where the fields are identified by names that appear in "keys". Fields that are absent because the row has insufficient fields will be omitted from a map. If there are more fields in a body row than in the header then the relevant keys are generated, e.g. field_22.

Simple use case where you know there are headers and you know what they are, you can do

let $result := parse-csv(...)
for $row in $result?body
return <e code="{$row?code}" status="{$row?status}"/>

For more complex use cases if you need to know the order of the columns you can look in $result?keys or $result?header. If there are no headers and you want to access the fields positionally

for $row in $result?body
return <row>{
   for $key in $result?keys
   return <cell>{ $body?$key }</cell>
}</row>

@ChristianGruen
Copy link
Contributor

How about returning a map with the following fields:

I still have some concerns with the idea of automagically renaming headers. I believe it’s never a good design choice to change data to make it easier to process. Duplicate header names aren’t that rare, and if you have to work with CSV data that you can’t tailor by yourself, the representation with renamed headers is not that convincing and difficult to convert back.

But I agree, it would be much easier to have maps for regular/consistent data.

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

I'm with @ChristianGruen on automagically renaming headers, at least as part of the primitive parse function - it feels like we're doing more, and possibly different, work than is needed in all cases.

Thinking about Sequences vs Arrays for the headers row and body rows, it occurs to me that we can't use Sequence because we need to be able to have gaps, and if we following the spec (at least, the note in https://www.w3.org/TR/xpath-31/#id-array-constructors) then we shouldn't be using arrays either, but integer-keyed maps, because we are expecting sparse arrays as rows. (Unless we're proposing that missing cells be empty string in all cases.)

fn:parse-csv(...) #=> { "headers": array(xs:string), body: array(map(*)) }

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

On further reflection, we should represent empty cells with the empty string, so:

fn:parse-csv(...) #=> { "headers": array(xs:string), body: array(array(*)) }

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

On further further reflection I realise that I skipped over the body being returned as a Sequence rather than an Array. I am currently assuming that Sequence has potential implementation benefits, especially with large inputs and laziness in mind. That sounds fine to me.

@michaelhkay
Copy link
Contributor Author

I think we should focus on the usability of the data structure that's returned. Let's look at a few use cases, like turning the CSV data into an HTML table with table headers, or into an XML structure where each row becomes an element and the cells become attributes named after the column header, or into a JSON structure where each row becomes an object with the cells mapping to properties of the object.

Another thing we haven't looked at much is data typing. It would be convenient sometimes to return fields as numbers or dates, without having to do a wholesale transformation after the parsing.

@fidothe
Copy link
Contributor

fidothe commented May 10, 2023

I think we should focus on the usability of the data structure that's returned.

My experience with building things that process CSV has been that whatever my end goal is, the actual CSV parsing part of that is the smallest bit. Reliably getting the bytes off the disk and into arrays/hashes is very important, but the structure of CSV data varies so wildly that the bulk of whatever I built was always taking the sea of arrays and making it fit whatever model I needed so that I could then build my HTML table, or JSON structure.

In terms of usability, what's been important for me is ideally being able to extract headers so I use names to address the data, but usually what I end up doing is assigning headers to the columns and imposing those on the data at parse time, and then using my names to address the data. I don't want much more than that (apart from handling separators and quoting) from the parser. If it did much more most of the time it would just be in my way.

If everyone's happy, I'll find a reasonable variety of samples from my archive of projects that represent different kinds of CSVs I've encountered, and construct use cases based on what I needed to do with them.

Another thing we haven't looked at much is data typing. It would be convenient sometimes to return fields as numbers or dates, without having to do a wholesale transformation after the parsing.

It would. Ruby's CSV library allows you to register callback functions for each column so that you can do exactly that, which is a nice way of handling that without burdening the parser with too much logic. In practice I've very rarely found myself using it, and instead have tended to work with a row as it is spit out of the iterator. I think I'd like to try constructing use cases and having a crack at a simpler implementation and seeing how dealing with those kind of issues feels before throwing in more machinery.

@ChristianGruen
Copy link
Contributor

If everyone's happy, I'll find a reasonable variety of samples from my archive of projects that represent different kinds of CSVs I've encountered, and construct use cases based on what I needed to do with them.

Thanks, that's always interesting.

Maybe we should also decide if we want to facilitate or rule out roundtripping in the future. What do others think?

@michaelhkay
Copy link
Contributor Author

Roundtripping: we should retain sufficient information to reconstruct an equivalent CSV, but not necessarily an identical CSV, for example there is no need to remember which fields originally appeared in quotation marks.

@fidothe
Copy link
Contributor

fidothe commented May 12, 2023

I put some examples together, unfortunately still unfinished (schedule clash), but hopefully enough to talk about, in this gist: https://gist.github.com/fidothe/5442c1e8770e8275334bebedbdafce28. They're all from real projects.

@michaelhkay michaelhkay added the PR Pending A PR has been raised to resolve this issue label Jun 14, 2023
@ChristianGruen ChristianGruen removed the PR Pending A PR has been raised to resolve this issue label Jul 26, 2023
@ChristianGruen ChristianGruen added the Tests Needed Tests need to be written or merged label Jan 16, 2024
@ChristianGruen
Copy link
Contributor

I propose to close this issue, as the function was already added to the spec, and as numerous new issues exist that refer to the status quo (#1043, #1044, others).

@ChristianGruen ChristianGruen added the Propose Closing with No Action The WG should consider closing this issue with no action label Feb 27, 2024
@ndw
Copy link
Contributor

ndw commented Mar 5, 2024

The CG agreed to close this issue without further action at meeting 068 .

@ndw ndw closed this as completed Mar 5, 2024
@ChristianGruen ChristianGruen removed the Tests Needed Tests need to be written or merged label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature Propose Closing with No Action The WG should consider closing this issue with no action XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

7 participants