-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New function: parse-csv() #413
Comments
Additional options that I regard as relevant, if not essential:
|
We also need to clarify what happens if multiple columns have the same header name. We could reuse the |
Duplicate column names: I was going to suggest that the duplicates get suffixed to make them unique, eg. name-1, name-2. |
A wonderful idea! Thank you! |
My favorite separators are US (
But what if the file already has those names as headers? |
I guess that raises the question of whether it is still appropriate to restrict the character set of
For allocation of namespace prefixes, we simply say "choose something that doesn't conflict with what's already there". We can either do that, or we can spell out an algorithm. Neither approach is difficult. |
Sounds good. I've opened a new issue for that #414.
The BaseX XML representation was designed to support roundtripping. We are generating a map for CSV input, with one (optional) map entry containing an array with the header names, and another one containing a sequence of arrays with the records. If we don't plan to serialize CSV data, the format proposed here is certainly more intuitive. |
I'm wondering whether there is a requirement to retain information about the order of columns? If there is, this could perhaps be achieved with an option for the first map in the returned sequence to map column names to integer positions. |
If we had a compact notation for getting the key of a value, it may already be sufficient to use (: let $csv := csv:parse('language' || char('\n') || 'fr' || char('\n') || 'ja') :)
let $csv := (
map { 1: 'language' },
map { 1: 'fr' },
map { 1: 'ja' }
)
let $header := head($csv)
(: map:key-where($header, { . = 'language' }) ? :)
let $ln := map:for-each($header, function($k, $v) { if($v = 'language') { $k } })
for $record in tail($csv)
return $record?$ln Another requirement that we just added for sparse tables is to ignore empty fields (via an |
I'm starting to take a serious look at this (I'm the new guy at Saxonica) from an implementor's perspective, but I've also done a lot projects making heavy use of CSV. From that point of view, some parsing options I've found I leaned on a lot in Ruby's CSV library (where I've spent most of my time):
The other thing that immediately occurs to me is the problem of missing and extra columns, where a CSV declares a number of headers in the first record, but not all the rows in the file have the same number of columns. When there are more, your record Ruby's CSV library documentation is nice and thorough: https://rubydoc.info/gems/csv |
👋 Hi Matt! Good to hear that you are now with Saxonica. (And sorry to all for this OT comment) |
I would very much prefer |
Thinking again about the problem of varying row lengths, especially when a header row (used to define keys for the maps returned) has fewer columns than some of the data rows. In that case, if your data row maps just use those keys, are the 'excess' columns discarded? (you might expect that all rows are the same size...) If they aren't discarded, then I guess they just get integers-as-keys. Other language implementations of this often use a Row object that acts as both a Map and an Array, which we can't, and in those cases where you're dealing with a variable number of 'anonymous' columns, |
I wonder if the parse-csv function should be as primitive as possible and just return an array of rows, each row an array of columns. It could take an optional mapping function to transform that into a map. We could provide standard mapping functions for the obvious cases: unique column heads, no extra columns, etc. but let users provide their own functions for more complicated cases. |
The XDM conversion we use (see #413 (comment)) was mainly designed to support bidirectional conversions. It’s not as convenient as other representations may be, but it can pretty well handle irregular CSV inputs (which are more or less the standard case in practice). The following CSV input…
…with map {
"names": [ "A", "A", "", "B" ],
"records": ( ["a1"], ["a1", "a2", "", "b", "c"] )
} The values of a specific column can be accessed as follows: for $A-index in array:index-of($csv?names, 'A') (: …actually »indexes of«, in this example :)
for $record in $csv?records
return array:get($record, $A-index, ()) If we want to circumvent the error handling caused by the strict array bounds, we could also represent the headers and single records as sequences and return the records as an array: let $csv := map {
"names": ( "A", "A", "", "B" ),
"records": [ ("a1"), ("a1", "a2", "", "b", "c") ]
}
for $A-index in index-of($csv?names, 'A')
for $record in array:members($csv?records)
return $record?value[$A-index]
(: or :)
for member $record in $csv?records
return $record[$A-index] Most people still prefer our XML representations for CSV data. |
There's an expectation, I think, of some special behaviour around the header row itself being part of the most-primitive functionality, but I like the idea of it being as dumb as possible at core. Without referring to the BaseX documentation (I'll have a proper look later) I'd probably lean toward something like this pseudocode: let $result := fn:parse-csv($my-csv-string, map { "headers": true() })
let $headers := map:get($result, 'headers') (: ['name', 'address'] :)
let $data := map:get($result, 'data') (: [ ['Christian', 'Fakestr. 1'], ['Matt', 'Fakestr. 2']] :) with header parsing defaulting to let $result := fn:parse-csv($my-csv-string)
let $headers := map:get($result, 'headers') (: [] :)
let $data := map:get($result, 'data') (: [ ['name', 'address'], ['Christian', 'Fakestr. 1'], ['Matt', 'Fakestr. 2']] :) WDYT? |
Ooops, sorry Christian I didn't see your comment before I posted. |
let $csv := fn:parse-csv($my-csv-string, map { "headers": true() })
let $csv-getter := csv:make-getter($csv?headers) (: function(row as array(), header as item()) :)
$csv-getter($csv?data[1], 'name') (: 'Matt' :) As per @ndw's suggestion, providing a standard function to produce a |
No problem; hi Matt! The function for retrieving map entries could also be embedded in the map: (: without headers :)
let $csv := map {
"records": ( ["a1"], ["a1", "a2", "", "b", "c"] ),
"value" : function($row, $column) { (: addressed with indexes :) }
}}
return $csv?value(2, 1) (: returns "a1" :)
(: with headers :)
let $csv := map {
"names" : [ "A", "A", "", "B" ],
"records": ( ["a1"], ["a1", "a2", "", "b", "c"] ),
"value" : function($row, $name) { (: addressed with row index and column name :) }
}}
return $csv?value(2, "A") (: returns "a1" and "a2" :) |
How about returning a map with the following fields:
Simple use case where you know there are headers and you know what they are, you can do
For more complex use cases if you need to know the order of the columns you can look in
|
I still have some concerns with the idea of automagically renaming headers. I believe it’s never a good design choice to change data to make it easier to process. Duplicate header names aren’t that rare, and if you have to work with CSV data that you can’t tailor by yourself, the representation with renamed headers is not that convincing and difficult to convert back. But I agree, it would be much easier to have maps for regular/consistent data. |
I'm with @ChristianGruen on automagically renaming headers, at least as part of the primitive parse function - it feels like we're doing more, and possibly different, work than is needed in all cases. Thinking about Sequences vs Arrays for the headers row and body rows, it occurs to me that we can't use Sequence because we need to be able to have gaps, and if we following the spec (at least, the note in https://www.w3.org/TR/xpath-31/#id-array-constructors) then we shouldn't be using arrays either, but integer-keyed maps, because we are expecting sparse arrays as rows. (Unless we're proposing that missing cells be empty string in all cases.) fn:parse-csv(...) #=> { "headers": array(xs:string), body: array(map(*)) } |
On further reflection, we should represent empty cells with the empty string, so: fn:parse-csv(...) #=> { "headers": array(xs:string), body: array(array(*)) } |
On further further reflection I realise that I skipped over the body being returned as a |
I think we should focus on the usability of the data structure that's returned. Let's look at a few use cases, like turning the CSV data into an HTML table with table headers, or into an XML structure where each row becomes an element and the cells become attributes named after the column header, or into a JSON structure where each row becomes an object with the cells mapping to properties of the object. Another thing we haven't looked at much is data typing. It would be convenient sometimes to return fields as numbers or dates, without having to do a wholesale transformation after the parsing. |
My experience with building things that process CSV has been that whatever my end goal is, the actual CSV parsing part of that is the smallest bit. Reliably getting the bytes off the disk and into arrays/hashes is very important, but the structure of CSV data varies so wildly that the bulk of whatever I built was always taking the sea of arrays and making it fit whatever model I needed so that I could then build my HTML table, or JSON structure. In terms of usability, what's been important for me is ideally being able to extract headers so I use names to address the data, but usually what I end up doing is assigning headers to the columns and imposing those on the data at parse time, and then using my names to address the data. I don't want much more than that (apart from handling separators and quoting) from the parser. If it did much more most of the time it would just be in my way. If everyone's happy, I'll find a reasonable variety of samples from my archive of projects that represent different kinds of CSVs I've encountered, and construct use cases based on what I needed to do with them.
It would. Ruby's CSV library allows you to register callback functions for each column so that you can do exactly that, which is a nice way of handling that without burdening the parser with too much logic. In practice I've very rarely found myself using it, and instead have tended to work with a row as it is spit out of the iterator. I think I'd like to try constructing use cases and having a crack at a simpler implementation and seeing how dealing with those kind of issues feels before throwing in more machinery. |
Thanks, that's always interesting. Maybe we should also decide if we want to facilitate or rule out roundtripping in the future. What do others think? |
Roundtripping: we should retain sufficient information to reconstruct an equivalent CSV, but not necessarily an identical CSV, for example there is no need to remember which fields originally appeared in quotation marks. |
I put some examples together, unfortunately still unfinished (schedule clash), but hopefully enough to talk about, in this gist: https://gist.github.com/fidothe/5442c1e8770e8275334bebedbdafce28. They're all from real projects. |
The CG agreed to close this issue without further action at meeting 068 . |
I propose a new function parse-csv() that accepts a CSV string (such as might be read from a CSV file using unparsed-text()). CSV is as defined in RFC 4180; implementations may be liberal in what they accept, and may define additional options.
An options parameter includes the option header=true|false to indicate whether the first line should be taken as containing column headings.
The result of the function is a sequence of maps, one map per row of the CSV file (excluding the header). Each map contains one entry per column, the key being taken from the column header if present, or an integer 1...n if not.
The text was updated successfully, but these errors were encountered: