-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pattern string formats for parsing dates/numbers/durations #54
Comments
I'd like to propose that we drop the requirement to define pattern strings for parsing values; the pattern strings in Unicode TR35 (eg for dates) are extremely complicated and it's a large implementation burden to support the full set of locales. |
I support the proposal to drop this as a requirement. We could include an informational reference to the relevant standards, as a way to encourage toolmakers to add non-standard but useful extras. |
To help resolve this, @6a6d74 will look at ISO8601 format strings to see if they're simpler than TR35. I will look at what it would take to properly support parsing numbers. |
As suspected, ISO 8601:2004 (the most recent version) does not cater for variations in formatting the way we need things. It's strict about the way the date-time is formatted (albeit allowing variations like 'week number', 'day in year' etc. alongside the normal year-month-day representation). TR35 looks more complex than I think is warranted. However, the [“XPath and XQuery Functions and Operators 3.0” 9.8 Formatting dates and times][http://www.w3.org/TR/xpath-functions-30/#formatting-dates-and-times] looks to cover what we need whilst not being super complex. Of course - we're interested in converting from the locale-dependent number strings to xsd date-time (and gYear, gMonth etc.), which is the opposite direction to the functions in XPath. That said, the rules are very clearly defined & should be reversible. (Also see [4.7 Formatting numbers][http://www.w3.org/TR/xpath-functions-30/#formatting-numbers]) I think that it would be possible to express a testable set of functions for dealing with picture strings as defined for xpath 3.0 - especially if we assert things like "The ISO (ISO 8601) calendar must be supported. Support for other calendars may be provided, in which case conversion between different calendars is implementation defined". There are a few other challenges to deal with ... I'll try to write some testable rules. |
This relates to a recent thread in public-vocals which I commented on [1]. The Microdata to RDF spec looks for specific XSD patterns, but ISO 8601 allows a wider range of formats, but still not as free formed as most spreadsheet content. For those of strong stomach, here's the (ruby extended) regular expression I use to match many 8601 patterns:
|
Aside: I often consider regular expressions a "write only" syntax, as you can create them, but trying to understand one by reading it is a job best left to computers! :) |
discussion: http://www.w3.org/2014/11/19-csvw-irc |
My proposal based on chat w/ jtandy + ivan is:
|
+1 |
To complement these, what we thought of proposing is:
Ivan
Ivan Herman, W3C |
Re "copied verbatim", ... I assume you have the mapping to RDF/JSON use cases most heavily in mind. But are the picture strings also useful just as metadata to document the meaning of fields? even if not doing a bulk conversion... |
That is probably correct. Ivan
|
Here's a summary of the key points from the teleconf discussion of Wed-19-Nov-2014
So basically, we're
(I am assuming that dealing with 'naturally formatted' numbers, e.g. with comma ',' as decimal separator, will be treated in the same way) |
The proposal above (with an ISO 8601 date-time complementing the 'naturally formatted' date-time) assumes that the naturally formatted date-time column(s) can be suppressed in the output. Need to ensure that 'column suppression in mapping' is supported. Also I wonder how easy it would be to use a REGEXP on the naturally formatted date-times to create a 'virtual' column when parsing the data? (I'll think about that some more!) |
Yes.
I just try to understand what you mean: would the metadata include a regexp pair (from and to) that should be applied on a literal before output? With the 'group' feature of the usual regular expression syntaxes that may work. I am just a little bit afraid of opening up the floodgates of some sort of an extra mapping mechanism for the mapping... Another problem with this: as Jeremy put it, I believe, regexp is a write only language. Our authors are not necessarily computer techies; asking them to write a regexp (and test it!) rather than a picture string might be a really tall order. Ivan
Ivan Herman, W3C |
Good points ... as ever I'm pushing the boundaries (but mindful of our agreement regarding "simple mapping") Jeremy |
+1 on simplicity (e.g. ISO 8601) - I also imagine we would likely have this as SHOULD rather than MUST. |
@6a6d74 can you provide an example of the metadata syntax that you have in mind based on the above discussion? It's not really clear to me. Regarding suppressing columns, I think it's actually useful to maintain the original human-readable data so I don't think that recommending two columns necessitates having a feature to suppress columns. (Can we have a separate issue for suppressing columns please?) |
Is it something like:
We would define I note that XPath has additional arguments for language, calendar and place when creating these strings. Should we provide mechanisms in the metadata to allow these to be set? TR35 has a whole section about how a calendar is defined (which is then used to parse/format the date). I'm guessing that support for all of these, and for particular languages and calendars, would be implementation defined for 1.0 ie not something where we would insist on compatibility? |
For numbers, TR35 suggests that you can parse numbers without knowing what format they're actually in and the main ambiguity is over the grouping and decimal separators. So that would suggest having something like:
and having rules along the lines of:
Presumably if there is a percent or per-mille character then the resulting number should be divided by 100/1000 when mapping into a numeric value? Should people be able to specify a particular format for the number in the schema, eg |
For dates we decided not to go down the route of the full unicode standard; instead, the approach was to ask for a standard format, and have a structure whereby the metadata can specify the various possible "picture strings", assigned to various programming languages, that describe the data. This means implementations do not have to implement complex parsers but can rely on the 'standard' tools of their respective environments. I would prefer to be consistent and choose the same approach for numbers. Ivan Ivan Herman (Written on mobile, sorry for brevity and misspellings...)
|
I am struggling to understand what this looks like in practice. Can you supply a sample of a metadata document to illustrate? |
(Agreed about aiming for consistency in approach between dates and numbers, though I think in practice there are very different considerations and levels of complexity.) |
I am making this up as I write, because we did not work out the details. But it would be something like (I use the date example): { "datatype" : "date", Not pretty, I am the first one to say. On the other hand, referring to the Unicode standard for picture string that, afaik, nobody really implements is a major drag; it means all implementations, checkers or converters, will have to implement complex parsing for the datatypes, and do not believe this will really happen:-( Ivan
Ivan Herman, W3C |
@JeniT: for numbers I think that your proposal (specifying decimal and grouping chars) covers most of the issues. However, is there a case where, when validating CSV files, you would wish to check the number of decimal places used? (i.e. where there are mandatory digits in the number string). For example:
I'm guessing that this is probably beyond what we want/need - certainly the use cases don't cover this. So just specifying the grouping and decimal-separator characters is probably enough. If we do adopt picture strings for numbers, note that the XPath function for number formatting includes some useful statements for parsing picture strings. In fact, there are a number of rules spec'd by XPath that might be applied when we parse the number fields themselves ...
(etc. ... and it also talks about repeating patterns for grouping characters too) |
Looking at numbers and date-times, I think it's ok to have different approaches (e.g. one requires a full picture string, the other does not). However, we have Also, I note that there's a mismatch between numbers and date-times ...
Feels a bit discordant ... would suggest something like:
You can see that I've added a |
@JeniT: should the |
@JeniT: and here's another one ... currency token (noting that the metadata for the column might provide additional metadata about the actual currency so that a value with token See ISO 4217 "Codes for the representation of currencies and funds" for authoritative currency codes |
There is also the habit of expressing negative numbers in parentheses ... often used in accounting. So we'd want to be able to parse |
discussed 14/1/2015 and decided to support a fixed set of known/popular date-time formats, which we will list, probably using Unicode picture string formats for the names. Editors to propose a list and work out whether this needs to be supplemented with the names/abbreviations for months. |
Note that this should be handled in the metadata document, and conversion documents do not need to repeat the logic. |
What pattern string formats should we use. There are pattern string formats defined in Unicode TR35.
The text was updated successfully, but these errors were encountered: