Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MAY language for using CLDR symbols and number formats as request… #651

Merged
merged 3 commits into from Jul 5, 2015

Conversation

@gkellogg
Copy link
Member

commented Jul 3, 2015

…ed in #617.

Note that if a CLDR format pattern is used, there is no way to know whether pattern is a regular expression or CLDR format. Also, additional symbols are not included in our namespace, and so will be lost through a hypothetical JSON-LD transformation.

…ed in #617.

Note that if a CLDR format pattern is used, there is no way to know whether `pattern` is a regular expression or CLDR format. Also, additional symbols are not included in our namespace, and so will be lost through a hypothetical JSON-LD transformation.
@iherman

This comment has been minimized.

Copy link
Member

commented Jul 4, 2015

On 03 Jul 2015, at 20:27 , Gregg Kellogg notifications@github.com wrote:

…ed in #617.

Thanks!

Note that if a CLDR format pattern is used, there is no way to know whether pattern is a regular expression or CLDR format.

That is correct. But a regular expression is already some sort of a fall-back today, right? Ie, if an implementation is not prepared for CLDR, nothing changes...

Also, additional symbols are not included in our namespace, and so will be lost through a hypothetical JSON-LD transformation.

I am not sure which symbols you are referring to

You can view, comment on, or merge this pull request online at:

#651

Commit Summary

• Add MAY language for using CLDR symbols and number formats as requested in #617.
File Changes

• M syntax/index.html (3)
Patch Links:

https://github.com/w3c/csvw/pull/651.patch
https://github.com/w3c/csvw/pull/651.diff

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2015

I might have misinterpreted the summary that @r12a gave in #617, but I don't think that the suggestion was to change the interpretation of the pattern property. I think that the suggestion was only to allow processors to recognise (in some way that we don't specify) numbers when they are formatted according to different (but recognisable) patterns. So if a cell contains something like %12.5 then rather than saying that this is an error, processors may interpret it as 0.125 because they recognise the use of one of the number patterns in CLDR.

I think we'd be asking for trouble to say that pattern could be interpreted by some processors as a regular expression and by other processors as a CLDR numeric pattern. If we were to introduce a mechanism for authors to provide number patterns, we ought to use a separate property to do it or stop use of regular expressions. I do not think we want to do this.

@JeniT JeniT assigned gkellogg and unassigned JeniT Jul 4, 2015
@gkellogg

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2015

I read it that way too, but without a format, I didn't see how it could be possible to interpret a value using CLDR without a format specification. I'll wait for clarification from @r12a.

It would be better to have some way of determining that the pattern is not supposed to be an RE, but a CLDR format string.

@iherman

This comment has been minimized.

Copy link
Member

commented Jul 4, 2015

I think this is not a problem of @r12a, but more that the description in the model[1] is not really really crisp enough. I mean, forget about the CLDR reference and let us consider only those formats that are in the section 6.4.2.

My reading should be that:

  • if the string value of the cell can be interpreted as one of the formats listed in the numbered item, then the value is the number parsed accordingly
  • otherwise I use the pattern as a regular expression

In other words, the value of "pattern" is not taken into account if I can parse that number. If this is the intention, then we do not have any problem with the CLDR reference, it just adds more formats that pre-empt the usage of "pattern". And I think that is fine.

But I am not sure this is really what that section says; it does not crisply says that 'pattern' comes into the picture only if the default formatting requirements fail, although the situations leading to possible errors seems to indicate that. In other words, it should be stated more clearly that the patterns have nothing to do with the pattern described through the 1.-6. bullet points (and, possibly, the CLDR patterns), but it is a fallback.

Ivan

[1] http://w3c.github.io/csvw/syntax/index.html#h-formats-for-numeric-types

On 04 Jul 2015, at 17:39 , Gregg Kellogg notifications@github.com wrote:

I read it that way too, but without a format, I didn't see how it could be possible to interpret a value using CLDR without a format specification. I'll wait for clarification from @r12a.

It would be better to have some way of determining that the pattern is not supposed to be an RE, but a CLDR format string.


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2015

@gkellogg, the only text that @r12a suggested adding was "In addition, implementations MAY support the patterns described by CLDR, of which the preceding is a subset." I'm assuming that he feels that this is all that's needed. He didn't say anything about metadata authors needing to supply number patterns to support parsing.

However, for forwards compatibility (should we wish to be more supportive of locale-based patterns in the future) it does make sense for pattern to take a number pattern instead of a regular expression. I'm just wary of going the whole hog. I suggest that we constrain them to contain only 0, #, the decimal separator, the group separator, E, +, % and (but enable processors to support more complex formats).

In addition, I think that @r12a was suggesting adding similar text around the parsing of dates and times, so implementations could support other formats if they wanted to (presumably taking locale from the language of the table).

If you agree, let me know if you'd like me to make changes in this branch as suggested above.

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2015

@iherman As currently written, the cell value will be a number if it is recognisable as a number. But errors will be raised if it doesn't match the regular expression given in the pattern. For example, if we have:

"datatype": {
  "base": "number",
  "pattern": "(0-9)+.(0-9)(0-9)"
}

then:

  • a cell with the string value rubbish will have an error because it isn't a recognisable number and the cell's value will be set to the string "rubbish"
  • a cell with the string value 123 will have an error because it doesn't meet the regular expression pattern, but the cell's value will be set to the number 123
  • a cell with the string value 1.23 will not have any errors and the cell's value will be set to the number 1.23

If someone defined a regular expression that couldn't match numbers, such as (0-9)+k, then all the values in the column would have errors: ones that were numbers wouldn't match the regular expression, and ones that matched the regular expression wouldn't be recognised as numbers. (An advantage of making pattern be interpreted as a number pattern rather than a regular expression is that this kind of situation couldn't then happen.)

If you have any suggestions to make it clearer in the text that this is what's going on, please make them.

@gkellogg

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2015

@JeniT, I'm fine with that. I think we should use MAY text for all of it, so that non-RE patterns are not required. @iherman's language about determining if the pattern can be interpreted as a CLDR pattern makes sense too.

I agree that what @r12a said was different, but when I was working on the section, I couldn't see how this could possibly work without a pattern, so took license to go that way.

@JeniT, regarding the behavior of an otherwise valid number that does not match the pattern, in every other case, IIRC, it is treated as a string (for example, for dates). I think it would be consistent to render anything not matching the pattern, even if otherwise valid, as being a string, and, of course, create a cell error.

Please go ahead and modify as you think is necessary.

@gkellogg gkellogg assigned JeniT and unassigned gkellogg Jul 4, 2015
@iherman

This comment has been minimized.

Copy link
Member

commented Jul 4, 2015

Hm. Now that I re-read the text after this comment, yes, this is what it says, but, shall we say, this is not what I expected. I regarded the pattern as a fall back if the cell value cannot be interpreted as (in this case) as a number. I can see some value in the current setup, (eg, I want a specific format within the allowed number formats), but I did not get it. The only place that makes it clear is the error condition specification.

I think it would be worth empasizing this in 6.4.2, saying that regardless of whether the number can be interpreted as such, the pattern matching does happen.

(Yes, the error causes are clear but, let us face it, people do not start by by reading the details of the error conditions.)

Ivan

Cc @JeniT


Ivan Herman
Tel:+31 641044153
http://www.ivan-herman.net

(Written on mobile, sorry for brevity and misspellings...)

On 4 Jul 2015, at 19:07, Jeni Tennison notifications@github.com wrote:

@iherman As currently written, the cell value will be a number if it is recognisable as a number. But errors will be raised if it doesn't match the regular expression given in the pattern. For example, if we have:

"datatype": {
"base": "number",
"pattern": "(0-9)+.(0-9)(0-9)"
}
then:

a cell with the string value rubbish will have an error because it isn't a recognisable number and the cell's value will be set to the string "rubbish"
a cell with the string value 123 will have an error because it doesn't meet the regular expression pattern, but the cell's value will be set to the number 123
a cell with the string value 1.23 will not have any errors and the cell's value will be set to the number 1.23
If someone defined a regular expression that couldn't match numbers, such as (0-9)+k, then all the values in the column would have errors: ones that were numbers wouldn't match the regular expression, and ones that matched the regular expression wouldn't be recognised as numbers. (An advantage of making pattern be interpreted as a number pattern rather than a regular expression is that this kind of situation couldn't then happen.)

If you have any suggestions to make it clearer in the text that this is what's going on, please make them.


Reply to this email directly or view it on GitHub.

@gkellogg

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2015

If a cell error does not result in a plain string result for the value for mis-matching numbers, then you'd need to consider other datatypes as well. You might have a format for dates, and if the date happened to be a valid xsd:date, but did not match the format, you'd need to consider that valid. In my implementation, if there's a format, then it MUST match, otherwise it's a plain string. Indeed, in this case, matching is required in order to get the pattern matches to form the resulting date.

Booleans may have a format, but if the format was Y|N, would true be considered an xsd:boolean?

We either stay consistent on the current behavior (and tests), or we change it to say that if an unmatched cell string value is valid against the underlying datatype, it is cell value is a literal of that datatype, rather than being a plain string. This would allow for cell string values that have the legal form of their base to be rendered accordingly, and still generate a cell error.

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2015

@gkellogg yes, agree that the cell value should be a string if it doesn't match the pattern.

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2015

@gkellogg see what you think?

@JeniT JeniT assigned gkellogg and unassigned JeniT Jul 4, 2015
@gkellogg

This comment has been minimized.

Copy link
Member Author

commented Jul 5, 2015

That's going further than we had originally discussed, but it seems better than using simple regular expressions and it is limited in scope.

I'll need to update/create tests to flesh this out, so I'll keep the issue open until that's complete.

gkellogg added a commit that referenced this pull request Jul 5, 2015
Add MAY language for using CLDR symbols and number formats as request…
@gkellogg gkellogg merged commit f4261bd into gh-pages Jul 5, 2015
@iherman

This comment has been minimized.

Copy link
Member

commented Jul 5, 2015

On 04 Jul 2015, at 21:36 , Gregg Kellogg notifications@github.com wrote:

If a cell error does not result in a plain string result for the value for mis-matching numbers, then you'd need to consider other datatypes as well. You might have a format for dates, and if the date happened to be a valid xsd:date, but did not match the format, you'd need to consider that valid. In my implementation, if there's a format,

you mean 'pattern', right?

then it MUST match, otherwise it's a plain string. Indeed, in this case, matching is required in order to get the pattern matches to form the resulting date.

With the remark above that would indeed be more consistent.

Booleans may have a format, but if the format was Y|N, would true be considered an xsd:boolean?

We either stay consistent on the current behavior (and tests), or we change it to say that if an unmatched cell string value is valid against the underlying datatype, it is cell value is a literal of that datatype, rather than being a plain string. This would allow for cell string values that have the legal form of their base to be rendered accordingly, and still generate a cell error.


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@iherman

This comment has been minimized.

Copy link
Member

commented Jul 5, 2015

What would that mean?

  • the language tag is just an indicator, ie, if the string would still be interpreted in CLDR correctly albeit not for that specific locale, then the match is still valid; or
  • the match is invalid in the case above?

I would expect the second being consistent with the rest, but should be properly indicated.

Also, would than we use the lang property, is it inherited?

I am not sure that it is worth going down that road. (I know I did propose something like that as a first proposal, but by going down the CLDR route I think this becomes superfluous.)

@gkellogg gkellogg deleted the issue-617-number-formats branch Sep 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.