Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Human readable formats for numeric percentages #617

Closed
r12a opened this issue Jun 15, 2015 · 17 comments · Fixed by #671

Comments

@r12a
Copy link

commented Jun 15, 2015

6.2.2 Formats for numeric types
http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#formats-for-numeric-types

"When parsing the string value of a cell against this format specification, implementations MUST recognise and parse numbers that consist of:"

the list that follows this makes allowances for groupChar and decimalChar, but still fails to capture real life formats from various cultures, when it comes to percentages.

For example, CLDR[1] shows formats where the percent sign appears to the left of the numeric sequence (Basque, Turkish), or where the percent sign is separated from the numeric sequence by white space (about 30 locales).

It's also not clear to me how pattern properties (which would appear to be the alternative) are taken into consideration in the validation checks (which actually are well-formedness checks, aren't they?).

It unfortunately seems a little too restrictive, and the alternative property values seem a little too simplistic, for international use.

Is there a way we can improve on that?

I'm wondering whether the spec should:

  1. only specify required formats for normalized numbers (ie. converted to the canonical form)
  2. do 1 and require applications to convert from real world scenarios as a separate step, taking into account the possible variations represented in CLDR
  3. do 1 and provide a property that specifies a locale, which recognises values that cover all the CLDR alternatives
  4. something else

1] http://www.unicode.org/cldr/charts/27/by_type/numbers.number_formatting_patterns.html

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2015

@r12a Thanks for the comment. We are trying to strike a balance between the variety of numeric formats that are encountered in the real world and not requiring implementors to get to grips with all the locale-specific formats.

What do you think about simply amending the text around recognising/parsing numbers such that the percent/permille sign can come before (as well as after) the number, and with optional spaces?

@gkellogg

This comment has been minimized.

Copy link
Member

commented Jun 17, 2015

FWIW, my implementation does look for percent/per-mille at the end, but could easily look for it anywhere within the cell string value.

@lreis

This comment has been minimized.

Copy link

commented Jun 17, 2015

Also to consider during implementation, like with percent symbols, the usage of currency symbols is also locale-dependent. They can precede or follow numbers, with or without spaces. Also, some currencies don't use single-character symbols, rather they are represented by multiple characters, including those from non-Latin scripts (e.g. '1.000,00 DKK', '1 000,00 лв.', '1000 դր', 'د.إ.‏ 1,000.00').

Lastly, if you want to achieve further internationalization, parsing should expand beyond 'decimal digits' / Indo-Arabic numerals (0-9), to include other (albeit less common) numeric systems such as Eastern Arabic (٠١٢٣٤٥٦٧٨٩) and Indian (०१२३४५६७८९), to name a few.

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2015

@r12a please could you review #640 to see if it addresses your issue.

@lreis I think that supporting numbers with currencies is probably going too far for numeric formats and if we were to support it we'd probably want a separate currency type in order to have some way to retain the information about which currency was used with the number. Note that people can still constrain the values of a column that includes currency information by datatyping it as a string and using a regular expression. I think we'd want to see evidence of non-indo-arabic-numerals being used within CSV files in order to add support for that. Both are probably vNext features.

@r12a

This comment has been minimized.

Copy link
Author

commented Jul 1, 2015

@JeniT thanks for the edit. Note, however, that the percent thing was only really meant to be an example of something that was missed. For example, CLDR says that you should allow a different symbol for minus sign in languages such as Finnish, Lithuanian, Norwegian, Swedish, etc. (ie. U+2212 MINUS SIGN).

What i was thinking was rather that it may make more sense to say something like: "When parsing the string value of a cell against this format specification, implementations MUST recognise and parse numbers that conform to the formatting patterns described by CLDR."

(see http://www.unicode.org/cldr/charts/27/by_type/numbers.number_formatting_patterns.html to get a notion about what that means, though that shouldn't necessarily be listed as a source of the data)

This not only makes it easier to specify the allowable patterns, but allows us to expand the list as information is added CLDR for new locales, without having to reissue the CSV standard.

Btw, reading through the thread again, i realise that (although i mentioned it to Ivan) i didn't say in this issue that there may be similar issues for the other formats, beside numbers. Numbers is the easy stuff. I'll look into that and raise another issue if i spot some problems for the other formats. But a similar approach for those formats would presumably make life easier there too.

@gkellogg

This comment has been minimized.

Copy link
Member

commented Jul 1, 2015

Given our proximity to LCCR, I would say that this entire mechanism should be left to V2, as it is quite a substantial change; I don't see a point in doing something specific to percent/per-mille given @r12a's advice. We do format checking for dates and times now, and there's no end of what might be done. This should be use-case driven with some evidence to do this, which is beyond what the WG can do at this point, IMO.

There are probably many things that need to be left out, and hopefully there will be interest to do a V2 spec subsequently to get what was missed. There may also be room for some community extensions which may see developer adoption.

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2015

I agree we should not go overboard here. We purposefully decided to scope ourselves to supporting the 80% use case with the number formatting that we support (see #54 for our discussion).

I note that the numeric parsing suggested in TR35 does not include the possibility for percent/per-mille at the beginning of the string.

I'm +0 on supporting different characters for minus signs and exponents: there is a fixed set in the By-Type charts and using them was in my original proposal at #54 (comment).

What do others in the WG think to this (responses quickly please as we want to publish LCCR)?

@6a6d74

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2015

@JeniT - I can see how this might add value but note our scoping decision to be use-case driven. The RTL directionality use case (nor any others) talk about particular number formats for minus signs and exponents.

So from me, it's a -1 for this version, but a +1 for putting into the pot for the next release.

Jeremy

@iherman

This comment has been minimized.

Copy link
Member

commented Jul 2, 2015

As a compromise solution, what about:

  • add a locale property in 6.2.2 (as Richard proposed) with a text saying that the locale MAY provide other locale-specific formats for the numbers (or dates or whatever) and implementations MAY be able to interpret numbers in those locales (or a subset thereof). What we describe in this specification is what all implementations MUST understand, and we test only those. We can also refer to CDLR as a non-normative reference for the various locale variants.
  • Version 2 may become more stringent and require the interpretation of all CDLR locales. I definitely agree that this seems to be way too complex to expect any implementations to fullfill those in this round.

@r12a what is the right terms for locales? Is it the same as language tags, ie, we can reuse that reference?

@r12a

This comment has been minimized.

Copy link
Author

commented Jul 2, 2015

i18n folks discussed this on the i18n call today, with Ivan. We understand the time pressure to find a solution, and so the need to compromise. Here's our proposal – somewhat different (and probably easier to manage) than what Ivan wrote just above.

We suggest that you continue to say in the spec: "When parsing the string value of a cell against this format specification, implementations MUST recognise and parse numbers that consist of:" followed by the current lists. But then add, "In addition, implementations MAY support the patterns described by CLDR, of which the preceding is a subset."

There's no need, at this point, to worry about matching to specific locales (you wouldn't do that anyway if dealing with the MUSTified set of validation criteria). The intention is simply that the data should at least fit one of the (few) patterns described for, in this case numeric, data in CLDR.

This means, for example (no need to add this to the spec – it's just an explanation for this comment, and i know that this one is already covered, it's just an example), that implementers could support one of the following patterns for percentages (but no more):
#,##,##0 %
#,##,##0%
#,##0 %
#,##0%
#0%
% #,##0
%#,##0

What this buys you is a simple way to allow implementers to support additional, locally valid formats, in a predictable and controlled way, if they wish. In fact, if in the future some local format pattern is discovered to be missing, new data can be added to the CLDR repository. In this way the range of supported formats can be extended in future without having to change the spec.

This comment was originally specific to the numeral formatting section, but we would recommend a similar addition for the other formats for which data exists in CLDR.

Would that work for you?

@iherman

This comment has been minimized.

Copy link
Member

commented Jul 2, 2015

This would indeed work for me

@6a6d74

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2015

+1 from me too.

@gkellogg

This comment has been minimized.

Copy link
Member

commented Jul 2, 2015

+1 from me, and I can do a PR containing something like that, as well as implement it in my version.

@gkellogg gkellogg self-assigned this Jul 2, 2015
gkellogg added a commit that referenced this issue Jul 3, 2015
…ed in #617.

Note that if a CLDR format pattern is used, there is no way to know whether `pattern` is a regular expression or CLDR format. Also, additional symbols are not included in our namespace, and so will be lost through a hypothetical JSON-LD transformation.
@iherman

This comment has been minimized.

Copy link
Member

commented Jul 5, 2015

I am sorry, but I do not think what we have now as a result of #651 is answering the original concern, ie, @r12a's comment. The way I read this, what is says now is that

  • the pattern annotation is not really a traditional regex any more, because it contains symbols that are not covered by traditional regular expression (this fact should be emphasized!)
  • in spite of what we discussed with @r12a, the user of a number format that is not covered by the current default values still MUST fill the right pattern, otherwise he/she will not get what is intended.

I think the idea of @r12a was, instead, that implementations MAY understand different patterns beyond the ones we have listed, without any further ado, certainly without the necessity to fill in those (complex) pattern values.

I still do not see what the real problem is to just say, well, that, and not involve regular expressions (or the CLRD version thereof). I would propose to go back to where we were, and just add a 'may' clause accordingly...

Cc: @gkellogg @JeniT

@JeniT

This comment has been minimized.

Copy link
Contributor

commented Jul 5, 2015

@iherman,

Yes, the pattern annotation is not a regex under this edit, it's a restricted version of a number pattern format as defined in TR35. The text says:

A number format pattern as defined in [UAX35]. Implementations MUST recognise number format patterns containing the symbols 0, #, the specified decimalChar (or "." if unspecified), the specified groupChar (or "," if unspecified), E, +, % and . Implementations MAY additionally recognise number format patterns containing other special pattern characters defined in [UAX35].

and doesn't mention regexs at all. I'm not sure why it should be emphasised that this isn't a regex. We don't emphasise that other annotations that aren't regexs aren't regexs.

Note that the text explicitly says that implementations MAY recognise other pattern characters, which should give the flexibility in use of number formats that @r12a wants.

The text also says that, in the case that no pattern annotation is provided but there is a non-null groupChar then:

Implementations may also recognise numeric values that are in any of the standard-decimal, standard-percent or standard-scientific formats listed in the Unicode Common Locale Data Repository.

which is the text that @r12a wanted (made a bit more explicit about which of the patterns implementations are allowed to recognise).

I did make a choice here that the lack of a format (specifically, no groupChar or pattern) indicated that implementations should only recognise the standard (XML Schema 1.1) formats for the datatype. This was to bring it into line with the treatment of all the other datatypes (eg boolean, date/time) where supplying a format expands the range of values that are acceptable to alternative human-readable values. I don't think we want numbers to be treated completely differently from other datatypes.

You seem to be suggesting removing the pattern annotation entirely? That would prevent people from being able to say that they expect all the numbers in the column to have two decimal places, for example, or all to be percentages. Both of these are useful things to be able to validate. So I don't think we should leave numbers without any pattern-based validation.

We could go back to using a regular expression instead of a number format pattern. I think this will leave us with longer term problems should we wish to adopt extended number formats in the future. If we do use a regex instead of a number format pattern, I suggest that we call the annotation regex rather than pattern so that we can add a pattern annotation holding a number format pattern in the future. I'm OK with this as a way forward.

@iherman

This comment has been minimized.

Copy link
Member

commented Jul 5, 2015

On 05 Jul 2015, at 11:52 , Jeni Tennison notifications@github.com wrote:

@iherman,

Yes, the pattern annotation is not a regex under this edit, it's a restricted version of a number pattern format as defined in TR35. The text says:

A number format pattern as defined in [UAX35]. Implementations MUST recognise number format patterns containing the symbols 0, #, the specified decimalChar (or "." if unspecified), the specified groupChar (or "," if unspecified), E, +, % and ‰. Implementations MAY additionally recognise number format patterns containing other special pattern characters defined in [UAX35].

and doesn't mention regexs at all. I'm not sure why it should be emphasised that this isn't a regex. We don't emphasise that other annotations that aren't regexs aren't regexs.

What is a little bit disturbing is that the 'format' string:

  • a number format for numeric type
  • a special format ('Y|N') for boolean
  • date format pattern for dates
  • regexp for duration
  • regexp otherwise

Which is all o.k. and precise, but has to be digged out of the text. What I am looking for is some sort of a text that make it clear for the reader that the value of 'format' depends on what the expected datatype is.

But I am just picky, obviously:-)

Note that the text explicitly says that implementations MAY recognise other pattern characters, which should give the flexibility in use of number formats that @r12a wants.

The text also says that, in the case that no pattern annotation is provided but there is a non-null groupChar then:

Implementations may also recognise numeric values that are in any of the standard-decimal, standard-percent or standard-scientific formats listed in the Unicode Common Locale Data Repository.

Sorry, I missed that one. You are right

which is the text that @r12a wanted (made a bit more explicit about which of the patterns implementations are allowed to recognise).

Yes.

I did make a choice here that the lack of a format (specifically, no groupChar or pattern) indicated that implementations should only recognise the standard (XML Schema 1.1) formats for the datatype. This was to bring it into line with the treatment of all the other datatypes (eg boolean, date/time) where supplying a format expands the range of values that are acceptable to alternative human-readable values. I don't think we want numbers to be treated completely differently from other datatypes.

You seem to be suggesting removing the pattern annotation entirely? That would prevent people from being able to say that they expect all the numbers in the column to have two decimal places, for example, or all to be percentages. Both of these are useful things to be able to validate. So I don't think we should leave numbers without any pattern-based validation.

As I said, I missed the sentence above.

We could go back to using a regular expression instead of a number format pattern. I think this will leave us with longer term problems should we wish to adopt extended number formats in the future. If we do use a regex instead of a number format pattern, I suggest that we call the annotation regex rather than pattern so that we can add a pattern annotation holding a number format pattern in the future. I'm OK with this as a way forward.

O.k., leave it as is…

@gkellogg gkellogg removed their assignment Jul 6, 2015
JeniT added a commit that referenced this issue Jul 7, 2015
fixes #617 as well as some of the test issues in #664
@iherman

This comment has been minimized.

Copy link
Member

commented Jul 8, 2015

Close by virtue of telco resolution: http://www.w3.org/2015/07/08-csvw-irc#T14-12-00

Cc @r12a

@iherman iherman closed this Jul 8, 2015
@iherman iherman added the NonWG label Jul 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.