Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fn:format-number: relax restrictions on exponent-separator (possibly minus-sign, percent, per-mille) #1048

Closed
ChristianGruen opened this issue Feb 28, 2024 · 14 comments
Labels
Enhancement A change or improvement to an existing feature XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

The current rules for decimal formats are too restrictive (i.e., too much focused on Anglo-Saxon formatting rules). The most prominent case is the Arabic exponent-separator „character“, which consists of two characters: عر (https://www.localeplanet.com/icu/ar/). The exponent separator of other locales is not restricted to a single character either. For example, se-NO uses ·10^.

When we include the ICU library in the analysis, we also find minus-sign, percent and per-mille properties that are longer than 1 character. Examples:

  • The minus-sign character for he consists of 200e and 002d (200e is the Left-to-Right Mark).
  • The Arabic percent character consists of 066a and 061c (061c is the “Arabic Letter Mark”).
  • The per-mille property of en-US-posix is 0/00.
@ChristianGruen ChristianGruen added XQFO An issue related to Functions and Operators Enhancement A change or improvement to an existing feature labels Feb 28, 2024
@michaelhkay
Copy link
Contributor

Use of multi-character representations in the formatted number is not a problem (as with NaN, Infinity). Use of such symbols in the picture string is potentially much more problematic as we need to be sure that the picture string parses unambiguously. For example using 0/00 as the per-mille symbol wouldn't work in the picture string.

@ChristianGruen
Copy link
Contributor Author

True; it would be easier if the picture string was not language-specific.

@michaelhkay
Copy link
Contributor

A pragmatic albeit ugly solution would be allow per-mille="‰;0/00" where the single-character value appearing before the semicolon is the value used in the picture string, and the multi-character value after the semicolon is the value used in the formatted output.

@ChristianGruen
Copy link
Contributor Author

A pragmatic albeit ugly solution would be allow per-mille="‰;0/00" where the single-character value appearing before the semicolon is the value used in the picture string, and the multi-character value after the semicolon is the value used in the formatted output.

I think we should try to avoid manually tweaking decimal formats that are provided by existing languages and libraries. Instead, we should rather try to get closer to what other languages do…

// JavaScript
new Intl.NumberFormat('de').format(1234.56);

// Java, with default picture string
DecimalFormat.getInstance(Locale.GERMANY).format(1234.56);

…and simplify the most common requests. Indeed, I would guess that a syntax as simple as…

format-number(1234.56, 'de')

…is what would make most users more than happy.

Related to the original thread, it would possibly be better to introduce a mode in which the default pattern can be used to specify patterns. That’s what Java and other languages offer:

DecimalFormat df = (DecimalFormat) DecimalFormat.getInstance(Locale.GERMANY);
df.applyPattern("#,##0.00")
df.format(1234.56);

The difficult question remains what would be the easiest syntax…

@michaelhkay
Copy link
Contributor

michaelhkay commented Mar 5, 2024

It would be quite a big departure to use locales for number formatting rather than explicit specification; and personally I'm not sure it would be a good move. I've always taken the view that the way you format dates and numbers is more likely to depend on what publisher you are working for than on what country you are in. For example the decision whether to use a period (.) or a middle dot (·) as a decimal separator is a question of editorial house-style, not a question of what language you speak. The notion that Norwegians always write exponential/scientific notation differently from the rest of Europe is clearly absurd.

(I'm afraid when it comes to localization, my views have always been a bit maverick. Partly because we live in an increasingly globalised|globalized world; and probably a consequence of growing up in a bilingual family).

@ChristianGruen
Copy link
Contributor Author

It would be quite a big departure to use locales for number formatting rather than explicit specification; and personally I'm not sure it would be a good move. I've always taken the view that the way you format dates and numbers is more likely to depend on what publisher you are working for than on what country you are in. For example the decision whether to use a period (.) or a middle dot (·) as a decimal separator is a question of editorial house-style, not a question of what language you speak. The notion that Norwegians always write exponential/scientific notation differently from the rest of Europe is clearly absurd.

The good thing is that locales don’t only stand for languages, but regions as well. In Germany, I would claim that the grouping and decimal separator is almost always the same, so there are hardly reasons to deviate from the standard de locale. In Switzerland, ' and . is popular, for which the de-CH is available. If a publisher or a company has very specific requirements, it’s still possible to provide custom separators and characters (provided that no custom code exists anyway to tackle the deficiencies of given standard functions).

What I can safely confirm is that many users seem to simply avoid format-number at the moment because it is too sophisticated and idiosyncratic. Thus, pragmatic fallbacks that are encountered frequently in practice look like…

string(1.23) ! replace('\.', ',')
string(1.23) => translate('.,', ',.')

…which works fine, but would probably not be what we would encourage to do.

From the implementor perspective, we can benefit a lot from the extensive work that has been done by the Java and ICU folks. It’s straightforward to apply settings of the predefined locales to fn:format-number (as long as they don’t violate our current restrictions imposed by the picture string).

And sorry again for mixing up two topics; my thoughts won’t provide a solution for the original problem of multi-character symbols. An obvious option would be providing an option…

format-number(1234.56, '#,##0.00' ,'de', options := map { 'default-picture': true() })

…but it would certainly be too verbose. I would favor a fixed character or prefix in the picture string (provided it doesn’t render the picture string ambiguous):

(: 1,234.56 :) format-number(1.2  , '=0.0'  , 'de')
(: ١٢٣٠٫٠٠e :) format-number(1.2e3, '=0.0e0', 'ar')
(: 10 0/00  :) format-number(.01,   '=0 ‰'  , 'en-US-posix')

Of course another option is to regard the cases as too specific. I have only encountered them by analysing the pre-defined decimal format symbols of Java and ICU, and I don’t know how many people have ever missed the possibility of outputting correctly formatted Arabic exponential signs.

@michaelhkay
Copy link
Contributor

michaelhkay commented Mar 5, 2024

In Germany, I would claim that the grouping and decimal separator is almost always the same

That might be true*, but I would also claim that whether you write 1e5 or 1×10^5 doesn't depend on whether you're in Germany, it depends on whether you are part of a programming community or a scientific community, and I happen to think that the I18N folks have completely failed to grasp that. Which is why people have so much trouble using Unicode collations to produce a decent index to a book: the experts live in a bubble disconnected from the real world.

*And even then it's not always true. A quick glance at the Frankfurter Allgemeine quickly found a reference to "iOS 17.3".

@ChristianGruen
Copy link
Contributor Author

the experts live in a bubble disconnected from the real world.

Yes, I can sense that.

*And even then it's not always true. A quick glance at the Frankfurter Allgemeine quickly found a reference to "iOS 17.3".

It’s true: For version numbers of software, I would never use commas either (or any formatting functions at all). My intuitive rationale would be that .3 in a version number is not regarded as the fractional part of a floating-point number, but I can be wrong. @gimsieke You may be THE person to answer that (?)…

@michaelhkay
Copy link
Contributor

Going off at a tangent here, but yes, with version numbers we would tend to think that 17.10 comes after 17.9. Let's hope we never have an XQuery version 4.10, because the current spec says it is the same thing as version 4.1.

@michaelhkay
Copy link
Contributor

The good thing is that locales don’t only stand for languages, but regions as well.

I'm told that the distinction between "Samstag" and "Sonnabend" for Saturday is traditionally based on religious affiliation, which is only loosely correlated with region; but these days it certainly depends mainly on which publisher's house style you are following.

@gimsieke
Copy link
Contributor

gimsieke commented Mar 5, 2024

It’s true: For version numbers of software, I would never use commas either (or any formatting functions at all). My intuitive rationale would be that .3 in a version number is not regarded as the fractional part of a floating-point number, but I can be wrong. @gimsieke You may be THE person to answer that (?)…

Maybe not. I’m guilty of xs:decimal(system-property('xsl:version')) ge 3.0 myself… One can argue that for 3.0 specifically it doesn’t matter since all minor versions are covered by that expression.

@ChristianGruen
Copy link
Contributor Author

I'm told that the distinction between "Samstag" and "Sonnabend" for Saturday is traditionally based on religious affiliation, which is only loosely correlated with region; but these days it certainly depends mainly on which publisher's house style you are following.

Ironically, and as I learned just recently, the term “Sonnabend” is an Anglicism: It was brought to Germany by an Anglo-Saxon missionary centuries ago (while England itself eventually went for “Saturday”), and it became an official term much later in the officially non-religious GDR/DDR, which is why it’s still popular in Eastern/Northern areas of Germany. Today, as far as I know, “Samstag” is the official/recommended term, which I would assume most publishers use (as one advantage is that the short version “Sa” differs from “So”, which is used for “Sonntag”), but… there may be exceptions (indeed, ICU won’t help you here, as it’s not possible to name an exact region for it; it only knows „Samstag”).

Going off at a tangent here, but yes, with version numbers we would tend to think that 17.10 comes after 17.9. Let's hope we never have an XQuery version 4.10, because the current spec says it is the same thing as version 4.1.

;·) Let’s see if we can avoid it.

@michaelhkay
Copy link
Contributor

michaelhkay commented Jun 2, 2024

Proposal.

For decimal format properties that define characters used both in the picture string and the result string, specifically decimal-separator, grouping-separator, exponent-separator, percent and per-mille (but not zero-digit) we allow the format property to take the value "x:y", where x is a single character indicating the marker used in the picture string, and y is an arbitrary string indicating the form used in the result string.

For example, if the percent property has the value "%:pc" then format-number(0.358, '#0.0%') produces "35.8pc".

For minus-sign, we remove the constraint that the value must be a single character.

@michaelhkay michaelhkay added the PR Pending A PR has been raised to resolve this issue label Jun 2, 2024
@michaelhkay
Copy link
Contributor

The PR was accepted so the issue is closed.

@michaelhkay michaelhkay removed the PR Pending A PR has been raised to resolve this issue label Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

3 participants