Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fn:sort, and XSLT and XQuery sorting, should use transitive comparisons #866

Closed
michaelhkay opened this issue Dec 3, 2023 · 6 comments · Fixed by #881
Closed

fn:sort, and XSLT and XQuery sorting, should use transitive comparisons #866

michaelhkay opened this issue Dec 3, 2023 · 6 comments · Fixed by #881
Labels
Enhancement A change or improvement to an existing feature Tests Needed Tests need to be written or merged XQFO An issue related to Functions and Operators

Comments

@michaelhkay
Copy link
Contributor

michaelhkay commented Dec 3, 2023

We have addressed the question of non-transitivity of equality matching in distinct-values(), and in XSLT and XQuery grouping, but the same issue exists for sorting. Currently fn:sort, as well as XSLT and XQuery sorting, rely on the "lt" operator for comparing values including mixed numerics such as doubles and decimals. Because this promotes to double, it is capable of losing precision, and is therefore non-transitive. Most sort algorithms rely on the supplied comparison function being transitive, and if it isn't, then undefined failures may occur including non-termination.

One particular quirk (which led me here) is that fn:highest and fn:lowest start by using fn:sort semantics to put the values in order, and then rely on fn:deep-equal semantics to find the values that are "equal highest" or "equal lowest". But fn:sort and fn:deep-equal have different ways of deciding whether two values are equal: decimal 1.2 and double 1.2 are equal for fn:sort, but not for fn:deep-equal.

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Dec 3, 2023

It's so tempting to say we should be using a transitive order relation for everything including the "eq" and "lt" operators.

But this causes problems: currently the decimal 1.2 and the double 1.2e0 compare equal, and they would then compare non-equal. Because the simple notation "1.2" represents a decimal, while atomization and arithmetic operators default to producing a double, unintentional comparisons between decimals and doubles are very common.

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Dec 5, 2023

Specific proposal: I propose that in XSLT xsl:sort (and xsl:perform-sort, xsl:merge, etc), in XQuery order by, and in fn:sort, fn:min, fn:max, fn:highest, and fn:lowest, numeric values should be compared using the new function fn:numeric-compare($v1, $v2) which returns -1, 0, or +1 depending on the exact mathematical relationship between the numeric values (or, if you like, conversion to xs:decimal with no limit on precision). In this ordering NaN equals NaN and precedes -INF, followed by all ordinary values, followed by +INF. Note that for numeric operands fn:numeric-compare($v1, $v2) returns 0 if and only if fn:atomic-equal($v1, $v2) returns true.

The behaviour of the lt operator (and its backing function, op:numeric-less-than) is not changed.

@ChristianGruen ChristianGruen added XQFO An issue related to Functions and Operators Enhancement A change or improvement to an existing feature labels Dec 6, 2023
@michaelhkay
Copy link
Contributor Author

Note in passing: the 3.1 specifications (and the current 4.0 drafts) for fn:min and fn:max have a "properties" section that refers to the arity-0 and arity-1 forms of these functions, which is nonsense; the available arities are 1 and 2.

@ChristianGruen
Copy link
Contributor

I definitely see the need to make the functions transitive.

We’d then have fn:deep-equal, fn:compare, fn:atomic-equal, fn:numeric-compare, … Which makes sense from a technical point of view, but it could confuse all those people who don’t belong to the exclusive group of implementors.

I may be overlooking something, but couldn’t we simply drop fn:atomic-equal and fn:numeric-compare and generalize fn:compare to accept arbitrary atomic types? fn:deep-equal could then be simplified as well. The following rules could be dropped…

d. One of the following conditions is true:
i. If both $i1 and $i2 are instances of xs:string or xs:untypedAtomic, equal-strings($i1, $i2, $collation, $options) returns true.
ii. If both $i1 and $i2 are instances of xs:date, xs:time or xs:dateTime, $i1 eq $i2 returns true.
iii. Otherwise, fn:atomic-equal($i1, $i2) returns true.

…and replaced with:

fn:compare(normalize($i1), normalize($i2)) returns true.

…with equal-strings being replaced with:

declare function normalize(
  $item     as xs:anyAtomicType,
  $options  as map(*)
) as xs:string {
  if($item instance of xs:string) then (
    $string
    ! (if($options?whitespace = "normalize")) then normalize-unicode(., $options?normalization-form)  else .)
    ! (if($options?normalize-space) then normalize-space(.) else .)
  ) else $item
}

The signature of fn:compare could be:

fn:compare(
  $value1     as xs:anyAtomicType?,
  $value2     as xs:anyAtomicType?,
  $collation  as xs:string?	    := fn:default-collation()
) as xs:integer?

For types other than strings, we could adopt the rules from fn:atomic-equal and the ones proposed in #881.

Another positive effect would be that fn:compare (which has no reference to strings in its name) could then be used for arbitrary atomic items. It would also be helpful for the upcoming fn:sort-with function: fn:sort-with($sequence, compare#2)) could be used for strings, numbers, dates, etc.

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Dec 6, 2023

couldn’t we simply drop fn:atomic-equal and fn:numeric-compare and generalize fn:compare to accept arbitrary atomic types

Firstly, atomic-equal was specifically designed to be context-free, which makes it messy to combine it into a function that uses collations.

We could generalise atomic-equal to do an ordered comparison over all data types, in which case it could embrace numeric-compare, but that's providing functionality that I'm not sure we need: we're happy, I think, for sorting to be context-dependent.

Folding fn:numeric-compare into fn:compare is more feasible, but you've then got one function that does two different jobs; there's no type safety to ensure that the arguments have compatible types, and you need ad-hoc rules to say which combinations of arguments are valid and which aren't. The merit of two separate functions is that each is a total function over the domain implied by its signature.

Another option is to make numeric-compare private (i.e. put it in "op" namespace). I put it in the fn namespace because I think there are valid use-cases for applications to use it directly; that will be especially true if we introduce fn:sort-with.

I did wonder while writing this whether we should combine the op:xxx-less-than and op:xxx-equal functions for other data types into op:xxx-compare functions, but I think that's an unrelated task, and is purely editorial.

As for deep-equal, I wish we could separate it into two functions, one which compares items and one which compares sequences taking an item-comparison function as an argument; but again that's a separate project, and one that is hampered by compatibility concerns.

@ChristianGruen
Copy link
Contributor

Folding fn:numeric-compare into fn:compare is more feasible, but you've then got one function that does two different jobs; there's no type safety to ensure that the arguments have compatible types, and you need ad-hoc rules to say which combinations of arguments are valid and which aren't.

I can’t agree here; those seem technical concerns to me that can be solved (similar to arithmetic expressions, for example, ad-hoc rules are only necessary for those cases in which static types are not available). From a user perspective, it’s not intuitive at all to have multiple compare functions, i.e., to have fn:compare for strings and fn:numeric-compare for numbers. We neither have fn:date-compare nor do we have fn:numeric-max, fn:duration-sum, fn:numeric-sort, etc. and I’m convinced no one misses these functions. The same applies to comparison operators: < or lt are defined for different types; there is no specific operator for strings or numbers.

Apart from the already mentioned advantages for a generalized fn:compare function, it could get easier to define other comparisons in the future.

Firstly, atomic-equal was specifically designed to be context-free, which makes it messy to combine it into a function that uses collations.

I see. Perhaps fn:atomic-equal could be made private: Maps seem to be the only language feature that requires a context-free comparison; and I don’t think we should confront people with yet another comparison function when they already have =, eq, fn:deep-equal and fn:compare.

@michaelhkay michaelhkay added PR Pending A PR has been raised to resolve this issue Tests Needed Tests need to be written or merged labels Dec 6, 2023
@ndw ndw closed this as completed in #881 Dec 19, 2023
ChristianGruen added a commit to qt4cg/qt4tests that referenced this issue Dec 20, 2023
ChristianGruen added a commit to qt4cg/qt4tests that referenced this issue Dec 20, 2023
ChristianGruen added a commit to qt4cg/qt4tests that referenced this issue Dec 20, 2023
@ChristianGruen ChristianGruen removed the PR Pending A PR has been raised to resolve this issue label Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature Tests Needed Tests need to be written or merged XQFO An issue related to Functions and Operators
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants