Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBV 4.0 #817

Closed
ChristianGruen opened this issue Nov 6, 2023 · 11 comments
Closed

EBV 4.0 #817

ChristianGruen opened this issue Nov 6, 2023 · 11 comments
Labels
Enhancement A change or improvement to an existing feature XPath An issue related to XPath XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

Yes, I dare to question the semantics of effective boolean values. The reason is that I never learned to fully like them. It seems obvious where the rules come from, and why they have been reasonable in previous versions of the language. From today’s perspective, I think there’s really some need to simplify and unify the rules, and I believe it’s possible with little effort and without endangering backward compatibility (provided that we are willing to drop errors and return results).

Some examples for the somewhat strange nature of the current rules:

  • boolean((<_>x</_>, <_>y</_>)) returns true, whereas boolean(('x', 'y')) raises an error.
  • boolean(xs:NCName('x')) returns true, whereas boolean(xs:QName('x')) raises an error.
  • boolean((<a/>, 1)) and boolean((1, <a/>)) may either return true or raise an error, depending on the implementation.

I believe it will make much more sense to

  1. check all values of the input equally (in analogy to the existential semantics of general comparisons), and
  2. use existence checks for more types instead of raising a clueless error.

The semantics would be tidied up a lot, it could look like this…

declare function ebv($input as item()*) as xs:boolean {
  some $item in $input satisfies typeswitch($item) {
    case xs:untypedAtomic | xs:string | xs:anyURI  return $item != ''
    case xs:numeric                                return $item != 0
    case xs:boolean                                return $item
    default                                        return true()
  }
};

…or, if we include more types, like this:

declare function ebv($input as item()*) as xs:boolean {
  some $item in $input satisfies typeswitch($item) {
    case xs:untypedAtomic | xs:string | xs:anyURI  return $item != ''
    case xs:numeric                                return $item != 0
    case xs:boolean                                return $item
    case xs:base64Binary                           return $item != xs:base64Binary('')
    case xs:hexBinary                              return $item != xs:hexBinary('')
    case array(*)                                  return array:size($item) != 0
    case map(*)                                    return map:size($item) != 0
    default                                        return true()
  }
};

(If we believe that it’s too progressive to accept all types, we could still raise an error for some specific types… although I don’t think that anyone would benefit from this choice).

As a result, EBV checks could also be used to check more than one item:

(: true if at least one tokenized string is non-empty :)
if(tokenize('a/', '/')) then ...
(: true if at least one number is unequal to 0 :)
if($numbers) then ...
(: true if at least one Boolean is true :)
if(false(), true(), true()) then ...

Nothing would change for the classical EBV checks: if($node/*), if($x = $y), if($ok), …

Regarding “1. check all values of the input equally”, one could argue that this might affect performance. I don’t actually think so: For node sequences, it will still be sufficient to retrieve only the first item. For mixed-type sequences, errors were raised in the past.

The resulting EBV could be easily combined with revised predicate semantics (#816).

@ChristianGruen ChristianGruen added XPath An issue related to XPath XQFO An issue related to Functions and Operators Enhancement A change or improvement to an existing feature labels Nov 6, 2023
@michaelhkay
Copy link
Contributor

Many of our users will have spent many frustrated hours learning the Javascript rules, and I think it's important we remain consistent with them. At present we are well aligned - except that in JS, if it's not one of a small number of falsy things, then its truthy, whereas with our rules things like empty arrays and maps are errors rather than truthy. I'm really not keen on making the rules more complicated especially if it leads to outcomes that are different from JS.

@ChristianGruen
Copy link
Contributor Author

Many of our users will have spent many frustrated hours learning the Javascript rules, and I think it's important we remain consistent with them. At present we are well aligned - except that in JS, if it's not one of a small number of falsy things, then its truthy, whereas with our rules things like empty arrays and maps are errors rather than truthy. I'm really not keen on making the rules more complicated especially if it leads to outcomes that are different from JS.

I don’t see those similarities between JavaScript and XPath. The only thing that is close is the treatment of strings, numbers and booleans, and we would keep this anyway.

The main difference, and the one that regularly causes confusion, is the varying treatment of node sequences and other sequences, and it’s hard for me to grasp why this seems necessary today. The confusing examples that I’ve stated in the initial comment have no counterpart in JavaScript, and I’m convinced we could simplify the rules here by treating all items of a sequence identically, and achieving a more intuitive result. In addition, we can also sort out different behavior across implementations for heterogeneous sequences:

  • Both boolean((<_>x</_>, <_>y</_>)) and boolean(('x', 'y')) would return true.
  • Both boolean(xs:NCName('x')) and boolean(xs:QName('x')) would return true. It makes sense to always return true for xs:QName, because a QName can never be empty.
  • Both boolean((<a/>, 1)) and boolean((1, <a/>)) will return true – no matter which implementation is used.

For function items and arrays (positional, associative), my proposal would bring XPath and JS even closer together, by getting rid of the error message which I believes serves no one in practice. It would be much easier to use if($array) then ... instead of if(exists($array)) then .... For arrays, we could certainly choose the JS way and return true without checking the contents (provided that we believe that sequences and arrays are different enough). Same for maps.

@michaelhkay
Copy link
Contributor

Both boolean((<a/>, 1)) and boolean((1, <a/>)) will return true – no matter which implementation is used.

I'm not sure why you draw out this case as being implementation-dependent. Currently the first case is unambiguously true, the second case is unambiguously an error.

I don’t see those similarities between JavaScript and XPath.

In Javascript any array or object is truthy, regardless of its contents. I'm not sure what you're proposing for arrays and maps, but for sequences you're proposing something very different, and I'm still not sure exactly what. Or what the use cases are.

Read Javascript tutorials, and you find people advising everyone to steer clear of this minefield. With XPath too, a lot of people suggest using functions like exists() to avoid relying on the complex EBV rules. If we make them even more complex, there will be even more advice telling users not to go there.

@ChristianGruen
Copy link
Contributor Author

I'm not sure why you draw out this case as being implementation-dependent. Currently the first case is unambiguously true, the second case is unambiguously an error.

Sorry for that. Indeed the specification states clearly that it's the first item that’s responsible for the result. I got misled by one implementation (well, not ours) that behaves differently.

I think/hope we can agree that it's at least strange that the order of the input defines here what is going to happen. I cannot think of any good reason for the current behavior for sequences of mixed type (apart from maybe historical reasons and algebra with XPath 1.0).

In Javascript any array or object is truthy, regardless of its contents. I'm not sure what you're proposing for arrays and maps,

In my initial proposal, I suggested checking the map/array size and returning true or false. I’d be open to the decision to always return true, in alignment with JS. The EBV of function items would always be true (similar to JS).

but for sequences you're proposing something very different, and I'm still not sure exactly what.

I hoped that the equivalent XQuery code was self-explanatory. I think it's questionable to base the result on the first item (which can easily change of data is reordered), and to raise errors for sequences… unless the first item is a node. I don't know any other language that behaves similarly.

I really don't believe that the proposed rules would make EBV more complex. Quite contrary, I think that the new rules would be more consistent and easier to explain and teach: For each item in the input sequence, there's a well-defined rule to get true or false. If at least one item matches, the EBV is true.

@michaelhkay
Copy link
Contributor

I cannot think of any good reason for the current behavior for sequences of mixed type (apart from maybe historical reasons and algebra with XPath 1.0).

In XPath 1.0 there were essentially four types: string, number, boolean, and node-set, and EBV was defined for each of them. When the data model was extended in 2.0, the rules had to be compatible with the 1.0 rules, but also to handle mixed sequences, and there was a significant amount of debate on the best way of doing this. One of the concerns, if I remember rightly, was that the revised rules should not make it necessary to read an entire sequence before making a decision (so if(//x) could still be decided on finding the first x element). But I think there was also a strong view that the rules should not become too unwieldy, and it was better to make most cases (other than 1.0-compatible cases) into errors than to have very complex rules that people would have trouble remembering.

@ChristianGruen
Copy link
Contributor Author

ChristianGruen commented Nov 7, 2023

Thanks for the discussion. I think it helps to look at two aspects of the EBV computation separately:

  1. processing the input sequence and
  2. processing single items.

For 1., the current rules are:

declare function ebv($input as item()*) as xs:boolean {
  if (empty($input)) then false()
  else if(head($input) instance of node()) then true()
  else single-ebv($input)
};

I think it would be more intuitive to get rid of any special-casing and use existential semantics instead, so I would propose:

declare function ebv($input as item()*) as xs:boolean {
  some $item in $input satisfies single-ebv($item)
};

This way, the result won’t change if the input sequence is reordered. More importantly, all item types would have “equal rights”. This feels important to me, as the language has evolved a lot since XPath 1.0, which was very node-centric. I really can’t find a good reason today for treating node sequences differently to sequences of other types.

For 2., we currently have…

declare function single-ebv($item as item()) as xs:boolean {
  typeswitch($item) {
    case xs:untypedAtomic | xs:string | xs:anyURI  return $item != ''
    case xs:numeric                                return $item != 0
    case xs:boolean                                return $item
    default                                        return error(xs:QName('err:FORG0006'))
  }
};

It could possibly be:

declare function single-ebv($item as item()) as xs:boolean {
  typeswitch($item) {
    case xs:untypedAtomic | xs:string | xs:anyURI  return $item != ''
    case xs:numeric                                return $item != 0
    case xs:boolean                                return $item
    (: to be discussed... :)
    case xs:base64Binary                           return $item != xs:base64Binary('')
    case xs:hexBinary                              return $item != xs:hexBinary('')
    case array(*)                                  return array:size($item) != 0
    case map(*)                                    return map:size($item) != 0
    default                                        return true()
  }
};

We should get rid of raised error. I can see it was reasonable in the past, but I don’t believe it’s suitable today. If we want to align sequences and arrays, it just makes no sense to me that boolean(()) returns false and boolean([]) raises an error. If we think it does – apart from doing what we did in the past – we should find good arguments for it. Same for boolean(xs:QName('x'))… well, I’m repeating myself.

PS: I wondered why “…and algebra” slipped into my sentence. Should probably have been “…and alignment”.

@michaelhkay
Copy link
Contributor

Starting from first principles, I can certainly see why you want boolean([]) and boolean(map{}) to be false, but the fact that both are true in Javascript feels like we're just making life too hard for our users.

@ChristianGruen
Copy link
Contributor Author

Starting from first principles, I can certainly see why you want boolean([]) and boolean(map{}) to be false, but the fact that both are true in Javascript feels like we're just making life too hard for our users.

If we believe that JavaScript users are (one of) our main target groups today, we should at least return true() instead of an error (which is what I suggested in the first proposal in the first comment of this issue).

@michaelhkay
Copy link
Contributor

Another point to bear in mind: in XSLT predicates, failure means no match. So in 3.0 match="person[*!string-length(.)]" will be a no-match (not an error) if person has more than one child element. Under your rules it would be a match if any child has a non-zero string length. It's unlikely anyone is doing this deliberately, but dormant template rules that never match anything often lie around in legacy code.

@ChristianGruen
Copy link
Contributor Author

Another point to bear in mind: in XSLT predicates, failure means no match. So in 3.0 match="person[*!string-length(.)]" will be a no-match (not an error) if person has more than one child element. Under your rules it would be a match if any child has a non-zero string length. It's unlikely anyone is doing this deliberately, but dormant template rules that never match anything often lie around in legacy code.

Oh dear; yes, that sounds like a hard nut to crack. If we think this through, it basically disallows us to turn any error in the language into a success (try/catch is regularly used if people are overwhelmed to assess what exactly is supposed to go wrong in more more complex code).

@ChristianGruen
Copy link
Contributor Author

ChristianGruen commented Nov 12, 2023

I’m grateful for the discussion! I’ll open another issue with a narrower focus → #829.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature XPath An issue related to XPath XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

2 participants