Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short-circuiting functions, function-arity guards and lazy hints #299

Closed
dnovatchev opened this issue Jan 2, 2023 · 46 comments
Closed

Short-circuiting functions, function-arity guards and lazy hints #299

dnovatchev opened this issue Jan 2, 2023 · 46 comments
Labels
Feature A change that introduces a new feature XPath An issue related to XPath

Comments

@dnovatchev
Copy link
Contributor

dnovatchev commented Jan 2, 2023

I. Shortcutting and lazy hints

Let us have this expression:

let $f := function($arg1 as item()*, $arg2 as item()*) as function(item()*) as item()*
             {  (: Some code here :) }
  return
    $f($x) ($y)

Evaluating $f($x) produces a function. The actual arity of this resulting function can be any number N >= 0 :

  • If N > 1 there would be arity mismatch error, as only one argument $y is provided in the expression.

  • If N = 1 the final function call can be evaluated, and the argument $y must be evaluated, or

  • If N = 0, then $y is unneeded and can safely be ignored according to the updated Coercion Rules / Function Coercion in Xpath 4.0.

Because a possibility exists to be able to ignore the evaluation of $y, it is logical to delay the evaluation of $y until the actual arity of $f($x) is known.

The current XPath 4.0 evaluation rules do not require an implementation to base its decision whether or not to evaluate $y on the actual arity of the function produced by $f($x), thus at present an implementation could decide to evaluate $y regardless of the actual arity of the function produced by $f($x).

This is where a lazy hint comes: it indicates to the XPath processor that it is logical to make the decision about evaluation of $y based on the actual arity of the function returned by $f($x).

A rewrite of the above expression using a lazy hint looks like this:

let $f := function($arg1 as item()*, $arg2 as item()*) as function(item()*) as item()*
             {  (: Some code here :) }  
  return
    $f($x) (lazy $y)

Here is one example of a function with short-cutting and calling it with a lazy hint:

let $fAnd := function($x as xs:boolean, $y as xs:boolean) as xs:boolean
             {
                let $partial := function($x as xs:boolean) as function(xs:boolean) as  xs:boolean
                                {
                                  if(not($x)) then ->(){false()}
                                              else ->($t) {$t}
                                }
                 return $partial($x)($y)
             }
  return
     $fAnd($x (: possibly false() :), lazy $SomeVeryComplexAndSlowComputedExpression)

Without the lazy hint in the above example, it is perfectly possible that an XPath implementation, unrestricted by the current rules, would evaluate $SomeVeryComplexAndSlowComputedExpression - something that is unneeded and could be avoided completely.

Formal syntax and semantics

  1. The lazy keyword should immediately precede any argument in a function call. If specified, it means that it is logical to make the decision about evaluation of this argument based on the actual arity of the function in this function call.

    Based on this definition, it follows that lazy $argK implies lazy for all arguments following $argK in the function call. Thus specifying more than one lazy hint within a given function call is redundant and an implementation may report this redundancy to the user.

    The scope of a lazy keyword specified on an argument is this and all following arguments of (only) the current function call.

  2. It is possible to specify a lazy keyword that is in force for the respective argument(s) of all function calls of the given function. To do this, the lazy keyword must be specified immediately preceding a parameter name in the function definition of that function.

    For example, if the function $f is specified as:

    let $f := function($arg1 as item()*, lazy $arg2 as item()*, $arg3 as item()*, $arg4 as item()* ) 
              { (: some code here:) }
      return
         $someExpression

    Then any call of $f in its definition scope that has the form:

    $f($x, $y, $z, $t)

    is equivalent to:

    $f($x, lazy $y, $z, $t)
  3. It is possible to specify the lazy keyword immediately preceding a function definition. This instructs the XPath processor that any call of this function is only necessary to be evaluated if the function is actually called during the evaluation of the expression that contains this function call.

    For example:

    let $complexComputation := lazy function($x, $y) {$x + $y}, (: Make it as complex as you want ... :)
         $someCondition := function()
            {
                let $date := current-date()
                  return
                      month-from-date($date) eq 2
                    and 
                     day-from-date($date) eq 29 
           }
      return if($someCondition()) 
               then $complexComputation(2, 3)
               else 0

    Specifying the lazy keyword in the function definition for $complexComputation can save significant computing resources, because the programmer knows that $someCondition() is true during only a single day in any 4-years period.

II.fn:lazy

Summary

Applied on a single argument that can be any expression. Lazily returns its argument expression.

Signature

lazy fn:lazy( 
        $expression as item()*
) as item()*

Properties

This function is deterministic, context-independent, focus-independent

Rules

The semantics of the function is strictly defined below:

let $lazyFunction := lazy fn:identity#1
   return
      (: AnyExpression here :)

Any expression Q of the form:

Q(E1, lazy(E2))

where E1 and E2 are subexpressions of Q, must be evaluated by the Processor in two steps:

  1. Substitute the expression

    Q(E1, lazy(E2))

    with:

    Q(E1, ?) (lazy E2)

  2. Evaluate the latter according to the rules for a lazy argument

Example

We can use almost the same example as above, but here $complexComputation is defined without the lazy keyword and thus is not a lazy function. To have $complexComputation evaluated lazily, we call the lazy() function, passing $complexComputation to it:

let $complexComputation := (: no lazy here :) function($x, $y) {$x + $y}, (: Make it as complex as you want ... :)
     $someCondition := function()
        {
            let $date := current-date()
              return
                  month-from-date($date) eq 2
                and 
                 day-from-date($date) eq 29 
       }
  return 
        $someCondition() and lazy( $complexComputation(2, 3))

Here the expression Q is:

$someCondition() and lazy( $complexComputation(2, 3))

This is the same as:

fn:op("and")($someCondition(), lazy( $complexComputation(2, 3))

According to the Rules above, the processor converts this to:

fn:op("and")($someCondition(), ?) (lazy( $complexComputation(2, 3)) )

$someCondition() is evaluated and if its value is false(), then the expression to be evaluated is:

fn:op("and")(false(), ?) (lazy( $complexComputation(2, 3)) )

As fn:op("and")(false(), ?) by definition is function() {false()}. then the final result false() is produced and the unnecessary argument $complexComputation(2, 3) is not evaluated at all.

III. A function's arity is a guard for its arguments

Let us have a function $f defined as below:

let $f := function($arg1 as item()*, $arg2 as item()*, …, $argN as item()*)
   as function(item()*, item()*, …, item()*) as item()*
     {
       if($cond0($arg1))       then -> () { 123 }
        else if($cond1($arg1)) then -> ($Z1 as item()*) {$Z1}
        else if($cond2($arg1)) then -> ($Z1 as item()*, $Z2 as item()*) {$Z1 + $Z2}
        (:    .        .        .        .         .        .        .         .  :)
        else if($condK($arg1)) then -> ($Z1 as item()*, $Z2 as item()*, …, $Zk as item()*)
                                       {$Z1 + $Z2 + … + $Zk}
        else ()
     }
  return
     $f($y1, $y2, …, $yN) ($z1, $z2, …, $zk)

A call to $f returns a function whose arity may be any of the numbers: 0, 1, …, K.

Depending on the arity of the returned function (0, 1, …, K), the last (K, K-1, K-2, …, 2, 1, 0) arguments of the function call:

$f($y1, $y2, . . . , $yN) ($z1, $z2, . . . , $zk)

are unneeded and it is logical that they would not need to be evaluated.

So, the actual arity of the result of calling $f is a guard for the arguments of a call to this function-result.

Thus, one more bullet needs to be added to [2.4.5 Guarded Expressions] https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-guarded-expressions), specifying an additional guard-type:

  • In an expression of the type E(A1, A2, ..., AN) any of the arguments AK is guarded by the condition
    actual-arity(E) ge K. This rule has the consequence that if the actual arity of E() is less than K then if any argument Am (where m >= K) is evaluated, this must not raise a dynamic error. An implementation may base on the actual arity of E() its decision for the evaluation of the arguments.
@dnovatchev dnovatchev added XPath An issue related to XPath Feature A change that introduces a new feature labels Jan 2, 2023
@michaelhkay
Copy link
Contributor

michaelhkay commented Jan 8, 2023

I agree with you in principle, but I think fixing this is a little bit more complicated.

Consider

declare function $select($input, $predicate as function($item, $position) as s:boolean) {
   $input[$predicate(., position() div 0)]
}

with the call

$select(1 to 10, is-NaN#1)

Then I guess what we are trying to achieve is that no error occurs as a result of evaluating the expression position() div 0.

The rules for function coercion in the current draft say

If F has lower arity than the expected type [i.e. N < 1], then F is wrapped in a new function [let's call it G] that declares and ignores the additional argument; the following steps are then applied to this new function.

So this is equivalent to

`let $g := function($item, $position) {is-NaN($item)} return $select(1 to 10, $g)`

The actual call $input[$predicate(., position() div 0)] is now calling the arity-2 function $g with two arguments. So we can't apply any special rules here for function calls where the number of supplied arguments is different from the number required.

One way forward would be to change the mechanism: instead of having function coercion create an arity-2 function by wrapping a supplied arity-1 function, we could have a general rule that the number of arguments supplied to a dynamic function call can be greater than the arity of the function item. I think that would be a bad idea, as it would lead to many undiagnosed user errors.

Another way would be to have a general rule for all function calls saying that an error in evaluating an argument is ignored if the value of the argument is not used by the function. I'm not sure this would really improve interoperability, because in principle the caller of a function doesn't know anything about the implementation of the function. Consider for example a function such as max($input, $collation). Is $collation used in the case where $input is an empty sequence or a singleton? There might be some implementations where it is used and others where it isn't. The caller won't get any improved interoperability by knowing that there will be no error if the argument isn't used, unless they also have some way of knowing whether it will be used or not.

In fact the more I think about it, it seems to me that you are trying to achieve a level of interoperability that can only be delivered by exposing details of the implementation of a function (specifically, the conditions under which it uses the value of a particular argument) that are best left hidden.

But of course there is one case where we know that an argument is not used, and that's the case where the argument is added to a wrapper function created by function coercion. Perhaps we can find some way of special-casing that situation. But is it worth the effort, can't we just leave it to implementors to "do the right thing"?

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 8, 2023

One way forward would be to change the mechanism: instead of having function coercion create an arity-2 function by wrapping a supplied arity-1 function, we could have a general rule that the number of arguments supplied to a dynamic function call can be greater than the arity of the function item. I think that would be a bad idea, as it would lead to many undiagnosed user errors.

No, we don't want the change the rules. In this case we simply define the function using the lazy hint:

declare function $select($input, $predicate as function($item, lazy $position) as xs:boolean) {
   $input[$predicate(., position() div 0)]
}

This clearly tells the XPath processor: "Do not issue any errors when (if at all) evaluating $position unless the actual function arity of the passed as parameter $predicate is >= 2"

This is actually a very good example of the usefulness of lazy hints.

Thank you!

@ChristianGruen
Copy link
Contributor

This is actually a very good example of the usefulness of lazy hints.

@dnovatchev Wouldn’t it be better to let the query processor evaluate arguments lazily whenever possible? Or, to put it differently, are there any good reasons for a developer to omit the proposed lazy keyword?

And a related question: Is a processor allowed to do lazy evaluation if the keyword is omitted?

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 8, 2023

This is actually a very good example of the usefulness of lazy hints.

@dnovatchev Wouldn’t it be better to let the query processor evaluate arguments lazily whenever possible? Or, to put it differently, are there any good reasons for a developer to omit the proposed lazy keyword?

@ChristianGruen Of course, the query processor can take any decisions for lazy evaluation. The lazy hint is an additional help to the processor in cases where it may not have (what it considers) enforcing reason for laziness.

And a related question: Is a processor allowed to do lazy evaluation if the keyword is omitted?

Absolutely, as per above.

... are there any good reasons for a developer to omit the proposed lazy keyword?

The developer uses the lazy hint to indicate to the processor that there is a significant probability that the evaluation of the argument may not be needed. Let's take the well-known example:

let $fAnd := function($x as xs:boolean, $y as xs:boolean) as xs:boolean
             {
                let $partial := function($x as xs:boolean) as function(xs:boolean) as  xs:boolean
                                {
                                  if(not($x)) then ->(){false()}
                                              else ->($t) {$t}
                                }
                 return $partial($x)($y)
             }
  return
     $fAnd($x (: possibly false() :), lazy $SomeVeryComplexAndSlowComputedExpression)

In this particular case the developer knows that the 2nd argument will be unneeded whenever $x is false(), and a boolean value could be false() in roughly 50% of all cases. So, this is a good reason to use

lazy $SomeVeryComplexAndSlowComputedExpression

On the other side, if the 2nd argument would be not needed in just 1 in 10 cases (let's imagine a multi-valued logic with 10 possible "states" where only one state-value causes short-circuiting). Then it is most likely that the developer will not use a lazy hint, as there is no pressing evidence that a short-cutting case may happen frequently enough.

And finally, in most cases there is enough information to know in advance that all arguments are needed, such as in:

function($x, $y) {$x + $y)

To summarize:

The developer uses the lazy hint only on as-needed basis (as in the real-world case where an XPath expression was being evaluated in 100 seconds by Processor 1 and in 0 seconds by Processor 2. In any such case, the work of Processor 1 is enormously enhanced with a lazy hint -- a win/win situation for all: the vendor, the developer, and the users).

@michaelhkay
Copy link
Contributor

But what "lazy" actually says, I think, is "don't report any errors evaluating the argument unless the implementation actually needs to know the value of the argument". And what I was trying to explain was that that's only interoperable if some internal information about the implementation is exposed: one implementation might be able to work out that the evaluation is possible without knowing the $collation, another might not.

Also, what if the implementation needs to know part of the argument value but not all?

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 9, 2023

And what I was trying to explain was that that's only interoperable if some internal information about the implementation is exposed: one implementation might be able to work out that the evaluation is possible without knowing the $collation, another might not.

Actually, this is not the definition of lazy.

It was clearly defined as:

" it indicates to the XPath processor that it is logical to make the decision about evaluation of $y based on the actual arity of the function returned by $f($x)"

There is no interoperability problem because every Processor knows the actual arity of the function E() at the moment when this function has to be executed. This is how the implementation knows to issue an arity error, if too-many arguments have been provided in the call, or to wrap the function in another one, if the actual arity is less than the expected.

According to the rules, every implementation has to do this. Thus there is no interoperability problem!

Here is a small code example demonstrating this both with Saxon-EE 11.4 and with BaseX 10.4:

let $f := function($x)
  {
    if($x lt 1) then function() {15}
    else if($x eq 1) then function($z) {$z}
    else if($x eq 2) then function($z, $t) {$z +$t}
    else function($z, $t, $r) {$z + $t + $r}
  }
 return
   for $i in (0 to 3)
     return
        function-arity($f($i))

Both produce the correct result: 0, 1, 2, 3 :

image

image

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 9, 2023

Also, what if the implementation needs to know part of the argument value but not all?

Not addressed in this proposal, can be addressed separately.

One way to do this is to use destructuring of the argument and insert the lazy hint before the unneeded part(s) of the destructured argument.

I am guessing the correct syntax of the destructured argument, in the case of a list may look like this:

$(x -> head(), lazy x -> tail())

@rhdunn ? What would be your suggestion for destructuring of arguments of a function call?

@ChristianGruen
Copy link
Contributor

ChristianGruen commented Jan 9, 2023

@dnovatchev I don’t understand the implications for implementors yet (including myself):

A processor is free to evaluate arguments lazily whenever it seems suitable. The lazy keyword is described as “an additional help to the processor”, but what does that mean exactly?

If an argument is unneeded and can safely be ignored, my natural decision as an implementor would be to never evaluate it anyway. What would you expect me to do with the additional keyword?

Next, it is worth noting that the usage of variable references as arguments may be confusing in the examples. As a variable may be referenced more than once, its value is often assigned before it’s eventually used. Think of the following example, in which it will be too late to evaluate the argument lazily (presumably, depending on the processor):

let $e := EXPENSIVE
return $function($e, lazy $e)

@dnovatchev
Copy link
Contributor Author

@dnovatchev I don’t understand the implications for implementors yet (including myself):

A processor is free to evaluate arguments lazily whenever it seems suitable. The lazy keyword is described as “an additional help to the processor”, but what does that mean exactly?

@ChristianGruen ,

This means that whenever the developer is in the situation that Processor1 he is using takes 100s. time to evaluate an expression but Processor 2 takes 0s. time to evaluate the same expression, and there is obvious shortcutting, the fastest way to signal and correct the issue is to use the lazy hint so that Processor1 would only decide whether or not to evaluate the argument(s) following the hint based on the actual arity of the function.

Thus, the Developer (and their users) will see the improvement immediately after specifying the hint and not have to file a bug with the vendor, have to persuade the vendor even to look at this bug and then wait several months until this is fixed. And this describes just a single case, but there may be many such cases occurring through the course of development and usage of any product.

If an argument is unneeded and can safely be ignored, my natural decision as an implementor would be to never evaluate it anyway.

Only if your Processor is smart enough. There are cases when this is obvious to the Developer, yet it is difficult to infer/prove automatically from the code.

What would you expect me to do with the additional keyword?

As per above: Do not to evaluate the argument(s) following the hint, before knowing the actual arity of this function. And if the actual arity turns out to be K < N and there are N expressions supplied as arguments, evaluate just the needed K expressions-arguments and not all N.

Next, it is worth noting that the usage of variable references as arguments may be confusing in the examples. As a variable may be referenced more than once, its value is often assigned before it’s eventually used. Think of the following example, in which it will be too late to evaluate the argument lazily (presumably, depending on the processor):

let $e := EXPENSIVE
return $function($e, lazy $e)

This is a very good observation, thank you! You are absolutely right!

In fact I should be using metasymbols denoting expressions, like E1, E,2 , ... , Ek.

But I wanted to always have syntactically correct code (that would compile), not meta symbols that when directly input to a Processor would cause syntax errors.

Perhaps if we had a macro-expansion facility (imagine not $x but #x, then #x would be something that would expand to expression).

I would be glad to update the proposal using meta-symbols (Capital letters, each standing for "expression") if you recommend this as an improvement.

Thank you, Christian!

@ChristianGruen
Copy link
Contributor

As per above: Do not to evaluate the argument(s) following the hint, before knowing the actual arity of this function. And if the actual arity turns out to be K < N and there are N expressions supplied as arguments, evaluate just the needed K expressions-arguments and not all N.

Wouldn’t that always make sense if K < N? In other words, can you imagine reasons why an implementation should decide to evaluate all N arguments eagerly if it does come with support for lazy evaluation and if the keyword is omitted?

I think it would be helpful to find at least one implementor that would actually benefit from the existence of the keyword. Maybe you have a specific implementor in your mind that does not take part in this discussion yet?

Personally, I would tend to choose a different name for the optimization. By lazy evaluation, I would rather think of an evaluation that is supposed to take place, but is postponed until the value of an expression is actually needed. In the given case, though, the argument will simply be dropped/ignored once we know that it’s not required. See the following query:

let $a := EXPENSIVE
return 123

The value of $a will not be evaluated lazily. Instead, it can simply be dropped, as it is never referenced. I would do the same for…

(function() { 123 })(EXPENSIVE)

…as it’s possible to statically detect that the function item has 0 parameters. If the arity of a function item is only known at runtime…

(if(random-number-generator()?number > 0.5)
 then function() {}
 else function($a) { 123 }
)(EXPENSIVE)

…and if the value of an argument is not needed, I would ignore the expression as well instead of evaluating it lazily any time later.

@michaelhkay
Copy link
Contributor

As far as Saxon is concerned, I can't see us changing the behaviour of the product to take account of this "hint". We already evaluate function parameters lazily whenever possible. In the cases where it's not possible (typically because the expression has dependencies on parts of the context that aren't easily saved, such as last()), the "hint" doesn't magically solve the problem.

I think this feature is very similar to unordered{}. No-one ever uses unordered mode, and Saxon ignores it if they do. The historic experience of using hints like this is not a good one; neither the users nor the specification writers have a good understanding of the performance complexities of the internals of an implementation. If an implementation wants to provide ways for users to give performance hints there are plenty of mechanisms for doing so. In Saxon we only do this as a last resort, because we know that only a tiny number of users will ever discover the hint exists, and only a tiny number of those will know how to take advantage of it. Our business aim is to deliver good performance across the board "out of the box" to relatively unskilled users, not to provide extra controls that enable highly-skilled users to make it go even faster.

@michaelhkay
Copy link
Contributor

@ChristianGruen wrote: I would ignore the expression as well instead of evaluating it lazily any time later.

Indeed, this illustrates that so-called "lazy" evaluation embraces a wide range of techniques, including eliminating dead code, deferring evaluation of expressions until the value is needed, and incremental evaluation of sequences (or other structures...) so that only as much of the sequence is evaluated as necessary.

@dnovatchev
Copy link
Contributor Author

Wouldn’t that always make sense if K < N? In other words, can you imagine reasons why an implementation should decide to evaluate all N arguments eagerly if it does come with support for lazy evaluation and if the keyword is omitted?

@ChristianGruen It is one thing for it to "always make sense" and another to be required by the specification.

Something that "always makes sense" to you may not even come to mind to another implementor.

And currently this "future code" produces an error with BaseX 10.4, so I wouldn't like to guess if the right thing to do will "totally make sense" to you in this case in a future version. We need to accept that a thing that "always makes sense" to any person Allice may not "always makes sense" to any other person Bob.

image

If we want not to need a lazy hint, then we need to require that in

E1() (E2, E3)

none of the arguments E2 and E3 are evaluated before E1() is evaluated and the actual arity of its result is known. But I think such a rule will be much more restrictive, than just indicating with a hint only the cases when this is actually meaningful.

@dnovatchev
Copy link
Contributor Author

@ChristianGruen wrote: I would ignore the expression as well instead of evaluating it lazily any time later.

Indeed, this illustrates that so-called "lazy" evaluation embraces a wide range of techniques, including eliminating dead code, deferring evaluation of expressions until the value is needed, and incremental evaluation of sequences (or other structures...) so that only as much of the sequence is evaluated as necessary.

@michaelhkay If the word "lazy" bothers you, we could use any other word that is appropriate, such as:

  • if-needed
  • when-needed
  • as-needed
  • when-required

although I still think that "lazy" best denotes what is intended here.

As for "lazy" referring to "eliminating dead code", I don't see this mentioned in Wikipedia.

As for "incremental evaluation of sequences", as per a previous reply, we could very well also specify the lazy hint combined with sequence destructuring.

As for "deferring evaluation of expressions until the value is needed", yes, this is exactly what is being proposed -- specifically for the case when these expressions are values of arguments within a function call.

@dnovatchev
Copy link
Contributor Author

We already evaluate function parameters lazily whenever possible. In the cases where it's not possible (typically because the expression has dependencies on parts of the context that aren't easily saved, such as last()), the "hint" doesn't magically solve the problem.

@michaelhkay Here is just one example where Saxon EE 11.4 does not evaluate lazily the argument of the last function call in the below expression or otherwise the evaluation wouldn't take 100 seconds. And this function is context-free, thus the explanation above doesn't apply ("In the cases where it's not possible (typically because the expression has dependencies on parts of the context that aren't easily saved, such as last())" :

let $fnAnd := function($x)
   {
     function($y)
     {
      if(not($x)) then false()
                  else $y
     }
   }
   return
      $fnAnd(false())(some $b in ( ((1 to 2147483647) !true()) )  satisfies not($b)   )

The historic experience of using hints like this is not a good one; neither the users nor the specification writers have a good understanding of the performance complexities of the internals of an implementation.

The proposed feature has nothing to do with understanding the internals of any implementation. On the contrary, its purpose is to provide a quick fix for gross external disbehavior, such as evaluating an expression for 100 seconds, while observing how with another implementation this takes 0 seconds.

Our business aim is to deliver good performance across the board "out of the box" to relatively unskilled users, not to provide extra controls that enable highly-skilled users to make it go even faster.

Excuse me if I find such a statement rather offensive:

  1. It is sharply discriminatory -- should "highly-skilled users" feel "guilty"?
  2. We have an open issue in this group to work for supporting diversity, which means to honor the interests of all people, not to promote division into groups and falsely maintain that the interests of each group are opposing each other.
  3. What can "go even faster" than a 100 seconds evaluation? Answer: almost everything, thus neither of the two groups of users should suffer such "out of the box" behavior.

@michaelhkay
Copy link
Contributor

Your example illustrates some of the difficulties.

The expression some $b in ( ((1 to 2147483647) !true()) ) satisfies not($b) is a constant subexpression (rather like 2+2) and as such is a candidate for early evaluation. Many processors are going to evaluate constant subexpressions unconditionally (for example by promoting them to global variables). To comply with the rules on error handling, an implementation might well evaluate this expression statically, while catching and deferring any error so it isn't raised unless the value of the expression is actually used. If you're working in an environment where queries are typically compiled once and executed repeatedly, then early static evaluation of constant subexpressions is usually a good strategy, and cases where it isn't are probably rare enough to be treated as pathological: it's a basic principle of optimizer design that you should optimize the queries that occur in real life, not those that arise in artificial test cases.

Would a processor take notice if the expression were tagged "lazy"? Well, it might avoid early evaluation of the entire expression. But it probably wouldn't avoid early evaluation of constant subexpressions, such as (1 to 2147483647) !true(). Why should it?

It's perfectly possible to execute this query efficiently without any user-supplied hints, and when the inefficiencies in Saxon's optimization strategy were pointed out to us, we improved the strategy. That's the way performance of an optimizing implementation improves over time, by studying test cases and adapting the strategy by a process of continuous improvement. As it happens, we had done very little work on the performance of dynamic function calls; our general approach when we first implement a new feature is to focus on correctness rather than efficiency, and then to make performance improvements in subsequent releases as use cases start to emerge. Although we've had dynamic function calls in Saxon for at least ten years, they have very rarely featured in the customer workloads that we have studied from a performance perspective, so they had never received a great deal of attention.

@ChristianGruen
Copy link
Contributor

There are several solutions to simplify the given expression at compile time. Here’s one possible path (BaseX proceeds differently):

(: Step 1 … inline function :)
let $x := false()
return function($y) {
  if(not($x)) then false() else $y
}(some $b in (1 to 2147483647) ! true()  satisfies not($b))

(: Step 2 … inline function :)
let $x := false()
let $y := some $b in (1 to 2147483647) ! true()  satisfies not($b)
return if(not($x)) then false() else $y

(: Step 3 … inline FLWOR variables :)
if(not(false())) then false() else some $b in (1 to 2147483647) ! true()  satisfies not($b)

(: Step 4 … simplify condition :)
if(true()) then false() else some $b in (1 to 2147483647) ! true()  satisfies not($b)

(: Step 5 … choose branch :)
false()

Another strategy is to simplify the quantifier expression. Both can be done at constant time, and in both cases, the lazy keyword would get lost in the optimized query.

Dimitre, it seems you are offended by the headwind. Sorry for that. I just haven’t come across a use case yet in which I would even know how to exploit the keyword reasonably in BaseX. I’m convinced we should at least find one implementor who’s capable of doing so. Do you have someone in your mind who would benefit from the proposed keyword?

I know I’m just Alice, but we should get to know Bob before we proceed.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 10, 2023

@ChristianGruen

Thank you for this analysis.

(: Step 1 … inline function :)
let $x := false()
  return function($y) {
     if(not($x)) then false() else $y
    }(some $b in (1 to 2147483647) ! true()  satisfies not($b))

I used something very similar, and the same implementation still took more than 20 seconds to evaluate it (though this is quite an improvement: from 100s. to 20 s. is 5 times faster), so I would still put in this case a lazy hint:

let $fnAnd := function($x)
   {
     function($y as function(*))
     {
      if(not($x)) then false()
                  else $y()
     }
   }
   return
      $fnAnd(false())(function()  {some $b in ( ((1 to 2147483647) !true()) )  satisfies not($b)}   )

BaseX takes 0 seconds to evaluate this.

I’m convinced we should at least find one implementor who’s capable of doing so.

Don't you think the implementation that takes 20s. to evaluate the above expression or 100s. to evaluate the original expression, will not benefit from having a lazy hint?

@ChristianGruen
Copy link
Contributor

Don't you think the implementation that takes 20s. to evaluate the above expression or 100s. to evaluate the original expression, will not benefit from having a lazy hint?

Maybe. I need to see the query that takes 20 seconds to assess this; feel free to attach it.

@dnovatchev
Copy link
Contributor Author

Don't you think the implementation that takes 20s. to evaluate the above expression or 100s. to evaluate the original expression, will not benefit from having a lazy hint?

Maybe. I need to see the query that takes 20 seconds to assess this; feel free to attach it.

It is already in my previous reply:

let $fnAnd := function($x)
   {
     function($y as function(*))
     {
      if(not($x)) then false()
                  else $y()
     }
   }
   return
      $fnAnd(false())(function()  {some $b in ( ((1 to 2147483647) !true()) )  satisfies not($b)}   )

@ChristianGruen
Copy link
Contributor

This query is evaluated in constant time. Do you use an older version of BaseX?

@dnovatchev
Copy link
Contributor Author

Yes, BaseX evaluates it in 0s. It is the other implementation that takes 20s. for the evaluation

@ChristianGruen
Copy link
Contributor

I can only talk about our implementation, for which I don’t see or understand how to utilize the keyword (…yet. I’m interested in new examples). If Saxon is the other implementation, I think that Michael already gave an answer above (#299 (comment)). If there’s yet another implementation, maybe we should include the implementor.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 10, 2023

I can only talk about our implementation, for which I don’t see or understand how to utilize the keyword (…yet. I’m interested in new examples). If Saxon is the other implementation, I think that Michael already gave an answer above (#299 (comment)). If there’s yet another implementation, maybe we should include the implementor.

I only have access to BaseX and Saxon.

There may be many explanations and justifications why the evaluation is very far from well-performing, but having explanations doesn't solve the issue, and using the lazy hint is designed to be used exactly in such cases of good explanations and justifications of the unacceptable.

@michaelhkay
Copy link
Contributor

The poor performance of Saxon on this query was a bug. The code in Saxon was designed to use lazy evaluation but wasn't working as designed. The bug tells us nothing about the language design, only about the difficulty of achieving perfection in implementation.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 11, 2023

The poor performance of Saxon on this query was a bug. The code in Saxon was designed to use lazy evaluation but wasn't working as designed. The bug tells us nothing about the language design, only about the difficulty of achieving perfection in implementation.

Exactly. Nothing is perfect in the real world. In order to survive, human beings have been using many life-saving mechanisms/strategies like pillows, blankets, coats, helmets (to protect oneself from flying debris) and similarly, in the area of software development we can significantly soothe / survive hard bugs that bring down an application to an almost lifeless state, by using universal, wellknown helpers, such as the lazy "shield" .

The XPath developer could either stay helpless, as at present, or be armed with a few tools that help survival, such as the lazy keyword. We could make all the difference by providing this capability.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 22, 2023

Note:

I have updated the initial post in this thread with a few new things (please read, any comments will be appreciated):

  • Defining a whole function as lazy, meaning that any call of this function is only necessary to be evaluated if the function is actually called during the evaluation of the expression that contains this function call
  • A new function: fn:lazy() that can be specified directly on any expression to indicate that this expression is not necessary to be evaluated unless the function that returns it is actually called.

What still remains to be done: Ways to specify incremental (or partial) evaluation of composite (non-atomic) items. Hopefully we could discuss this together with @rhdunn and link this to his proposal for destructuring of composite items #37.

@michaelhkay
Copy link
Contributor

A minor complication is that the grammar doesn't work unless "lazy" is made a reserved function name; lazy Expr could be lazy (2+2) which is parsed as a function call.

The rationale states that in the expression $f($x)($y) "it is logical to delay the evaluation of $y until the actual arity of $f($x) is known". However, under the current rules, this expression is an error unless the actual arity of $f($x) is 1.

If we were to allow a dynamic function call to supply more arguments than are needed, then I fail to see why the user needs to indicate to the processor that the extra arguments are not needed: not evaluating such arguments is an obvious optimization that requires no intelligence on the part of the processor.

Further the rationale makes assumptions like "Specifying the lazy keyword ... can save significant computing resources, because the programmer knows that $someCondition() is true during only a single day in any 4-years period." My experience is (a) very few users will discover that the feature exists, and (b) even fewer will know how to use it correctly to achieve such benefits. Note that even if the value of an argument is only used very rarely, saving the expression and its context so that it can be evaluated later will often take longer than evaluating it eagerly, and users simply do not have enough detailed knowledge of the implementation to assess this. In my view the implementation is much more likely to get this right than the user is: especially if it uses smart techniques (which are becoming common in JIT optimisers) such as adjusting its strategy based on monitoring of actual run-time behaviour.

My view is that this proposal does not improve the specification.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 22, 2023

A minor complication is that the grammar doesn't work unless "lazy" is made a reserved function name; lazy Expr could be lazy (2+2) which is parsed as a function call.

This can be fixed easily. let's name the function "lazily" 😄

The rationale states that in the expression $f($x)($y) "it is logical to delay the evaluation of $y until the actual arity of $f($x) is known". However, under the current rules, this expression is an error unless the actual arity of $f($x) is 1.

Not true: Here are the current rules (which, btw you introduced). Bullet 2 tells us that "If F has lower arity than the expected type, then F is wrapped in a new function that declares and ignores the additional argument":

  1. If F has higher arity than the expected type, a type error is raised [err:XPTY0004]
  2. If F has lower arity than the expected type, then F is wrapped in a new function that declares and ignores the additional argument;

If we were to allow a dynamic function call to supply more arguments than are needed, then I fail to see why the user needs to indicate to the processor that the extra arguments are not needed: not evaluating such arguments is an obvious optimization that requires no intelligence on the part of the processor.

What is "obvious" to one person may not be obvious to another. This is why rules and specifications are used to fill such ambiguities.

Further the rationale makes assumptions like "Specifying the lazy keyword ... can save significant computing resources, because the programmer knows that $someCondition() is true during only a single day in any 4-years period." My experience is (a) very few users will discover that the feature exists, and (b) even fewer will know how to use it correctly to achieve such benefits.

This could be so if we all quietly waited for another book similar to "XSLT 2.0 and XPath 2/0" to be written, this time about XPath 4.0.

However, luckily, there are a number of people that provide more immediate teaching, like @liamquin, G. Ken Holman and even I.

One look at the Pluralsight statistics for the course "The Evolution of XPath: What’s New in XPath 3.0" has been watched by almost 4000 people. Summing up the people who were trained just by these 3 authors will give us many tens of thousands of people. Do you still consider this as "very few users"?

Note that even if the value of an argument is only used very rarely, saving the expression and its context so that it can be evaluated later will often take longer than evaluating it eagerly, and users simply do not have enough detailed knowledge of the implementation to assess this. In my view the implementation is much more likely to get this right than the user is: especially if it uses smart techniques (which are becoming common in JIT optimisers) such as adjusting its strategy based on monitoring of actual run-time behaviour.

No JIT optimiser can save hours of processing time and these are exactly the cases where specifying laziness matters most.

Should I rename $complexComputation to $computationTakingHours ?

My view is that this proposal does not improve the specification.

And my view is that it fills a gap in the language and provides the end user with a powerful, last-resort weapon against "optimization-gone-wrong" or "lack-of-optimization", thus making it possible to save hours of unnecessary use of processing power; -- immediately, rather than having to report a bug and wait potentially months for this to be acknowledged and fixed (at the mercy of the "Gods").

@ChristianGruen
Copy link
Contributor

@dnovatchev I’m sorry to say I share many of Michael’s concerns. The two query processors we mostly talk about in this context both deploy a wide variety of optimization techniques that include different kinds of lazy evaluation, and it will be pretty difficult to illustrate to the users what exactly will change if lazy evaluation is enforced. At least, I would still be overwhelmed with finding reasonable implementation strategies that are not applied anyway without the proposed extensions and (even more) teach users which effects this will have in practice.

And we still haven’t managed to define at least one example that everyone here agrees on that would be slow without an explicit lazy hint and faster with this hint. That should be the minimum requirement before we push this any further.

@michaelhkay
Copy link
Contributor

@dnovatchev The rules you quote from are the rules for function coercion, which apply when a function F is supplied in a context where a function of type T is required. In a dynamic function call there is currently no required type and no function coercion. We could change that, of course; though on balance I don't think we should, because it would lead to user mistakes going undetected.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Jan 22, 2023

@dnovatchev I’m sorry to say I share many of Michael’s concerns. The two query processors we mostly talk about in this context both deploy a wide variety of optimization techniques that include different kinds of lazy evaluation, and it will be pretty difficult to illustrate to the users what exactly will change if lazy evaluation is enforced.

@ChristianGruen Are you saying that you would be afraid against giving the users an easy way to experiment themselves in order to find these answers?

At least, I would still be overwhelmed with finding reasonable implementation strategies that are not applied anyway without the proposed extensions and (even more) teach users which effects this will have in practice.

This is really good, and I don't see how lazy hints may stop you from doing this in the future 😄

And we still haven’t managed to define at least one example that everyone here agrees on that would be slow without an explicit lazy hint and faster with this hint. That should be the minimum requirement before we push this any further.

If the example of the evaluation of an expression once in 4 years is not a good example for you, then I am not sure what example you would like ... 😞

@dnovatchev
Copy link
Contributor Author

@dnovatchev The rules you quote from are the rules for function coercion, which apply when a function F is supplied in a context where a function of type T is required. In a dynamic function call there is currently no required type and no function coercion. We could change that, of course; though on balance I don't think we should, because it would lead to user mistakes going undetected.

@michaelhkay A dynamically defined function still can have its return type defined, thus the current rules do apply in this case.

@rhdunn
Copy link
Contributor

rhdunn commented Jan 22, 2023

One interesting point is that MarkLogic has xdmp:lazy and xdmp:eager functions. -- Lazy evaluation causes the results to be streamed (making it more expensive if the result is consumed multiple times, esp. for database requests), whereas eager causes the results to be stored in memory (like an array of the results).

Personally, I find a user having to make this decision is the wrong way around. -- If the user has to manually optimize a query, the processor's optimizer is not doing its job. Requiring something like lazy/eager forces decisions that may work for small data sets and break for large data sets.

@ChristianGruen
Copy link
Contributor

@ChristianGruen Are you saying that you would be afraid against giving the users an easy way to experiment themselves in order to find these answers?

No, I’d first need to know by myself – as implementor – which difference it makes if the keyword is specified.

If the example of the evaluation of an expression once in 4 years is not a good example for you, then I am not sure what example you would like ...

I'm afraid it isn't. As I tried to indicate before, I don't know why an implementation should choose a different evaluation strategy if the keyword is omitted. Do you have cases in your mind in which you believe that an eager evaluation strategy could be advantageous?

Next, $complexComputation doesn't tell me anything. Can you envision any concrete example here that would benefit from (whatever we may jointly believe to be) lazy evaluation?

@dnovatchev
Copy link
Contributor Author

Next, $complexComputation doesn't tell me anything. Can you envision any concrete example here that would benefit from (whatever we may jointly believe to be) lazy evaluation?

Maybe BaseX has a near-perfect optimizer, thus it would be much more difficult to construct such an example for BaseX.

However, users of other XPath processors may not be so lucky and could experience serious huge difference in performance comparing the evaluation of the same XPath expression on their chosen/main XPath processor and a better-optimizing one.

I already provided an example of such an expression whose evaluation takes around 100 seconds on Saxon 11.4 but takes 0 seconds on BaseX. A "user-optimization" of the same expression run 5 times faster on Saxon (for "only" 20 seconds) but still is hugely disappointing.

This is a real case in which if using a lazy hint was possible, and honored by the implementation, then the evaluation time would instantly become similar to that of BaseX.

The reason may be just a bug, that could be fixed in the nearest future (2-3 months from acknowledging it ...), but this is a real case and there is no guarantee that other bugs do not exist, and when encountering them the user could be stranded for a prohibitively long time, but having the ability to specify the lazy hint could give that user an immediate improvement.

@michaelhkay
Copy link
Contributor

The bug that caused Saxon not to use lazy evaluation in this example would have applied either way: Saxon was trying to use lazy evaluation and failing (because of a complication involving dynamic type checking) and that would have applied exactly the same if the request for lazy evaluation had been explicit. Please don't try to use this example as justification.

Also note: the bug was fixed as soon as it was pointed out. Optimizers are constantly improving. Hand-optimization by users based on experience with old releases ensures that user queries don't benefit from this continual improvement.

@dnovatchev
Copy link
Contributor Author

Also note: the bug was fixed as soon as it was pointed out. Optimizers are constantly improving. Hand-optimization by users based on experience with old releases ensures that user queries don't benefit from this continual improvement.

I guess that such kind of reasoning may indicate that an implementor doesn't know (or doesn't care about) the real-world developer whose given requirements often are that code is supposed to work as expected now.

Having faith in the bright future of product X has no defined time limits and thus little, if anything, to do with any rational solution process. This proposal's goal is to provide one tool that helps achieving the latter.

Seems that we really need diversity in this group, and diversity cannot be achieved without having users properly represented.

@rhdunn
Copy link
Contributor

rhdunn commented Jan 23, 2023

MarkLogic

@dnovatchev I use MarkLogic for work which (as I note above) has lazy/eager functions similar to what you propose. It uses those to determine whether to build a complete sequence into memory (eager) or to stream the results (lazy) akin to Java stream() or Kotlin sequences. -- I've also spent a lot of time profiling and optimizing MarkLogic queries, using the profiler that MarkLogic provides.

When developing with languages like that, the developer makes the choice of what data structure and processing methods to use. In other languages (including XPath/XQuery) the exact implementation is up to the processor implementation. That gives an implementation the option to use different data structures depending on the exact query type, or switch to a different data structure in the future.

Having used MarkLogic, I get frustrated with it that I can't write a correct FLWOR statement for processing data in the database and have it figure out which indices and search primitives to use in order to make the performance of that query optimal. Optimizing MarkLogic queries often involves rewriting the FLWOR queries into MarkLogic's search APIs (cts:search, cts:query). -- That doesn't mean that we should be standardizing those search APIs just because MarkLogic struggles to optimize those queries. Instead, MarkLogic should get better at figuring out how to do that itself.

Kotlin

Kotlin has a concept of lazy properties where you can do things like:

val connection by lazy { database.connect() }

Lazy properties are calculated at the point where they are used, not at the point where they are first used. In XPath/XQuery/XSLT, variables are in effect lazy by default.

The some/every example

A processor can reorder or rewrite a query as it needs to in order to optimize it. In your some $b in (1 to 2147483647) ! true() satisfies not($b) case, I could see a processor going:

  1. (1 to 2147483647) ! true() in a some/all expression is equivalent to just (true()) -- this is because some/all are set operations so you only need one of each possible value. [1] [2]
  2. not($b) expands to not(true()), which expands to false() -- this is simple logic.
  3. some ... satisfies false() is false -- this is the behaviour of some/every expressions.

[1] Note also that a static analyzer can identify that the expression will always return true, so could use that in the optimization path. For more complex expressions, it could figure out the paths that return/can return true in the satisfies clause (or different values), and just check those paths.

[2] For my XQuery plugin, I've been thinking about static analysis (e.g. Type Manipulation). There, I've been thinking about static type analysis, but could extend that to value analysis per your example. I've not yet implemented this in the public version of the plugin, but have in the past experimented with it. -- I'm currently working on making my plugin work on other IDEs, and in doing that I'm also working to the point where I can actually evaluate queries. At that point, I'll be looking at optimizing the engine.

From the spec:

The order in which test expressions are evaluated for the various binding tuples is implementation-dependent. If the quantifier is some, an implementation may return true as soon as it finds one binding tuple for which the test expression has an effective boolean value of true, and it may raise a dynamic error as soon as it finds one binding tuple for which the test expression raises an error.

This is saying that the operation is lazy, and an implementation is free to optimize it that way. For example:

  1. Create (1 to 2147483647) ! true() as a lazily evaluated list instead of computing it in memory.
  2. Performing satisfies not(true()) on the first item in the sequence.
  3. Stopping there because the some expression is false.

Another query processor may (and some do) rewrite the some/every query to a SQL expression, where it is the SQL engine that will do the optimizations and that does not have the concept of a "lazy" hint.

Using Kotlin or similar

Note that where I said that a processor could use different logic/techniques, an implementation could for example implement start to end as a Kotlin IntRange or similar. That just stores the start and end values, it does not enumerate the values or build them into a list, but has extension methods to do that when needed.

Expressions like $sequence[1 to 10] and fn:subsequence($sequence, 1, 10) can be further optimized. If the given sequence is backed by a List, Kotlin and Java have that as a 0 cost operation -- they return a new List that wraps the original list, where index operations are mapped according to the specified range. The implementation could also do something like constructing a generating list where the values are computed in a lambda expression (higher order function) as needed and backed on a list to avoid recomputation.

A user does not need to know any of these implementation details. They just write the query they want to write.

If a query is slow, a query processor should provide profiling tools for evaluating where the performance issues are in the query, and guides on optimizing queries. Those optimizations will by their nature be vendor specific, and optimizing for one processor may slow down the query for another.

Optimizations

There is a huge amount of research and implementation experience in optimizing programs.

For native programs (C, C++, Rust, etc.) the backends of compiler tools like LLVM and GCC do a lot of work with things like constant folding and more complex optimizations.

For Java/C#/JavaScript/Lua/etc. there is a lot of research in "Just in Time" (JIT) compilation. These languages typically have a two-stage JIT process -- initial unoptimized code generation, with optimized code generation for slow/long-running programs. Reading how JavaScript engines work (and how they've evolved over time w.r.t. interpretation and JIT compilation) is fascinating.

Final Thoughts

This isn't a diversity issue. It's an optimization issue, so is a processor implementation issue.

Implementations require users to report bugs and performance issues in order for them to improve.

@ChristianGruen
Copy link
Contributor

Thanks, @rhdunn, for the excellent summary.

You mentioned xdmp:lazy or xdmp:eager: We had comparable helper functions in initial versions of our processor, and we soon replaced those by an architecture that allows every expression to be evaluated iteratively or fully:

  • If fn:head is called, it is favorable to call the iterative implementation of the argument.
  • Functions like fn:sort rely on caching, and it is faster to request the full argument value only once.
  • Things are getting trickier for functions like fn:foot: If the argument size is known at runtime, it can be faster to explicitly request the last item of the argument value (e.g. if it’s an integer range with constant memory consumption). If not, the iterative approach is the one to go.

In all cases, even experienced users would need to have full knowledge on the internals of a language to understand why/when lazy or eager evaluation may yield better results. Every new version of a product may introduce unexpected changes, and if a product is implemented on top of another programming language, this language will evolve in parallel.

@michaelhkay
Copy link
Contributor

michaelhkay commented Jan 23, 2023

Indeed, I think most mature processors will have a range of strategies, not just a binary choice "eager" or "lazy".

Another factor that users may not be aware of is that lazy evaluation can in some cases have a substantial space overhead. This is especially true when recursion leads to long chains of dependencies (one lazily-evaluated expression depends on other lazily-evaluated expression, etc). Of course lazy evaluation can also reduce memory requirements; it's a complex trade-off.

We no longer live in an age where car drivers get maximum performance from their engine by making manual adjustments to the carburettor: computers do the job far better. At any rate, they do the job better most of the time, for most drivers. And our target market is people who want to jump in the car and go shopping, not racing drivers who employ specialist engineering teams. Sadly, while it would be nice to have average users represented on the working group, it's in the nature of such users that they are more interested in the destination than in the journey.

@benibela
Copy link

Seems that we really need diversity in this group, and diversity cannot be achieved without having users properly represented.

Here is a diverse approach in the other direction: Xidel is never lazy and performs no optimizations. Everything is evaluated eagerly as it is written. And users have told me it is ten times faster than basex on their real-world queries. Perhaps Xidel finishes evaluating the query before basex has done its optimizations.

Most users also rarely declare any functions. They hardly use any features beyond XPath 2.0. If they knew what a function was, they would use something like Python rather than XPath.

@michaelhkay
Copy link
Contributor

Yes, I've seen XPath engines that are amazingly simple-minded, and I suspect they do a very good job on 90% of XPath expressions. In fact, inspired by this, I did an experiment where we found that 40% of XPath expressions in a typical XSLT stylesheet (docbook) can be handled by tokenizing the XPath, matching the sequence of tokens to a dictionary of known expression types, and evaluating those directly without further analysis or optimization. A 10% reduction in the cost of evaluating select="@code" has much more impact than a 90% reduction in the cost of dynamic function calls.

@dnovatchev
Copy link
Contributor Author

Yes, I've seen XPath engines that are amazingly simple-minded, and I suspect they do a very good job on 90% of XPath expressions. In fact, inspired by this, I did an experiment where we found that 40% of XPath expressions in a typical XSLT stylesheet (docbook) can be handled by tokenizing the XPath, matching the sequence of tokens to a dictionary of known expression types, and evaluating those directly without further analysis or optimization. A 10% reduction in the cost of evaluating select="@code" has much more impact than a 90% reduction in the cost of dynamic function calls.

@michaelhkay You are reasoning again statistically, "about the average", but this proposal is all about specific cases, where it can provide immediate help.

As in the case of your findings, they still didn't make you remove the optimizors from Saxon 😄

@michaelhkay
Copy link
Contributor

Indeed, I would love to get rid of a lot of the optimizations in Saxon but I'm not brave enough to do so because I have no way of predicting the effect.

@dnovatchev
Copy link
Contributor Author

Closing this issue because its problem area and solution are fully covered by the newer #670 :

"The trouble with XPath‘s fn:fold-right.
Laziness in XPath.
"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature XPath An issue related to XPath
Projects
None yet
Development

No branches or pull requests

5 participants