Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

150 fn:ranks #1027

Closed
wants to merge 39 commits into from
Closed

150 fn:ranks #1027

wants to merge 39 commits into from

Conversation

dnovatchev
Copy link
Contributor

As proposed and discussed here: #150

@ChristianGruen ChristianGruen changed the title fn:ranks 150 fn:ranks Feb 19, 2024
@michaelhkay
Copy link
Contributor

Looking at the first example, why do we want to return [(2, 4)], [3] rather than [2, 4], [3]. Generally I would have thought an array with two singleton members was more useful than an array with one member being a sequence of two items.

To make this change return [$input[$key(.) eq $v]] should be return array{$input[$key(.) eq $v]}.

In all of the examples given, the supplied key function returns a single item. But it is allowed (according to the signature) to return any sequence of atomic values. I don't think I understand the intended behaviour when it returns multiple items. The predicate $key(.) eq $v requires $key(.) to return zero or one items.

What is the effect of NaN values?

The supplied $collation is used when sorting the values, but not when deciding whether they are distinct. Is that right?

Specification style: this is the subject of a separate issue. We should either provide an expression that can act as an implementation of the function being specified, or we should provide a user-written function that has the same effect. In this case I think providing an expression will do the job. Alternatively, wouldn't an XQuery expression using group by and order by be clearer (perhaps not, since FLWOR expressions cannot have a dynamic collation).

Summary: I would suggest: Sorts a supplied sequence based on the value of a sort key function, grouping the results so that items with the same key appear together as members of the same array.

@dnovatchev
Copy link
Contributor Author

Looking at the first example, why do we want to return [(2, 4)], [3] rather than [2, 4], [3]. Generally I would have thought an array with two singleton members was more useful than an array with one member being a sequence of two items.

To make this change return [$input[$key(.) eq $v]] should be return array{$input[$key(.) eq $v]}.

In all of the examples given, the supplied key function returns a single item. But it is allowed (according to the signature) to return any sequence of atomic values. I don't think I understand the intended behaviour when it returns multiple items. The predicate $key(.) eq $v requires $key(.) to return zero or one items.

What is the effect of NaN values?

The supplied $collation is used when sorting the values, but not when deciding whether they are distinct. Is that right?

Specification style: this is the subject of a separate issue. We should either provide an expression that can act as an implementation of the function being specified, or we should provide a user-written function that has the same effect. In this case I think providing an expression will do the job. Alternatively, wouldn't an XQuery expression using group by and order by be clearer (perhaps not, since FLWOR expressions cannot have a dynamic collation).

Summary: I would suggest: Sorts a supplied sequence based on the value of a sort key function, grouping the results so that items with the same key appear together as members of the same array.

@michaelhkay Thank you for these observations.

I am studying them and will respond.

@dnovatchev
Copy link
Contributor Author

Looking at the first example, why do we want to return [(2, 4)], [3] rather than [2, 4], [3]. Generally I would have thought an array with two singleton members was more useful than an array with one member being a sequence of two items.

To make this change return [$input[$key(.) eq $v]] should be return array{$input[$key(.) eq $v]}.

A good observation. Probably I wanted the key() function to be most general, but it feels difficult to find an immediate and compelling example.

In all of the examples given, the supplied key function returns a single item. But it is allowed (according to the signature) to return any sequence of atomic values. I don't think I understand the intended behaviour when it returns multiple items. The predicate $key(.) eq $v requires $key(.) to return zero or one items.

Yes, then we would need a function such as deep-equal

@ChristianGruen
Copy link
Contributor

A good observation. Probably I wanted the key() function to be most general, but it feels difficult to find an immediate and compelling example.

Off-topic, but maybe we can ask the same question for the scan functions: Wouldn’t singleton members be more intuitive?

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Feb 19, 2024

What is the effect of NaN values?

Aren't NaN values supposed to be smaller than anything else? The answer should be: "The effect is the same as when sorting."

The supplied $collation is used when sorting the values, but not when deciding whether they are distinct. Is that right?

The function distinct-values, used in the sample implementation, can be passed a collation, too. Not sure if the collation in the signature of fn:ranks should be used both for sorting and getting the distinct values from the input-sequence, or (if this at all is so important), we could have two different collations as parameters.

I prefer this to be as simple as possible. The $colation parameter was intended only because fn:sort needs one, not for producing the distinct values.

@dnovatchev
Copy link
Contributor Author

Specification style: this is the subject of a separate issue. We should either provide an expression that can act as an implementation of the function being specified, or we should provide a user-written function that has the same effect. In this case I think providing an expression will do the job. Alternatively, wouldn't an XQuery expression using group by and order by be clearer (perhaps not, since FLWOR expressions cannot have a dynamic collation).

A function definition is an expression, isn't it?

Summary: I would suggest: Sorts a supplied sequence based on the value of a sort key function, grouping the results so that items with the same key appear together as members of the same array.

A good one, thanks.

I will incorporate these suggestions now.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Feb 19, 2024

The supplied $collation is used when sorting the values, but not when deciding whether they are distinct. Is that right?

The function distinct-values, used in the sample implementation, can be passed a collation, too. Not sure if the collation in the signature of fn:ranks should be used both for sorting and getting the distinct values from the input-sequence, or (if this at all is so important), we could have two different collations as parameters.

@michaelhkay ,

Thinking further on this, if the key function is, say, translation from English to Swedish, then we must have two different collations - one for the English input words, and one for the Swedish translation results.

It is a pity we don't have the set type yet, otherwise the type of $input would more precisely be specified as set and the question about making the input values distinct would be eliminated,

This will also make a fine example - maybe close synonyms will have the same translation and would thus be in the same ranking set.

What do you think?

@michaelhkay
Copy link
Contributor

michaelhkay commented Feb 20, 2024

Thanks for responding to my comments.

I find it hard to believe that multiple collations are needed here; on the contrary, the way that sort keys are compared using distinct-values needs to be consistent with the way they are compared using sort. I'm also worried that there's a third comparison being done using eq, which uses the default collation rather than the supplied collation. I think this is also why I was uneasy about NaN - there are three different comparisons here which all potentially treat NaN differently.

I would like to suggest an alternative approach.

  1. Make the signature compatible with fn:sort except that it returns array(item())*.

  2. Take the rules of fn:sort as currently written, and modify them as described below to define fn:ranks

  3. Change the definition of fn:sort so that fn:sort($input, $collations, $keys, $orders) returns fn:ranks($input, $collations, $keys, $orders)?*. So we define fn:sort in terms of fn:ranks, not the other way round.

The changes needed to the fn:sort rules might be primarily, under "The result of the function is obtained as follows:" change rules 1, 3, and 4 as follows:

  1. The result is a sequence of arrays S such that S?* contains the same items as the input sequence $input, but generally in a different order.

  2. (unchanged)

  3. When a pair of corresponding sort key values of $A and $B are found to be not equal, then $A and $B appear in different arrays in the result sequence, and the array containing $A precedes the array containing $B in the result sequence if both the following conditions are true, or if both conditions are false:

  • The sort key value for $A is less than the sort key value for $B, as defined below.
  • The order direction in the corresponding sort key definition is "ascending".
  1. If all the sort key values for $A and $B are pairwise equal, then $A and $B appear in the same array in the result sequence, and $A precedes $B in this array if and only if $A precedes $B in the input sequence.
    Note:
    That is, the sort is stable.

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Feb 20, 2024

I would like to suggest an alternative approach.

  1. Make the signature compatible with fn:sort except that it returns array(item())*.
  2. Take the rules of fn:sort as currently written, and modify them as described below to define fn:ranks
  3. Change the definition of fn:sort so that fn:sort($input, $collations, $keys, $orders) returns fn:ranks($input, $collations, $keys, $orders)?*. So we define fn:sort in terms of fn:ranks, not the other way round.

The changes needed to the fn:sort rules might be primarily, under "The result of the function is obtained as follows:" change rules 1, 3, and 4 as follows:

Thank you, @michaelhkay ,

I understand exactly what you are proposing, and yes, this is possible, however it becomes overly (and is that necessary?) complicated.

In particular, I never wanted to have a sequence of key-functions, and it seems that just one function can internally perform multiple comparisons, if that is necessary at all.

Also, by definition, fn:ranks is defined (to be meaningful) over a set of (distinct) items -- while fn:sort returns all the input items, even in the case when they are not distinct.

As for comparing NaN values, can't we just say that NaN is less than any other item, and for the purposes of this function NaN is equal to NaN? Thus no additional collation for treating NaN would be necessary.

@michaelhkay
Copy link
Contributor

however it becomes overly (and is that necessary?) complicated.

I think there's a lot of complexity in the current proposed spec, which compares values in three different ways: For example, sort and distinct-values treat two NaNs as equal, while eq treats them as not-equal. The proposal to define fn:sort in terms of fn:ranks is certainly a significant refactoring that may be difficult to get right, but if successful it will reduce complexity overall. (It might also be possible to define other functions such as distinct-values, duplicates, min, max, highest and lowest by reference to fn:ranks, and that would certainly be a great reduction in complexity if it can be achieved). But I agree it might be over-ambitious.

Another point, I just spotted the error condition "If the set of computed keys contains xs:untypedAtomic values that are not castable to xs:double then [the] operation will fail with a dynamic error." Why is that? All three comparisons that are used in the specification (sort(), distinct-values() and 'eq') treat untypedAtomic values as strings; I can't see where untypedAtomic-to-double conversion occurs.

@ChristianGruen ChristianGruen added the Tests Needed Tests need to be written or merged label Feb 20, 2024
@dnovatchev
Copy link
Contributor Author

dnovatchev commented Mar 5, 2024

I would prefer to spend a little bit more time reading and understanding (being assigned with this) or hearing the person assigned to do so, than realizing when it is too-late that everybody's time was wasted at one or more meetings due to unrealized complexity and lack of understanding.

»Everybody« implies I’m part of it, but I don’t see myself involved. Are you sure others, or even all of us, share your perspective?

When I have the impression that a feature is too complex to be accepted, I tend to ask for more time before we accept it.

What I’ve indeed suggested just recently is that we should spend time on the features that have already been added to the draft, but have not been officially accepted (https://lists.w3.org/Archives/Public/public-xslt-40/2024Feb/0016.html). I didn’t get any reply, so it could be that people don’t feel it’s necessary (or again it’s a matter of time).

If it is regularly the case that I don't understand well at least 50% of what someone is writing, should I constantly raise this (might well be mistaken for having a personal grudge or embarrassment) or should we deal in a more organized, systematic way? And what if I am not the only one who feels that way and who is shy to raise their voice? Doesn't this make for a significant part of the people (maybe even the majority)?

I think it is the Chair's responsibility not to ask for a vote if there is even the slightest sense of not understanding and discomfort. Maybe we are often rushed to make decisions when we are still not fully prepared to do so? Here is where having an officially assigned independent reviewer could help everyone of us get a better understanding.

@ChristianGruen
Copy link
Contributor

If it is regularly the case that I don't understand well at least 50% of what someone is writing, should I constantly raise this (might well be mistaken for having a personal grudge or embarrassment)

I welcome this personally (neutral language might helps to avoid irritations). In addition, I have repeatedly observed that my lack of native language skills lead to technical misunderstandings that I like to have clarified myself.

Doesn't this make for a significant part of the people (maybe even the majority)?

…could very well be the case.

I think it is the Chair's responsibility not to ask for a vote if there is even the slightest sense of not understanding and discomfort.

We should take in mind that a too strict procedure might lead to stagnancy. Several years have already passed, and we are far from finalizing version 4.

But I think we would not lose anything by spending 10 or 20 minutes of our joint time to discuss the current procedure in an upcoming meeting.

My personal hope is slightly different: I think we all should be as open-minded as possible to accept others’ thoughts and opinions. It hurts to see a PR questioned for which one has spent hours and hours to make it seemingly water-proof. However, that doesn’t prevent anyone of us to be confronted with a result that differs a lot from the initial proposal.

When saying this, I hope not to be suggestive. I don’t refer to this specific proposal; I rather have my own proposals in mind that underwent various changes before becoming accepted or eventually rejected.

@ChristianGruen ChristianGruen added PR Pending A PR has been raised to resolve this issue and removed PR Pending A PR has been raised to resolve this issue labels Mar 6, 2024
@ChristianGruen
Copy link
Contributor

@dnovatchev Thanks for the example code. I took the liberty of pasting your reply to the mailing list:


Yes, any sequence of functions can be replaced by a single function.

Here is one such example:

We are given a company's employees and each employee has a name, department and salary.

We will rank the employees first just by department, then by both department and salary - done with a single function as specified in the 2nd call to fn:ranks below:

let $employees := map{
"John Smith": map{ "dept": "Sales", "salary": 50000},
"Erin Carter": map{ "dept": "Computing", "salary": 120000},
"Ryan Gosling": map{ "dept": "Sales", "salary": 100000},
"Ann Gould": map{ "dept": "Computing", "salary": 150000},
"Pete Lagard": map{ "dept": "Sales", "salary": 50000},
"Jim Carter": map{ "dept": "Sales", "salary": 80000},
"Greg Wilson": map{ "dept": "Computing", "salary": 120000}
}
return
(
ranks(map:keys($employees), fn($emp){$employees($emp)("dept")}),

"===============================================================================",
ranks(map:keys($employees),
fn($emp){$employees($emp)("dept")
|| (let $sal := $employees($emp)("salary"),
$salDigits := string-length(string($sal))
return substring('0000000', $salDigits +1) ||
string($sal) )})
)


I see that the concatenated string seems to be based on the actual value distribution of the input data (e.g., knowledge on the maximum value)…

  1. How would you handle arbitrary numbers (e.g. doubles)?
  2. How would you sort a secondary double sort key in a descending order?

@dnovatchev
Copy link
Contributor Author

@dnovatchev Thanks for the example code. I took the liberty of pasting your reply to the mailing list:

@Christian Grün Thanks, but I actually sent this to the mailing list and to Norm Walsh and not to any other recipient.

image

@ChristianGruen
Copy link
Contributor

@Christian Grün Thanks, but I actually sent this to the mailing list and to Norm Walsh and not to any other recipient.

That’s what I wanted to say (might have been a misunderstanding?).

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Mar 12, 2024

I see that the concatenated string seems to be based on the actual value distribution of the input data (e.g., knowledge on the maximum value)…

  1. How would you handle arbitrary numbers (e.g. doubles)?

A general answer: A double is less precise than a decimal, and the example shows how to handle decimals - thus handling doubles can be done in a similar way

  1. How would you sort a secondary double sort key in a descending order?

By substituting it with the difference between a suitable constant and this value.For any x we use N - x where N is the largest possible value.

@ChristianGruen
Copy link
Contributor

A general answer: A double is less precise than a decimal, and the example shows how to handle decimals - thus handling doubles can be done in a similar way

This is how I would sort ascending strings and descending doubles with two keys:

let $items := (
  map { 'name': 'A', 'size': 1e33  },
  map { 'name': 'A', 'size': .1    },
  map { 'name': 'B', 'size': 0.01  },
  map { 'name': 'B', 'size': -1e99 }
)
return sort($items, (), (fn { ?name }, fn { -?size }))

How would you do it with a single key?

@dnovatchev
Copy link
Contributor Author

dnovatchev commented Mar 12, 2024

A general answer: A double is less precise than a decimal, and the example shows how to handle decimals - thus handling doubles can be done in a similar way

This is how I would sort ascending strings and descending doubles with two keys:

let $items := (
  map { 'name': 'A', 'size': 1e33  },
  map { 'name': 'A', 'size': .1    },
  map { 'name': 'B', 'size': 0.01  },
  map { 'name': 'B', 'size': -1e99 }
)
return sort($items, (), (fn { ?name }, fn { -?size }))

How would you do it with a single key?

There are many ways to do this.

We can even return the hash of the concatenation of the 'name' and the normalized (meaning having the same agreed upon representation) of the 'size'.

@dnovatchev
Copy link
Contributor Author

A general answer: A double is less precise than a decimal, and the example shows how to handle decimals - thus handling doubles can be done in a similar way

This is how I would sort ascending strings and descending doubles with two keys:

let $items := (
  map { 'name': 'A', 'size': 1e33  },
  map { 'name': 'A', 'size': .1    },
  map { 'name': 'B', 'size': 0.01  },
  map { 'name': 'B', 'size': -1e99 }
)
return sort($items, (), (fn { ?name }, fn { -?size }))

How would you do it with a single key?

There are many ways to do this.

We can even return the hash of the concatenation of the 'name' and the normalized (meaning having the same agreed upon representation) of the 'size'.

This is the maximum double value:

dMax =1.7976931348623157E+308

Use: dMax - ?size, then convert this to a fixed-length string with the decimal representation, then concat the result to ?name.

There are many possible ways to compute the final single value, and I am not saying that I can immediately provide the best algorithm to do that.

The statement is that all this can be done with a single function.

@dnovatchev
Copy link
Contributor Author

I will am closing this PR because it is from my master branch and this is not good when one has more than one open PRs.

Will re-submit it from a dedicated feature branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tests Needed Tests need to be written or merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants