feat: add least and greatest functions to functions_comparison.yml #247

richtia · 2022-07-15T03:38:36Z

PR to add functions for least and greatest.

jvanstraten · 2022-07-15T04:40:58Z

I feel like string could use some more details in the description. Case sensitivity? Lexicographic vs natural? etc. Float too maybe for NaN behavior.

westonpace · 2022-07-15T13:38:00Z

Hmm...the entire topic of "comparison" probably deserves a nice hefty block of prose somewhere (perhaps on the site itself). Otherwise we are at risk of repeating ourselves all over the YAML. The same goes for overflow, overflow vs NaN, etc.

I wonder if we want a section on the website for "functions" where some of this text can live.

ianmcook · 2022-07-15T13:44:43Z

In PostgreSQL, least and greatest return null only if all the arguments evaluate to null. I think that's the behavior we want consumers to implement, so we should say something about that in the descriptions. (Because I believe in some other databases/engines, least and greatest return null if any argument is null.)

richtia · 2022-07-15T16:40:22Z

I feel like string could use some more details in the description. Case sensitivity? Lexicographic vs natural? etc. Float too maybe for NaN behavior.

Do you or @westonpace have a suggestion for how to handle the NaN behavior?

jvanstraten · 2022-07-18T11:05:59Z

Hmm...the entire topic of "comparison" probably deserves a nice hefty block of prose somewhere (perhaps on the site itself).

Isn't this kind of behavior up to the extension that specifies the function, though? I could imagine different engines having subtly different native ways of doing comparisons, in which case conforming with Substrait's "defaults" might cost performance. You'd then want to have the option to override Substrait's defaults by just using a different function.

The alternative is associating ordering information with types instead, or at least default ordering information. SortRel actually already requires this for the default ascending/descending sorts, but beyond how nulls are to be ordered it leaves ordering up to the imagination of the user. Also, for the sort-by-function method it leaves the function signature (return type?) and the behavior for return values outside of [-1, 0, 1] unspecified.

Digression: personally I don't like having these SQL-esque default "ascending/descending" sorts at all; the implication that all types should have exactly one default ordering method seems odd to me. There is no logical way to order a 2D coordinate, for instance: you could just order by X first and then by Y, but that's as meaningless as any other sort order (Y first, by polar coordinates, by Hilbert curve, whatever). Instead, I'd much prefer having only the "custom function identifier" and "clustered" methods. If it were up to me I'd deprecate/remove the default orderings and define something like this instead:

By function: compare(T, T) -> i8 or less_than(T, T) -> boolean
By function, nulls first: compare(T!, T!) -> i8 or less_than(T!, T!) -> boolean
By function, nulls last: compare(T!, T!) -> i8 or less_than(T!, T!) -> boolean
Clustered by implicit identity function (I don't really like it, but implicit identity is used all over the place already anyway, for example in set relations)
Clustered using equality or comparison function: compare(T, T) -> i8 or equals(T, T) -> boolean

where by T! I mean the non-nullable version of the nullable type that's being sorted, so you only need to define ordering functions for non-nullable types without loss of generality.

If I'm not the only one who feels this way I can escalate this to an issue or PR.

Otherwise we are at risk of repeating ourselves all over the YAML.

Personally I'd much rather documentation be repeated ad nauseam than only be specified in one place where someone might not find it. It requires more maintenance, but no one wants to or can be expected to scour the complete documentation for clues when they need one specific piece of information, especially when (at present) odds are that no one has thought about it yet at all, or at least has written it down anywhere. Linking to the single point of truth would also be fine (or better) but also requires maintenance to keep the links live.

westonpace · 2022-07-18T21:34:00Z

If I'm not the only one who feels this way I can escalate this to an issue or PR.

This pushes the burden of defining how types are sorted out of the spec and into the producers. However, the communication between producer & consumer would be very clear at that point which I believe is the point of the spec. This seems very similar to the implicit cast discussion. However, as someone working primarily on a consumer it is easy for me to say "push it all to the producer" 😆

@ianmcook @cpcloud thoughts?

Personally I'd much rather documentation be repeated ad nauseam than only be specified in one place where someone might not find it.

There should always be links/pointers to the comprehensive documentation. Yet I'd like to avoid copy/pasting entire paragraphs.

jacques-n · 2022-07-18T21:44:24Z

WRT to the sorting discussion specifically, there are two options in the spec:

You choose the structured sql type sorts with asc/desc and nulls first/nulls last OR you choose a specific function to reference.

https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L857

The intention I had was to formally declare a default comparison operation within the Substrait spec for known types but also allow one to use any function one wants for sorting using a direct function reference that is specified as returning -1, 0 or 1. We should add more content to the formalization of this but I feel like it allows for arbitrary alternative collations, etc while also having a more meaningful representation. I'd also be open to enhancing this so that if you choose the asc/desc/nf/nl paradigm, we could have multiple default collations to avoid having to use opaque function references if you're non-default.

richtia · 2022-07-19T18:06:41Z

WRT to the sorting discussion specifically, there are two options in the spec:

You choose the structured sql type sorts with asc/desc and nulls first/nulls last OR you choose a specific function to reference.

https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L857

The intention I had was to formally declare a default comparison operation within the Substrait spec for known types but also allow one to use any function one wants for sorting using a direct function reference that is specified as returning -1, 0 or 1. We should add more content to the formalization of this but I feel like it allows for arbitrary alternative collations, etc while also having a more meaningful representation. I'd also be open to enhancing this so that if you choose the asc/desc/nf/nl paradigm, we could have multiple default collations to avoid having to use opaque function references if you're non-default.

Do you have a suggestion of how to handle the sorting in the yaml spec. For example, if the default were lexicographical, how would I specify a natural sorting option?

Maybe for this PR we could also just try to get in what the function signatures look like and we can document/follow up on the sorting expectations for types via a github issue.

westonpace · 2022-07-19T22:03:38Z

Do you have a suggestion of how to handle the sorting in the yaml spec. For example, if the default were lexicographical, how would I specify a natural sorting option?

This is unclear to me as well. How would a custom compare function be provided to a scalar function like the ones specified here (or lt or gt)?

Maybe for this PR we could also just try to get in what the function signatures look like and we can document/follow up on the sorting expectations for types via a github issue.

A follow-up issue seems reasonable to me given we already have functions like lt or gt that rely on the comparison of their inputs.

jvanstraten

LGTM as is considering the description states how strings are to be ordered, despite the discussion surrounding ordering. I could see the string versions be superseded by something more generic at some point, though.

richtia · 2022-07-29T02:27:37Z

Needed to fix the commit messages to pass linting check

cpcloud · 2022-08-12T21:04:14Z

extensions/functions_comparison.yaml

+        - value: "string"
+        variadic:
+          min: 1
+        return: "string"


Should we add the rest of the types here?

I would add everything except:

interval_year

interval_day

struct

list

map

Just added all the other ones, except for the ones you listed and boolean.

edit: I also added uuid, but not sure how much sense that one makes? Let me know if that one should be removed.

@cpcloud Why don't you consider intervals to be comparable? Substrait effectively defines them as a number of months and a number of microseconds respectively, so I don't see why they're special. I would personally argue that only UUIDs and maps make little sense to order, though it's perfectly possible to define an ordering for them (for maps because they are basically just defined to behave like list<struct<K, V>>, and both of those can be compared using tie).

IMO this should just be least(T) -> T/greatest(T) -> T. Likewise for all normal comparison functions. If an engine doesn't consider a type comparable*, they will already have had to solve this problem and return suitable errors for sort relations with sort keys that have no default comparison operation and no custom comparator function.

* and we should just define how each builtin type is to be compared in the absence of a custom comparator function somewhere in the spec, so this isn't for to the engine to decide.

I don't consider intervals to be comparable because of time zone shenanigans.

Consider two intervals: 1 day and 24 hours. How do these compare?

Here's how I view this:

==: only possible to implement if you know the two timestamps (with time zones) that produced the interval (if that was even how it was produced!), because a timezone change across the boundary would potentially make these two unequal. An interval whose fields are exactly equivalent should compare equal.

!=: Similar to ==, time zones make implementing this in any kind of "obvious" way approximately impossible

</>: ordering has similar problems to = and !=

Let's say we have a timestamp of 2022-01-01 12:00:00 and tomorrow we switch to DST (bear with me for the sake of example).

Looking at the result of adding the above two intervals to that timestamp:

2022-01-01 12:00:00 + 1 day gives 2022-01-02 13:00:00

2022-01-01 12:00:00 + 24 hours gives 2022-01-02 12:00:00

Duration-based intervals I think should be comparable, but comparing finer granularity than day with day or coarser seems too fraught.

I hope I'm just wrong here and there's a sane way to do this.

This kind of slipped through the cracks because I thought we had resolved this offline (more or less), but I must admit I never considered this example.

At some point in the past, when I clarified what the built-in types mean, I made the assertion that it should be allowed to store interval types using a single number. In other words, 1 day, 24 hours, 1440 minutes, or 86400 seconds, should all mean exactly the same thing. I figured we could do this because year-month and day-second were separate types already anyway (but, again, didn't consider this one). I did this to make the description of the types at all compatible with Arrow's types, because you could fairly easily construct an interval type that stores the components separately using structs anyway, and frankly, because (even after this example) I remain convinced that it's good enough.

For this particular example, I would argue that the result depends on whether you're adding the interval to a timestamp or timestamp_tz. timestamp has no timezone awareness, so in both cases you would get 2022-01-02 12:00:00. timestamp_tz instead represents "real" time, where DST just doesn't exist. You simply get the timestamp that occurs 24 hours/1 day later. When represented in this particular timezone that might result in 2022-01-01 12:00:00 -> 2022-01-02 13:00:00, but represented in UTC it might be 2022-01-01 11:00:00Z -> 2022-01-02 11:00:00Z if that timezone happened to be CET. Timezone shenanigans in general are captured by the conversion between timestamp and timestamp_tz, and need not exist anywhere else.

This leaves only the more fundamental problem that months are not always the same length in our calendar, which is already covered by having different types for year-month and day-second, and the lack of overlap between these ranges. Even leap seconds (if we would want to consider those) are defined to always happen at the end of a month.

What I don't know is whether existing query engines and SQL operations are defined with sufficient sanity to be encompassed by this logic. I'm going to hazard a guess based on recent experiences with null and say no, so maybe we need to revise these types again. But in any case, the current definition of Substrait interval types is encompassed by a number with some implicit time unit associated with it (i.e. seconds or months), which makes them trivially ordered.

CLAassistant · 2022-10-06T23:47:47Z

All committers have signed the CLA.

CLAassistant · 2022-10-06T23:48:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

extensions/functions_comparison.yaml

jacques-n · 2022-11-25T01:32:23Z

extensions/functions_comparison.yaml

+      Uppercase letters are less than lowercase letters.
+    impls:
+      - args:
+          - value: List<T>


I don't think the value should be a List. I think it should just be a T. Variadic means that we can execute something like:

least(4,6,8,10).

I don't understand what something like the below would mean. This is what you are currently indicating with the arg type of List<T>, min:2.

least([2,4],[5,8])

Updated! Thanks!

richtia · 2023-04-14T17:16:58Z

Would it make sense to provide an option to decide when null should be returned? (all arguments are null, or any arguments are null)

westonpace · 2023-04-14T19:42:06Z

Given there are inconsistent vendors already then yes, I think we at least need an option.

However, you could argue the return types are different between the two variations.

"Return null if any of the arguments is null" would have a return of T|? while "return null only if all of the arguments are null" would have a return type of T&?.

Given that, instead of an option, it might almost make sense to have two different functions (least and least_skip_null?)

Though, I assume it is safe to err on the side of returning something that is nullable so it would probably be ok (if a little imprecise) to handle this with an option and use T|? as the return type.

I think we need @jacques-n or @cpcloud to weigh in on #340 first.

EpsilonPrime · 2023-09-26T01:42:37Z

extensions/functions_comparison.yaml

+    description: >-
+      Returns the smallest value. Only return null if 'all' arguments evaluate to null.
+
+      String comparison is done in lexicographical ordering, one character at a time, from left to right.


By lexicographic is the assumption the "C" locale? The ordering could be different if an alternative locale is chosen.

extensions/functions_comparison.yaml

westonpace

In yesterday's community meeting we discussed this PR. I think we came to the following conclusions (but I am paraphrasing here so @EpsilonPrime can feel free to correct me):

Being able to specify a custom comparator for sorting is a problem that affects several functions. We should not hold up this PR while we figure that out. In the meantime, we should assume that all types have a default comparison method and we should not explicitly mention how values are compared.
We should support both "skip null" and "don't skip null" variants as two different functions instead of one function with two options.

With that said, I think these descriptions need updated. I also think we need a greatest_skip_null.

extensions/functions_comparison.yaml

westonpace · 2023-10-26T16:11:08Z

extensions/functions_comparison.yaml

+      String comparison is done in lexicographical ordering, one character at a time, from left to right.
+      Uppercase letters are less than lowercase letters.
+
+      There is no greatest_skip_null function because it behaves the same as greatest.


I disagree with this conclusion. There is no existing engine I am aware of that behaves this way. For example, in both Oracle and MySQL GREATEST(1, 3, NULL) yields NULL. The theory is that "if any one of the inputs is unknown then I cannot know which is the greatest value because it may be the unknown one"

I think we do need a greatest_skip_null variant and this variant should not skip nulls.

Co-authored-by: Weston Pace <weston.pace@gmail.com>

richtia · 2023-10-26T18:31:38Z

@westonpace Thanks for the suggestion! I included them and added the greatest_skip_null function

jvanstraten previously approved these changes Jul 27, 2022

View reviewed changes

richtia dismissed jvanstraten’s stale review via dc2616b July 27, 2022 16:38

richtia force-pushed the add_comparison_functions branch 2 times, most recently from dc2616b to e15c063 Compare July 27, 2022 16:56

richtia requested a review from jvanstraten July 29, 2022 02:30

cpcloud reviewed Aug 12, 2022

View reviewed changes

richtia force-pushed the add_comparison_functions branch from edb382b to 5a66d77 Compare August 12, 2022 21:52

richtia requested a review from cpcloud August 12, 2022 21:54

richtia force-pushed the add_comparison_functions branch 3 times, most recently from afd3bcd to 3fb45ab Compare September 2, 2022 20:09

jvanstraten mentioned this pull request Sep 13, 2022

feat: add temporal functions #272

Merged

This was referenced Nov 2, 2022

Discuss including Clip arithmetic operator #366

Closed

Add support for Clip ibis-project/ibis-substrait#409

Closed

cpcloud reviewed Nov 2, 2022

View reviewed changes

extensions/functions_comparison.yaml Show resolved Hide resolved

richtia requested review from cpcloud and removed request for jvanstraten November 8, 2022 20:41

jacques-n reviewed Nov 25, 2022

View reviewed changes

gforsyth added the extension label Jun 27, 2023

EpsilonPrime reviewed Sep 26, 2023

View reviewed changes

EpsilonPrime added the under consideration label Oct 19, 2023

richtia added 2 commits October 25, 2023 14:59

feat: add least and greatest functions

eb6dffb

fix: add least_skip_null function and options to least function

b44bdcc

richtia force-pushed the add_comparison_functions branch from c58b8e4 to b44bdcc Compare October 26, 2023 00:07

richtia requested a review from vbarua as a code owner October 26, 2023 00:07

EpsilonPrime reviewed Oct 26, 2023

View reviewed changes

extensions/functions_comparison.yaml Outdated Show resolved Hide resolved

fix: use 2 functions without options

ab7bcbe

richtia force-pushed the add_comparison_functions branch from 89b5cbb to ab7bcbe Compare October 26, 2023 01:05

EpsilonPrime previously approved these changes Oct 26, 2023

View reviewed changes

EpsilonPrime added awaiting SMC approval and removed under consideration labels Oct 26, 2023

EpsilonPrime requested review from westonpace and jacques-n October 26, 2023 01:08

westonpace requested changes Oct 26, 2023

View reviewed changes

fix: update descriptions for least functions

bebf67f

Co-authored-by: Weston Pace <weston.pace@gmail.com>

richtia dismissed EpsilonPrime’s stale review via bebf67f October 26, 2023 18:14

feat: add greatest_skip_null function

c01331b

richtia requested review from EpsilonPrime and westonpace October 26, 2023 18:29

EpsilonPrime approved these changes Oct 26, 2023

View reviewed changes

EpsilonPrime assigned westonpace Oct 26, 2023

westonpace approved these changes Oct 29, 2023

View reviewed changes

EpsilonPrime merged commit b3071bc into substrait-io:main Oct 31, 2023
13 checks passed

richtia mentioned this pull request Feb 8, 2024

fix: remove function definitions w/ invalid return types #599

Merged

vbarua mentioned this pull request Apr 18, 2024

add a CI check to lint function extensions #633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add least and greatest functions to functions_comparison.yml #247

feat: add least and greatest functions to functions_comparison.yml #247

richtia commented Jul 15, 2022

jvanstraten commented Jul 15, 2022

westonpace commented Jul 15, 2022

ianmcook commented Jul 15, 2022

richtia commented Jul 15, 2022

jvanstraten commented Jul 18, 2022

westonpace commented Jul 18, 2022

jacques-n commented Jul 18, 2022

richtia commented Jul 19, 2022

westonpace commented Jul 19, 2022

jvanstraten left a comment

richtia commented Jul 29, 2022

cpcloud Aug 12, 2022

richtia Aug 12, 2022 •

edited

Loading

jvanstraten Aug 15, 2022 •

edited

Loading

cpcloud Aug 15, 2022 •

edited

Loading

jvanstraten Sep 12, 2022

CLAassistant commented Oct 6, 2022 •

edited

Loading

CLAassistant commented Oct 6, 2022

jacques-n Nov 25, 2022

richtia Dec 1, 2022

richtia commented Apr 14, 2023

westonpace commented Apr 14, 2023

EpsilonPrime Sep 26, 2023

westonpace left a comment

westonpace Oct 26, 2023

richtia commented Oct 26, 2023

feat: add least and greatest functions to functions_comparison.yml #247

feat: add least and greatest functions to functions_comparison.yml #247

Conversation

richtia commented Jul 15, 2022

jvanstraten commented Jul 15, 2022

westonpace commented Jul 15, 2022

ianmcook commented Jul 15, 2022

richtia commented Jul 15, 2022

jvanstraten commented Jul 18, 2022

westonpace commented Jul 18, 2022

jacques-n commented Jul 18, 2022

richtia commented Jul 19, 2022

westonpace commented Jul 19, 2022

jvanstraten left a comment

Choose a reason for hiding this comment

richtia commented Jul 29, 2022

cpcloud Aug 12, 2022

Choose a reason for hiding this comment

richtia Aug 12, 2022 • edited Loading

Choose a reason for hiding this comment

jvanstraten Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

cpcloud Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

jvanstraten Sep 12, 2022

Choose a reason for hiding this comment

CLAassistant commented Oct 6, 2022 • edited Loading

CLAassistant commented Oct 6, 2022

jacques-n Nov 25, 2022

Choose a reason for hiding this comment

richtia Dec 1, 2022

Choose a reason for hiding this comment

richtia commented Apr 14, 2023

westonpace commented Apr 14, 2023

EpsilonPrime Sep 26, 2023

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace Oct 26, 2023

Choose a reason for hiding this comment

richtia commented Oct 26, 2023

richtia Aug 12, 2022 •

edited

Loading

jvanstraten Aug 15, 2022 •

edited

Loading

cpcloud Aug 15, 2022 •

edited

Loading

CLAassistant commented Oct 6, 2022 •

edited

Loading