Improve filter expression language consistency #1478

jkbonfield · 2022-07-20T15:39:43Z

This tries to follow the principles of 3-state logic used in other systems, such as SQL. See https://en.wikipedia.org/wiki/Null_(SQL)#Comparisons_with_NULL_and_the_three-valued_logic_(3VL)

There is some controversy there and a suggestion that 4-state logic is better, which we sort of already have with the "null-but-true" concept, which may impact below.

The primary changes here are:

New hts_expr_val_exists and hts_expr_val_undef inline functions to check if a value exists and to clear it. These simply logic elsewhere. This fixes SAM filter expressions like [AB] / [BC] > 1 where tags AB or BC don't exist. Similar updates have been applied to many logic and mathematical operators. To avoid spammage these are silent, failing the expression without complaint due to undefined values. Additionally hts_exp_val_existsT for conditionals (&&, ||) is true if the data exists or it is true (so null-but-true works).
hts_expr_val_undef now uses NAN as a definition of undefined, rather than purely a false empty string. I do this because we automatically gain some consistency with how NAN works on mathematical operators which we previously didn't have. Eg sqrt(-1) == sqrt(-1) is false according to the strict comparison rules for NAN.
In addition to this, <, >, <=, >=, == and != also fail for undefined values (due to the use of NAN).
A new function exists(var) is used to test for the existance (whether it is defined) of a variable. This returns a boolean, which removes the need for the null-but-true and zero-but-true nature used in current tag querying. We still have this null-but-true concept as we've previously documented the difference between filter expression [X0] (has tag X0) and [X0] && [X0]!=0 (has tag X0 with a non-zero value). This still works, but is now much easier to write using [X0] != 0 rather than it automatically including !exists[X0] || [X0] != 0. This is therefore an incompatibility, but probably a sensible one.

What we previously struggled to do was to check a tag does not exist. It was possible via !([X0] || [X1]==0), but that was highly cryptic and I doubt many figured it out! Now !exists([X0]) is a far simpler solution. NOTE however the old cryptic method no longer works, as null || false is still null (not false) and !null is also null due to the afforementioned logic. (I'm prepared to change this though; maybe we need null-and-false vs null-and-true so we can explicitly distinguish from null-is-unknown.)

INCOMPATIBILITY: Note previously "[X1] != 0" meant X1 is non-zero or X1 doesn't exist. This now gets interpreted only as X1 is non-zero. I believe this change is a considerable improvement and worth breaking compatibility over.
A new function default(var1,var2) which returns the first value that is defined of the pair. So exists(var1)?var1:var2 (if we had a ?: operator that is).
New mathematic operators pow, exp, sqrt, log. These perhaps should have been their own commit, but I added them during this work purely because they're a convenient mechanism for getting +/- inf and nan into things for validation testing.

If I use the old test_expr.c test harness on the new hts_expr.c code now also passes all tests.

Fixes samtools/samtools#1677

jkbonfield · 2022-07-21T12:03:59Z

After some corrections and rebasing I think I'm ready for this to be reviewed.

It's almost entirely compatible with the old mechanism apart from expressions genuinely involving NaN (eg 0/0) which were considered true previously and are now false, and the [X1] != 0 example quoted above.

This means we probably want a slight update to the samtools man page too. Although what it states there is all legal, some of it is out of date and it misses the newer functionality.

daviesrob

I notice that if bam_sym_lookup() were changed to return NaN for missing tags instead of NULL strings, it would go a long way to solve the problem of having to free the kstring each time hts_filter_eval2() is called.

daviesrob · 2022-07-22T11:09:59Z

hts_expr.c

+        if (strncmp(str, "exists(", 7) == 0) {
+            if (expression(filt, data, fn, str+7, end, res)) return -1;
+            func_ok = 1;
+            res->d = hts_expr_val_exists(res);


Maybe set res->is_true here as well, to match other operations that return a truth value?

There's quite a lot of the function code which doesn't set res->is_true. Eg the existing expr_func_max, expr_func_avg, etc.

The catch all is in

htslib/hts_expr.c

Lines 879 to 888 in 8344b82

// Strings evaluate to true. An empty string is also true, but an

// absent (null) string is false, unless overriden by is_true. An

// empty string has kstring length of zero, but a pointer as it's

// nul-terminated.

if (res->is_str) {

res->is_true |= res->s.s != NULL;

res->d = res->is_true;

} else if (hts_expr_val_exists(res)) {

res->is_true |= res->d != 0;

}

which basically states any non-zero value sets is_true. (This probably means a lot of code elsewhere is redundant.)

Given we're setting a boolean here of yes (1) or no (0) then explicitly setting is_true is redundant.

hts_expr.c

daviesrob · 2022-07-22T14:18:33Z

hts_expr.c

-            res->is_str = 0;
+            if (!hts_expr_val_existsT(res) || !hts_expr_val_existsT(&val)) {
+                hts_expr_val_undef(res);
+                res->d = 0;


This doesn't match the normal 3VL truth table. Is that what you want?

Note the wiki page shows how and and or can be done using min and max, via the encoding (−1, false; 0, unknown; +1, true) .

It's deliberate, as stated in the commit message.

You've moved on from quoting SQL to quoting somewhere else as the source of 3VL, but the SQL one was potentially more interesting. Specifically it stated that NOT(unknown) is unknown. That is obviously logical from a boolean value view point as bool can be {0,1} and when negating it's {1,0} which is still basically all possibilities.

However it then went on to say that NOT used in a comparison would yield the opposite case, so "if (unknown)" will fail and "if (NOT(unknown))" would pass. That is also clearly useful as it means negation always takes the opposite code path even when we don't know what the value is.

I applied the same logic to the AND and OR case. I could potentially set is_true 0/1 and leave the value as unknown, but I felt it's not desparately helpful. I came to this conclusion when looking at how things like NAN work in C. "NAN || 0" is 1, not NAN, despite NAN being considered true. It's canonicallised it basically. The same happens with !NAN and !!NAN. I took the same approach here, especially as (at your request) I switched to using NANs. We're not bit-wise ANDing and ORing, but as a logic statement, and if it's going to return true/false as far as a program execution goes then it's also less surprising if it returns 1/0 in value too.

hts_expr.c

htslib/hts_expr.h

daviesrob · 2022-07-22T15:39:33Z

test/test_expr.c

@@ -64,14 +64,17 @@ int lookup(void *data, char *str, char **end, hts_expr_val_t *res) {
        kputs("", ks_clear(&res->s));
    } else if (strncmp(str, "null-but-true", 13) == 0) {
        *end = str+13;
+        hts_expr_val_undef(res);


Maybe this should be done using a nan-but-true, leaving the null-but-true test as-is.

Note that nan-but-true could occur, if a floating-point tag exists with a nan in it.

I switched to using nan at your own request...

I'm not inclined to switch back again now. I don't think we really need to worry too much about real nans as it's such a corner case we can simply document it as nan being considered to be false by default (which it is in hts_expr.h).

I thought I'd already updated the expression syntax documentation too, but apparently not. (It's rather annoying that most of htslib's documentation is in samtools, but that's a different issue.)

jkbonfield · 2022-08-03T09:18:14Z

Adding documentation I'm struck by null being "false/null" while !null being "true/1". I'm wondering what the purpose was of that, but we could just do is_true ^= 1 so it becomes "true/null". This seems to be more logical, although for numeric values !2 becomes 0 and !!2 therefore becomes 1, so maybe that's why I chose to make !null to be 1 (and !!null to be 0).

Thoughts?

daviesrob · 2022-08-04T10:42:20Z

hts_expr.c

+            if (!hts_expr_val_exists(&val) || !hts_expr_val_exists(res)) {
+                hts_expr_val_undef(res);
+                hts_expr_val_free(&val);
+                return 0;


I just noticed that this short-circuits, which can lead to a parsing failure if you hit a null value:

$ ./test/test_expr 'null * 1 * 1' Unable to parse expression at null * 1 * 1

Compare:

$ ./test/test_expr 'null + 1 + 1' false nan

Just commenting out the free and return solves it, but this is slightly different to how the add code works which sets undef=1. Given nan +/- anything is nan then it won't change the d value and is_true becomes 0 in both cases.

I am wondering though if null-but-true + 1 and null-but-true * 1 should be true still. I expect there are lots of bizarre corner cases like that. Practically speaking, false is better as [NM]+1 < 11 and [NM] < 10 give the same result now, but wouldn't I think if null-but-true + 1 is true.

Looks like that fixed it.

It's possible to get NAN with is_true = 1 by making an aux tag with NAN in it: NT:f:nan. It stays true long enough for exists() and the logical-and/or operators to work. Everything else thinks it's false, which is probably OK, apart maybe from default(). If the tag exists, shouldn't default() return its value, even if it's nan?

This also leads to another slight default() oddity, as it also checks the second value with hts_expr_val_exists():

$ ./test/test_expr 'default(null, 123)' true 123 << Expected $ ./test/test_expr 'default(null-but-true, 123)' true 123 << Arguable, maybe wrong? $ ./test/test_expr 'default(null, null-but-true)' false nan << Not the second argument $ ./test/test_expr 'default(null-but-true, null)' true nan << Not the second argument

The last two look inconsistent with the first two. I think default() should only test hts_expr_val_existsT(res) and always return val if false. (Or hts_expr_val_exists(res) if null-but-true is more null than true in this context).

I'm not sure it's worth worrying overly about genuine NANs.

We took the decision to basically relegate genuine NANs when we decided that NAN was to be our "null" state instead of a null string (and something you suggested IIRC). Specifically using NAN instead of null meant nan is false in conditionals (unlike C where it's true). I was OK with that as realistically it's not something that crops up in the wild, at least not deliberately so! So I'm not concerned with getting cases like NT:f:nan working fully.

It's possible we should improve the consistency of null-but-true though and how it propagates through things. Mostly it's fine, but there's perhaps some debate over !null-but-true being false/0 rather than false/null. I can see default changes are justified. I'll think some more about that and test other cases too.

Regarding default(null-but-true, 123), currently all nulls, whether true or false, are still considered as null. The only difference is how they work in conditionals and the negation operator. I think that's fine. Although maybe what you're saying is "null-but-true" is infact a genuine NAN, and "null" is our non-NAN "null" equivalent. That's very mirky though and it doesn't make sense for the NOT operator to toggle between a null state and a nan state. I'd rather not sail in those mirky waters and think the current action of default is the correct one.

It matters when you have a tag like fN:f:nan, as for those default disagrees with exists:

$ ./test/test_view -i filter='exists([fN])' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800 SRR065390.14978392 16 CHROMOSOME_I 2 1 27M1D73M CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC XN:i:0 XO:i:1 fN:f:nan f0:f:0 $ ./test/test_view -i filter='default([fN], 0)' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800

Compare with a tag that isn't nan:

$ ./test/test_view -i filter='exists([f0])' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800 SRR065390.14978392 16 CHROMOSOME_I 2 1 27M1D73M CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC XN:i:0 XO:i:1 fN:f:nan f0:f:0 $ ./test/test_view -i filter='default([f0], 0)' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800 SRR065390.14978392 16 CHROMOSOME_I 2 1 27M1D73M CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC XN:i:0 XO:i:1 fN:f:nan f0:f:0

And one that isn't there:

$ ./test/test_view -i filter='exists([XX])' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800 $ ./test/test_view -i filter='default([XX], 0)' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800

I'd agree, having an explicit exists() makes things a lot easier. In fact, sambamba seems to have something similar, in the form of explicit == null and != null comparisons.

So, given we now have exists(), should [XX] always go with the value of the tag, including for truthiness? As of 1feac5e, it still seems to have a foot in both camps:

$ ./test/test_view -i filter='[XN]' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800 SRR065390.14978392 16 CHROMOSOME_I 2 1 27M1D73M CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC XN:i:0 XO:i:1 fN:f:nan f0:f:0 $ ./test/test_view -i filter='![XN]' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800 SRR065390.14978392 16 CHROMOSOME_I 2 1 27M1D73M CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC XN:i:0 XO:i:1 fN:f:nan f0:f:0 $ ./test/test_view -i filter='!![XN]' /tmp/ce#1.sam @SQ SN:CHROMOSOME_I LN:1009800

This matches what happened before we started fiddling with the expressions, but current develop (ad80f8e) has [XN] true, ![XN] false, !![XN] true. If we went strictly with the value, this would invert to [XN] false, ![XN] true, !![XN] false. Given that the last release was logically inconsistent here both for tags with value 0, and non-existent tags, I guess we should worry too much about the resulting breaking change?

Indeed we seem to have got back to where we started for [XN] syntax.

If I was designing it right now, with what I've learnt, then absolutely I'd require exists to check for existance and have [XN] purely being true/false based on the content of the tag. (Obviously when it doesn't exist it should fail regardless.) The old "[NM]" syntax is short, sweet, and almost certainly in use somewhere. If we changed that to mean "NM:i:0" now returns false, then we're going to break pipelines.

I also highly doubt anyone really cares about nuances of things like "MD:Z:" (empty string) and "AB:f:nan" handling. They'll just be dealing with the typical values that occur on a daily basis. So we shouldn't throw the baby out with the bath-water here and we need to make sure the most basic operations don't change.

I agree the old way of doing things wasn't entirely consistent when it came to negation, so "![NM]" and "[NM] != 0" gave odd results, but these were actually documented with workarounds:

If no comparison is used with an auxiliary tag it is taken simply to be a test for the existence of that tag. So "[NM]" will return any record containing an NM tag, even if that tag is zero (NM:i:0).

If you need to check specifically for a non-zero value then use [NM] && [NM]!=0.

That still works, but it's now been revised so the [NM] && bit is not needed as the expression is only true when the tag exists. We could improve the documentation further. I've attempted to do this in samtools/samtools#1687 but if you have further suggestions I'm happy to incorporate them.

Edit: one obvious thing is ![NM] isn't just NM==0 but also not-present. So explicitly stating to use [NM]==0 is the recommended solution ([NM] && ![NM] also works, but that's just bonkers looking!)

The documentation didn't mention the ![NM] case where the value is 0, probably because no-one had noticed it. In fact, it's the only anomalous case for unary-not, as far as I can tell. It can be made consistent with this:

diff --git a/hts_expr.c b/hts_expr.c index 9e2454b4..b86803ac 100644 --- a/hts_expr.c +++ b/hts_expr.c @@ -380,7 +380,12 @@ static int unary_expr(hts_filter_t *filt, void *data, hts_expr_sym_func *fn, if (!hts_expr_val_exists(res)) { // We can still negate undef values by toggling the // is_true override value. + res->is_str = 0; res->d = res->is_true = !res->is_true; + } else if (res->is_true) { + // Explicit true becomes false + res->is_str = 0; + res->d = res->is_true = 0; } else if (res->is_str) { res->is_str = 0; // !null = true, !"foo" = false, NOTE: !"" = false also

Note that all the existing tests still pass after applying this change.

It alluded to the case of [NM] and zero values being in the "check for zero" example, but you're correct it didn't explicitly spell out the ![NM] case. I'll need to think on the explicit true becoming false thing here a bit and do some experimentation of my own.

Ok pushed some new changes, including new and updated tests.

I'll update the documentation side in the samtools PR too.

hts_expr.c

1. We have strings, numerics, and a null/unknown state. The latter was previously only used as a mechanism for reporting the presence of absence of a tag, eg "[X1]" being the value of X1 if found, or null if not. We also had an override of is_true=1 to permit "![X1]", although sadly this wasn't robust (and still isn't) as it has a dual meaning of X1 not existing or X1 existing but being zero. Our unknown state now uses NaN semantics as defined by IEE 754 for comparisons and mathematics, but not for conditionals (see below). This means unknown+2 is still unknown (and false). Previously "0/0+2" was true/-nan, it's now false/nan. 2. NaN semantics means <, >, == and != all return false with null/nan values, even when comparing null to null. Note this changes the language results slightly. INCOMPATIBILITY: Previously "[X1] != 0" meant X1 tag exists and is zero, or X1 tag does not exist. To avoid that second clause the man page recommends "[X1] && [X1] != 0" to add a clause of checking the tag for existance first. This was illogical and almost certainly not the intended outcome. Now != will be false whenever tag [X1] does not exist so the expression is only true when the value is defined. (The man page expression still works, but has a redundant component.) Similarly "![X1]" now means X1 doesn't exist, rather than the previous interpretation of doesn't exist or is zero. 3. Fix arithmetic on non-existant aux tags. Undefined values are now considered to be false. They defined as either null strings or NaN doubles, although we use the latter ourselves. (The former are considered the same as the previous code used this and possibly external methods copying our old style). Note for compatibility with before the empty string is not false. Previously attempting to use an undefined value gave a warning message, so expressions like "[X0] + [X1] > 10" would spam when X0 or X1 were absent. Added a hts_expr_val_exists() function to simplify testing for defined values, hts_expr_val_existsT() for defined or undef-but-true (useful in conditionals), and hts_expr_val_undef() to set a variable to be undefined (used when invalidating things). 4. Added an exists() function. "[X0]" is a synonym for X0-exists, and "![X0]" as doesn't exist, but sadly previously "![X0]" was interpreted as "X0-doesn't-exist or X0==0". Given this change and for general clarity, a less ambiguous explicit exists function has been added. Note "exists" means has a known value or has an explicit is_true==1. So null-but-true is still "exists". Hence tags XX:f:nan are considered to exist. 5. Added a default(a,b) function where "a" is returned if defined, otherwise "b". As previously explained the expression "[X0] + [X1] > 10" is false whenever X0 or X1 don't exist, but using "default([X0],0) + default([X1],0) > 10" we can use the sum of the values present, or a single value if one is absent. 6. Added mathematical functions of sqrt, log, exp, and pow. 7. Null and boolean operations are largely unchanged, but for clarification they work as follow, with 0/1 also being false/true and symmetric operations. null == NaN (false) null-but-true == NaN (true) null && x == 0 null || 0 == 0 null || 1 == 1 !null == 1 !!null == 0 !null-but-true == 0 Although we're using NaN internally in order to get the arithmetic consistent, it's not good to assume we rigidly follows all NaN semantics. Specifically in C NaN is considered to be true (so "NaN && 1" is true), but for us it is false.

daviesrob · 2022-08-08T14:25:59Z

I think we've achieved some sort of logical consistency. We'll see how it goes when released into the wild.

jkbonfield force-pushed the expr_arith branch 3 times, most recently from e29899e to 8344b82 Compare July 21, 2022 12:01

jkbonfield marked this pull request as ready for review July 21, 2022 12:01

daviesrob reviewed Jul 22, 2022

View reviewed changes

daviesrob self-assigned this Jul 26, 2022

jkbonfield force-pushed the expr_arith branch from dbe6585 to 3d10eb4 Compare August 4, 2022 09:10

daviesrob reviewed Aug 4, 2022

View reviewed changes

jkbonfield force-pushed the expr_arith branch from 1611197 to 1feac5e Compare August 5, 2022 11:55

daviesrob reviewed Aug 8, 2022

View reviewed changes

hts_expr.c Show resolved Hide resolved

jkbonfield force-pushed the expr_arith branch from 0095669 to b9a809c Compare August 8, 2022 11:33

jkbonfield force-pushed the expr_arith branch from b9a809c to c270926 Compare August 8, 2022 12:43

daviesrob merged commit c270926 into samtools:develop Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve filter expression language consistency #1478

Improve filter expression language consistency #1478

jkbonfield commented Jul 20, 2022 •

edited

jkbonfield commented Jul 21, 2022

daviesrob left a comment

daviesrob Jul 22, 2022

jkbonfield Aug 2, 2022

daviesrob Jul 22, 2022

jkbonfield Jul 22, 2022 •

edited

daviesrob Jul 22, 2022

jkbonfield Aug 2, 2022

jkbonfield commented Aug 3, 2022 •

edited

daviesrob Aug 4, 2022

jkbonfield Aug 4, 2022

daviesrob Aug 4, 2022

jkbonfield Aug 5, 2022

daviesrob Aug 5, 2022

daviesrob Aug 5, 2022

jkbonfield Aug 5, 2022 •

edited

daviesrob Aug 5, 2022

jkbonfield Aug 8, 2022

jkbonfield Aug 8, 2022

daviesrob commented Aug 8, 2022

	// Strings evaluate to true. An empty string is also true, but an
	// absent (null) string is false, unless overriden by is_true. An
	// empty string has kstring length of zero, but a pointer as it's
	// nul-terminated.
	if (res->is_str) {
	res->is_true \|= res->s.s != NULL;
	res->d = res->is_true;
	} else if (hts_expr_val_exists(res)) {
	res->is_true \|= res->d != 0;
	}

Improve filter expression language consistency #1478

Improve filter expression language consistency #1478

Conversation

jkbonfield commented Jul 20, 2022 • edited

jkbonfield commented Jul 21, 2022

daviesrob left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbonfield Jul 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbonfield commented Aug 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbonfield Aug 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviesrob commented Aug 8, 2022

jkbonfield commented Jul 20, 2022 •

edited

jkbonfield Jul 22, 2022 •

edited

jkbonfield commented Aug 3, 2022 •

edited

jkbonfield Aug 5, 2022 •

edited