Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve filter expression language consistency #1478

Merged
merged 1 commit into from Aug 8, 2022

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jul 20, 2022

This tries to follow the principles of 3-state logic used in other systems, such as SQL. See https://en.wikipedia.org/wiki/Null_(SQL)#Comparisons_with_NULL_and_the_three-valued_logic_(3VL)

There is some controversy there and a suggestion that 4-state logic is better, which we sort of already have with the "null-but-true" concept, which may impact below.

The primary changes here are:

  • New hts_expr_val_exists and hts_expr_val_undef inline functions to check if a value exists and to clear it. These simply logic elsewhere. This fixes SAM filter expressions like [AB] / [BC] > 1 where tags AB or BC don't exist. Similar updates have been applied to many logic and mathematical operators. To avoid spammage these are silent, failing the expression without complaint due to undefined values. Additionally hts_exp_val_existsT for conditionals (&&, ||) is true if the data exists or it is true (so null-but-true works).

  • hts_expr_val_undef now uses NAN as a definition of undefined, rather than purely a false empty string. I do this because we automatically gain some consistency with how NAN works on mathematical operators which we previously didn't have. Eg sqrt(-1) == sqrt(-1) is false according to the strict comparison rules for NAN.

  • In addition to this, <, >, <=, >=, == and != also fail for undefined values (due to the use of NAN).

  • A new function exists(var) is used to test for the existance (whether it is defined) of a variable. This returns a boolean, which removes the need for the null-but-true and zero-but-true nature used in current tag querying. We still have this null-but-true concept as we've previously documented the difference between filter expression [X0] (has tag X0) and [X0] && [X0]!=0 (has tag X0 with a non-zero value). This still works, but is now much easier to write using [X0] != 0 rather than it automatically including !exists[X0] || [X0] != 0. This is therefore an incompatibility, but probably a sensible one.

    What we previously struggled to do was to check a tag does not exist. It was possible via !([X0] || [X1]==0), but that was highly cryptic and I doubt many figured it out! Now !exists([X0]) is a far simpler solution. NOTE however the old cryptic method no longer works, as null || false is still null (not false) and !null is also null due to the afforementioned logic. (I'm prepared to change this though; maybe we need null-and-false vs null-and-true so we can explicitly distinguish from null-is-unknown.)

    INCOMPATIBILITY: Note previously "[X1] != 0" meant X1 is non-zero or X1 doesn't exist. This now gets interpreted only as X1 is non-zero. I believe this change is a considerable improvement and worth breaking compatibility over.

  • A new function default(var1,var2) which returns the first value that is defined of the pair. So exists(var1)?var1:var2 (if we had a ?: operator that is).

  • New mathematic operators pow, exp, sqrt, log. These perhaps should have been their own commit, but I added them during this work purely because they're a convenient mechanism for getting +/- inf and nan into things for validation testing.

If I use the old test_expr.c test harness on the new hts_expr.c code now also passes all tests.

Fixes samtools/samtools#1677

@jkbonfield jkbonfield force-pushed the expr_arith branch 3 times, most recently from e29899e to 8344b82 Compare July 21, 2022 12:01
@jkbonfield jkbonfield marked this pull request as ready for review July 21, 2022 12:01
@jkbonfield
Copy link
Contributor Author

After some corrections and rebasing I think I'm ready for this to be reviewed.

It's almost entirely compatible with the old mechanism apart from expressions genuinely involving NaN (eg 0/0) which were considered true previously and are now false, and the [X1] != 0 example quoted above.

This means we probably want a slight update to the samtools man page too. Although what it states there is all legal, some of it is out of date and it misses the newer functionality.

Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that if bam_sym_lookup() were changed to return NaN for missing tags instead of NULL strings, it would go a long way to solve the problem of having to free the kstring each time hts_filter_eval2() is called.

hts_expr.c Outdated
if (strncmp(str, "exists(", 7) == 0) {
if (expression(filt, data, fn, str+7, end, res)) return -1;
func_ok = 1;
res->d = hts_expr_val_exists(res);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set res->is_true here as well, to match other operations that return a truth value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a lot of the function code which doesn't set res->is_true. Eg the existing expr_func_max, expr_func_avg, etc.

The catch all is in

htslib/hts_expr.c

Lines 879 to 888 in 8344b82

// Strings evaluate to true. An empty string is also true, but an
// absent (null) string is false, unless overriden by is_true. An
// empty string has kstring length of zero, but a pointer as it's
// nul-terminated.
if (res->is_str) {
res->is_true |= res->s.s != NULL;
res->d = res->is_true;
} else if (hts_expr_val_exists(res)) {
res->is_true |= res->d != 0;
}
which basically states any non-zero value sets is_true. (This probably means a lot of code elsewhere is redundant.)

Given we're setting a boolean here of yes (1) or no (0) then explicitly setting is_true is redundant.

hts_expr.c Show resolved Hide resolved
hts_expr.c Outdated Show resolved Hide resolved
hts_expr.c Show resolved Hide resolved
hts_expr.c Show resolved Hide resolved
res->is_str = 0;
if (!hts_expr_val_existsT(res) || !hts_expr_val_existsT(&val)) {
hts_expr_val_undef(res);
res->d = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the normal 3VL truth table. Is that what you want?

Note the wiki page shows how and and or can be done using min and max, via the encoding (−1, false; 0, unknown; +1, true) .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's deliberate, as stated in the commit message.

You've moved on from quoting SQL to quoting somewhere else as the source of 3VL, but the SQL one was potentially more interesting. Specifically it stated that NOT(unknown) is unknown. That is obviously logical from a boolean value view point as bool can be {0,1} and when negating it's {1,0} which is still basically all possibilities.

However it then went on to say that NOT used in a comparison would yield the opposite case, so "if (unknown)" will fail and "if (NOT(unknown))" would pass. That is also clearly useful as it means negation always takes the opposite code path even when we don't know what the value is.

I applied the same logic to the AND and OR case. I could potentially set is_true 0/1 and leave the value as unknown, but I felt it's not desparately helpful. I came to this conclusion when looking at how things like NAN work in C. "NAN || 0" is 1, not NAN, despite NAN being considered true. It's canonicallised it basically. The same happens with !NAN and !!NAN. I took the same approach here, especially as (at your request) I switched to using NANs. We're not bit-wise ANDing and ORing, but as a logic statement, and if it's going to return true/false as far as a program execution goes then it's also less surprising if it returns 1/0 in value too.

hts_expr.c Show resolved Hide resolved
htslib/hts_expr.h Outdated Show resolved Hide resolved
@@ -64,14 +64,17 @@ int lookup(void *data, char *str, char **end, hts_expr_val_t *res) {
kputs("", ks_clear(&res->s));
} else if (strncmp(str, "null-but-true", 13) == 0) {
*end = str+13;
hts_expr_val_undef(res);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be done using a nan-but-true, leaving the null-but-true test as-is.

Note that nan-but-true could occur, if a floating-point tag exists with a nan in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using nan at your own request...

I'm not inclined to switch back again now. I don't think we really need to worry too much about real nans as it's such a corner case we can simply document it as nan being considered to be false by default (which it is in hts_expr.h).

I thought I'd already updated the expression syntax documentation too, but apparently not. (It's rather annoying that most of htslib's documentation is in samtools, but that's a different issue.)

@daviesrob daviesrob self-assigned this Jul 26, 2022
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Aug 3, 2022

Adding documentation I'm struck by null being "false/null" while !null being "true/1". I'm wondering what the purpose was of that, but we could just do is_true ^= 1 so it becomes "true/null". This seems to be more logical, although for numeric values !2 becomes 0 and !!2 therefore becomes 1, so maybe that's why I chose to make !null to be 1 (and !!null to be 0).

Thoughts?

hts_expr.c Outdated
if (!hts_expr_val_exists(&val) || !hts_expr_val_exists(res)) {
hts_expr_val_undef(res);
hts_expr_val_free(&val);
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this short-circuits, which can lead to a parsing failure if you hit a null value:

$ ./test/test_expr 'null * 1 * 1'
Unable to parse expression at null * 1 * 1

Compare:

$ ./test/test_expr 'null + 1 + 1'
false	nan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting out the free and return solves it, but this is slightly different to how the add code works which sets undef=1. Given nan +/- anything is nan then it won't change the d value and is_true becomes 0 in both cases.

I am wondering though if null-but-true + 1 and null-but-true * 1 should be true still. I expect there are lots of bizarre corner cases like that. Practically speaking, false is better as [NM]+1 < 11 and [NM] < 10 give the same result now, but wouldn't I think if null-but-true + 1 is true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that fixed it.

It's possible to get NAN with is_true = 1 by making an aux tag with NAN in it: NT:f:nan. It stays true long enough for exists() and the logical-and/or operators to work. Everything else thinks it's false, which is probably OK, apart maybe from default(). If the tag exists, shouldn't default() return its value, even if it's nan?

This also leads to another slight default() oddity, as it also checks the second value with hts_expr_val_exists():

$ ./test/test_expr 'default(null, 123)'
true	123    << Expected
$ ./test/test_expr 'default(null-but-true, 123)'
true	123    << Arguable, maybe wrong?
$ ./test/test_expr 'default(null, null-but-true)'
false	nan    << Not the second argument
$ ./test/test_expr 'default(null-but-true, null)'
true	nan    << Not the second argument

The last two look inconsistent with the first two. I think default() should only test hts_expr_val_existsT(res) and always return val if false. (Or hts_expr_val_exists(res) if null-but-true is more null than true in this context).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth worrying overly about genuine NANs.

We took the decision to basically relegate genuine NANs when we decided that NAN was to be our "null" state instead of a null string (and something you suggested IIRC). Specifically using NAN instead of null meant nan is false in conditionals (unlike C where it's true). I was OK with that as realistically it's not something that crops up in the wild, at least not deliberately so! So I'm not concerned with getting cases like NT:f:nan working fully.

It's possible we should improve the consistency of null-but-true though and how it propagates through things. Mostly it's fine, but there's perhaps some debate over !null-but-true being false/0 rather than false/null. I can see default changes are justified. I'll think some more about that and test other cases too.

Regarding default(null-but-true, 123), currently all nulls, whether true or false, are still considered as null. The only difference is how they work in conditionals and the negation operator. I think that's fine. Although maybe what you're saying is "null-but-true" is infact a genuine NAN, and "null" is our non-NAN "null" equivalent. That's very mirky though and it doesn't make sense for the NOT operator to toggle between a null state and a nan state. I'd rather not sail in those mirky waters and think the current action of default is the correct one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters when you have a tag like fN:f:nan, as for those default disagrees with exists:

$ ./test/test_view -i filter='exists([fN])' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800
SRR065390.14978392	16	CHROMOSOME_I	2	1	27M1D73M	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	XN:i:0	XO:i:1	fN:f:nan	f0:f:0

$ ./test/test_view -i filter='default([fN], 0)' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800

Compare with a tag that isn't nan:

$ ./test/test_view -i filter='exists([f0])' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800
SRR065390.14978392	16	CHROMOSOME_I	2	1	27M1D73M	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	XN:i:0	XO:i:1	fN:f:nan	f0:f:0

$ ./test/test_view -i filter='default([f0], 0)' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800
SRR065390.14978392	16	CHROMOSOME_I	2	1	27M1D73M	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	XN:i:0	XO:i:1	fN:f:nan	f0:f:0

And one that isn't there:

$ ./test/test_view -i filter='exists([XX])' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800

$ ./test/test_view -i filter='default([XX], 0)' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree, having an explicit exists() makes things a lot easier. In fact, sambamba seems to have something similar, in the form of explicit == null and != null comparisons.

So, given we now have exists(), should [XX] always go with the value of the tag, including for truthiness? As of 1feac5e, it still seems to have a foot in both camps:

$ ./test/test_view -i filter='[XN]' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800
SRR065390.14978392	16	CHROMOSOME_I	2	1	27M1D73M	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	XN:i:0	XO:i:1	fN:f:nan	f0:f:0

$ ./test/test_view -i filter='![XN]' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800
SRR065390.14978392	16	CHROMOSOME_I	2	1	27M1D73M	CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA	#############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	XN:i:0	XO:i:1	fN:f:nan	f0:f:0

$ ./test/test_view -i filter='!![XN]' /tmp/ce#1.sam
@SQ	SN:CHROMOSOME_I	LN:1009800

This matches what happened before we started fiddling with the expressions, but current develop (ad80f8e) has [XN] true, ![XN] false, !![XN] true. If we went strictly with the value, this would invert to [XN] false, ![XN] true, !![XN] false. Given that the last release was logically inconsistent here both for tags with value 0, and non-existent tags, I guess we should worry too much about the resulting breaking change?

Copy link
Contributor Author

@jkbonfield jkbonfield Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we seem to have got back to where we started for [XN] syntax.

If I was designing it right now, with what I've learnt, then absolutely I'd require exists to check for existance and have [XN] purely being true/false based on the content of the tag. (Obviously when it doesn't exist it should fail regardless.) The old "[NM]" syntax is short, sweet, and almost certainly in use somewhere. If we changed that to mean "NM:i:0" now returns false, then we're going to break pipelines.

I also highly doubt anyone really cares about nuances of things like "MD:Z:" (empty string) and "AB:f:nan" handling. They'll just be dealing with the typical values that occur on a daily basis. So we shouldn't throw the baby out with the bath-water here and we need to make sure the most basic operations don't change.

I agree the old way of doing things wasn't entirely consistent when it came to negation, so "![NM]" and "[NM] != 0" gave odd results, but these were actually documented with workarounds:

If no comparison is used with an auxiliary tag it is taken simply to be a test for the existence of that tag. So "[NM]" will return any record containing an NM tag, even if that tag is zero (NM:i:0).

If you need to check specifically for a non-zero value then use [NM] && [NM]!=0.

That still works, but it's now been revised so the [NM] && bit is not needed as the expression is only true when the tag exists. We could improve the documentation further. I've attempted to do this in samtools/samtools#1687 but if you have further suggestions I'm happy to incorporate them.

Edit: one obvious thing is ![NM] isn't just NM==0 but also not-present. So explicitly stating to use [NM]==0 is the recommended solution ([NM] && ![NM] also works, but that's just bonkers looking!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation didn't mention the ![NM] case where the value is 0, probably because no-one had noticed it. In fact, it's the only anomalous case for unary-not, as far as I can tell. It can be made consistent with this:

diff --git a/hts_expr.c b/hts_expr.c
index 9e2454b4..b86803ac 100644
--- a/hts_expr.c
+++ b/hts_expr.c
@@ -380,7 +380,12 @@ static int unary_expr(hts_filter_t *filt, void *data, hts_expr_sym_func *fn,
         if (!hts_expr_val_exists(res)) {
             // We can still negate undef values by toggling the
             // is_true override value.
+            res->is_str = 0;
             res->d = res->is_true = !res->is_true;
+        } else if (res->is_true) {
+            // Explicit true becomes false
+            res->is_str = 0;
+            res->d = res->is_true = 0;
         } else if (res->is_str) {
             res->is_str = 0;
             // !null = true, !"foo" = false, NOTE: !"" = false also

Note that all the existing tests still pass after applying this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It alluded to the case of [NM] and zero values being in the "check for zero" example, but you're correct it didn't explicitly spell out the ![NM] case. I'll need to think on the explicit true becoming false thing here a bit and do some experimentation of my own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok pushed some new changes, including new and updated tests.

I'll update the documentation side in the samtools PR too.

hts_expr.c Show resolved Hide resolved
1. We have strings, numerics, and a null/unknown state.  The latter
was previously only used as a mechanism for reporting the presence of
absence of a tag, eg "[X1]" being the value of X1 if found, or null if
not.  We also had an override of is_true=1 to permit "![X1]", although
sadly this wasn't robust (and still isn't) as it has a dual meaning of
X1 not existing or X1 existing but being zero.

Our unknown state now uses NaN semantics as defined by IEE 754 for
comparisons and mathematics, but not for conditionals (see
below). This means unknown+2 is still unknown (and false).  Previously
"0/0+2" was true/-nan, it's now false/nan.

2. NaN semantics means <, >, == and != all return false with null/nan
values, even when comparing null to null.  Note this changes the
language results slightly.

INCOMPATIBILITY: Previously "[X1] != 0" meant X1 tag exists and is
zero, or X1 tag does not exist.  To avoid that second clause the man
page recommends "[X1] && [X1] != 0" to add a clause of checking the
tag for existance first.

This was illogical and almost certainly not the intended outcome.  Now
!= will be false whenever tag [X1] does not exist so the expression is
only true when the value is defined.  (The man page expression still
works, but has a redundant component.)

Similarly "![X1]" now means X1 doesn't exist, rather than the previous
interpretation of doesn't exist or is zero.

3. Fix arithmetic on non-existant aux tags.

Undefined values are now considered to be false.  They defined as
either null strings or NaN doubles, although we use the latter
ourselves.  (The former are considered the same as the previous code
used this and possibly external methods copying our old style).  Note
for compatibility with before the empty string is not false.

Previously attempting to use an undefined value gave a warning
message, so expressions like "[X0] + [X1] > 10" would spam when X0 or
X1 were absent.

Added a hts_expr_val_exists() function to simplify testing for defined
values, hts_expr_val_existsT() for defined or undef-but-true (useful
in conditionals), and hts_expr_val_undef() to set a variable to be
undefined (used when invalidating things).

4. Added an exists() function.  "[X0]" is a synonym for
X0-exists, and "![X0]" as doesn't exist, but sadly previously "![X0]"
was interpreted as "X0-doesn't-exist or X0==0".  Given this change and
for general clarity, a less ambiguous explicit exists function has
been added.

Note "exists" means has a known value or has an explicit is_true==1.
So null-but-true is still "exists".  Hence tags XX:f:nan are
considered to exist.

5. Added a default(a,b) function where "a" is returned if defined,
otherwise "b".  As previously explained the expression "[X0] + [X1] > 10"
is false whenever X0 or X1 don't exist, but using
"default([X0],0) + default([X1],0) > 10" we can use the sum of the
values present, or a single value if one is absent.

6. Added mathematical functions of sqrt, log, exp, and pow.

7. Null and boolean operations are largely unchanged, but for
clarification they work as follow, with 0/1 also being false/true and
symmetric operations.

null           == NaN (false)
null-but-true  == NaN (true)
null && x      == 0
null || 0      == 0
null || 1      == 1
!null          == 1
!!null         == 0
!null-but-true == 0

Although we're using NaN internally in order to get the arithmetic
consistent, it's not good to assume we rigidly follows all NaN
semantics.  Specifically in C NaN is considered to be true (so "NaN &&
1" is true), but for us it is false.
@daviesrob
Copy link
Member

I think we've achieved some sort of logical consistency. We'll see how it goes when released into the wild.

@daviesrob daviesrob merged commit c270926 into samtools:develop Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to apply tag division?
2 participants