Test suite fails on some architectures #389

Open
afif-elghraoui opened this Issue Mar 10, 2016 · 13 comments

Projects

None yet

6 participants

@afif-elghraoui

When building bcftools on architectures where htslib is available, we see test failures on 32-bit architectures as well as some others.

This was picked up by Debian's QA in version 1.2, but apparently persists in the latest release. The Debian bug report is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812268

The first post there (partially reproduced below) explains the nature of the problem.

Builds of bcftools for several architectures failed with test suite
errors, as detailed at

https://buildd.debian.org/status/logs.php?pkg=bcftools&ver=1.2-1

Specifically:

* On arm64, armel, and ppc64el, test_vcf_consensus* failed with usage
  errors(!).

* On i386 and kfreebsd-i386, 57 other test cases failed with output
  differences (perhaps related to floating-point unit idiosyncracies --
  it normally works with double-extended precision internally, rather
  than rounding at every step).

The updated link to the build logs is https://buildd.debian.org/status/logs.php?pkg=bcftools&ver=1.3-2
I'm sorry for not forwarding this report earlier.

@jrandall jrandall self-assigned this Mar 14, 2016
@jrandall
Contributor

I have confirmed the second issue on develop (using lenny-dev32 at Sanger).

Here is an example of the differences (in this case from the test_vcf_merge test) after running make test:

$ diff -y --suppress-common-lines ./test/merge.abc.out ./test/merge.abc.out.new
1       3000150 .       C       T       59.2    PASS    AN=4; | 1       3000150 .       C       T       nan     PASS    AN=4;
1       3000151 .       C       T       59.2    PASS    AN=4; | 1       3000151 .       C       T       nan     PASS    AN=4;
1       3062915 id3D    GTTT    G       84.6    q10;q20 INDEL | 1       3062915 id3D    GTTT    G       nan     q10;q20 INDEL
1       3062915 id1D;id2D       GTT     GT,G    999     q20;q | 1       3062915 id1D;id2D       GTT     GT,G    nan     q20;q
1       3062915 idSNP   G       T,C,A   419     test;q20      | 1       3062915 idSNP   G       T,C,A   nan     test;q20
1       3106154 .       CAAAA   CA,C    342     PASS    DP=15 | 1       3106154 .       CAAAA   CA,C    nan     PASS    DP=15
1       3106154 .       C       CT      459     PASS    AN=8; | 1       3106154 .       C       CT      nan     PASS    AN=8;
1       3106154 .       C       T       999     PASS    DP=15 | 1       3106154 .       C       T       nan     PASS    DP=15
1       3157410 .       GAC     GC,G    90.6    q10     DP=11 | 1       3157410 .       GAC     GC,G    nan     q10     DP=11
1       3157410 .       G       T       46.7    q10     AN=4; | 1       3157410 .       G       T       nan     q10     AN=4;
1       3162006 .       GAA     G,GA    238     PASS    DP=19 | 1       3162006 .       GAA     G,GA    nan     PASS    DP=19
1       3177144 .       G       T       999     PASS    DP=24 | 1       3177144 .       G       T       nan     PASS    DP=24
1       3177144 .       G       .       45      PASS    AN=4  | 1       3177144 .       G       .       nan     PASS    AN=4
1       3177144 .       GT      G       999     PASS    DP=24 | 1       3177144 .       GT      G       nan     PASS    DP=24
1       3184885 .       TAAAA   TA,T    61.5    PASS    DP=16 | 1       3184885 .       TAAAA   TA,T    nan     PASS    DP=16
2       3188209 .       GA      G       41.5    .       DP=15 | 2       3188209 .       GA      G       nan     .       DP=15
2       3199812 .       G       GTT,GT  291     PASS    AN=8; | 2       3199812 .       G       GTT,GT  nan     PASS    AN=8;
3       3199812 .       GA      GTT,GT  17.5    PASS    DP=19 | 3       3199812 .       GA      GTT,GT  nan     PASS    DP=19
3       3212016 .       CTT     C,CT    79      PASS    AN=8; | 3       3212016 .       CTT     C,CT    nan     PASS    AN=8;
4       3212016 .       CTT     C       999     q20     DP=15 | 4       3212016 .       CTT     C       nan     q20     DP=15
4       3258448 .       TACACACAC       T       .       PASS  | 4       3258448 .       TACACACAC       T       nan     PASS

On a 32-bit system, the QUAL values are being output as nan for some reason.

@jmarshall
Member

For example for test/query.12.out, I think what we're seeing here is bcf_float_missing, which is a signalling NaN, being turned into a quiet NaN somewhere along the line, and then not being recognised as missing by bcf_float_is_missing().

@jrandall
Contributor

All tests pass on a 32-bit debian system when omitting -O or using -O0 to compile both bcftools and htslib.

htslib bcftools failed tests
-O2 -O2 93
-O2 -O1 93
-O2 -O0 79
-O1 -O2 93
-O1 -O1 93
-O1 -O0 79
-O0 -O2 24
-O0 -O1 24
-O0 -O0 0
@jrandall
Contributor

It seems like there are 24 tests that fail on 32-bit systems due to issues with optimisations in bcftools and 79 tests that fail due to issues with optimisations in htslib, with some overlap between them.

The 24 tests that fail for (htslib -O0 bcftools -O1) and (htslib -O0 bcftools -O2) are the same:
12 test_vcf_merge tests - QUAL column is filled with . instead of numeric values
6 test_vcf_query tests - various values are set to nan instead of . (e.g. IAF and IF)
6 test_vcf_norm tests - QUAL column is filled with . instead of numeric values

@daviesrob
Member

I've tracked down the reason for the failures like the ones in the armel build where consensus prints out a usage message. It's due to char being unsigned on arm (and presumably some other platforms).

At https://github.com/samtools/bcftools/blob/develop/consensus.c#L627 the result of getopt_long is put into a char. On arm, this means c is set to 0xff when getopt_long has finished, which triggers the default case of the switch.

The solution is trivial, just change line 626 so c is an int.

@jmarshall
Member

@daviesrob: Thanks. Would be caught with -Wconversion, sigh. I'll take care of it.

@jkbonfield
Contributor

On Tue, Mar 15, 2016 at 08:41:58AM -0700, John Marshall wrote:

@daviesrob: Thanks. Would be caught with -Wconversion, sigh. I'll take care of it.

We can also try making one of our travis builds compile with -funsigned-char.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@jrandall
Contributor

Looks like @jmarshall is right about the nan issue. The float assignment on line: https://github.com/samtools/htslib/blob/develop/htslib/vcf.h#L786 appears to change the value of the float from 0x7f800001 to 0x7fc00001 but only when optimisations are turned on. We could probably "fix" this by doing the assignment a different way, but then still any attempt to assign to a qual would potentially break the signal that it is missing. I think the BCF spec needs to be changed so that only the lower bits of the NaN are examined.

@jmarshall jmarshall added a commit that referenced this issue Mar 15, 2016
@jmarshall jmarshall getopt_long() returns int, not char
Much like fgetc() returning EOF, we need to store getopt_long()'s
result in an int so that we can reliably check the -1 return value.
Hat tip @daviesrob; fixes the usage errors part of #389.
afecb4e
@daviesrob
Member

@jmarshall , @jrandall : Ignoring the wisdom of using NAN payloads for anything, couldn't this be fixed by a bit of bit-masking in bcf_float_is_missing and bcf_float_is_vector_end? Something like:

return (u.i & 0xffbfffff)==bcf_float_missing ? 1 : 0;

bcf_enc_vfloat would also need to detect NANs and bit-mask them so they are written in the correct form to comply with the specification. (Possibly also other functions I haven't found yet as well?)

@jrandall
Contributor

Yes, I suppose we could patch this all around so that the NaN quiet bit is ignored for relevant comparisons and forced off when writing nans (perhaps that would also be an opportunity to handle unsupported nans).

@afif-elghraoui

Are there any updates on this? While I'm admittedly not a user on 32-bit x86, it's one of the more mainstream architectures that Debian supports and is more likely to have a non-zero user base that's affected by this problem---especially if they didn't run the test suite.

@jmarshall jmarshall referenced this issue in samtools/hts-specs Apr 13, 2016
Open

BCF's use of signalling NaNs #145

@jmarshall
Member

IMHO the best approach would be for the BCF format to move away from signalling NaNs — see samtools/hts-specs#145.

AFAIK no-one's yet worked on changing bcf_enc_vfloat() and probably other functions to ensure that they write out the currently-proper SNaN values for missing et al. My suspicion is that this is going to be rather nasty and invasive, which is why I'd prefer to fix the format.

@rekado
rekado commented Oct 4, 2016

I'm also seeing test failures when building bcftools on i686 and armhf for GNU Guix. The full build log for the i686 build can be seen here: https://hydra.gnu.org/build/1479605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment