New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite fails on some architectures #389

Open
0xaf1f opened this Issue Mar 10, 2016 · 13 comments

Comments

Projects
None yet
6 participants
@0xaf1f

0xaf1f commented Mar 10, 2016

When building bcftools on architectures where htslib is available, we see test failures on 32-bit architectures as well as some others.

This was picked up by Debian's QA in version 1.2, but apparently persists in the latest release. The Debian bug report is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812268

The first post there (partially reproduced below) explains the nature of the problem.

Builds of bcftools for several architectures failed with test suite
errors, as detailed at

https://buildd.debian.org/status/logs.php?pkg=bcftools&ver=1.2-1

Specifically:

* On arm64, armel, and ppc64el, test_vcf_consensus* failed with usage
  errors(!).

* On i386 and kfreebsd-i386, 57 other test cases failed with output
  differences (perhaps related to floating-point unit idiosyncracies --
  it normally works with double-extended precision internally, rather
  than rounding at every step).

The updated link to the build logs is https://buildd.debian.org/status/logs.php?pkg=bcftools&ver=1.3-2
I'm sorry for not forwarding this report earlier.

@jrandall jrandall self-assigned this Mar 14, 2016

@jrandall

This comment has been minimized.

Show comment
Hide comment
@jrandall

jrandall Mar 14, 2016

Contributor

I have confirmed the second issue on develop (using lenny-dev32 at Sanger).

Here is an example of the differences (in this case from the test_vcf_merge test) after running make test:

$ diff -y --suppress-common-lines ./test/merge.abc.out ./test/merge.abc.out.new
1       3000150 .       C       T       59.2    PASS    AN=4; | 1       3000150 .       C       T       nan     PASS    AN=4;
1       3000151 .       C       T       59.2    PASS    AN=4; | 1       3000151 .       C       T       nan     PASS    AN=4;
1       3062915 id3D    GTTT    G       84.6    q10;q20 INDEL | 1       3062915 id3D    GTTT    G       nan     q10;q20 INDEL
1       3062915 id1D;id2D       GTT     GT,G    999     q20;q | 1       3062915 id1D;id2D       GTT     GT,G    nan     q20;q
1       3062915 idSNP   G       T,C,A   419     test;q20      | 1       3062915 idSNP   G       T,C,A   nan     test;q20
1       3106154 .       CAAAA   CA,C    342     PASS    DP=15 | 1       3106154 .       CAAAA   CA,C    nan     PASS    DP=15
1       3106154 .       C       CT      459     PASS    AN=8; | 1       3106154 .       C       CT      nan     PASS    AN=8;
1       3106154 .       C       T       999     PASS    DP=15 | 1       3106154 .       C       T       nan     PASS    DP=15
1       3157410 .       GAC     GC,G    90.6    q10     DP=11 | 1       3157410 .       GAC     GC,G    nan     q10     DP=11
1       3157410 .       G       T       46.7    q10     AN=4; | 1       3157410 .       G       T       nan     q10     AN=4;
1       3162006 .       GAA     G,GA    238     PASS    DP=19 | 1       3162006 .       GAA     G,GA    nan     PASS    DP=19
1       3177144 .       G       T       999     PASS    DP=24 | 1       3177144 .       G       T       nan     PASS    DP=24
1       3177144 .       G       .       45      PASS    AN=4  | 1       3177144 .       G       .       nan     PASS    AN=4
1       3177144 .       GT      G       999     PASS    DP=24 | 1       3177144 .       GT      G       nan     PASS    DP=24
1       3184885 .       TAAAA   TA,T    61.5    PASS    DP=16 | 1       3184885 .       TAAAA   TA,T    nan     PASS    DP=16
2       3188209 .       GA      G       41.5    .       DP=15 | 2       3188209 .       GA      G       nan     .       DP=15
2       3199812 .       G       GTT,GT  291     PASS    AN=8; | 2       3199812 .       G       GTT,GT  nan     PASS    AN=8;
3       3199812 .       GA      GTT,GT  17.5    PASS    DP=19 | 3       3199812 .       GA      GTT,GT  nan     PASS    DP=19
3       3212016 .       CTT     C,CT    79      PASS    AN=8; | 3       3212016 .       CTT     C,CT    nan     PASS    AN=8;
4       3212016 .       CTT     C       999     q20     DP=15 | 4       3212016 .       CTT     C       nan     q20     DP=15
4       3258448 .       TACACACAC       T       .       PASS  | 4       3258448 .       TACACACAC       T       nan     PASS

On a 32-bit system, the QUAL values are being output as nan for some reason.

Contributor

jrandall commented Mar 14, 2016

I have confirmed the second issue on develop (using lenny-dev32 at Sanger).

Here is an example of the differences (in this case from the test_vcf_merge test) after running make test:

$ diff -y --suppress-common-lines ./test/merge.abc.out ./test/merge.abc.out.new
1       3000150 .       C       T       59.2    PASS    AN=4; | 1       3000150 .       C       T       nan     PASS    AN=4;
1       3000151 .       C       T       59.2    PASS    AN=4; | 1       3000151 .       C       T       nan     PASS    AN=4;
1       3062915 id3D    GTTT    G       84.6    q10;q20 INDEL | 1       3062915 id3D    GTTT    G       nan     q10;q20 INDEL
1       3062915 id1D;id2D       GTT     GT,G    999     q20;q | 1       3062915 id1D;id2D       GTT     GT,G    nan     q20;q
1       3062915 idSNP   G       T,C,A   419     test;q20      | 1       3062915 idSNP   G       T,C,A   nan     test;q20
1       3106154 .       CAAAA   CA,C    342     PASS    DP=15 | 1       3106154 .       CAAAA   CA,C    nan     PASS    DP=15
1       3106154 .       C       CT      459     PASS    AN=8; | 1       3106154 .       C       CT      nan     PASS    AN=8;
1       3106154 .       C       T       999     PASS    DP=15 | 1       3106154 .       C       T       nan     PASS    DP=15
1       3157410 .       GAC     GC,G    90.6    q10     DP=11 | 1       3157410 .       GAC     GC,G    nan     q10     DP=11
1       3157410 .       G       T       46.7    q10     AN=4; | 1       3157410 .       G       T       nan     q10     AN=4;
1       3162006 .       GAA     G,GA    238     PASS    DP=19 | 1       3162006 .       GAA     G,GA    nan     PASS    DP=19
1       3177144 .       G       T       999     PASS    DP=24 | 1       3177144 .       G       T       nan     PASS    DP=24
1       3177144 .       G       .       45      PASS    AN=4  | 1       3177144 .       G       .       nan     PASS    AN=4
1       3177144 .       GT      G       999     PASS    DP=24 | 1       3177144 .       GT      G       nan     PASS    DP=24
1       3184885 .       TAAAA   TA,T    61.5    PASS    DP=16 | 1       3184885 .       TAAAA   TA,T    nan     PASS    DP=16
2       3188209 .       GA      G       41.5    .       DP=15 | 2       3188209 .       GA      G       nan     .       DP=15
2       3199812 .       G       GTT,GT  291     PASS    AN=8; | 2       3199812 .       G       GTT,GT  nan     PASS    AN=8;
3       3199812 .       GA      GTT,GT  17.5    PASS    DP=19 | 3       3199812 .       GA      GTT,GT  nan     PASS    DP=19
3       3212016 .       CTT     C,CT    79      PASS    AN=8; | 3       3212016 .       CTT     C,CT    nan     PASS    AN=8;
4       3212016 .       CTT     C       999     q20     DP=15 | 4       3212016 .       CTT     C       nan     q20     DP=15
4       3258448 .       TACACACAC       T       .       PASS  | 4       3258448 .       TACACACAC       T       nan     PASS

On a 32-bit system, the QUAL values are being output as nan for some reason.

@jmarshall

This comment has been minimized.

Show comment
Hide comment
@jmarshall

jmarshall Mar 14, 2016

Member

For example for test/query.12.out, I think what we're seeing here is bcf_float_missing, which is a signalling NaN, being turned into a quiet NaN somewhere along the line, and then not being recognised as missing by bcf_float_is_missing().

Member

jmarshall commented Mar 14, 2016

For example for test/query.12.out, I think what we're seeing here is bcf_float_missing, which is a signalling NaN, being turned into a quiet NaN somewhere along the line, and then not being recognised as missing by bcf_float_is_missing().

@jrandall

This comment has been minimized.

Show comment
Hide comment
@jrandall

jrandall Mar 14, 2016

Contributor

All tests pass on a 32-bit debian system when omitting -O or using -O0 to compile both bcftools and htslib.

htslib bcftools failed tests
-O2 -O2 93
-O2 -O1 93
-O2 -O0 79
-O1 -O2 93
-O1 -O1 93
-O1 -O0 79
-O0 -O2 24
-O0 -O1 24
-O0 -O0 0
Contributor

jrandall commented Mar 14, 2016

All tests pass on a 32-bit debian system when omitting -O or using -O0 to compile both bcftools and htslib.

htslib bcftools failed tests
-O2 -O2 93
-O2 -O1 93
-O2 -O0 79
-O1 -O2 93
-O1 -O1 93
-O1 -O0 79
-O0 -O2 24
-O0 -O1 24
-O0 -O0 0
@jrandall

This comment has been minimized.

Show comment
Hide comment
@jrandall

jrandall Mar 14, 2016

Contributor

It seems like there are 24 tests that fail on 32-bit systems due to issues with optimisations in bcftools and 79 tests that fail due to issues with optimisations in htslib, with some overlap between them.

The 24 tests that fail for (htslib -O0 bcftools -O1) and (htslib -O0 bcftools -O2) are the same:
12 test_vcf_merge tests - QUAL column is filled with . instead of numeric values
6 test_vcf_query tests - various values are set to nan instead of . (e.g. IAF and IF)
6 test_vcf_norm tests - QUAL column is filled with . instead of numeric values

Contributor

jrandall commented Mar 14, 2016

It seems like there are 24 tests that fail on 32-bit systems due to issues with optimisations in bcftools and 79 tests that fail due to issues with optimisations in htslib, with some overlap between them.

The 24 tests that fail for (htslib -O0 bcftools -O1) and (htslib -O0 bcftools -O2) are the same:
12 test_vcf_merge tests - QUAL column is filled with . instead of numeric values
6 test_vcf_query tests - various values are set to nan instead of . (e.g. IAF and IF)
6 test_vcf_norm tests - QUAL column is filled with . instead of numeric values

@daviesrob

This comment has been minimized.

Show comment
Hide comment
@daviesrob

daviesrob Mar 15, 2016

Member

I've tracked down the reason for the failures like the ones in the armel build where consensus prints out a usage message. It's due to char being unsigned on arm (and presumably some other platforms).

At https://github.com/samtools/bcftools/blob/develop/consensus.c#L627 the result of getopt_long is put into a char. On arm, this means c is set to 0xff when getopt_long has finished, which triggers the default case of the switch.

The solution is trivial, just change line 626 so c is an int.

Member

daviesrob commented Mar 15, 2016

I've tracked down the reason for the failures like the ones in the armel build where consensus prints out a usage message. It's due to char being unsigned on arm (and presumably some other platforms).

At https://github.com/samtools/bcftools/blob/develop/consensus.c#L627 the result of getopt_long is put into a char. On arm, this means c is set to 0xff when getopt_long has finished, which triggers the default case of the switch.

The solution is trivial, just change line 626 so c is an int.

@jmarshall

This comment has been minimized.

Show comment
Hide comment
@jmarshall

jmarshall Mar 15, 2016

Member

@daviesrob: Thanks. Would be caught with -Wconversion, sigh. I'll take care of it.

Member

jmarshall commented Mar 15, 2016

@daviesrob: Thanks. Would be caught with -Wconversion, sigh. I'll take care of it.

@jkbonfield

This comment has been minimized.

Show comment
Hide comment
@jkbonfield

jkbonfield Mar 15, 2016

Contributor

On Tue, Mar 15, 2016 at 08:41:58AM -0700, John Marshall wrote:

@daviesrob: Thanks. Would be caught with -Wconversion, sigh. I'll take care of it.

We can also try making one of our travis builds compile with -funsigned-char.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Contributor

jkbonfield commented Mar 15, 2016

On Tue, Mar 15, 2016 at 08:41:58AM -0700, John Marshall wrote:

@daviesrob: Thanks. Would be caught with -Wconversion, sigh. I'll take care of it.

We can also try making one of our travis builds compile with -funsigned-char.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@jrandall

This comment has been minimized.

Show comment
Hide comment
@jrandall

jrandall Mar 15, 2016

Contributor

Looks like @jmarshall is right about the nan issue. The float assignment on line: https://github.com/samtools/htslib/blob/develop/htslib/vcf.h#L786 appears to change the value of the float from 0x7f800001 to 0x7fc00001 but only when optimisations are turned on. We could probably "fix" this by doing the assignment a different way, but then still any attempt to assign to a qual would potentially break the signal that it is missing. I think the BCF spec needs to be changed so that only the lower bits of the NaN are examined.

Contributor

jrandall commented Mar 15, 2016

Looks like @jmarshall is right about the nan issue. The float assignment on line: https://github.com/samtools/htslib/blob/develop/htslib/vcf.h#L786 appears to change the value of the float from 0x7f800001 to 0x7fc00001 but only when optimisations are turned on. We could probably "fix" this by doing the assignment a different way, but then still any attempt to assign to a qual would potentially break the signal that it is missing. I think the BCF spec needs to be changed so that only the lower bits of the NaN are examined.

jmarshall added a commit that referenced this issue Mar 15, 2016

getopt_long() returns int, not char
Much like fgetc() returning EOF, we need to store getopt_long()'s
result in an int so that we can reliably check the -1 return value.
Hat tip @daviesrob; fixes the usage errors part of #389.
@daviesrob

This comment has been minimized.

Show comment
Hide comment
@daviesrob

daviesrob Mar 15, 2016

Member

@jmarshall , @jrandall : Ignoring the wisdom of using NAN payloads for anything, couldn't this be fixed by a bit of bit-masking in bcf_float_is_missing and bcf_float_is_vector_end? Something like:

return (u.i & 0xffbfffff)==bcf_float_missing ? 1 : 0;

bcf_enc_vfloat would also need to detect NANs and bit-mask them so they are written in the correct form to comply with the specification. (Possibly also other functions I haven't found yet as well?)

Member

daviesrob commented Mar 15, 2016

@jmarshall , @jrandall : Ignoring the wisdom of using NAN payloads for anything, couldn't this be fixed by a bit of bit-masking in bcf_float_is_missing and bcf_float_is_vector_end? Something like:

return (u.i & 0xffbfffff)==bcf_float_missing ? 1 : 0;

bcf_enc_vfloat would also need to detect NANs and bit-mask them so they are written in the correct form to comply with the specification. (Possibly also other functions I haven't found yet as well?)

@jrandall

This comment has been minimized.

Show comment
Hide comment
@jrandall

jrandall Mar 15, 2016

Contributor

Yes, I suppose we could patch this all around so that the NaN quiet bit is ignored for relevant comparisons and forced off when writing nans (perhaps that would also be an opportunity to handle unsupported nans).

Contributor

jrandall commented Mar 15, 2016

Yes, I suppose we could patch this all around so that the NaN quiet bit is ignored for relevant comparisons and forced off when writing nans (perhaps that would also be an opportunity to handle unsupported nans).

@0xaf1f

This comment has been minimized.

Show comment
Hide comment
@0xaf1f

0xaf1f Apr 13, 2016

Are there any updates on this? While I'm admittedly not a user on 32-bit x86, it's one of the more mainstream architectures that Debian supports and is more likely to have a non-zero user base that's affected by this problem---especially if they didn't run the test suite.

0xaf1f commented Apr 13, 2016

Are there any updates on this? While I'm admittedly not a user on 32-bit x86, it's one of the more mainstream architectures that Debian supports and is more likely to have a non-zero user base that's affected by this problem---especially if they didn't run the test suite.

@jmarshall

This comment has been minimized.

Show comment
Hide comment
@jmarshall

jmarshall Apr 13, 2016

Member

IMHO the best approach would be for the BCF format to move away from signalling NaNs — see samtools/hts-specs#145.

AFAIK no-one's yet worked on changing bcf_enc_vfloat() and probably other functions to ensure that they write out the currently-proper SNaN values for missing et al. My suspicion is that this is going to be rather nasty and invasive, which is why I'd prefer to fix the format.

Member

jmarshall commented Apr 13, 2016

IMHO the best approach would be for the BCF format to move away from signalling NaNs — see samtools/hts-specs#145.

AFAIK no-one's yet worked on changing bcf_enc_vfloat() and probably other functions to ensure that they write out the currently-proper SNaN values for missing et al. My suspicion is that this is going to be rather nasty and invasive, which is why I'd prefer to fix the format.

@rekado

This comment has been minimized.

Show comment
Hide comment
@rekado

rekado Oct 4, 2016

I'm also seeing test failures when building bcftools on i686 and armhf for GNU Guix. The full build log for the i686 build can be seen here: https://hydra.gnu.org/build/1479605

rekado commented Oct 4, 2016

I'm also seeing test failures when building bcftools on i686 and armhf for GNU Guix. The full build log for the i686 build can be seen here: https://hydra.gnu.org/build/1479605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment