New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test_vcf_api/test_vcf_sweep tests on i386 #565

Open
satta opened this Issue Jul 15, 2017 · 12 comments

Comments

Projects
None yet
6 participants
@satta

satta commented Jul 15, 2017

I'm forwarding Debian bug 865012 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865012), which mentions failing tests on i386. I can reproduce these on a plain Debian stretch i386 install with the most recent htslib release 1.5:

vagrant@debian-9rc1-i386:/vagrant/htslib$ make test
test/hts_endian
test/fieldarith test/fieldarith.sam
test/hfile
test/test_bgzf test/bgziptest.txt
cd test/tabix && ./test-tabix.sh tabix.tst
Testing tabix...
PASS : $tabix -f -p vcf vcf_file.tbi.tmp.vcf.gz
[...]
test_vcf_api:
	/vagrant/htslib/test/test-vcf-api /tmp/KuypJQksGq/test-vcf-api.bcf
bcf_get_format_float didn't produce the expected output.

.. failed ...

test_vcf_sweep:
	/vagrant/htslib/test/test-vcf-sweep /tmp/KuypJQksGq/test-vcf-api.bcf

	The outputs differ:
		/vagrant/htslib/test/test-vcf-sweep.out
		/vagrant/htslib/test/test-vcf-sweep.out.new
.. failed ...

It looks like the test_vcf_sweep failure is a consequence of the test_vcf_api test failing. That failure seems to be caused by test-vcf-api.c:360:

           if (ret != 4 ||
            count < ret ||
            !bcf_float_is_missing(values[0]) ||
            values[1] != 47.11f ||
            !bcf_float_is_vector_end(values[2]) ||
            !bcf_float_is_vector_end(values[3]))
        {
            fprintf(stderr, "bcf_get_format_float didn't produce the expected output.\n");
            exit(-1);
        }

in particular the bcf_float_is_vector_end() checks which are failing. Values read do not match the constants defined in vcf.c for bcf_float_vector_end.
Apparently there is also some amount of uninitialized memory handled while writing the formatted file in the first place:

vagrant@debian-9rc1-i386:/vagrant/htslib$ valgrind ./test/test-vcf-api test-vcf-api.bcf
==25115== Memcheck, a memory error detector
==25115== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==25115== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==25115== Command: ./test/test-vcf-api test-vcf-api.bcf
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x4843683: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x4843686: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x484368D: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
==25115== Use of uninitialised value of size 4
==25115==    at 0x4843694: crc32 (in /lib/i386-linux-gnu/libz.so.1.2.8)
==25115==    by 0x142274: bgzf_compress (bgzf.c:377)
==25115==    by 0x1423E2: deflate_block (bgzf.c:411)
==25115==    by 0x144A6F: bgzf_flush (bgzf.c:1351)
==25115==    by 0x144FF8: bgzf_close (bgzf.c:1435)
==25115==    by 0x11376A: hts_close (hts.c:924)
==25115==    by 0x10B046: write_format_values (test-vcf-api.c:339)
==25115==    by 0x10A16D: test_get_format_values (test-vcf-api.c:381)
==25115==    by 0x10A16D: main (test-vcf-api.c:390)
==25115==
bcf_get_format_float didn't produce the expected output.
==25115==
==25115== HEAP SUMMARY:
==25115==     in use at exit: 136,576 bytes in 69 blocks
==25115==   total heap usage: 213 allocs, 144 frees, 857,261 bytes allocated
==25115==
==25115== LEAK SUMMARY:
==25115==    definitely lost: 0 bytes in 0 blocks
==25115==    indirectly lost: 0 bytes in 0 blocks
==25115==      possibly lost: 0 bytes in 0 blocks
==25115==    still reachable: 136,576 bytes in 69 blocks
==25115==         suppressed: 0 bytes in 0 blocks
==25115== Rerun with --leak-check=full to see details of leaked memory
==25115==
==25115== For counts of detected and suppressed errors, rerun with: -v
==25115== Use --track-origins=yes to see where uninitialised values come from
==25115== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 70 from 30)

This is currently breaking htslib builds on Debian and will affect dependent packages if unresolved.

FYI, the suggested approach of setting -ffloat-store in CFLAGS did not help.

@satta

This comment has been minimized.

Show comment
Hide comment
@satta

satta Jul 17, 2017

FYI, we were able to work around this by adding -fno-strict-aliasing to CFLAGS.

satta commented Jul 17, 2017

FYI, we were able to work around this by adding -fno-strict-aliasing to CFLAGS.

@daviesrob

This comment has been minimized.

Show comment
Hide comment
@daviesrob

daviesrob Jul 17, 2017

Member

Which one does it fix, the make test, or valgrind?

The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to completely fix it, although #485 helped a bit.

We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it happens with the latest zlib (1.2.11)? I'll have a go to see if I can reproduce.

Member

daviesrob commented Jul 17, 2017

Which one does it fix, the make test, or valgrind?

The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to completely fix it, although #485 helped a bit.

We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it happens with the latest zlib (1.2.11)? I'll have a go to see if I can reproduce.

@satta

This comment has been minimized.

Show comment
Hide comment
@satta

satta Jul 17, 2017

Which one does it fix, the make test, or valgrind?

The failing test. I was assuming the valgrind issue had to do with it, but then again it might not.

The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of
signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts
signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to
completely fix it, although #485 helped a bit.

OK. Thanks for the insight.

We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it
happens with the latest zlib (1.2.11)?

I haven't tried 1.2.11, 1.2.8 is the latest in Debian unstable.

satta commented Jul 17, 2017

Which one does it fix, the make test, or valgrind?

The failing test. I was assuming the valgrind issue had to do with it, but then again it might not.

The problem with make test is a known one, caused by BCF's rather unfortunate (ab)use of
signalling NANs (see samtools/hts-specs#145). It affects i386 because the FPU converts
signalling NANs to quiet ones unless much care is taken. I suspect we need API changes to
completely fix it, although #485 helped a bit.

OK. Thanks for the insight.

We've seen problems with zlib and valgrind before, but never in crc32. Do you know if it
happens with the latest zlib (1.2.11)?

I haven't tried 1.2.11, 1.2.8 is the latest in Debian unstable.

@daviesrob

This comment has been minimized.

Show comment
Hide comment
@daviesrob

daviesrob Jul 18, 2017

Member

It looks like valgrindhas spotted a real problem. write_format_values() in test-vcf-api.c does this:

    float test[4];
    bcf_float_set_missing(test[0]);
    test[1] = 47.11f;
    bcf_float_set_vector_end(test[2]);
    bcf_update_format_float(hdr, rec, "TF", test, 4);
    bcf_write1(fp, hdr, rec);

So it allocates an array of four values, sets three of them and then updates the record with four. This means the last item in the record is undefined, triggering the warning when it's eventually written out.

When the data is checked (in check_format_values()), the undefined value is converted to vector_end as bcf_get_format_values() ignores anything after a vector_end marker and sets all remaining output values to vector_end.

The fix in this case is trivial (just set test[3] to any value). @pd3 may want to comment on whether bcf_update_format() should also set the appropriate vector_end values after it's seen one in it's input.

Member

daviesrob commented Jul 18, 2017

It looks like valgrindhas spotted a real problem. write_format_values() in test-vcf-api.c does this:

    float test[4];
    bcf_float_set_missing(test[0]);
    test[1] = 47.11f;
    bcf_float_set_vector_end(test[2]);
    bcf_update_format_float(hdr, rec, "TF", test, 4);
    bcf_write1(fp, hdr, rec);

So it allocates an array of four values, sets three of them and then updates the record with four. This means the last item in the record is undefined, triggering the warning when it's eventually written out.

When the data is checked (in check_format_values()), the undefined value is converted to vector_end as bcf_get_format_values() ignores anything after a vector_end marker and sets all remaining output values to vector_end.

The fix in this case is trivial (just set test[3] to any value). @pd3 may want to comment on whether bcf_update_format() should also set the appropriate vector_end values after it's seen one in it's input.

@pd3

This comment has been minimized.

Show comment
Hide comment
@pd3

pd3 Jul 18, 2017

Member

@daviesrob bcf_update_format() uses your serialize_float_array() which calls float_to_le(), so no special handling of vector_end values should be required.

Member

pd3 commented Jul 18, 2017

@daviesrob bcf_update_format() uses your serialize_float_array() which calls float_to_le(), so no special handling of vector_end values should be required.

@daviesrob

This comment has been minimized.

Show comment
Hide comment
@daviesrob

daviesrob Jul 18, 2017

Member

@pd3 The problem is the values that come after the vector_end. Should they all be written as vector_end too, which is how they end up when read back in? Or possibly as zeros, which might compress better if there's a long run of them?

As the reading side ignores all of the data after vector_end, it would seem sensible that the writing side should too?

Member

daviesrob commented Jul 18, 2017

@pd3 The problem is the values that come after the vector_end. Should they all be written as vector_end too, which is how they end up when read back in? Or possibly as zeros, which might compress better if there's a long run of them?

As the reading side ignores all of the data after vector_end, it would seem sensible that the writing side should too?

@pd3

This comment has been minimized.

Show comment
Hide comment
@pd3

pd3 Jul 18, 2017

Member

The specification requires that all values after vector_end are filled with vector_end values (if the sample has fewer values than the rest of the samples). The bcf_update_format() function assumes that the caller provides output that is valid in this sense.

Member

pd3 commented Jul 18, 2017

The specification requires that all values after vector_end are filled with vector_end values (if the sample has fewer values than the rest of the samples). The bcf_update_format() function assumes that the caller provides output that is valid in this sense.

@anderskaplan

This comment has been minimized.

Show comment
Hide comment
@anderskaplan

anderskaplan Jul 18, 2017

Contributor

Ironically I added this test case, with the behavior I could find in the code comments, as part of PR #485 to make sure I didn't mess up when I got the code to work with MSVC, 32-bit. So the real problem (i.e. bad output produced on i386) might have been there for a while, undetected.

For the write_format_values test case, I didn't realize that the vector should be padded with vector_end values. That should be an easy fix though and not something that will make things break in production.

I believe the issues reported by valgrind related to CRC32 is a whole different problem.

Contributor

anderskaplan commented Jul 18, 2017

Ironically I added this test case, with the behavior I could find in the code comments, as part of PR #485 to make sure I didn't mess up when I got the code to work with MSVC, 32-bit. So the real problem (i.e. bad output produced on i386) might have been there for a while, undetected.

For the write_format_values test case, I didn't realize that the vector should be padded with vector_end values. That should be an easy fix though and not something that will make things break in production.

I believe the issues reported by valgrind related to CRC32 is a whole different problem.

@anderskaplan

This comment has been minimized.

Show comment
Hide comment
@anderskaplan

anderskaplan Jul 18, 2017

Contributor

I've reproduced the problem on Debian stretch i386. The problem seems to be that bcf_float_set sometimes doesn't do what it's supposed to, if an earlier floating-point operation has put the FPU in an erroneous state.

This piece of code demonstrates the behavior.

float f;
bcf_float_set_missing(f);
hexdump(&f, sizeof(float));
bcf_float_set_vector_end(f);
hexdump(&f, sizeof(float)); 

Where hexdump is a small utility function that takes a void* as input, casts it to a uint8_t*, and prints it byte by byte.
The correct output on a little-endian system is:

0100807f
0200807f

But depending on what state the FPU is in, i.e., what operations were executed before this code, the output can also be:

0100c07f
0200807f

where the signaling NaN has been converted to quiet.

Contributor

anderskaplan commented Jul 18, 2017

I've reproduced the problem on Debian stretch i386. The problem seems to be that bcf_float_set sometimes doesn't do what it's supposed to, if an earlier floating-point operation has put the FPU in an erroneous state.

This piece of code demonstrates the behavior.

float f;
bcf_float_set_missing(f);
hexdump(&f, sizeof(float));
bcf_float_set_vector_end(f);
hexdump(&f, sizeof(float)); 

Where hexdump is a small utility function that takes a void* as input, casts it to a uint8_t*, and prints it byte by byte.
The correct output on a little-endian system is:

0100807f
0200807f

But depending on what state the FPU is in, i.e., what operations were executed before this code, the output can also be:

0100c07f
0200807f

where the signaling NaN has been converted to quiet.

@anderskaplan anderskaplan referenced a pull request that will close this issue Jul 28, 2017

Open

Fix for #565, failing tests on i386 #571

@mapreri

This comment has been minimized.

Show comment
Hide comment
@mapreri

mapreri Nov 10, 2017

Did anybody work on this bug in the last month?

mapreri commented Nov 10, 2017

Did anybody work on this bug in the last month?

@daviesrob

This comment has been minimized.

Show comment
Hide comment
@daviesrob

daviesrob Nov 13, 2017

Member

No, mainly because it's not easy to fix. This signalling NAN issue arises due to reliance on behaviour which its not well-defined, especially on i386. Even if it's possible to fix it in one place, there's no guarantee that it won't pop up again somewhere else.

A full solution is likely to need changes in the way floating-point values are stored internally. This may also require changes to the ways in which they are stored and accessed. It's going to need some thought to get the right solution.

Member

daviesrob commented Nov 13, 2017

No, mainly because it's not easy to fix. This signalling NAN issue arises due to reliance on behaviour which its not well-defined, especially on i386. Even if it's possible to fix it in one place, there's no guarantee that it won't pop up again somewhere else.

A full solution is likely to need changes in the way floating-point values are stored internally. This may also require changes to the ways in which they are stored and accessed. It's going to need some thought to get the right solution.

@ginggs

This comment has been minimized.

Show comment
Hide comment
@ginggs

ginggs Nov 20, 2017

Contributor

We were able to work around this again for GCC 7 with -fno-strict-aliasing -fno-code-hoisting.

Contributor

ginggs commented Nov 20, 2017

We were able to work around this again for GCC 7 with -fno-strict-aliasing -fno-code-hoisting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment