Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random bcftools concat CI failures on Windows #1901

Open
jkbonfield opened this issue Apr 5, 2023 · 4 comments · Fixed by samtools/htslib#1601
Open

Random bcftools concat CI failures on Windows #1901

jkbonfield opened this issue Apr 5, 2023 · 4 comments · Fixed by samtools/htslib#1601

Comments

@jkbonfield
Copy link
Contributor

Even with the same commit hash, sometimes make check works, and sometimes not. For example:

https://ci.appveyor.com/project/jkbonfield/bcftools/builds/46710244#L6077
https://ci.appveyor.com/project/samtools/bcftools/builds/46694487#L6077

This test does both bcf and vcf. In both cases the vcf worked, although oddly there's a rogue newline in the command line reported for the second URL above in the vcf command line. This may mean some bizarre line-wrapping buglet going on? I'm really not sure.

However one other difference in the first test_naive_concat bcf test is the use of --naive-force over --naive. The documentation claims to use this with caution as no compatibility check is done. Given this is a random error and the tests take 10 minutes to run, we can only run one at a time, and it takes lots of tests to get one failing, this isn't an ideal thing to experiment with.

The BCF headers contain things like dates and times. We rapidly create 10 test files in a row:

    for my $file (@files)
    {
        cmd("$$opts{bin}/bcftools view -Ob -o $file.bcf $file.vcf");
        cmd("$$opts{bin}/bcftools view -Oz -o $file.vcf.gz $file.vcf");
    }

What happens if the time changes between two of those files? It may give different compressed sizes of the headers, and then give different byte offsets for downstream data? I'm clutching at straws because I've no idea what --naive-force is actually warning about. Is it literally a rapid byte copying function, or is it doing decode and reencode jobs and it's irrelevant if the headers differ slightly, as long as they have the same contig names and lengths?

Diagnosing this remotely like this is almost impossible. Locally I can't get it to fail at all, but locally it also runs much quicker, so that's also that got me wondering if it's timestamps that are causing it to break. I see some code has --no-version options. So maybe that would cure it.

Any clues welcomed as I'm just making wild stabs in the dark at the moment.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Apr 6, 2023

Argh, this is another impossible to reproduce random bug!

On my local windows box I tried:

$ for i in `seq 1 100`;do TEST_OPTS="-t tests.out -f test_naive_concat" make test-plugins 2>&1 | grep failed;done
    failed  .. 0
[repeated 100x over]

Yet we know this test does fail sometimes on AppVeyor. Eg https://ci.appveyor.com/project/samtools/bcftools/builds/46694487#L6079

I'll try it again after modifying the perl script to force the seed to 3434840604.
(So far it's working although not many iterations done, but even so it's not turning it into an always-fail bug.)

Edit: no errors there either.

So for faster iteration I ran through the actual command it's executing 1000 times and that also works:

$ for i in `seq 1 1000`;do ./bcftools.exe concat --naive-force tests.out/naive_concat.*.bcf 2>/dev/null|wc -c;done | uniq
377537

Yet clearly the 7th file on appveyor had a bgzf error:

	Non-zero status 127
		[W::bcf_sr_add_reader] No BGZF EOF marker; file '-' may be truncated
		[E::bgzf_read_block] Invalid BGZF header at offset 0
		[E::bgzf_read] Read block operation failed with error 2 after 0 of 5 bytes
		[E::bcf_hdr_read] Failed to read the header (reading BCF in text mode?)
		Failed to read from standard input: could not parse header
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.0.bcf	0.003108 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.1.bcf	0.002063 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.2.bcf	0.001913 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.3.bcf	0.000318 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.4.bcf	0.001813 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.5.bcf	0.003632 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.6.bcf	0.002178 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.7.bcf
.. failed ...

Rob was reporting errors with his fixes-1.17 branch too, in the exact same test. Although his time it failed on the 4th bcf file. Yet other times this same code works on AppVeyor.

So clearly this is a genuine intermittent bug as the fail twice on the same test in the same place is vanishingly small for it to just be a random system glitch. Sadly I can't work out how to reproduce it locally though.

My exhaustive testing has been on the concat command, but that is very robust. So I'm now assuming it's more likely to be the generation of the data that the concat command is reading that has silently failed. Any clues welcome.

@jkbonfield
Copy link
Contributor Author

I should have thought about those errors more carefully. Specifically:

[W::bcf_sr_add_reader] No BGZF EOF marker; file '-' may be truncated

That coming from blah | bcftools view -H. The question that needed asking is, why is it validating if the file has EOF when reading from a pipe?

The answer to this, surprisingly, is that on newer mingw releases lseek on a pipe no longer returns -1! Eg:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>

int main(int argc, char **argv) {
  printf("Seek END returned %ld\n", (long)lseek(0, 0, SEEK_END));
  return 0;
}

Now test it on a small file, with "< file" (fd 0 is a real file on disk) vs "cat file | ..." (fd 0 is a pipe):

jkbon@nagafen MINGW64 ~/bcftools
$ ./a.exe < _
Seek END returned 226472
# the actual file length

jkbon@nagafen MINGW64 ~/bcftools
$ cat _ | ./a.exe
Seek END returned 131072
# the size of a pipe buffer?

This appears to have changed behaviour somewhere between mingw gcc 11.2 and gcc 12.2, but it's likely not the compiler per se and rather somewhere in the standard C library. I'm struggling to find another route for validating this. We could potentially modify the hfile struct somehow so stdin sets a flag which is then explicitly checked later, but given this is just validation of input my inclination is to fix this with a #ifdef __MINGW32__ guard and simply avoid the problematic behaviour. I filed this as a bug with mingw.

@jkbonfield
Copy link
Contributor Author

As to why this is randomly failing, I think it's because most of the files being concatenated are < 128Kb long so the SEEK_END -28 actually works, but occasionally it gets a long one (they're randomly generated with a different seed every time), which then seeks to the wrong location.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 11, 2023
Unfortunately this test is proving unreliable now following a change
to lseek so that it no longer returns -1 on pipes.

Fixes samtools/bcftools#1901
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 11, 2023
Unfortunately this test is proving unreliable now following a change
to lseek so that it no longer returns -1 on pipes.

Fixes samtools/bcftools#1901
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 11, 2023
Unfortunately this test is proving unreliable now following a change
to lseek so that it no longer returns -1 on pipes.

Fixes samtools/bcftools#1901
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 18, 2023
MinGW 12.x started returning non-zero values from lseek when the fd
is a pipe.  This is unhelpful and it breaks bgzf_check_EOF as seeking
to the end is actually seeking to the end of the pipe memory buffer,
causing invalid EOFs.  (This breaks bcftools CI tests.)

Fixes samtools/bcftools#1901

Co-authored-by: John Marshall <jmarshall@hey.com>
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 18, 2023
MinGW 12.x started returning non-zero values from lseek when the fd
is a pipe.  This is unhelpful and it breaks bgzf_check_EOF as seeking
to the end is actually seeking to the end of the pipe memory buffer,
causing invalid EOFs.  (This breaks bcftools CI tests.)

Fixes samtools/bcftools#1901

Co-authored-by: John Marshall <jmarshall@hey.com>
daviesrob pushed a commit to samtools/htslib that referenced this issue May 16, 2023
MinGW 12.x started returning non-zero values from lseek when the fd
is a pipe.  This is unhelpful and it breaks bgzf_check_EOF as seeking
to the end is actually seeking to the end of the pipe memory buffer,
causing invalid EOFs.  (This breaks bcftools CI tests.)

Fixes samtools/bcftools#1901

Co-authored-by: John Marshall <jmarshall@hey.com>
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this issue May 30, 2023
MinGW 12.x started returning non-zero values from lseek when the fd
is a pipe.  This is unhelpful and it breaks bgzf_check_EOF as seeking
to the end is actually seeking to the end of the pipe memory buffer,
causing invalid EOFs.  (This breaks bcftools CI tests.)

Fixes samtools/bcftools#1901

Co-authored-by: John Marshall <jmarshall@hey.com>
@pd3
Copy link
Member

pd3 commented Jun 25, 2024

I just received the same error, apparently this can still happen
https://cirrus-ci.com/task/6336215039868928?logs=test#L6624

test_naive_concat:
	/tmp/cirrus-ci-build/bcftools concat --naive /tmp/jSUwzfxxPq/naive_concat.0.vcf.gz /tmp/jSUwzfxxPq/naive_concat.1.vcf.gz /tmp/jSUwzfxxPq/naive_concat.2.vcf.gz /tmp/jSUwzfxxPq/naive_concat.3.vcf.gz /tmp/jSUwzfxxPq/naive_concat.4.vcf.gz /tmp/jSUwzfxxPq/naive_concat.5.vcf.gz /tmp/jSUwzfxxPq/naive_concat.6.vcf.gz /tmp/jSUwzfxxPq/naive_concat.7.vcf.gz /tmp/jSUwzfxxPq/naive_concat.8.vcf.gz /tmp/jSUwzfxxPq/naive_concat.9.vcf.gz | /tmp/cirrus-ci-build/bcftools view -H
	Non-zero status 9
		Checking the headers of 10 files.
		Done, the headers are compatible.
		Concatenating /tmp/jSUwzfxxPq/naive_concat.0.vcf.gz	0.019761 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.1.vcf.gz	0.005515 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.2.vcf.gz	0.007389 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.3.vcf.gz	0.003913 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.4.vcf.gz[E::bgzf_read_block] Failed to read BGZF block data at offset 182405 expected 12805 bytes; hread returned 12763
		Error: BCF read error
		
.. failed ...

@pd3 pd3 reopened this Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants