-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random bcftools concat CI failures on Windows #1901
Comments
Argh, this is another impossible to reproduce random bug! On my local windows box I tried:
Yet we know this test does fail sometimes on AppVeyor. Eg https://ci.appveyor.com/project/samtools/bcftools/builds/46694487#L6079 I'll try it again after modifying the perl script to force the seed to 3434840604. Edit: no errors there either. So for faster iteration I ran through the actual command it's executing 1000 times and that also works:
Yet clearly the 7th file on appveyor had a bgzf error:
Rob was reporting errors with his fixes-1.17 branch too, in the exact same test. Although his time it failed on the 4th bcf file. Yet other times this same code works on AppVeyor. So clearly this is a genuine intermittent bug as the fail twice on the same test in the same place is vanishingly small for it to just be a random system glitch. Sadly I can't work out how to reproduce it locally though. My exhaustive testing has been on the concat command, but that is very robust. So I'm now assuming it's more likely to be the generation of the data that the concat command is reading that has silently failed. Any clues welcome. |
I should have thought about those errors more carefully. Specifically:
That coming from The answer to this, surprisingly, is that on newer mingw releases lseek on a pipe no longer returns -1! Eg:
Now test it on a small file, with "< file" (fd 0 is a real file on disk) vs "cat file | ..." (fd 0 is a pipe):
This appears to have changed behaviour somewhere between mingw gcc 11.2 and gcc 12.2, but it's likely not the compiler per se and rather somewhere in the standard C library. I'm struggling to find another route for validating this. We could potentially modify the hfile struct somehow so stdin sets a flag which is then explicitly checked later, but given this is just validation of input my inclination is to fix this with a |
As to why this is randomly failing, I think it's because most of the files being concatenated are < 128Kb long so the SEEK_END -28 actually works, but occasionally it gets a long one (they're randomly generated with a different seed every time), which then seeks to the wrong location. |
Unfortunately this test is proving unreliable now following a change to lseek so that it no longer returns -1 on pipes. Fixes samtools/bcftools#1901
Unfortunately this test is proving unreliable now following a change to lseek so that it no longer returns -1 on pipes. Fixes samtools/bcftools#1901
Unfortunately this test is proving unreliable now following a change to lseek so that it no longer returns -1 on pipes. Fixes samtools/bcftools#1901
MinGW 12.x started returning non-zero values from lseek when the fd is a pipe. This is unhelpful and it breaks bgzf_check_EOF as seeking to the end is actually seeking to the end of the pipe memory buffer, causing invalid EOFs. (This breaks bcftools CI tests.) Fixes samtools/bcftools#1901 Co-authored-by: John Marshall <jmarshall@hey.com>
MinGW 12.x started returning non-zero values from lseek when the fd is a pipe. This is unhelpful and it breaks bgzf_check_EOF as seeking to the end is actually seeking to the end of the pipe memory buffer, causing invalid EOFs. (This breaks bcftools CI tests.) Fixes samtools/bcftools#1901 Co-authored-by: John Marshall <jmarshall@hey.com>
MinGW 12.x started returning non-zero values from lseek when the fd is a pipe. This is unhelpful and it breaks bgzf_check_EOF as seeking to the end is actually seeking to the end of the pipe memory buffer, causing invalid EOFs. (This breaks bcftools CI tests.) Fixes samtools/bcftools#1901 Co-authored-by: John Marshall <jmarshall@hey.com>
MinGW 12.x started returning non-zero values from lseek when the fd is a pipe. This is unhelpful and it breaks bgzf_check_EOF as seeking to the end is actually seeking to the end of the pipe memory buffer, causing invalid EOFs. (This breaks bcftools CI tests.) Fixes samtools/bcftools#1901 Co-authored-by: John Marshall <jmarshall@hey.com>
I just received the same error, apparently this can still happen
|
Even with the same commit hash, sometimes make check works, and sometimes not. For example:
https://ci.appveyor.com/project/jkbonfield/bcftools/builds/46710244#L6077
https://ci.appveyor.com/project/samtools/bcftools/builds/46694487#L6077
This test does both bcf and vcf. In both cases the vcf worked, although oddly there's a rogue newline in the command line reported for the second URL above in the vcf command line. This may mean some bizarre line-wrapping buglet going on? I'm really not sure.
However one other difference in the first test_naive_concat bcf test is the use of
--naive-force
over--naive
. The documentation claims to use this with caution as no compatibility check is done. Given this is a random error and the tests take 10 minutes to run, we can only run one at a time, and it takes lots of tests to get one failing, this isn't an ideal thing to experiment with.The BCF headers contain things like dates and times. We rapidly create 10 test files in a row:
What happens if the time changes between two of those files? It may give different compressed sizes of the headers, and then give different byte offsets for downstream data? I'm clutching at straws because I've no idea what
--naive-force
is actually warning about. Is it literally a rapid byte copying function, or is it doing decode and reencode jobs and it's irrelevant if the headers differ slightly, as long as they have the same contig names and lengths?Diagnosing this remotely like this is almost impossible. Locally I can't get it to fail at all, but locally it also runs much quicker, so that's also that got me wondering if it's timestamps that are causing it to break. I see some code has
--no-version
options. So maybe that would cure it.Any clues welcomed as I'm just making wild stabs in the dark at the moment.
The text was updated successfully, but these errors were encountered: