-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
64bit integers break VCF/BCF #999
Comments
test2.vcf is simply impossible in BCF. Like BAM, it simply doesn't support 64bit coordinates. Hence you must keep all data in VCF (or likewise in SAM). However it probably ought to complain when encoding to BCF rather than only decoding. The first case I don't understand why it's failing. Maybe it doesn't like negatives. What is the permitted range of values for MPOS? I assume it's meant to be unsigned. |
The first test fails on the reserved integer values, 64bits are used for these. Yes, BCF does not support 64bit integers. Therefore with htslib happily writing 64bit BCFs is problematic / unusable in practice. The recommended bcftools workflow is to use uncompressed BCF for streaming from one command into another to avoid the costly binary -> text -> binary conversion (which is significant for files with many samples). Right now one 64-bit value in the VCF breaks the whole processing pipeline, programs relying on htslib cannot operate in a robust way. 64 bit integers is a niche use. I am considering a pull request where 64bit integers are turned into a missing value by default, and make the 64bit values compile conditionally for those who need it. Alternatively, the header structure could be extended and the user program could control what to do on out-of-bounds values. |
Ok ignore my comment on test1.vcf. I see MPOS has no special meaning in the spec and it's just a generic parser bug. Any negative number between INT_MIN and INT_MIN+8 falls through the cracks as it's not interpreted as a 64-bit int but also isn't representable due to the BCF code wanting those values for its own purpose. I'll try and fix this. |
I think this is the bug. I'm testing.
|
If the underlying variable is now an |
While as a QoI matter it would be good if 64 bit values were preserved when reading/writing VCF and a suitable error message was produced if an attempt is made to write an out-of-range value to BCF (or BCF was appropriately extended to handle this), the VCF spec is quite clear (§1.3) that the supported Integer range in both VCF and BCF is -231+8 to 231-1. The MPOS values in both these test files are outside that range, so they are invalid VCF files. Bcftools 1.9 happened to preserve the out-of-bounds value in test 1 because it was buggy wrt #766, and it silently corrupted the out-of-bounds value in test 2 to So the 64-bit support changed the behaviour but it was already broken in the previous release. This seems like more of a “don't do that then” than something requiring urgent fixing… |
The "don't do that then" philosophy can be aimed at software developers. Regular users who just need to process a 700GB file tend to view things differently... |
Well duh. Regular users who need to process their subtly-invalid 700GB files will do so using tools that exhibit one of three categories of behaviour:
Bcftools has changed from behaviour 1 in 1.9 to behaviour 3 in 1.10, and that is a valid complaint (though opinions differ as to which is worse). But it is good to be clear that both are bad, and behaviour 2 of some flavour is the desired one. And I think we can assume that Picard will also exhibit behaviour 3 on such files. Presumably it is time for an hts-specs issue along the lines of “VCF/BCF representation of integers larger than 231”… |
For what it's worth, I think I was wrong with that. I was changing the value to test, but this was actually changing the values being used to interpret the other INFO fields and hence either producing errors or not (I didn't notice then it was also changing the values themselves). The bug is simply getting out of sync with the in-memory data block. |
I think there are several things to address here.
|
Any 64-bit INFO field that wasn't the last in the list would cause subsequent fields to be decoded incorrectly. This commit fixes that, plus updates the tests accordingly so the bug could be triggered. Fixes samtools#999 Fixes samtools/bcftools#1123
Any 64-bit INFO field that wasn't the last in the list would cause subsequent fields to be decoded incorrectly. This commit fixes that, plus updates the tests accordingly so the bug could be triggered. Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second part (BCF output silently being broken). Fixes samtools/bcftools#1123
Hi, the source of the error was from the GATK recommended MuTect2 workflow using gatk 4.1.4.0, I could also replicate it with the current version gatk 4.1.4.1, so the issue is originating from a relatively commonly used tool and recommended workflow. Not sure if test1 and 2 are from the specific files I provided but it didn't error on the original mutect calls after merging chunks with bcftools, but did error after applying FILTER flags using "gatk gatk FilterMutectCalls" |
Does it work after applying the patch in #1000? If it does, it would be interesting to know what the difference is between a file processed by that version and one processed by 1.9. I suspect that the real problem happened slightly after the line you quoted in samtools/bcftools#1123 as some data may not have been flushed out before the program stopped. |
Also, if |
@pd3 - do you have a time line for your changes?
Also how complex is this likely to be? Right now we have a trivial one line (one character!) fix to the first issue here which we could merge as-is and make a quick patch-level release. Or we could wait and do it properly, but I'm loath to do that unless it is both coming fast and also is trivial enough to code review and be absolutely certain it doesn't introduce any other problems. Right now I see the first part of this issue as a substantial bug and the second part as a quality of life improvement which could, if needs be, be punted to the next release (presumably scheduled for March as we'll be back on the 3-monthly cycle). Also, I haven't actually seen the cause of this problem - only your crafted demonstration. @PedalheadPHX is this a bug which happens regularly now, or is it a one off caused by something odd in your input data that lead to MuTech2 producing invalid output? If the latter I'm not inclined to consider this as an urgent "must fix now" issue, but if it's perfectly normal data then we really do need it fixing. Finally, as to that point, my proposal is as follows. Please comment if there are better solutions.
I don't personally think it helps to have a new samtools-1.10.1 and bcftools-1.10.1 as they have no code changes and their version reporting already codes with splitting the application and library version numbers apart. For what it's worth, we did something like this with 1.2 to 1.2.1 which had a bug fix to htslib land 2 days after the release. |
I think this is most likely caused by a bug in GATK (see below), which is now being revealed by a bug in HTSlib. Applying #1000 will fix the HTSlib problem, and make the GATK problem silent again (as long as you don't use BCF) albeit with a different outcome (MPOS will be a large negative number instead of a missing value). As the HTSlib bug is only triggered by large positions or bad INFO tags, I think I would apply #1000 and then wait to see if any more reports come in before going to the trouble of rushing out a new release. It would also be good to try to reproduce the suspected GATK bug and report it to them if we can demonstrate it. The code in GATK to calculate If this analysis is correct, it would suggest that the bug is triggered when a variant happens to overlap a lot of reads with deletions at the same point. Do we know if this is true in the test case for samtools/bcftools#1123? |
Yes, I've got testing data from the user who open the issue. I now put together a pull request which fixes a few other bugs in addition to @jkbonfield fix and makes the 64bit goodness optional #1003. The commit fails on trivial log issues and one major issue: 64-bit positions are not compiled in by default. I made it this way because large positions cannot be stored in BCF but full compatibility of VCF and BCF was an essential part of the design goal. I would rather live without 64-bit positions than give up VCF/BCF interoperability for two reasons: 1) The number of large genomes that will ever be processes is negligible compared to normal sized genomes and 2) large chromosome can always be split into two. In general, the situation in the VCF world is more chaotic than in SAM and we need the parsing to be more permissive. There are many more tools that add annotations and they have bugs. In result we often work with files which are not 100% valid VCFs - wrong number of fields, out of range values etc. Occasional silent under/overflow of some field at a few positions is in practice less bad than not be able to process the file at all. |
I could be persuaded to simply make the 64-bit support in VCF optional for this release, as a temporary fix to getting a functioning bcftools on broken-in-the-wild data, but I absolutely can't agree with how it changes the external BCF format. If we want to go down the road of temporarily disabling the VCF 64-bit support so we can implement it more fully in the next release then please make a minimal PR that does that and only that. Perhaps by simply commenting out the offending lines that caused this bug. (However I have yet to see an argument for why my PR #1000 doesn't fix things.) |
The original version did already output a modified bcf. This pull request just enabled to read it as well, although partially, POS would be scrambled anyway.
…On December 11, 2019 12:30:16 PM GMT, James Bonfield ***@***.***> wrote:
I could be persuaded to simply make the 64-bit support in VCF optional
for this release, as a temporary fix to getting a functioning bcftools
on broken-in-the-wild data, but I absolutely can't agree with how it
changes the external BCF format.
If we want to go down the road of temporarily disabling the VCF 64-bit
support so we can implement it more fully in the next release then
please make a minimal PR that does that. Perhaps by simply commenting
out the offending lines that caused this bug.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_samtools_htslib_issues_999-23issuecomment-2D564518898&d=DwICaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=xdvdTaAZDWitAtUqWIZL0A&m=1LOX5iBiCBIsTfnE90sx8rCbxyhRFQBcKTqesSFPxnY&s=cK3f252mR-ol1pn0fs579n92dLUBkldj7S_ml_NjCtE&e=
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|
On Wed, Dec 11, 2019 at 04:45:45AM -0800, Petr Danecek wrote:
The original version did already output a modified bcf. This pull request just enabled to read it as well, although partially, POS would be scrambled anyway.
Which was a bug. I'm attempting to fix it, not to bless it.
James
…--
James Bonfield (jkb@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|
Nice analysis. This is broadinstitute/gatk#5492; see also this GATK forum thread. |
@jkbonfield Apologies for the late reply. We encountered this issue in all tests to date, about 5-10 independent samples. In at least one I removed the line in question on chromosome1 and encountered the same error on chr19, so clearly not extremely common within one file but to date we have hit it each time. For priority, we had been waiting for this new release to leverage the updated samtools markdup in particular, for VCF processing we elected to stay with bcftools v1.9 for the time being but would be interested to move to a whole version when available |
Fixed by #1003 |
The
BCF_BT_INT64
type breaks VCF and BCF. It might be better to replace out-of-range values with missing values instead of producing files that cannot be processed.Test 1:
Test 2:
This was originally reported in samtools/bcftools#1123
The text was updated successfully, but these errors were encountered: