Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64bit integers break VCF/BCF #999

Closed
pd3 opened this issue Dec 8, 2019 · 23 comments
Closed

64bit integers break VCF/BCF #999

pd3 opened this issue Dec 8, 2019 · 23 comments

Comments

@pd3
Copy link
Member

pd3 commented Dec 8, 2019

The BCF_BT_INT64 type breaks VCF and BCF. It might be better to replace out-of-range values with missing values instead of producing files that cannot be processed.

Test 1:

$ cat test1.vcf
##fileformat=VCFv4.2
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##INFO=<ID=NALOD,Number=A,Type=Float,Description="dummy">
##INFO=<ID=NLOD,Number=A,Type=Float,Description="dummy">
##INFO=<ID=POPAF,Number=A,Type=Float,Description="dummy">
##contig=<ID=chr1,length=248956422>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=-2147483641;NALOD=-8.279e-01;NLOD=15.45;POPAF=6.00

$ bcftools view test1.vcf 
[E::bcf_fmt_array] Unexpected type 0

Test 2:

$ cat test2.vcf
##fileformat=VCFv4.2
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    1   .   G   C   .   .   MPOS=42949672950

$ bcftools view test2.vcf -Ou | bcftools view -o /dev/null 
[W::bcf_record_check] Bad BCF record: Invalid INFO type 4 (unknown)
Error: BCF read error

This was originally reported in samtools/bcftools#1123

@jkbonfield
Copy link
Contributor

test2.vcf is simply impossible in BCF. Like BAM, it simply doesn't support 64bit coordinates. Hence you must keep all data in VCF (or likewise in SAM). However it probably ought to complain when encoding to BCF rather than only decoding.

The first case I don't understand why it's failing. Maybe it doesn't like negatives. What is the permitted range of values for MPOS? I assume it's meant to be unsigned.

@pd3
Copy link
Member Author

pd3 commented Dec 9, 2019

The first test fails on the reserved integer values, 64bits are used for these.

Yes, BCF does not support 64bit integers. Therefore with htslib happily writing 64bit BCFs is problematic / unusable in practice. The recommended bcftools workflow is to use uncompressed BCF for streaming from one command into another to avoid the costly binary -> text -> binary conversion (which is significant for files with many samples). Right now one 64-bit value in the VCF breaks the whole processing pipeline, programs relying on htslib cannot operate in a robust way.

64 bit integers is a niche use. I am considering a pull request where 64bit integers are turned into a missing value by default, and make the 64bit values compile conditionally for those who need it. Alternatively, the header structure could be extended and the user program could control what to do on out-of-bounds values.

@pd3 pd3 added the P1: Urgent label Dec 9, 2019
@jkbonfield
Copy link
Contributor

Ok ignore my comment on test1.vcf. I see MPOS has no special meaning in the spec and it's just a generic parser bug.

Any negative number between INT_MIN and INT_MIN+8 falls through the cracks as it's not interpreted as a 64-bit int but also isn't representable due to the BCF code wanting those values for its own purpose. I'll try and fix this.

@jkbonfield
Copy link
Contributor

I think this is the bug. I'm testing.

diff --git a/vcf.c b/vcf.c
index 9bcad91..1d26c91 100644
--- a/vcf.c
+++ b/vcf.c
@@ -59,7 +59,7 @@ HTSLIB_EXPORT
 uint32_t bcf_float_vector_end = 0x7F800002;
 
 HTSLIB_EXPORT
-uint8_t bcf_type_shift[] = { 0, 0, 1, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+uint8_t bcf_type_shift[] = { 0, 0, 1, 2, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
 
 static bcf_idinfo_t bcf_idinfo_def = { .info = { 15, 15, 15 }, .hrec = { NULL, NULL, NULL}, .id = -1 };

@jmarshall
Copy link
Member

jmarshall commented Dec 9, 2019

Any negative number between INT_MIN and INT_MIN+8 falls through the cracks as it's not interpreted as a 64-bit int but also isn't representable due to the BCF code wanting those values for its own purpose. I'll try and fix this.

If the underlying variable is now an int64_t, it seems like there's now a need for bcf_enc_vint() to acquire a if (max <= BCF_MAX_BT_INT32 && min >= BCF_MIN_BT_INT32) … else …use BCF_BT_INT64 additional clause? And similarly for the other places touched by PR #766?

@jmarshall
Copy link
Member

While as a QoI matter it would be good if 64 bit values were preserved when reading/writing VCF and a suitable error message was produced if an attempt is made to write an out-of-range value to BCF (or BCF was appropriately extended to handle this), the VCF spec is quite clear (§1.3) that the supported Integer range in both VCF and BCF is -231+8 to 231-1.

The MPOS values in both these test files are outside that range, so they are invalid VCF files.

Bcftools 1.9 happened to preserve the out-of-bounds value in test 1 because it was buggy wrt #766, and it silently corrupted the out-of-bounds value in test 2 to MPOS=-10. (And it silently changed the out-of-bounds value in samtools/bcftools#1123 (comment) to MPOS=..)

So the 64-bit support changed the behaviour but it was already broken in the previous release. This seems like more of a “don't do that then” than something requiring urgent fixing…

@pd3
Copy link
Member Author

pd3 commented Dec 9, 2019

The "don't do that then" philosophy can be aimed at software developers. Regular users who just need to process a 700GB file tend to view things differently...

@jmarshall
Copy link
Member

jmarshall commented Dec 9, 2019

Well duh. Regular users who need to process their subtly-invalid 700GB files will do so using tools that exhibit one of three categories of behaviour:

  1. Process the files successfully while silently changing the out-of-bounds values
  2. Process the files successfully while producing warning messages and some combination of preserving out-of-bounds values or changing them
  3. Abort on the invalid records and produce an error message of some kind

Bcftools has changed from behaviour 1 in 1.9 to behaviour 3 in 1.10, and that is a valid complaint (though opinions differ as to which is worse). But it is good to be clear that both are bad, and behaviour 2 of some flavour is the desired one.

And I think we can assume that Picard will also exhibit behaviour 3 on such files.

Presumably it is time for an hts-specs issue along the lines of “VCF/BCF representation of integers larger than 231”…

@jkbonfield
Copy link
Contributor

Any negative number between INT_MIN and INT_MIN+8 falls through the cracks as it's not interpreted as a 64-bit int

For what it's worth, I think I was wrong with that. I was changing the value to test, but this was actually changing the values being used to interpret the other INFO fields and hence either producing errors or not (I didn't notice then it was also changing the values themselves).

The bug is simply getting out of sync with the in-memory data block.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 9, 2019

I think there are several things to address here.

  1. Clearly a user triggered this bug with genuine data. What was that data? Not what we see in bcftools 1.10 Unexpected type 0 bcftools#1123 as that looks fine. I assume they have data with out of bounds values, but what produced it? My naive view (and presumably @daviesrob's) was that the change in behaviour only caused issues with already broken data, so it's a non-issue. I'd certainly not flag it as "urgent" unless we know the tool that is breaking the data is widespread and not amenable to fixing itself, or that we have many such files deposited in various repositories.

  2. How should we handle out of bounds values. Personally I agree with @pd3 that it should treat them as missing, but with a warning and with some stringency option to control this (@jmarshall's option 2). Absolutely though it shouldn't silently "work" and store up a gotcha for later on. This is how the 1.9 code "worked" and it's a horrible approach.

  3. I know the VCF spec has limits on what integer ranges are valid, but that's something that will be changing in future simply because it must do so. The INFO "END" field needs to be able to handle more than 32bit sizes in order to cope with large genomes, of which we know several have already been sequenced (and had to have chromosomes chopped up to be acceptable by tools). This is just a case of the code leading the spec rather than the other way around.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Dec 9, 2019
Any 64-bit INFO field that wasn't the last in the list would cause
subsequent fields to be decoded incorrectly.

This commit fixes that, plus updates the tests accordingly so the bug
could be triggered.

Fixes samtools#999
Fixes samtools/bcftools#1123
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Dec 9, 2019
Any 64-bit INFO field that wasn't the last in the list would cause
subsequent fields to be decoded incorrectly.

This commit fixes that, plus updates the tests accordingly so the bug
could be triggered.

Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second
part (BCF output silently being broken).

Fixes samtools/bcftools#1123
@PedalheadPHX
Copy link

Hi, the source of the error was from the GATK recommended MuTect2 workflow using gatk 4.1.4.0, I could also replicate it with the current version gatk 4.1.4.1, so the issue is originating from a relatively commonly used tool and recommended workflow.

Not sure if test1 and 2 are from the specific files I provided but it didn't error on the original mutect calls after merging chunks with bcftools, but did error after applying FILTER flags using "gatk gatk FilterMutectCalls"

@daviesrob
Copy link
Member

Does it work after applying the patch in #1000? If it does, it would be interesting to know what the difference is between a file processed by that version and one processed by 1.9. I suspect that the real problem happened slightly after the line you quoted in samtools/bcftools#1123 as some data may not have been flushed out before the program stopped.

@daviesrob
Copy link
Member

Also, if MPOS ("median distance from end of read") is the cause, would you normally expect it to have such a large value?

@jkbonfield
Copy link
Contributor

@pd3 - do you have a time line for your changes?

64 bit integers is a niche use. I am considering a pull request where 64bit integers are turned into a missing value by default, and make the 64bit values compile conditionally for those who need it. Alternatively, the header structure could be extended and the user program could control what to do on out-of-bounds values.

Also how complex is this likely to be? Right now we have a trivial one line (one character!) fix to the first issue here which we could merge as-is and make a quick patch-level release. Or we could wait and do it properly, but I'm loath to do that unless it is both coming fast and also is trivial enough to code review and be absolutely certain it doesn't introduce any other problems.

Right now I see the first part of this issue as a substantial bug and the second part as a quality of life improvement which could, if needs be, be punted to the next release (presumably scheduled for March as we'll be back on the 3-monthly cycle).

Also, I haven't actually seen the cause of this problem - only your crafted demonstration. @PedalheadPHX is this a bug which happens regularly now, or is it a one off caused by something odd in your input data that lead to MuTech2 producing invalid output? If the latter I'm not inclined to consider this as an urgent "must fix now" issue, but if it's perfectly normal data then we really do need it fixing.

Finally, as to that point, my proposal is as follows. Please comment if there are better solutions.

  1. Merge PR Fixes BCF in-memory decoding with 64-bit integers. #1000 to fix the bcftools view bug.

  2. Produce a new htslib release (NEWS, version numbers, tag, etc).

  3. Add htslib-1.10.1 subdirectories to the samtools-1.10 and bcftools-1.10 tarballs and upload these as new assets to github (and similarly for sourceforge), overwriting the old ones. (That may give issues if people have md5sums of them, so alternative is bcftools-1.10.1.tar.bz2 and update download links on the website?)

I don't personally think it helps to have a new samtools-1.10.1 and bcftools-1.10.1 as they have no code changes and their version reporting already codes with splitting the application and library version numbers apart. For what it's worth, we did something like this with 1.2 to 1.2.1 which had a bug fix to htslib land 2 days after the release.

@daviesrob
Copy link
Member

I think this is most likely caused by a bug in GATK (see below), which is now being revealed by a bug in HTSlib. Applying #1000 will fix the HTSlib problem, and make the GATK problem silent again (as long as you don't use BCF) albeit with a different outcome (MPOS will be a large negative number instead of a missing value).

As the HTSlib bug is only triggered by large positions or bad INFO tags, I think I would apply #1000 and then wait to see if any more reports come in before going to the trouble of rushing out a new release. It would also be good to try to reproduce the suspected GATK bug and report it to them if we can demonstrate it.


The code in GATK to calculate MPOS seems to be here. It calls ReadPosRankSumTest.getReadPosition and casts the result to an int. Looking at that function, it can return INVALID_ELEMENT_FROM_READ which is defined as Double.NEGATIVE_INFINITY. According to the java documentation, casting NEGATIVE_INFINITY to int will result in a value of INT_MIN. I'm not sure exactly how this would result in the number given in Test 1, but it seems the most likely explanation for how a big negative value would end up there.

If this analysis is correct, it would suggest that the bug is triggered when a variant happens to overlap a lot of reads with deletions at the same point. Do we know if this is true in the test case for samtools/bcftools#1123?

@pd3
Copy link
Member Author

pd3 commented Dec 11, 2019

Yes, I've got testing data from the user who open the issue. I now put together a pull request which fixes a few other bugs in addition to @jkbonfield fix and makes the 64bit goodness optional #1003.

The commit fails on trivial log issues and one major issue: 64-bit positions are not compiled in by default. I made it this way because large positions cannot be stored in BCF but full compatibility of VCF and BCF was an essential part of the design goal. I would rather live without 64-bit positions than give up VCF/BCF interoperability for two reasons: 1) The number of large genomes that will ever be processes is negligible compared to normal sized genomes and 2) large chromosome can always be split into two.

In general, the situation in the VCF world is more chaotic than in SAM and we need the parsing to be more permissive. There are many more tools that add annotations and they have bugs. In result we often work with files which are not 100% valid VCFs - wrong number of fields, out of range values etc. Occasional silent under/overflow of some field at a few positions is in practice less bad than not be able to process the file at all.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 11, 2019

I could be persuaded to simply make the 64-bit support in VCF optional for this release, as a temporary fix to getting a functioning bcftools on broken-in-the-wild data, but I absolutely can't agree with how it changes the external BCF format.

If we want to go down the road of temporarily disabling the VCF 64-bit support so we can implement it more fully in the next release then please make a minimal PR that does that and only that. Perhaps by simply commenting out the offending lines that caused this bug. (However I have yet to see an argument for why my PR #1000 doesn't fix things.)

@pd3
Copy link
Member Author

pd3 commented Dec 11, 2019 via email

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 11, 2019 via email

@jmarshall
Copy link
Member

jmarshall commented Dec 11, 2019

I think this is most likely caused by a bug in GATK (see below), which is now being revealed by a bug in HTSlib.

Nice analysis. I didn't immediately notice any GATK issues reporting this bug, but there are a few interesting mentions of MPOS=-2147483648, notably broadinstitute/gatk#5684 (comment).

This is broadinstitute/gatk#5492; see also this GATK forum thread.

@PedalheadPHX
Copy link

@jkbonfield Apologies for the late reply. We encountered this issue in all tests to date, about 5-10 independent samples. In at least one I removed the line in question on chromosome1 and encountered the same error on chr19, so clearly not extremely common within one file but to date we have hit it each time. For priority, we had been waiting for this new release to leverage the updated samtools markdup in particular, for VCF processing we elected to stay with bcftools v1.9 for the time being but would be interested to move to a whole version when available

@PedalheadPHX
Copy link

In regards to how often we see the MPOS=-2147483648 it is fairly common in some files, at least 10 instances on a genome in chr1 alone. They are clearly in garbage areas of the genome, see attached IGV images, but not limited to INDELs. Most are not even clear somatic events.

Screen Shot 2019-12-12 at 3 45 59 PM
Screen Shot 2019-12-12 at 3 43 53 PM
Screen Shot 2019-12-12 at 3 36 14 PM

@valeriuo
Copy link
Contributor

Fixed by #1003

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants