Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vcf int64 part2 #1004

Closed
wants to merge 3 commits into from
Closed

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Dec 11, 2019

NOTE: this is branched off my vcf_int64 PR #1000 so it contains that commit also.

This adds extra checks to vcf parsing to set a flag to indicate the record contains 64-bit data which can only be written back to VCF, and not to BCF.

It's probably not complete (I don't check FORMAT data, only INFO and POS), but I'm not sufficiently familier with the vcf parser and format to know where 64-bit data can occur. I think it's just POS and INFO (for END)?

Anyway, this is a proof of principle, hence in its own separate PR so we can merge the other one if needed without too many shenanigans.

Any 64-bit INFO field that wasn't the last in the list would cause
subsequent fields to be decoded incorrectly.

This commit fixes that, plus updates the tests accordingly so the bug
could be triggered.

Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second
part (BCF output silently being broken).

Fixes samtools/bcftools#1123
@jkbonfield jkbonfield mentioned this pull request Dec 11, 2019
@daviesrob
Copy link
Member

You should also check v->pos and v->rlen as they could both contain 64-bit values that cannot be stored in BCF.

To reduce concerns about change in behaviour due to 64 bit INFO tags, it should be fairly easy to restrict them to just the END tag. There's already a strcmp(key, "END") in vcf_parse so it can detect the need to adjust v->rlen. If that was added to the if (n_val == 1) condition, END tags could take 64 bit values while all others would work as before.

@pd3
Copy link
Member

pd3 commented Dec 11, 2019

This goes in the right direction, but should build on top of #1003. I agree with forbidding writing out BCFs with 64-bit values. But the problem remains: one out-of-range value will turn on 64-bit mode and will make it impossible to write out an otherwise valid BCF. It should not be compiled in by default, unless you know you are going to work with large genomes. There are terabytes of invalid VCF/BCFs sitting on sanger's drives, but I have not seen a single 64-bit genome yet. It should be the rare case that gets inconvenienced, not the major use case.

vcf.c Show resolved Hide resolved
@jkbonfield
Copy link
Contributor Author

If the problem really is that endemic, that we have thousands of invalid files and that we must be able to transform these invalid files in / out of BCF silently (with corresponding broken data due to lack of support), then for this immediate problem we should simply disable the 64-bit VCF code so we can take time to do things "right" for the next release (where "right" probably means issue a warning and transform the values to 'missing').

However that's also not your PR which amended the BCF format.

Do you have a simple PR that does this? We don't have time to do something complex as it will require a huge amount of testing which simply won't get done, at least not by me, due to insufficient time this year.

@pd3
Copy link
Member

pd3 commented Dec 11, 2019

Hm, I did consider the PR simple. Anything outside 64bit branch is just turning large values into missing ones and printing a warning. I have tests in bcftools ready, which are waiting for the change in htslib.

Also adds check for rlen (END) field when trying to output to BCF.

Note: this will emit a warning per out-of-range value, but it is
expected that the frequency of these is very low.  If not, consider
this a good icentive to fix your pipelines!
@jkbonfield
Copy link
Contributor Author

@pd3 how about this latest commit?

It is still simple. It's only 20-odd lines of new code which is about 1/3rd the size of your PR and contains fewer problematic things (like mixing long with int64_t, strtol vs strtoll, and creation of non-compliant BCF files) while offering better diagnostics and reporting to the user.

It takes @daviesrob's idea of only permitting the internal representation on the pos field and the END info tag. All other out of range info tags now get converted to "missing" and a warning is issued. It also tracks when the VCF is not representable in BCF and refuses to write one out. That can only possibly happen when BCF itself cannot be used, so it is not a regression in any shape or form.

@valeriuo
Copy link
Contributor

Superseded by #1003

@valeriuo valeriuo closed this Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants