Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF 4.4 issues #640

Closed
d-cameron opened this issue Apr 4, 2022 · 4 comments
Closed

VCF 4.4 issues #640

d-cameron opened this issue Apr 4, 2022 · 4 comments
Assignees
Labels
Milestone

Comments

@d-cameron
Copy link
Contributor

  1. END

For precise variants, END is POS + length of REF allele − 1

chr 1 id AT <DEL> . . SVLEN=100

For the above variant, the given definition means that END=1+2-1=2 which is not the intended definition.

  • Need to fix the definition to include the SVLEN of the variant.
  • Need to explicitly state that SVLEN takes priority over END (since END is used for gVCF), and that implementation MAY infer SVLEN from END but MUST use SVLEN if both SVLEN and END are present.
  1. SVLEN

SVLEN is defined for INS, DUP, INV , and DEL symbolic alleles as the number of the inserted, duplicated, inverted, and deleted bases respectively.

This is redefining SVLEN for DEL events as well as DUP & INV. Ok to do, but less backwards compatible

  1. MEINFO/METRANS

Add clarifying text that the number of entries must be 4 * # alt because each entry contains 4 values (since they are comma separated and that's the string list separator).

  1. CICN

Should be Float, not Integer

  1. CN

Current definition of Copy number of segment containing breakend doesn't make sense except for BND variants. Need separate definitions for <CNV>, breakpoint, and symbolic allele variants.

  1. CNADJ

Same issue as CN. Also need to define in which direction the adjacency is.

@tcezard
Copy link
Contributor

tcezard commented Jun 13, 2022

  1. END
    The specified definition For precise variants, END is POS + length of REF allele − 1 and the description of the field End position of the variant described in this record seems to indicate that END specify the last position of the reference genome where the variant is inserted. I am guessing that it is useful for indexing (sorry if this is obvious for everyone but me)
    I agree that in the case of a long deletion END should be POS + SVLEN − 1 but that not universally true for the other types of variants

  2. SVLEN
    was not defined in VCF 4.3 so it's not clear how this new definition is changing it. Based on the examples the only one being changed is DEL. Depending on the outcome of discussion on END we might want to keep SVLEN for deletion as negative numbers. If we don't we need to at least change the examples

  3. MEINFO/METRANS
    The text in VCF 4.4 already specify that the number of entries must be 4 * # alt. Although there is a typo in for METRANS

  4. CICN
    I think I understand where you're coming from: the confidence interval is due to the imprecision of the data which is not necessarily going to give you an integer. For the sake of the argument though wouldn't that also apply to CIPOS or CILEN

  5. CN
    I agree that the definition is not applicable to all CNVs. DP also has the same definition issue. I think that the Copy number needs a bit of work to include all the use-case.

Some of the issues above are simple to fix but others probably needs more work. Maybe separate issue at least for copy number variation would be warranted.

d-cameron pushed a commit to d-cameron/hts-specs that referenced this issue Jun 21, 2022
…ified END & symbolic SVs start position
@d-cameron
Copy link
Contributor Author

I agree that in the case of a long deletion END should be POS + SVLEN − 1 but that not universally true for the other types of variants

It's POS + SVLEN as the actual SV starts after POS. Buried in S1.4.1.4 all the way back to VCF4.1 is this fairly unambiguous statement:

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String “”) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

I've been adding matching clarification text to other sections for 4.4 RC2 so it's much more obvious that there is indeed a padding base for SVs.

@d-cameron
Copy link
Contributor Author

4. For the sake of the argument though wouldn't that also apply to CIPOS or CILEN

Not really, as CIPOS and CILEN are CIs about fundamentally discrete positional values. I have a PR that changes CN to Float so it's actually a usable field for somatic samples but that's a different argument for a different thread.

@d-cameron
Copy link
Contributor Author

5. DP also has the same definition issue.

I quite dislike DP as an INFO field even more than I dislike AF (that is, they both sould be FORMAT only fields). - AF can at least serve as a proxy for population penetrance. Unfortunately, it's a bit late to remove it from the specs now.

d-cameron pushed a commit to d-cameron/hts-specs that referenced this issue Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants