Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF 4.2 SV errata #134

Closed
d-cameron opened this issue Mar 10, 2016 · 9 comments
Closed

VCF 4.2 SV errata #134

d-cameron opened this issue Mar 10, 2016 · 9 comments
Labels

Comments

@d-cameron
Copy link
Contributor

For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate.

This does not apply to symbolic alleles as the REF allele length is usually 1bp.

Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering.

This cannot be directly derived from REF/ALT fields as distinguishing between INS and DUP requires knowing additional reference bases (eg: A/AACGTA could be an INS or a DUP).

@jmarshall jmarshall added the vcf label Mar 10, 2016
@jmarshall
Copy link
Member

The same text appears in the 4.1 and 4.3 specs. So presumably these concerns are not limited to v4.2?

@d-cameron
Copy link
Contributor Author

Correct, although different versions have changes regarding valid SVTYPEs, they do not affect these minor wording updates.

@atks
Copy link

atks commented Mar 10, 2016

With regards to the END info tag, I think it can also be useful for non SV precise variants.

For example, consider a G/GA insertion of one A base. It is common that the insertion is actually into a stretch of A nucleotides i.e. GAAAAAAAA/GAAAAAAAAA. So it might be useful to use the END tag to represent the end of the poly A tract. This can be generalized to any repeat types that is represented explicitly as sequences. (this is very common amongst short indels)

By not strictly binding the definition of END to length(REF) + POS -1, but rather the end of an allelic region, it also allows some flexibility in representing the REF and ALT fields in a parsimonious fashion and remain correct when annotating the records identified with overlaps in the REF sequence by correctly choosing the region to check for overlapping via the END tag.

An example will be the above example G/GA again, say if we want to annotate this with low complexity regions generated by mdust for example, we will miss this annotation as the region annotated will only include the region covered by the A nucleotides, and since the G anchor base is just one position to the left, the variant will not be identified thus. Of course, this can be easily overcome by adding a window of 1 bp for checking overlaps.

@d-cameron
Copy link
Contributor Author

The end of the poly-A tract can already be represented using HOMLEN,HOMSEQ and/or CIPOS fields and a number of SV callers (at least pindel, SOCRATES and GRIDSS) already do this as microhomology is quite common at SV breakpoints. This syntax is also easier to parse and process as it is relatively straight-forward to extract the positions from ALT, POS, SVLEN, SVTYPE, END then apply CIPOS/CIEND for homology. Redefining END would create yet another edge case and break the few tools that actually do handle this.

@atks
Copy link

atks commented Mar 11, 2016

I think HOMLEN, HOMSEQ are meant for regions of micro homology, which is different from homopolymer tracts. The example I gave was a poly A tract, but the idea extends to (AC)n and other repeat like tracts.

I'm not completely convinced that this is going to be a wholesale redefinition rather than loosening the definition. For imprecise variants, we can infer that the END is an imprecise if a symbolic allele exists in the ALT column. For precise variants, the current definition is actually recoverable simply by computing length(REF) + POS -1, which makes it kind of unnecessary.

I might be misunderstanding some things here though since I understand you are coming from a SV point of view while I am talking about short repeat like indels.

@ekg - Erik, do you have any views on this with regards to short indels?

@d-cameron
Copy link
Contributor Author

Symbolic alleles are not necessarily imprecise, and that is already covered by the IMPRECISE flag. From an SV perspective, repeat expansion is an instance of breakpoint microhomology.

I would encode your example as:
chrA 2 polyAexpansion A AA . . SVTYPE=INS;SVLEN=1;CIPOS=0,8

or, using breakend notation:
chrA 1 polyAexpansionLow G GA[chrA:2[ . . SVTYPE=BND;CIPOS=0,8;HOMLEN=8;HOMSEQ=AAAAAAAA;MATEID=polyAexpansionHigh
chrA 2 polyAexpansionHigh A ]chrA:1]AA . . SVTYPE=BND;CIPOS=0,8;HOMLEN=8;HOMSEQ=AAAAAAAA;MATEID=polyAexpansionLow

and an AC(n) as:
chrA 2 polyACexpansion AC ACAC . . SVTYPE=INS;SVLEN=2;CIPOS=0,8;HOMLEN=8;HOMSEQ=ACACACAC

Annoyingly, HOMLEN and HOMSEQ don't actually encode where the homology is located with respect to the breakpoint position (hence the need for CIPOS), and for repeats, you could argue that HOMLEN=1;HOMSEQ=A is more informative, even if it is an incorrect usage of these fields.

You do raise a interesting issue: the spec doesn't seem to address breakpoint microhomology that itself is homologous to the inserted sequence (such as occurs with repeat expansion). The AC(n) encoding I gave, would technically allow ACA_ac_CACAC as well as ACAC_ac_ACAC since there is no way of specifying a stride length when specifying CIPOS.

@atks
Copy link

atks commented Mar 11, 2016

Thanks for the detailed explanation and examples for representing the repeats as breakpoints.

I would like to represent those repeats allowing for genotyping of the variability of those repeats and I think what lobSTR and repeatseq are doing now is to represent (albeit with more alleles usually) that homopolymer stretch as:

chrA 2 . AAAAAAAA AAAAAAAAA . . MOTIF=A;RU=A;RL=8;REF=8
(http://melissagymrek.com/lobstr-code/file-formats.html)

It might be a good idea to incorporate their way for representing tandem repeats into the spec as you observed. They also allow for the presence of inexact repeats. I'm adding Melissa, @mgymrek in case she is interested in pushing for the addition of tags used to represent STRs.

Back to the issue that I want to raise: Tandem repeats are usually predetermined by having a minimal length of the reference repeat tract. Therefore shorter indels that are repeat like are represented without annotation of a repeat unit and sometimes in a parsimonious fashion. I guess I was thinking along the lines of having an END tag to represent the end of that repeat tract (which you can identify by right alignment) just so that regardless of representation, we can always have the same END position. If the indel is left aligned, then regardless of representation, POS will always be the same. It is useful to store that information in an INFO tag as you need abit of additional processing with respect to the reference sequence to obtain the END position. As to why the END tag is important, I refer back to the reason of annotation which I mentioned earlier and in addition, it automatically gives the user a sense of repeatedness without an explict repeat unit being specified.

Out of curiosity, SVs are usually considered as large events involving many hundreds of bases to thousands. Is there a size cutoff recommended before SV notation are used in VCF?

@d-cameron
Copy link
Contributor Author

Pindel reports repeat expansions as:
chr1 10108 . C <INS> . PASS END=10108;HOMLEN=41;HOMSEQ=AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC;SVLEN=12;SVTYPE=INS GT:AD 0/0:268,1

With the position apparently left aligned to the start of the repeat.

dbSNP/dGV use a 50bp cut-off when determining whether an event should go into dbSNP or dGV. VCF makes no recommendation either way thus the notation is generally either fixed per tool, or changes depending on whether the exact breakpoint is known.

Cortex, for example, doesn't report symbolic alleles thus resulting in unnecessarily verbose variants such as:
chr11 45429849 UNION_BC_k31_var_1849759 CAGAATGTGGGGTCACACGTGGTCTTTTGTCTTCTCCCTGAACATTGGGTGTCTTAAGGAAAGGGATTAAATTTTAGTTCCCTCTTCATATCTCAAACAGGGCCTGTCTCATTACAGGTGCCCTGGAACAGTCATTGAACAAGAACTGGAGAAGGAGATGCAACCTGTGAAAGGCAAAGTCTCCACCTTCTTCCACACAGCCCAGCTCTGGTTCCCACAGTCACCACTGTGGAGGGTCAGCCTGGCTCCCCAGCCCCAGAGTGGCCTGGCCTGACTCACTGATCATGCAGACAGGATGTCAGGATTCCCTGGAAGCTCAGGGGCAGGCCATAGGGACTCCACTTGCCACCCTGACTGACTCAATGTGTCCAGTCTGAGTGTCCCCTGAAGAAATACATCATCTGCCATCTGAGGGCAGGCAGCCAGGCACTGACCTAGGATCAGGCCTTCTCACTTGTAGCTGGCTGCCTCCATCCTTTTGAGCTGGGAAGTTGTGAAGAATTTGTTAAGAACCAGAAAGAACTTGTTTTGGGGACCACAGGCACGTGTCAGACCCAGGTCCCAAAGCCCTGGGCTTGCAGAGCCAGGTGGGCTGTGGGAATGAGACATGGCAGGGGCAGGGTTCCTGCTGAATTGGGAGAGCCCTGTCCAGCACCAGCCAGTGAAGCCCGGTGAGTTAGCACCCAGGAAAGCCAGACCTTCTGATTTGTATGTCAATCTCCTGATATTGAAAAATCGGCTCAAGACATTTTTTGACCCTGAAAGGTAAAGAAAACCAGGCAGCTGGTCCAACCTGTCCTGTGGCTGTCGGCTTGACCTTTGTGTTAGGTTAGCATCATACCAAGCCAAGTGCACAGGTCAAGGTGTTGTGTGAGATGCAGCTCATCTGGCACATGGGGGTGGGCATTTCAAAGTCACGGTTGAGTGGATGAGTATGTGAAAGCTGCGTATCCGTCATGGATTAAGTAAACAGTGTCGAATGTCAGGAGCCATATCCGAGTTTCATTTTTACAATCATGTCAGCATCGGGTGCAGGTTGCTCATGTCGGAGGCAGGCACGAGGTACGCACATCAGCCATATGGTAGGTGCTGCTGTGTGTTACATGTGTATGTCGGTATCAGGTGTGGGGCCTCGGTTCACATGCTCATGTAAAGCGTGCCTGGCAACCTGTCTGGTACCCAGCCCCTCCTCCCAGCACCCCACTCCCCTCTCACAGGAATGGCTGCCTCTCCTGCTCAGGGCCTGTCCAGGTGACATAGCGGAAGCCCCTGCCTTTCATCTCCAGCCCCATAAGGTTGGGAGCCCCAAGCACATCCCAGCTCAGGGAATAAAGAGGCCCCTTTGTCGGACGCTACAGCAGACTCCCTGGTCCCCTCCTCCAGCCCTACCTCACCTATCCGTGTGCCCCACTGACCACAATCGCACTCGACCTGAGTCATGCTCCCTGGAGAAGGGGCCTGATGTCATCGAGAGCTGGCAGAGGAGAGGCATAGACAGGCTGCTGCTGTTAAAAATGACACCTCTGCAGGCTGCTGTGATACTGTGTGTGGGCCAGCTCTGTCCTCTATCTTCTCCTCTCCCTGTGGCACTGCCCTAGCAGGGCCTGCCTGCATTGCTCATTGCTCCAGGCTTTATGTTTGGAGAAGAGAGAGAGAGAGATGAACTGGGGCCCAGGAAAATATGCAAGTGGGGGCTGTGCATGAGGACCATGGCTCTGCTGCTTACCAGCTCTGGGAGCTTGAGCAAATTGCTTAAGCTCTTCTTGCC C . PASS KMER=31;SVLEN=-1767;SVTYPE=DEL GT:COV:GT_CONF 1/1:0,33:465.40
combined with necessarily verbose exact variants such as:
chr1 168364733 UNION_BC_k31_var_351081 C CTTCTGAGGTATCACTATCTTTTGGTTCCTTCCTAGCCCTGCCTTCTTCATGGAATCCTCTTCTTTTGCCCACTTAAACTTCGGTGTTTCTCTTCTTACACCATACCCTAATGGTGAGTGACCTTATTTAAAGCCAATTTCAGCTTCCACAGATGAGTACCAATTATCTCCAGCCTTCCCTCTGGTGTCCCACAGAGCTATCTCTCTACTAGATAGTTCTACTAGGATGCACATAGGCATTTCAAGTTCAGTATGCTAAATTCTTGGCAAAACACACAACTCCTGCTATATTTCTTATTCCCTTAAATGGCATCACAGGTCCCTTATTTGTTCATTTCAAGGATGACTCCTTTCACTCACTTGTGCATCTTCTCTAACTCAACAAACCAGTCACCACCTTCCGTGGCCTCCTTAAACTCTCCCATATCCATTAACATGCCTCCATCCTCACTGATGTTGCCTCGGATTAGTCCTTCATTGCTTCTTGGCTGGGTTACTATATGTAGTTACCATAAAGTAACCCAGGTTAACATTCTTTCTGCCATCAGTATTTTCCACCCTCCAATACATCTTCCATACTGCAGACTGAGTCATTTTTCTCAAATCCAATTTCCAAATTGATATGTCACTCTCCTCCAGTGGATTTTTCTCAATGTGCTCCTTAGTAAAGTGTTTCAAGGTCTAGCTTTTGCTTACCTCTCTAGAATCATTTCTCTTTACTAGGCTACGTACATTATTTGCTCCAATAGTACTACCCAGCCAAGGCCTGAGCTTCTACAAAATTTTTTTGCTTCTTTTTTGTCCTAACTTGCTTTCCTGTCTTTGAGCT . PASS KMER=31;PV=3;SVLEN=828;SVTYPE=INS GT:COV:GT_CONF 1/1:0,312:122.62

@atks
Copy link

atks commented Mar 16, 2016

ok. thanks. on a side note, the pindel example is a telomere repeat.

d-cameron pushed a commit to d-cameron/hts-specs that referenced this issue Jun 21, 2022
…ified END & symbolic SVs start position
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants