Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF 4.2 contig header clarification #162

Open
keithj opened this issue Aug 2, 2016 · 2 comments
Open

VCF 4.2 contig header clarification #162

keithj opened this issue Aug 2, 2016 · 2 comments
Labels

Comments

@keithj
Copy link

keithj commented Aug 2, 2016

Section 1.2

There is no description of the permitted values for fileDate, source or phasing fields.

Section 1.2.7 (Contig field format) states

As with chromosomal sequences [...] The format is identical to that of a reference sequence, but with an additional URL tag [...]

yet there is no description of handling chromosomes and the format of a reference sequence header record is not described.

Section 1.4.1 (Fixed fields) states

CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID string pointing to a contig in the assembly file.

The use of _or_ implies exclusivity i.e. if it is an identifier from the reference genome it is not an angle-bracketed ID string. When I create a VCF using reference genome identifiers, without inserting the chromosomes as angle-bracketed ID strings in the header, both @cyenyxe 's validator and bcftools give warnings related to missing header IDs.

@d-cameron
Copy link
Contributor

d-cameron commented Aug 3, 2016

When I create a VCF using reference genome identifiers, without inserting the chromosomes as angle-bracketed ID strings in the header, both @cyenyxe 's validator and bcftools give warnings related to missing header IDs.

This is expected behavior. The angle-backeted ID string is referring to the CHROM field itself, not the syntax for defining reference sequences in the header (which also happens to use angle brackets). VCF 4.2 Section 5.4.2 has an example where angle-bracketed ID strings are used in the CHROM field:

#CHROM POS ID REF ALT QUAL FILTER INFO
13 123456 bnd U C C[<ctg1>: 229[ 6 PASS SVTYPE=BND
13 123457 bnd V A ] <ctg1>: 45]A 6 PASS SVTYPE=BND
<ctg1> 1 bnd X A ] <ctg1>: 329]A 6 PASS SVTYPE=BND
<ctg1> 329 bnd Y T T[<ctg1>: 1[ 6 PASS SVTYPE=BND

In this example, 13 requires a header line since it is a reference contig, but ctg1 does not since it is not in the reference genome.

@keithj
Copy link
Author

keithj commented Aug 3, 2016

Thanks for the clarification. I'll submit some suggestions as a pull request in due course.

Is there a description of the format of the assembly file? I can infer that it might be fasta from looking at the file names in the examples. Are the any restrictions on what can go in it? E.g. single versus multiple sequences.

@jmarshall jmarshall added the vcf label Sep 1, 2016
@jmarshall jmarshall changed the title VCF 4.2 config header clarification VCF 4.2 contig header clarification Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants