Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify VCF to support GRC assembly model (GRCh37, GRCh38, GRCm38) #51

Open
deannachurch opened this issue Nov 12, 2014 · 2 comments
Open
Labels

Comments

@deannachurch
Copy link

There are a couple of limitations to the current VCF that make it difficult to fully represent data using the full GRC assemblies, GRCh37, GRCh38 and GRCm38 specifically. These are:

  • lack of representation of the relationship between alternate alleles and their chromosome locations (i.e. maintaining the allelic relationships of the alternate sequences).
  • inability for an ID to have >1 location. While this is a valid requirement within a specific assembly unit, it needs to be relaxed when describing data on the full assembly as a feature can validly be on the Primary assembly as well as an alternate locus.
    To get around this issue, for example, NCBI emits 2 VCF files- one for the Primary assembly and one for the alternate sequences. This also modifies how the ID column is used (and it is used differently in the two files as some rsIDs can be on more than 1 alternate locus in a biologically consistent way).

This issue was discussed at a workshop put on by the GRC at Genome Informatics 2014 and there are a series of proposals we'd like to put forth. A set of coherent examples can be found here: http://www.slideshare.net/GenomeRef/variant-calling-ii

  • Augment VCF header with more specific assembly information:
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT, id=GL000258.2, assoc_id=CM000679.2, reg=45309498-46836265>
  • Introduce new reserved VCF Info tags:
##INFO=<ID=ALTLOCS, Number=.,Type=String,Description=“A list of the alternate
loci in the reference genome that are associated with this locus”>

##INFO=<ID=ALTHAPS, Number=.,Type=String,Description=“A list of the known
haplotypes that are associated with this locus”>

##FORMAT=<ID=HT,Number=1,Type=String,Description=“Haplotype combination based on ALTHAPS">

There may be additional improvements/suggestions that can be made, but these seem like a reasonable start. Making this types of modifications will be an important part of helping groups migrate to GRCh38.

@jmarshall jmarshall added the vcf label Nov 12, 2014
@lh3
Copy link
Member

lh3 commented Nov 12, 2014

On the representation of alt contigs, I think we should develop a best practice before modifying the spec. What is the intended output from variant callers? Is it practical for callers to generate such output? How downstream tools are supposed to use the vcf?

Specifically, you proposed to add HT, but in my experience, alt contigs frequently recombine with each other, which makes the tag not applicable most of times. In addition, how are we supposed to use ALTLOCS? If we know a locus overlapping an alt contig, what can we do with it?

We will be clearer about the answers and then develop the right spec when more researchers get experiences on h38. Tools determine the adoption of alt contigs. It is not urgent to change the spec.

@deannachurch
Copy link
Author

I think this is a bit of a chicken and egg problem. If we want variant callers to be able to use the Alt loci, we need to be able to express the variants in VCF. This doesn't work well with the current spec (see how dbSNP distributes data).

I think the issue is, there are multiple ways to use VCF- it is just a reporting tool. dbSNP uses it to dump data- so you want to report all genomic contexts for a given SNV. An argument could be made that in the context of an individual genome, you may only want to report one context for a SNP- but how do you handle that when you have multiple samples in the VCF? I fear that decision making will be hard.
I agree the trying to define some best practices is useful.
To attempt to address some specific issues:

  • knowing an alt-locus is allelic with a region on the chromosome let's you put the alt-locus in chromsome context which is useful for reporting. Granted, this could be done as some sort of post-processing step, but then you'd have to convert the data to some other format (which may be OK, but likely inconvenient for everyone who wants VCF).
  • I agree there are few loci with named haplotypes, but for the ones that exist, this is useful information.

This is really meant to start the discussion about how we want to represent variation on GRCh38. It will be good to have some concrete examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants