Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph can imply unobserved sequences #1

Closed
davebiffuk opened this issue Feb 26, 2015 · 2 comments
Closed

graph can imply unobserved sequences #1

davebiffuk opened this issue Feb 26, 2015 · 2 comments

Comments

@davebiffuk
Copy link

What are the implications of the graph encoding (i.e. implying the existence of) sequences which have never been observed? e.g. variants at different locations which are not seen in a single individual but are on a valid path through the graph. Is there a way to encode that sort of contextual data in the graph?

@ekg
Copy link
Member

ekg commented Feb 26, 2015

@davebiffuk is referring to the phenomenon that occurs when phasing information about the original sequences is removed and the graph is constructed using only the edit information implied by the VCF, ignoring any haplotypes that are given as input. (As of the time of this comment, this is the default mode of VCF-based construction in vg.)

This is a really interesting issue. It presents some problems but may not be completely avoidable in all contexts. I'll try to explain when it does and doesn't make sense to allow the graph to imply unobserved sequences.

Suppose this graph had been generated from two original input sequences, one red and one black:

Variation graph

Now, we are able to find many sequences in it that may never have been observed. For instance, [1,2,5,6,7]. There are very many of these. In fact, there are 48 to be exact. You can see this by running this code in the test directory of this repo:

# first we make a sub-graph by taking the head of the GFA format version of the graph
vg construct -r small/x.fa -v small/x.vcf.gz \
    | vg view - | head -28 | vg view -v - >y.vg
# then we make all paths of ~length 40, which includes all the paths in this graph
vg paths -s -l 40 y.vg | wc -l # 48

This seems problematic, but there are a few things to keep in mind.

  1. Homologous regions can support recombination. Although recombination is rare (on the order of de novo variation), it does happen and it can happen anywhere. Many species do not have specific recombination loci. These paths could exist in the event of a specific set of recombinations between the original red and black sequences.
  2. This succinct representation is large in terms of the haplotype space but small in terms of sequence space. This enables sensitive, and efficient pairwise local alignment algorithms to run natively on the variation graph.
  3. Allowing a graph to encode sequences which haven't been observed could be expedient. For example, you may not know the actual genomes that were observed, and only have information about variants and frequencies. This is actually a rather common situation, particularly when the identities of the individuals who have gone into the list of variants is private or not shareable. A variant list is easy to exchange and rather lightweight relative to a full set of haplotypes.

Problems do occur. In particular, if one samples _k_mers of a particular length naïvely from the graph which allows many recombinations between closely-spaced variants, certain regions will generate huge numbers of _k_mers, which limits our ability to map to them and in the extreme, even our ability to generate the _k_mer index of the graph (done via vg index -k N x.vg).

This issue can be mitigated in several ways.

  1. We can limit the number of edges that may be crossed when a _k_mer is generated. (To do this specify -e to vg paths, vg kmers, or vg index, as in: vg paths -s -k 21 -e 9 x.vg.
  2. We can construct using a VCF which has short haplotypes combined into a single variant (so, multiallelic variants with long lengths against the reference) and use this for graph construction. Note that this isn't yet supported, but would require only a few minor changes to begin testing.
  3. We could remove any edge that doesn't lie in a path defined by by one of the input haplotypes. This also isn't supported but would be straightforward to do, and is probably better than (2).

I haven't pursued the second two approaches because I think it should be possible for people to use vg to build graphs when they only have variant lists. This is sometimes harder than building and mapping against graphs that only contain observed haplotypes as paths. Experience with real data will likely suggest the best and most-general approach.

@ekg
Copy link
Member

ekg commented Apr 23, 2015

I'm going to mark this as closed. The way to resolve this in the future is to store the paths of haplotypes, perhaps as compressed bitvectors, or perhaps in pBWT, perhaps both.

@ekg ekg closed this as completed Apr 23, 2015
ekg pushed a commit that referenced this issue Feb 19, 2016
Merge master into dev branch.
adamnovak pushed a commit that referenced this issue Oct 1, 2018
@ekg ekg mentioned this issue Apr 5, 2019
Robin-Rounthwaite added a commit that referenced this issue Mar 26, 2021
windows-friendly-filenames
jeizenga pushed a commit that referenced this issue Sep 23, 2021
A spelling correction in  autoindex_main.cpp
@xiaoguizz xiaoguizz mentioned this issue Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants