-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph can imply unobserved sequences #1
Comments
@davebiffuk is referring to the phenomenon that occurs when phasing information about the original sequences is removed and the graph is constructed using only the edit information implied by the VCF, ignoring any haplotypes that are given as input. (As of the time of this comment, this is the default mode of VCF-based construction in This is a really interesting issue. It presents some problems but may not be completely avoidable in all contexts. I'll try to explain when it does and doesn't make sense to allow the graph to imply unobserved sequences. Suppose this graph had been generated from two original input sequences, one red and one black: Now, we are able to find many sequences in it that may never have been observed. For instance, # first we make a sub-graph by taking the head of the GFA format version of the graph
vg construct -r small/x.fa -v small/x.vcf.gz \
| vg view - | head -28 | vg view -v - >y.vg
# then we make all paths of ~length 40, which includes all the paths in this graph
vg paths -s -l 40 y.vg | wc -l # 48 This seems problematic, but there are a few things to keep in mind.
Problems do occur. In particular, if one samples _k_mers of a particular length naïvely from the graph which allows many recombinations between closely-spaced variants, certain regions will generate huge numbers of _k_mers, which limits our ability to map to them and in the extreme, even our ability to generate the _k_mer index of the graph (done via This issue can be mitigated in several ways.
I haven't pursued the second two approaches because I think it should be possible for people to use |
I'm going to mark this as closed. The way to resolve this in the future is to store the paths of haplotypes, perhaps as compressed bitvectors, or perhaps in pBWT, perhaps both. |
A spelling correction in autoindex_main.cpp
What are the implications of the graph encoding (i.e. implying the existence of) sequences which have never been observed? e.g. variants at different locations which are not seen in a single individual but are on a valid path through the graph. Is there a way to encode that sort of contextual data in the graph?
The text was updated successfully, but these errors were encountered: