graph can imply unobserved sequences #1

davebiffuk · 2015-02-26T10:16:30Z

What are the implications of the graph encoding (i.e. implying the existence of) sequences which have never been observed? e.g. variants at different locations which are not seen in a single individual but are on a valid path through the graph. Is there a way to encode that sort of contextual data in the graph?

ekg · 2015-02-26T11:34:53Z

@davebiffuk is referring to the phenomenon that occurs when phasing information about the original sequences is removed and the graph is constructed using only the edit information implied by the VCF, ignoring any haplotypes that are given as input. (As of the time of this comment, this is the default mode of VCF-based construction in vg.)

This is a really interesting issue. It presents some problems but may not be completely avoidable in all contexts. I'll try to explain when it does and doesn't make sense to allow the graph to imply unobserved sequences.

Suppose this graph had been generated from two original input sequences, one red and one black:

Now, we are able to find many sequences in it that may never have been observed. For instance, [1,2,5,6,7]. There are very many of these. In fact, there are 48 to be exact. You can see this by running this code in the test directory of this repo:

# first we make a sub-graph by taking the head of the GFA format version of the graph
vg construct -r small/x.fa -v small/x.vcf.gz \
    | vg view - | head -28 | vg view -v - >y.vg
# then we make all paths of ~length 40, which includes all the paths in this graph
vg paths -s -l 40 y.vg | wc -l # 48

This seems problematic, but there are a few things to keep in mind.

Homologous regions can support recombination. Although recombination is rare (on the order of de novo variation), it does happen and it can happen anywhere. Many species do not have specific recombination loci. These paths could exist in the event of a specific set of recombinations between the original red and black sequences.
This succinct representation is large in terms of the haplotype space but small in terms of sequence space. This enables sensitive, and efficient pairwise local alignment algorithms to run natively on the variation graph.
Allowing a graph to encode sequences which haven't been observed could be expedient. For example, you may not know the actual genomes that were observed, and only have information about variants and frequencies. This is actually a rather common situation, particularly when the identities of the individuals who have gone into the list of variants is private or not shareable. A variant list is easy to exchange and rather lightweight relative to a full set of haplotypes.

Problems do occur. In particular, if one samples _k_mers of a particular length naïvely from the graph which allows many recombinations between closely-spaced variants, certain regions will generate huge numbers of _k_mers, which limits our ability to map to them and in the extreme, even our ability to generate the _k_mer index of the graph (done via vg index -k N x.vg).

This issue can be mitigated in several ways.

We can limit the number of edges that may be crossed when a _k_mer is generated. (To do this specify -e to vg paths, vg kmers, or vg index, as in: vg paths -s -k 21 -e 9 x.vg.
We can construct using a VCF which has short haplotypes combined into a single variant (so, multiallelic variants with long lengths against the reference) and use this for graph construction. Note that this isn't yet supported, but would require only a few minor changes to begin testing.
We could remove any edge that doesn't lie in a path defined by by one of the input haplotypes. This also isn't supported but would be straightforward to do, and is probably better than (2).

I haven't pursued the second two approaches because I think it should be possible for people to use vg to build graphs when they only have variant lists. This is sometimes harder than building and mapping against graphs that only contain observed haplotypes as paths. Experience with real data will likely suggest the best and most-general approach.

ekg · 2015-04-23T07:39:49Z

I'm going to mark this as closed. The way to resolve this in the future is to store the paths of haplotypes, perhaps as compressed bitvectors, or perhaps in pBWT, perhaps both.

Merge master into dev branch.

Updating Master

windows-friendly-filenames

A spelling correction in autoindex_main.cpp

ekg closed this as completed Apr 23, 2015

ekg pushed a commit that referenced this issue Feb 19, 2016

Merge pull request #1 from edawson/master

ffb295f

Merge master into dev branch.

adamnovak mentioned this issue Mar 11, 2016

Update to xg with graph PBWT support #229

Closed

glennhickey mentioned this issue Nov 10, 2016

genotyper: cactus source sink looks for node 0 #541

Closed

This was referenced Feb 13, 2017

vg map issue using pacbio reads #662

Closed

vg index segmantation fault #674

Closed

glennhickey mentioned this issue Apr 6, 2017

vg test fails at src/unittest/chunker.cpp:93 #760

Closed

ChriKub mentioned this issue Feb 19, 2018

vg call fails: Address not mapped to object #1459

Open

ChriKub mentioned this issue Mar 22, 2018

vg constuct maf import: `aln.second.size() == aln_len' #1563

Closed

glennhickey mentioned this issue Jun 13, 2018

Construction crash introduced with Safe SVs #1737

Closed

JervenBolleman mentioned this issue Jul 9, 2018

VG annotate can't add BED records across the junctions of circular genomes #1775

Closed

adamnovak pushed a commit that referenced this issue Oct 1, 2018

Merge pull request #1 from vgteam/master

8f06a18

Updating Master

glennhickey mentioned this issue Jan 8, 2019

VG crashed when performing construct #2048

Open

ekg mentioned this issue Apr 5, 2019

vg view terminates #2196

Closed

glennhickey mentioned this issue May 23, 2019

vg construct fails on pbsv INS and DEL #2278

Closed

yangxiaofeill mentioned this issue Jul 8, 2019

surject is extremely slow #2328

Open

This was referenced Sep 12, 2019

xg indexing doesn't work on whole-genome graph #2447

Closed

vg view #2457

Open

ekg mentioned this issue Sep 28, 2019

VG construct crashes #2490

Open

sc13-bioinf mentioned this issue Oct 1, 2019

Crash using vg ids --join #2492

Open

glennhickey mentioned this issue Oct 10, 2019

vg augment crashed #2502

Open

ekg mentioned this issue Nov 20, 2019

vg map crashes with signal 11 #2541

Open

ekg mentioned this issue Jan 2, 2020

vg find command to get subgraphs of genomic regions failed #2585

Open

ekg mentioned this issue Jan 10, 2020

VG crashing: invalid, or corrupt input at message #2597

Closed

glennhickey mentioned this issue Jan 21, 2020

Question on resulting VG variants #2594

Open

ekg mentioned this issue Feb 19, 2020

Help with snarls computation #2629

Open

adamnovak mentioned this issue Mar 25, 2020

SDSL assertion error when working with graph from vg mod -M 8 #2681

Closed

gwylym mentioned this issue Sep 29, 2020

vg crashes on centos 8 #3012

Open

Robin-Rounthwaite added a commit that referenced this issue Mar 26, 2021

Merge pull request #1 from vgteam/master

f8b2560

windows-friendly-filenames

jeizenga pushed a commit that referenced this issue Sep 23, 2021

Merge pull request #1 from mdkeehan/mdkeehan-patch-1

dc4e608

A spelling correction in autoindex_main.cpp

ZoeYang2020 mentioned this issue Jan 18, 2022

Got an error called: Signal 6 occurred. VG has crashed. #3516

Closed

jinshangkun mentioned this issue Apr 12, 2022

vg pack error #3634

Open

8banzhuan mentioned this issue Jul 17, 2022

Assertion `reference_for.count(fasta_contig)' failed. #3702

Open

xiaoguizz mentioned this issue Feb 7, 2023

vg #3845

Open

marinak-ebi mentioned this issue Mar 29, 2023

vg giraffe: unable to retrieve stacktrace.txt file because no access to /tmp #3882

Closed

jingydz mentioned this issue May 30, 2023

Error running VG #3365

Open

kaurharpreet-umn mentioned this issue Jun 14, 2023

Does VG giraffe works with genotype-by-sequencing (GBS) short reads data? #3991

Open

xuxingyubio mentioned this issue Jul 11, 2023

the problem in vg surject #4018

Open

Sourirewang mentioned this issue Sep 7, 2023

.gamp file can not convert to .gam file #4075

Closed

bcantarel mentioned this issue Sep 10, 2023

vg index crashes #4080

Closed

linsindian mentioned this issue Sep 25, 2023

Error for " vg convert -x" #4099

Closed

Tonitsk8264 mentioned this issue Sep 26, 2023

questions about vg autoindex #4098

Closed

yunhanajing mentioned this issue Oct 27, 2023

Error running giraffe step #4135

Closed

zhangming-m mentioned this issue Nov 8, 2023

use vg rna to construct spliced pangenome and crashed #4150

Open

Mirkocoggi mentioned this issue Jan 26, 2024

Problem with vg autoindex with phased VCF #4219

Closed

santhanakrishnanb mentioned this issue Apr 30, 2024

vg map errors #4280

Closed

SwenDiepstraten mentioned this issue May 3, 2024

VCF file empty when calling SV on ONT data #4279

Open

AlphaJulietAlpha mentioned this issue May 16, 2024

Mapping paired reads w/ giraffe, no EOF marker, job stalls, exit code 79 #4294

Open

NaturalKiller-code mentioned this issue Jul 10, 2024

Fastq analysis in euka #4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph can imply unobserved sequences #1

graph can imply unobserved sequences #1

davebiffuk commented Feb 26, 2015

ekg commented Feb 26, 2015

ekg commented Apr 23, 2015

graph can imply unobserved sequences #1

graph can imply unobserved sequences #1

Comments

davebiffuk commented Feb 26, 2015

ekg commented Feb 26, 2015

ekg commented Apr 23, 2015