VG graph with duplicate nodes at the same location #1801

zafarsustbd · 2018-07-23T14:36:26Z

Hi I am augmenting vg graph for multiple sample. In each step, I index the graph, map my sample (paired-end fastq files), filter it and then finally augment this mapping with the graph (from previous iteration). For augmenting, I am using the following command:

vg mod -i graph.aln.gam graph.vg | vg mod -z - | vg mod -c - > graph.aug.vg

After several iteration I got a graph which looks like-

Why do I get multiple node with the same character at the same location and why would we keep N's in the graph? Some node looks like dangling (e.g: node 1878) as well. Could you please give some explanation for this graph?

The text was updated successfully, but these errors were encountered:

adamnovak · 2018-08-02T17:38:44Z

The Ns make it into the graph because the Ns are in the reads and count as mismatches. When you add the reads to the graph with vg mod -i, there's no special check to detect Ns and treat them as matches, so new nodes get created for the N characters just like they would be for any other mismatching characters.

Ns in reads never match against anything, even other Ns, so on subsequent iterations you can add additional Ns at the same position (like on the right side in your image).

For other characters, you can get duplicates because vg mod -i doesn't bother to "tuck in" mismatches at the last base of a read. So you can get node 1878, a dangling C, created by a read that ends at that position with a C, and then have another read come in on the next iteration with an internal C, which needs an edge coming in and an edge going out. The aligner can't align it to the existing dangling C, because the aligner can't postulate new edges, so it gets aligned as a mismatch against something else and results in the creation of a new C node.

There might be an underlying bug here in that vg mod -i, which uses VG::edit(), doesn't let you tuck in dangling ends, while VG::edit_fast(), which we use elsewhere, provides for that. And the duplication of Ns is undesireable for this use case.

But on the other hand, vg mod -i is meant to be a low-level tool that does exactly one thing (call VG::edit()); it's not meant to be used to augment the graph with plausible variant candidates from a set of reads. For that use case, we have vg augment, which won't leave dangling ends and which will treat N characters as missing data. It also can filter out non-recurrent edits so your graph doesn't get overrun with all sequencing errors that ever occurred, and it won't atomize the graph into single-base nodes because it doesn't cut the graph every time a read begins or ends.

Can you try doing your pipeline with vg augment instead?

zafarsustbd · 2018-08-02T17:53:40Z

If I use vg augment, it doesn't keep the each read paths. Since I am augmenting multiple sample iteratively, I need all path information for all the reads for my downstream analysis. So using vg augment doesn't help is this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VG graph with duplicate nodes at the same location #1801

VG graph with duplicate nodes at the same location #1801

zafarsustbd commented Jul 23, 2018

adamnovak commented Aug 2, 2018

zafarsustbd commented Aug 2, 2018 •

edited

Loading

VG graph with duplicate nodes at the same location #1801

VG graph with duplicate nodes at the same location #1801

Comments

zafarsustbd commented Jul 23, 2018

adamnovak commented Aug 2, 2018

zafarsustbd commented Aug 2, 2018 • edited Loading

zafarsustbd commented Aug 2, 2018 •

edited

Loading