Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VG graph with duplicate nodes at the same location #1801

Open
zafarsustbd opened this issue Jul 23, 2018 · 2 comments
Open

VG graph with duplicate nodes at the same location #1801

zafarsustbd opened this issue Jul 23, 2018 · 2 comments

Comments

@zafarsustbd
Copy link

Hi I am augmenting vg graph for multiple sample. In each step, I index the graph, map my sample (paired-end fastq files), filter it and then finally augment this mapping with the graph (from previous iteration). For augmenting, I am using the following command:

vg mod -i graph.aln.gam graph.vg | vg mod -z - | vg mod -c - > graph.aug.vg

After several iteration I got a graph which looks like-
screen shot 2018-07-23 at 10 18 18 am

Why do I get multiple node with the same character at the same location and why would we keep N's in the graph? Some node looks like dangling (e.g: node 1878) as well. Could you please give some explanation for this graph?

@adamnovak
Copy link
Member

The Ns make it into the graph because the Ns are in the reads and count as mismatches. When you add the reads to the graph with vg mod -i, there's no special check to detect Ns and treat them as matches, so new nodes get created for the N characters just like they would be for any other mismatching characters.

Ns in reads never match against anything, even other Ns, so on subsequent iterations you can add additional Ns at the same position (like on the right side in your image).

For other characters, you can get duplicates because vg mod -i doesn't bother to "tuck in" mismatches at the last base of a read. So you can get node 1878, a dangling C, created by a read that ends at that position with a C, and then have another read come in on the next iteration with an internal C, which needs an edge coming in and an edge going out. The aligner can't align it to the existing dangling C, because the aligner can't postulate new edges, so it gets aligned as a mismatch against something else and results in the creation of a new C node.

There might be an underlying bug here in that vg mod -i, which uses VG::edit(), doesn't let you tuck in dangling ends, while VG::edit_fast(), which we use elsewhere, provides for that. And the duplication of Ns is undesireable for this use case.

But on the other hand, vg mod -i is meant to be a low-level tool that does exactly one thing (call VG::edit()); it's not meant to be used to augment the graph with plausible variant candidates from a set of reads. For that use case, we have vg augment, which won't leave dangling ends and which will treat N characters as missing data. It also can filter out non-recurrent edits so your graph doesn't get overrun with all sequencing errors that ever occurred, and it won't atomize the graph into single-base nodes because it doesn't cut the graph every time a read begins or ends.

Can you try doing your pipeline with vg augment instead?

@zafarsustbd
Copy link
Author

zafarsustbd commented Aug 2, 2018

If I use vg augment, it doesn't keep the each read paths. Since I am augmenting multiple sample iteratively, I need all path information for all the reads for my downstream analysis. So using vg augment doesn't help is this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants