vg sim error: [insert_gbwt_path()] path name already exists: #4209

cwatt · 2024-01-12T18:00:12Z

Hello, I made a post asking about this on biostars when I thought the error was more innocuous than it actually is.

1. What were you trying to do?

Produce simulated reads from a sample within a graph.

2. What did you want to happen?

Produce a .gam file with simulated reads only from the sample specified.

3. What actually happened?

I received these errors, which differed slightly if the sample specified was the reference sample of the graph:

Inserting 1 GBWT threads into the graph
error: [insert_gbwt_path()] path name already exists: sample#0#ZEAMA_sample_0_v1_0_0_chr1
[... repeated for each chromosome]
Inserted 0 paths

or not a reference sample (the error repeats many additional times per chromosome):

error: [insert_gbwt_path()] path name already exists: sample#0#ZEAMA_sample-XKM-0_v1_0_0_chr1#186287117

Despite the errors, a .gam file of the simulated reads is produced. However, I noticed that reads simulated from the same graph and seed but from different samples produced identically sized .gam files, and that the reads produced are identical but in a different order. It seems vg sim isn't constraining it's simulation to just one sample in the graph due to this error.

I created my graphs using minigraph-cactus to produce a .gbz, then created a .gbwt from the .gbz using vg gbwt. The error occurs no matter what sample I specify or if I use a .xg file produced from the .gbz file instead.

5. What data and command can the vg dev team use to make the problem happen?

EDIT: I was able to recreate this error using minigraph-cactus' yeast pangenome dataset from their tutorial. The required data is here.

I used the following sequence of commands to reproduce the error:

cactus-pangenome ./js ./examples/yeastPangenome.txt --reference S288C --outDir yeast-pg --outName yeast-pg --vcf --giraffe
vg gbwt -Z yeast-pg.d2.gbz -o yeast-pg.d2.gbwt
vg convert -x yeast-pg.d2.gbz > yeast-pg.d2.xg
vg sim -x yeast-pg.d2.gbz -g yeast-pg.d2.gbwt -m SK1 -n 10 -l 150 -p 335 -v 130 -t 30 -s 1 -t 30 -r > yeast-pg-simreads.gam
vg sim -x yeast-pg.d2.xg -g yeast-pg.d2.gbwt -m SK1 -n 10 -l 150 -p 335 -v 130 -t 30 -s 1 -t 30 -r > yeast-pg-simreads-xg.gam

Both vg sim commands result in the same error.

6. What does running vg version say?

vg version v1.52.0 "Bozen"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by jeizenga@emerald

I'm relatively new to using graphs and vg, so I'm hoping this is a simple mistake on my part at some point in the pipeline.

Thank you for your help!

The text was updated successfully, but these errors were encountered:

cwatt · 2024-01-18T15:18:52Z

I think I've (kind of) figured out what's going on here, or at least fixed my specific problem.

When I created the .xg graph, I did not include the -H option in vg convert, which drops haplotypes. Similarly, .gbz graphs include haplotype information as well, which is why using either graph produced the same errors, I assume. This results in simulated read files with unique read names and file sizes, which I take as a good sign.

I think including the haplotype information in the graph doesn't play well with the .gbwt index while simulating reads from a specific sample in some way, though I'm still not sure how this resulted in the same simulated read file being produced from commands specifying different samples.

When I run vg sim with the haplotype-dropped .xg graph and specify a non-reference sample, paths are inserted without errors. When I specify the reference sample, I receive the aforementioned errors, but I think this probably makes sense in this case because the sample is the reference used for the graph?

I only have a fuzzy idea of how the algorithm is working and the structure of these files, so if anyone has a clearer explanation, please let me know!

jeizenga · 2024-01-26T00:40:02Z

Sorry to be so delayed responding to this issue. It's definitely understandable that aspects of the vg sim interface would be confusing. The truth is that the current organization has a lot to do with the development history of VG, and since vg sim doesn't have a wide user base outside of the VG developers, we haven't put the work to rationalize its interface.

This particular confusion is because vg sim is only implemented to work with embedded paths that are stored along with the graph. These tend to be reference paths, and they're often expressed as P lines in a GFA file. They contrast with haplotype paths (often W lines in the GFA), which tend to be much more numerous, and as such they require more specialized data structures to be efficient--in this case, the GBWT or GBZ.

The hack that we use to also simulate from haplotype paths is to add the haplotype paths as embedded paths and then simulate from them with the standard algorithm (with the graph in XG format). Because the names of the embedded paths are expected to be unique, you have to ensure that you only add them to the XG graph once: either when converting from the GBZ or when starting vg sim. It looks like you eventually found your way to a pipeline that did this.

One thing that I notice from looking at your commands is that you are simulating from the .d2 graph. This is probably not what you want. The .d2 removes a lot of sequence from the graph if it isn't sufficiently common in the population. That can make the graph more effective as a mapping target, but it doesn't reflect the generating process for sequencing data. You probably want to simulate reads from all parts of the haplotype, not just the common ones. For that, you should use the full GBZ produced by Minigraph-Cactus.

cwatt · 2024-01-26T16:18:59Z

Thank you for the incredibly helpful response! This makes sense to me now, and gives me some extra confidence moving forward with my analysis. I'll keep your .d2 graph warning in mind as well!

cwatt closed this as completed Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vg sim error: [insert_gbwt_path()] path name already exists: #4209

vg sim error: [insert_gbwt_path()] path name already exists: #4209

cwatt commented Jan 12, 2024 •

edited

cwatt commented Jan 18, 2024

jeizenga commented Jan 26, 2024

cwatt commented Jan 26, 2024

vg sim error: [insert_gbwt_path()] path name already exists: #4209

vg sim error: [insert_gbwt_path()] path name already exists: #4209

Comments

cwatt commented Jan 12, 2024 • edited

cwatt commented Jan 18, 2024

jeizenga commented Jan 26, 2024

cwatt commented Jan 26, 2024

cwatt commented Jan 12, 2024 •

edited