Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vg sim error: [insert_gbwt_path()] path name already exists: #4209

Closed
cwatt opened this issue Jan 12, 2024 · 3 comments
Closed

vg sim error: [insert_gbwt_path()] path name already exists: #4209

cwatt opened this issue Jan 12, 2024 · 3 comments

Comments

@cwatt
Copy link

cwatt commented Jan 12, 2024

Hello, I made a post asking about this on biostars when I thought the error was more innocuous than it actually is.

1. What were you trying to do?

Produce simulated reads from a sample within a graph.

2. What did you want to happen?

Produce a .gam file with simulated reads only from the sample specified.

3. What actually happened?

I received these errors, which differed slightly if the sample specified was the reference sample of the graph:

Inserting 1 GBWT threads into the graph
error: [insert_gbwt_path()] path name already exists: sample#0#ZEAMA_sample_0_v1_0_0_chr1
[... repeated for each chromosome]
Inserted 0 paths

or not a reference sample (the error repeats many additional times per chromosome):

error: [insert_gbwt_path()] path name already exists: sample#0#ZEAMA_sample-XKM-0_v1_0_0_chr1#186287117

Despite the errors, a .gam file of the simulated reads is produced. However, I noticed that reads simulated from the same graph and seed but from different samples produced identically sized .gam files, and that the reads produced are identical but in a different order. It seems vg sim isn't constraining it's simulation to just one sample in the graph due to this error.

I created my graphs using minigraph-cactus to produce a .gbz, then created a .gbwt from the .gbz using vg gbwt. The error occurs no matter what sample I specify or if I use a .xg file produced from the .gbz file instead.

5. What data and command can the vg dev team use to make the problem happen?

EDIT: I was able to recreate this error using minigraph-cactus' yeast pangenome dataset from their tutorial. The required data is here.

I used the following sequence of commands to reproduce the error:

cactus-pangenome ./js ./examples/yeastPangenome.txt --reference S288C --outDir yeast-pg --outName yeast-pg --vcf --giraffe
vg gbwt -Z yeast-pg.d2.gbz -o yeast-pg.d2.gbwt
vg convert -x yeast-pg.d2.gbz > yeast-pg.d2.xg
vg sim -x yeast-pg.d2.gbz -g yeast-pg.d2.gbwt -m SK1 -n 10 -l 150 -p 335 -v 130 -t 30 -s 1 -t 30 -r > yeast-pg-simreads.gam
vg sim -x yeast-pg.d2.xg -g yeast-pg.d2.gbwt -m SK1 -n 10 -l 150 -p 335 -v 130 -t 30 -s 1 -t 30 -r > yeast-pg-simreads-xg.gam

Both vg sim commands result in the same error.

6. What does running vg version say?

vg version v1.52.0 "Bozen"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by jeizenga@emerald

I'm relatively new to using graphs and vg, so I'm hoping this is a simple mistake on my part at some point in the pipeline.

Thank you for your help!

@cwatt
Copy link
Author

cwatt commented Jan 18, 2024

I think I've (kind of) figured out what's going on here, or at least fixed my specific problem.

When I created the .xg graph, I did not include the -H option in vg convert, which drops haplotypes. Similarly, .gbz graphs include haplotype information as well, which is why using either graph produced the same errors, I assume. This results in simulated read files with unique read names and file sizes, which I take as a good sign.

I think including the haplotype information in the graph doesn't play well with the .gbwt index while simulating reads from a specific sample in some way, though I'm still not sure how this resulted in the same simulated read file being produced from commands specifying different samples.

When I run vg sim with the haplotype-dropped .xg graph and specify a non-reference sample, paths are inserted without errors. When I specify the reference sample, I receive the aforementioned errors, but I think this probably makes sense in this case because the sample is the reference used for the graph?

I only have a fuzzy idea of how the algorithm is working and the structure of these files, so if anyone has a clearer explanation, please let me know!

@jeizenga
Copy link
Contributor

Sorry to be so delayed responding to this issue. It's definitely understandable that aspects of the vg sim interface would be confusing. The truth is that the current organization has a lot to do with the development history of VG, and since vg sim doesn't have a wide user base outside of the VG developers, we haven't put the work to rationalize its interface.

This particular confusion is because vg sim is only implemented to work with embedded paths that are stored along with the graph. These tend to be reference paths, and they're often expressed as P lines in a GFA file. They contrast with haplotype paths (often W lines in the GFA), which tend to be much more numerous, and as such they require more specialized data structures to be efficient--in this case, the GBWT or GBZ.

The hack that we use to also simulate from haplotype paths is to add the haplotype paths as embedded paths and then simulate from them with the standard algorithm (with the graph in XG format). Because the names of the embedded paths are expected to be unique, you have to ensure that you only add them to the XG graph once: either when converting from the GBZ or when starting vg sim. It looks like you eventually found your way to a pipeline that did this.

One thing that I notice from looking at your commands is that you are simulating from the .d2 graph. This is probably not what you want. The .d2 removes a lot of sequence from the graph if it isn't sufficiently common in the population. That can make the graph more effective as a mapping target, but it doesn't reflect the generating process for sequencing data. You probably want to simulate reads from all parts of the haplotype, not just the common ones. For that, you should use the full GBZ produced by Minigraph-Cactus.

@cwatt
Copy link
Author

cwatt commented Jan 26, 2024

Thank you for the incredibly helpful response! This makes sense to me now, and gives me some extra confidence moving forward with my analysis. I'll keep your .d2 graph warning in mind as well!

@cwatt cwatt closed this as completed Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants