Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can VG simulate the third-generations long reads? #4284

Open
tanger-code opened this issue May 6, 2024 · 5 comments
Open

Can VG simulate the third-generations long reads? #4284

tanger-code opened this issue May 6, 2024 · 5 comments

Comments

@tanger-code
Copy link

Hi.

Now I have the .gbz graph file, and I want to simulate the third-generations long reads data from a pangenome graph. Can VG simulate the third-generations long reads? Or if there is some methods to do this?

Any advice would be very helpful to me. Thanks.

@jeizenga
Copy link
Contributor

jeizenga commented May 6, 2024

Although vg sim can run with long read input, it's really designed for short reads. If you use it to generate long reads, you won't get very realistic errors or a realistic read length distribution. In our own testing and development, we've used pbsim to simulate long reads. You would probably want to generate the reads from FASTAs of sample haplotypes, rather than directly from the GBZ file.

@tanger-code
Copy link
Author

Thank you!
And can I use vg sim and the .gbz file to generate short reads using vg sim -x graph.xg **-g graph.gbz** -m SAMPLE -n 1000 -l 150 -a > SAMPLE.gam ?
Now I have the .gbz file of all chromosomes pangenome graph. And I want to generate short reads only for chr21. Do I need to withdraw the .gbz file of chr21? I don't find Related command.

@tanger-code
Copy link
Author

Although vg sim can run with long read input, it's really designed for short reads. If you use it to generate long reads, you won't get very realistic errors or a realistic read length distribution. In our own testing and development, we've used pbsim to simulate long reads. You would probably want to generate the reads from FASTAs of sample haplotypes, rather than directly from the GBZ file.

I'm simulating long reads using pbsim3 and the output is .maf file. If I want to do some simulation experiment such as calling SV based on the simulation reads, can I use the maf file as the truth set? Or use some public truth set?

Do you have any suggestions?

@jeizenga
Copy link
Contributor

Looking through our script, it seems that we used the maf2sam subcommand of bioconvert.

@adamnovak
Copy link
Member

@tanger-code If you want to simulate from just one named path in the graph, you can use the -P option to vg sim.

But that simulates from just that path; it won't include variants in the graph that leave the embedded path.

I don't think we have a way to simulate from the connected component of the graph that contains a path, other than using vg chunk --components -p name-of-path to pull out that subgraph and then simulating from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants