Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Accept jellyfish kmer counts for vg haplotypes #4215

Open
JosephLalli opened this issue Jan 25, 2024 · 2 comments
Open

Feature request: Accept jellyfish kmer counts for vg haplotypes #4215

JosephLalli opened this issue Jan 25, 2024 · 2 comments

Comments

@JosephLalli
Copy link

JosephLalli commented Jan 25, 2024

For more background see eblerjana/pangenie#62.

Long story short, I'm trying to replicate and make use of the personalized pangenome pipeline described in your recent paper (https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2.full).

When using Pangenie to genotype the graph created by vg haplotypes from a human 30X Illumina fastq dataset, a representative run in my hands spends 1484s out of a total runtime of 1910s counting fastq kmer reads. Pangenie is able to accept pre-counted kmer files, but only if they are in Jellyfish2's format. Internally, Pangenie uses the jellyfish api for kmer management.

It seems that using kff files is difficult for Pangenie, since they do not appear to allow for random access. So, maybe we could use jellyfish to count kmers, and provide those counts to vg haplotypes? That would avoid having two different algorithms count the same kmers twice.

Best,
Joe

@jltsiren
Copy link
Contributor

We chose KFF because we wanted to avoid adding yet another major dependency. VG already has too many of them, making the build system fragile.

As for random access, we also need it in vg haplotypes. We simply load the kmer counts into a hash map. On my laptop, that takes ~100 seconds for the counts from 30x reads: 25 seconds for prepopulating the hash map with the kmers we are interested in and 75 seconds for multithreaded reading.

@JosephLalli
Copy link
Author

Understood. I agree about the dependencies!

I'll copy your comment on the similar issue I created at Pangenie (eblerjana/pangenie#62). Maybe you and Jana can help each other get behind one kmer ecosystem for pangenome analysis.

Best,
Joe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants