You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Pangenie to genotype the graph created by vg haplotypes from a human 30X Illumina fastq dataset, a representative run in my hands spends 1484s out of a total runtime of 1910s counting fastq kmer reads. Pangenie is able to accept pre-counted kmer files, but only if they are in Jellyfish2's format. Internally, Pangenie uses the jellyfish api for kmer management.
It seems that using kff files is difficult for Pangenie, since they do not appear to allow for random access. So, maybe we could use jellyfish to count kmers, and provide those counts to vg haplotypes? That would avoid having two different algorithms count the same kmers twice.
Best,
Joe
The text was updated successfully, but these errors were encountered:
We chose KFF because we wanted to avoid adding yet another major dependency. VG already has too many of them, making the build system fragile.
As for random access, we also need it in vg haplotypes. We simply load the kmer counts into a hash map. On my laptop, that takes ~100 seconds for the counts from 30x reads: 25 seconds for prepopulating the hash map with the kmers we are interested in and 75 seconds for multithreaded reading.
I'll copy your comment on the similar issue I created at Pangenie (eblerjana/pangenie#62). Maybe you and Jana can help each other get behind one kmer ecosystem for pangenome analysis.
For more background see eblerjana/pangenie#62.
Long story short, I'm trying to replicate and make use of the personalized pangenome pipeline described in your recent paper (https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2.full).
When using Pangenie to genotype the graph created by vg haplotypes from a human 30X Illumina fastq dataset, a representative run in my hands spends 1484s out of a total runtime of 1910s counting fastq kmer reads. Pangenie is able to accept pre-counted kmer files, but only if they are in Jellyfish2's format. Internally, Pangenie uses the jellyfish api for kmer management.
It seems that using kff files is difficult for Pangenie, since they do not appear to allow for random access. So, maybe we could use jellyfish to count kmers, and provide those counts to vg haplotypes? That would avoid having two different algorithms count the same kmers twice.
Best,
Joe
The text was updated successfully, but these errors were encountered: