kmcp search is very slow for metatranscriptome data #23

shenwei356 · 2022-11-15T15:30:03Z

The search results showed a huge number of reads from 16 rRNA genes have thousands of matches, writing results slowed down the search. So these reads should be filtered out before the search using tools like https://github.com/hzi-bifo/RiboDetector.

houjialin · 2022-11-24T05:51:11Z

I'm trying to use 'kmcp search' to classify my metagenome data, but the speed is very slow (I killed it after 1 week run) and unacceptable even if I already define -j 40 for multiple CPUs. My commond is ohup kmcp search -j 40 -d ~/data/Database/KMCP_database/GTDB_rep_genomes_r207/gtdb.r207.minh5.kmcp/ -1 ../../01-Trimming/02_trimmed_reads_1P.fq.gz -2 ../../01-Trimming/02_trimmed_reads_2P.fq.gz -o 150cm_ECS.KMCP.tsv.gz &

@shenwei356 Could you give me some idea about that?

shenwei356 · 2022-11-24T06:51:28Z

Hi thanks for using KMCP. Please provide more details:

Local machine or a cluster? number of cpus? size of RAM?
Where is the database stored? local disk or NAS? try add -w (https://bioinf.shenwei.me/kmcp/faq/#why-are-the-cpu-usages-are-very-low-not-100)
How's the database built? Is sketching used? Minimizer? What's the size of gtdb.r207.minh5.kmcp. I'd recommend using all k-mers.
Information of the query reads, read length, the number of reads.
Please rerun and check the instant speed.
Please use other tools like screen, instead of nohup, to run command in background.

houjialin · 2022-11-24T07:39:47Z

Hi thanks for using KMC. Please provide more details:

Local machine or a cluster? number of cpus? size of RAM?

Where is the database stored? local disk or NAS? try add -w (https://bioinf.shenwei.me/kmcp/faq/#why-are-the-cpu-usages-are-very-low-not-100)

How's the database built? Is sketching used? Minimizer? What's the size of gtdb.r207.minh5.kmcp. I'd recommend using all k-mers.

Information of the query reads, read length, the number of reads.

Please rerun and check the instant speed.

Please use other tools like screen, instead of nohup, to run command in background.

Thanks for your quick reply,

I run it in our local sever with 80 CPUs and 1TB RAM.
The database I used it built by myself, which is based on the latest GTDB database, here is the index.log

1 16:29:44.659 [INFO] kmcp v0.9.0
 2 16:29:44.695 [INFO]   https://github.com/shenwei356/kmcp
 3 16:29:44.696 [INFO]
 4 16:29:44.696 [INFO] loading .unik file infos from file: gtdb-r207-k21-n10/_info.txt
 5 16:29:45.409 [INFO]   657030 cached file infos loaded
  6 16:29:45.554 [INFO]
  7 16:29:45.554 [INFO] -------------------- [main parameters] --------------------
  8 16:29:45.554 [INFO]   number of hashes: 1
  9 16:29:45.554 [INFO]   false positive rate: 0.200000
 10 16:29:45.554 [INFO]   k-mer size(s): 21
 11 16:29:45.554 [INFO]   split seqequence size: 0, overlap: 150
 12 16:29:45.554 [INFO]   block-sizeX-kmers-t: 10.00 M
 13 16:29:45.555 [INFO]   block-sizeX        : 256
 14 16:29:45.555 [INFO]   block-size8-kmers-t: 20.00 M
 15 16:29:45.555 [INFO]   block-size1-kmers-t: 200.00 M
 16 16:29:45.555 [INFO] -------------------- [main parameters] --------------------
 17 16:29:45.555 [INFO]
 18 16:29:45.555 [INFO] building index ...
 19 16:29:46.285 [INFO]
 20 16:29:46.285 [INFO]   block size: 16432
 21 16:29:46.285 [INFO]   number of index files: 40 (may be more)
 22 16:29:46.285 [INFO]
 23 17:56:35.564 [INFO]
 24 17:56:35.564 [INFO] kmcp database with 213177546931 k-mers saved to gtdb.r207.minh5.kmcp
 25 17:56:35.564 [INFO] total file size: 120.21 GB
 26 17:56:35.564 [INFO] total index files: 40
 27 17:56:35.564 [INFO]
 28 17:56:35.565 [INFO] elapsed time: 1h26m50.906706327s
 29 17:56:35.565 [INFO]

My input data is two trimmed metagenome paired-end 150-bp reads files with 22GB zipped size (11 GB for each)
the instant speed in the log file is very low, I just rerun it about 2 hours ago, right now the last line of the log file is

processed queries: 4608, speed: 0.000 million queries per minute^Mprocessed queries: 4672, speed: 0.000 million queries per minute^M

shenwei356 · 2022-11-24T10:26:43Z

It's weird, please add my Wechat if you have one: shenwei356

shenwei356 · 2022-11-27T03:32:49Z

Are the CPUs ARM?

shenwei356 · 2022-11-27T06:54:34Z

Are the CPUs ARM?

If they are, please try the new binaries. I've fixed the search for ARM architectures.

BTW, there's no need to set the false positive rate as 0.2 for kmcp index; 0.3 is OK.

shenwei356 closed this as completed Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmcp search is very slow for metatranscriptome data #23

kmcp search is very slow for metatranscriptome data #23

shenwei356 commented Nov 15, 2022 •

edited

houjialin commented Nov 24, 2022

shenwei356 commented Nov 24, 2022 •

edited

houjialin commented Nov 24, 2022

shenwei356 commented Nov 24, 2022

shenwei356 commented Nov 27, 2022 •

edited

shenwei356 commented Nov 27, 2022

kmcp search is very slow for metatranscriptome data #23

kmcp search is very slow for metatranscriptome data #23

Comments

shenwei356 commented Nov 15, 2022 • edited

houjialin commented Nov 24, 2022

shenwei356 commented Nov 24, 2022 • edited

houjialin commented Nov 24, 2022

shenwei356 commented Nov 24, 2022

shenwei356 commented Nov 27, 2022 • edited

shenwei356 commented Nov 27, 2022

shenwei356 commented Nov 15, 2022 •

edited

shenwei356 commented Nov 24, 2022 •

edited

shenwei356 commented Nov 27, 2022 •

edited