Detecting closest reference in custom DB #32

jodyphelan · 2023-06-12T16:31:35Z

I've got some dengue NGS data and I'm wondering if I can use kmcp to find the best matching reference. I've built a database from ~4800 reference sequnces using the following:

kmcp compute -I ~/.dengue-ngs/ref/ -k 31 -D 10 -B partial -O temp
kmcp index -j 32 -I temp/ -O db -n 1 -f 0.3

Doing an assembly and then using kmcp search contig.fa -d kmcpdb/ works quite nicely and identifies the same reference sequence I get when I use blast.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
k141_133	10523	1975	0.0000e+00	231	MH823208.1	0	1	10723	31	1503	0.7610	0.7444	0.6034	0

However sometimes it is difficult to do a genome assembly and in this case I would like to search directly using the reads. Is this possible? I've tried with kmcp search -d kmcpdb reads.fq.gz --query-whole-file -o result.tsv.gz, however this does not seem to return any hits. Any guidance would be appreciated.

The text was updated successfully, but these errors were encountered:

shenwei356 · 2023-06-12T16:56:52Z

Yes, it does support search reads to detect the most similar strain against many reference genomes. Please follow the recently added tutorial: https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens/ .

You may need to remove duplicated sequences first and choose a limited number of representative genomes for each strain. Because, for thousands of reference genomes of small genomes like HBV and dengue viruses, KMCP might fail to detect the target species.

Some tips:

Use a larger k value, e.g., 31 or 51, cause you have so many reference genomes and they are highly similar.
In the profiling stage, use -n 1 if no targets are detected.

Anyway, just have a try, and let me know if you have any issues.

jodyphelan · 2023-06-12T16:59:39Z

Thanks, I'll give it a try. Great tool btw!

jodyphelan · 2023-06-12T17:22:47Z

Amazing that works like a charm!

shenwei356 · 2023-06-13T00:08:25Z

Glad it helps!

I just remembered I forgot to paste the latest unreleased version, which fixes a bug in chunk computation when splitting circular genomes (--circular).

https://github.com/shenwei356/kmcp/files/11702458/kmcp_linux_amd64.tar.gz

jodyphelan closed this as completed Jun 12, 2023

ctb mentioned this issue Jun 13, 2023

tutorial TODO: finding the best reference sequences for NGS data sourmash-bio/sourmash#2648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting closest reference in custom DB #32

Detecting closest reference in custom DB #32

jodyphelan commented Jun 12, 2023

shenwei356 commented Jun 12, 2023

jodyphelan commented Jun 12, 2023

jodyphelan commented Jun 12, 2023

shenwei356 commented Jun 13, 2023

Detecting closest reference in custom DB #32

Detecting closest reference in custom DB #32

Comments

jodyphelan commented Jun 12, 2023

shenwei356 commented Jun 12, 2023

jodyphelan commented Jun 12, 2023

jodyphelan commented Jun 12, 2023

shenwei356 commented Jun 13, 2023