Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting closest reference in custom DB #32

Closed
jodyphelan opened this issue Jun 12, 2023 · 4 comments
Closed

Detecting closest reference in custom DB #32

jodyphelan opened this issue Jun 12, 2023 · 4 comments

Comments

@jodyphelan
Copy link

I've got some dengue NGS data and I'm wondering if I can use kmcp to find the best matching reference. I've built a database from ~4800 reference sequnces using the following:

kmcp compute -I ~/.dengue-ngs/ref/ -k 31 -D 10 -B partial -O temp
kmcp index -j 32 -I temp/ -O db -n 1 -f 0.3

Doing an assembly and then using kmcp search contig.fa -d kmcpdb/ works quite nicely and identifies the same reference sequence I get when I use blast.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
k141_133	10523	1975	0.0000e+00	231	MH823208.1	0	1	10723	31	1503	0.7610	0.7444	0.6034	0

However sometimes it is difficult to do a genome assembly and in this case I would like to search directly using the reads. Is this possible? I've tried with kmcp search -d kmcpdb reads.fq.gz --query-whole-file -o result.tsv.gz, however this does not seem to return any hits. Any guidance would be appreciated.

@shenwei356
Copy link
Owner

Yes, it does support search reads to detect the most similar strain against many reference genomes. Please follow the recently added tutorial: https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens/ .

You may need to remove duplicated sequences first and choose a limited number of representative genomes for each strain. Because, for thousands of reference genomes of small genomes like HBV and dengue viruses, KMCP might fail to detect the target species.

Some tips:

  1. Use a larger k value, e.g., 31 or 51, cause you have so many reference genomes and they are highly similar.
  2. In the profiling stage, use -n 1 if no targets are detected.

Anyway, just have a try, and let me know if you have any issues.

@jodyphelan
Copy link
Author

Thanks, I'll give it a try. Great tool btw!

@jodyphelan
Copy link
Author

Amazing that works like a charm!

@shenwei356
Copy link
Owner

Glad it helps!

I just remembered I forgot to paste the latest unreleased version, which fixes a bug in chunk computation when splitting circular genomes (--circular).

https://github.com/shenwei356/kmcp/files/11702458/kmcp_linux_amd64.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants