Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Pre-filtering" use case #1697

Closed
sivico26 opened this issue Aug 1, 2021 · 8 comments
Closed

"Pre-filtering" use case #1697

sivico26 opened this issue Aug 1, 2021 · 8 comments

Comments

@sivico26
Copy link

sivico26 commented Aug 1, 2021

Hi there!

I wanted to try Sourmash for a specific use case, but after going through the docs I am still not sure how is the best way to use it.

I have a set of raw reads for which I want to know if they match to a genome or not, and filter them on that basis. Usually, you would do this by mapping those reads to the genome, but I wonder if it could be done faster with sketching tools such as Sourmash.

Specifically, I have a genome skimming dataset and wanted to extract the chloroplast reads. So my database is actually very small, but the number of queries is quite large.

This is what I was trying to do:

## Prepare db
sourmash sketch dna -p k=31,scaled=10 Inputs/refence.fasta -o ref.sig
sourmash index cpdb ref.sig
## Prepare Query 
mkdir query_sigs
sourmash sketch dna --singleton --outdir query_sigs Input/*.fastq
## Do the search
sourmash search -o report.csv --ksize 31 --containment out/sub1.fq.sig cpdb.sbt.zip

And I am getting this:

== This is sourmash version 4.2.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
When loading query from 'out/sub1.fq.sig'
100000 signatures matching ksize and molecule type;
need exactly one. Specify --ksize or --dna, --rna, or --protein.

I also tried to specify --dna, not putting --ksize, or any sensible combination but they generated the same error. I found that the same kind of error was reported in #1028 and #1089, but those seem triggered by lca and multigather, while this one emerged calling search.

By the way, I am using sourmash 4.2.1 installed through conda.

Any thoughts on this use case, approach, and errors would be greatly appreciated.
Regards

@ctb
Copy link
Contributor

ctb commented Aug 2, 2021

huh, that is a confusing error message. I can sort of dimly intuit what might be going on, but I'll have to dig!

one immediate problem I see is that you'll want both sourmash sketch dna commands to use the same parameter string, so -p k=31,scaled=10.

in re the beginning question,

Usually, you would do this by mapping those reads to the genome, but I wonder if it could be done faster with sketching tools such as Sourmash.

sourmash is unlikely to be faster, since mappers have been optimized quite a bit :). But k-mers were used for many years for filtering, before the Burroughs-Wheeler Transform methods became dominant; and software like mashmap uses k-mer sketching to do similar things, but only for longer sequences where the downsampling that happens with sketching still provides strong guarantees. You're heading in the right direction with scaled=10, but

we have developed the spacegraphcats software to do graph-based k-mer matching, and that is used for situations where you're trying to retrieve graph-adjacent sequences where mappers simply won't work.

thanks for posting the problem! I'll see what I can do to figure it out :)

@sivico26
Copy link
Author

Hi, sorry for the big delay.

I just wanted to mention that the issue persisted after using the -p k=31,scaled=10 in both the signature generation and the sketching of the reads.

Thanks a lot for your thoughts. I finally had the time to try spacegraphcats. Unfortunately, even though I installed successfully run the test data, I could not make it work over my data.

@ctb
Copy link
Contributor

ctb commented Aug 3, 2022

closed in favor of genome-grist - dib-lab/genome-grist#193

@ctb ctb closed this as completed Aug 3, 2022
@ReneKat
Copy link

ReneKat commented Jan 6, 2023

Hi there sourmash team,

I am having this issue with sourmash gather. I just updated from 4.4 to 4.6 and the issue persists. I see there was work done with genome-grist #193 and this issue was recently closed, but am unsure how to solve the issue myself.
I am following this tutorial published July '22.

To generate signatures I'm running:
sourmash sketch dna -p k=31,scaled=1000,abund -o ./${sample}_k31.sig ${sample}_HQ_R1.fastq ${sample}_HQ_R2.fastq
Then using gather from the GenBank viral signatures:
sourmash gather ${sample}_k31.sig genbank-2022.03-viral-k31.zip --ksize 31 --dna -o ${sample}_k31_rep.csv

== This is sourmash version 4.6.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
When loading query from 'F04-20180823_k31.sig'
2 signatures matching ksize and molecule type;
need exactly one. Specify --ksize or --dna, --rna, or --protein.

No output is being generated but I also do not have an error or warning message.

@ctb Let me know if I should create a new issue for this.
Thanks for any help and Happy New Year!
Best,
Rene

@ctb
Copy link
Contributor

ctb commented Jan 6, 2023

hi @sivico26 try adding --name {sample}_HQ to the sketch command.

If that works, what is happening is this:

two different sketches are being generated by the sourmash sketch command, one for each input file; and they are then being output to a single .sig file. (We support multiple sketches per .sig file).

The --name arg (equivalent to --merge if you prefer that) tells sourmash to produce only one sketch, named arg.

More generally, you can diagnose this kind of thing with sourmash sig summarize F04-20180823_k31.sig which should tell you you have two sketches, and sourmash sig describe F04-20180823_k31.sig which should show you that you have an R1 and R2 sketch.

Let me know if that works! Or if it doesn't 😆

@ReneKat
Copy link

ReneKat commented Jan 6, 2023

Hi @ctb , thank you so much for the rapid response. :) I'm regenerating the signatures now.
I had left the --merge off thinking that it was only for using different k-mer sizes. Thanks for explaining that in detail. :) I'll check back with you here when sourmash gather works.
Best,
Rene

@ReneKat
Copy link

ReneKat commented Jan 6, 2023

Works great! Thanks @ctb

@ctb
Copy link
Contributor

ctb commented Jan 6, 2023

welcome! glad it was an easy fix 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants