Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash LCA database update url? #15

Closed
nextgenusfs opened this issue Aug 25, 2021 · 18 comments
Closed

sourmash LCA database update url? #15

nextgenusfs opened this issue Aug 25, 2021 · 18 comments

Comments

@nextgenusfs
Copy link
Collaborator

nextgenusfs commented Aug 25, 2021

Seems that at some point sourmash newer version requires an updated LCA database, I got failure with v4.2.2 and the one we have pinned in resources.py. New link is https://osf.io/9xdg2/download and page is https://sourmash.readthedocs.io/en/latest/databases.html

@nextgenusfs
Copy link
Collaborator Author

Hmm, actually I'm not getting any taxonomy results from sourpurge must be a change in sourmash.....

@hyphaltip
Copy link
Member

hmm, I 'll try to test myself locally not sure if need to ask for input from @ctb or @luizirber if stuck.

@nextgenusfs
Copy link
Collaborator Author

yeah could just be me -- I haven't tried to use it in a few years. (I'm also on my Mac with sourmash install from BIOCONDA)

@nextgenusfs
Copy link
Collaborator Author

nextgenusfs commented Aug 25, 2021

But alternatively we could stick with an older version of sourmash, in case anybody is looking here, commands we are trying to run are:

$ sourmash compute -k 31 --scaled=1000 --singleton assembly.fasta > assembly.fasta.sig
$ sourmash lca classify --db genbank-k31.lca.json.gz --query assembly.fasta.sig

Using the latest v4.2.2 on Mac OS, I essentially got this result:

ID status superkingdom phylum class order family genus species strain
NODE_1_length_531696_cov_9.953852 nomatch                
NODE_2_length_448760_cov_9.622479 nomatch                
NODE_3_length_360422_cov_9.704374 nomatch                
NODE_4_length_343545_cov_9.301333 nomatch                
NODE_5_length_319398_cov_10.079307 nomatch                

@nextgenusfs
Copy link
Collaborator Author

Ah, I guess I should look at the command help menu!

$ sourmash compute
usage: 

** WARNING: the sourmash compute command is DEPRECATED as of 4.0 and
** will be removed in 5.0. Please see the 'sourmash sketch' command instead.

   sourmash compute -k 21,31,51 *.fa *.fq

Create MinHash sketches at k-mer sizes of 21, 31 and 51, for
all FASTA and FASTQ files in the current directory, and save them in
signature files ending in '.sig'. You can rapidly compare these files
with `compare` and query them with `search`, among other operations;
see the full documentation at http://sourmash.rtfd.io/.

@hyphaltip
Copy link
Member

ahh okay. changes to apply.

@ctb
Copy link

ctb commented Aug 26, 2021

hi all, thanks for tagging me in!

I'll have to go digging to give you exact dates, but we updated LCA database formats many, many moons ago - back in 2.x somewhere.

The difference in results is unexpected. The underlying algorithms didn't change; the database format expanded to accommodate sketches that didn't have taxonomy associated.

Last but by no means least, sourmash compute still works as it did before, and the sketch/signature formats are the same. So no change needed there right now. It's just getting removed in 5.0 :).

@nextgenusfs
Copy link
Collaborator Author

Okay thanks @ctb -- must be related to something with my install. I'll try to figure out and open an issue on sourmash GitHub if I can't figure it out. So @hyphaltip no reason to change the way we are running this quite yet, but we will need to update the database/resource link I think.

@ctb
Copy link

ctb commented Aug 26, 2021

well, I doubt it's your install - it's probably some SNAFU on our part, since it should have been working the same as before :). Either that or the database is bad/wrong? Yay computerz. We'll figure it out together tho, promise.

I do think you might want to take advantage of the new sourmash gather/sourmash tax approach, which is much better than lca classify, but that would be a somewhat bigger change. See @bluegenes blog post, https://bluegenes.github.io/sourmash-tax/.

@nextgenusfs
Copy link
Collaborator Author

Okay, I'll look into that. Basically what we are trying to do here is just classify each contig from de novo assembly and remove things that are obviously contamination, ie bacterial taxonomic classification when we are working on a fungal genome.

@ctb
Copy link

ctb commented May 3, 2022

hi! reminded of this by https://twitter.com/jonpalmer2013/status/1521312530936725506 :)

we did just release new databases! it would be easy for me to build you a new Genbank LCA (or give you the commands to do it), or you could just use the GTDB ones.

@ctb
Copy link

ctb commented May 3, 2022

(as of sourmash v4.4, scheduled soon, we can also point you at larger-on-disk but much faster and lower memory SQLite-based LCA database.)

@hyphaltip
Copy link
Member

thanks i was trying some thing before and it was way too slow for us to put on but I want to give this another go.

Noting that our current 'sourpurg' w sourmash did about as good a job as NCBI's now available screening tool in a fraction of the time and a lot less data to download..

@ctb
Copy link

ctb commented May 6, 2022

k - let us know how we can help! would AAFTF be something we can just download and run on our own, if we feel so inclined to try it out?

@hyphaltip
Copy link
Member

sure - is very simple python package and certainly welcome someone else helping me package it up for conda properly...

@hyphaltip
Copy link
Member

have implemented, it doesn't seem to really work as well as the old genbank-k31 though
with gtdb or gtdb-reps

CMD: sourmash lca classify --db /srv/projects/db/AAFTF_DB/gtdb-
        rs207-genomic-reps.dna.k31.lca.json.gz --query assembly.fasta.sig
[May 29 11:18 AM] Found 0 taxonomic classifications for contigs:

With old scheme.

CMD: sourmash lca classify --db
        /srv/projects/db/AAFTF_DB/genbank-k31.lca.json.gz --query
        assembly.fasta.sig
[May 29 11:17 AM] Found 2 taxonomic classifications for contigs:
Eukaryota;Ascomycota;Dothideomycetes;Capnodiales;Cladosporiaceae;Rachicladosporium;Rachicladosporium antarcticum
Eukaryota;Ascomycota;Dothideomycetes;Capnodiales;Cladosporiaceae;Rachicladosporium

@ctb is this because the gtdb is really bacteria only? I think this is okay in a sense but I guess DBs are too large now to really do a single sourmash search on representative dbs?

@hyphaltip
Copy link
Member

hyphaltip commented Oct 11, 2022 via email

@hyphaltip
Copy link
Member

This has been fixed now, it downloads automatically into $AAFTF_DB (either env variable or cmdline option) to save this DB. There are also options to run against GTDB databases as indexed by sourmash tools of @ctb and @sourmash-bio relying on LCA index files listed https://sourmash.readthedocs.io/en/latest/databases.html#gtdb-r08-rs214-dna-databases in addition the LCA for genbank 2017 which actually works really well for finding bacteria contamination and is fast to search https://sourmash.readthedocs.io/en/latest/legacy-databases.html#genbank-microbial-genomes-lca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants