Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing and debugging LCA_SqliteDatabase problems #2407

Open
ctb opened this issue Dec 15, 2022 · 2 comments
Open

fixing and debugging LCA_SqliteDatabase problems #2407

ctb opened this issue Dec 15, 2022 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented Dec 15, 2022

(Should this maybe be moved over to https://github.com/sourmash-bio/sourmash-examples/?)

So @bluegenes built an LCA_SqliteDatabase using:

sourmash lca index gtdb-rs207.taxonomy.with-strain.csv \
    gtdb-rs207.protein.k10.scaled1000.lca.sql \
    gtdb-rs207.protein.k10.scaled1000.zip \
    -F sql --protein -k 10 --no-dna --scaled 1000

and got the following warnings:

WARNING: 12139 duplicate signatures.
WARNING: no lineage provided for 305403 signatures.
WARNING: no signatures for 317542 spreadsheet rows.
WARNING: 317542 unused lineages.
WARNING: 317542 unused identifiers.

Yay warnings! Boo warnings!

What's going on?

First: fixing it.

No taxonomies were loaded, and sourmash SQLite LCA databases without taxonomies are just sourmash SQLite databases, so it should be possible to add a taxonomy ;).

So I tried:

sourmash tax prepare -t \
    /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
    -o gtdb-rs207.protein.k10.scaled1000.lca.sql \
    -F sql

and got back:

ERROR while saving!
taxonomy table already exists in 'gtdb-rs207.protein.k10.scaled1000.lca.sql'

because even though the taxonomy table was empty, they still existed.

So I ran the sqlite3 command line interface -

sqlite3 gtdb-rs207.protein.k10.scaled1000.lca.sql

and then inside sqlite,

sqlite> select * from sourmash_internal;

which returned:

SqliteIndex|1.0
SqliteManifest|1.0
SqliteLineage|1.0

The key/value pair SqliteLineage plus the existence of the table sourmash_taxonomy were preventing sourmash tax prepare from running.

The following commands removed these:

sqlite> delete from sourmash_internal where key='SqliteLineage';
sqlite> drop table sourmash_taxonomy;

and then I could re-run:

sourmash tax prepare -t \
    /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
    -o gtdb-rs207.protein.k10.scaled1000.lca.sql \
    -F sql

Now sourmash tax summarize produces good results:

loading taxonomies...
...loaded 317542 entries.
number of distinct taxonomic lineages: 317542
rank superkingdom:        2 distinct taxonomic lineages
rank phylum:              189 distinct taxonomic lineages
rank class:               481 distinct taxonomic lineages
rank order:               1593 distinct taxonomic lineages   
rank family:              4107 distinct taxonomic lineages   
rank genus:               16686 distinct taxonomic lineages  
rank species:             65703 distinct taxonomic lineages  
rank strain:              317542 distinct taxonomic lineages 

as does sourmash sig summarize:

path filetype: LCA_SqliteDatabase
location: gtdb-rs207.protein.k10.scaled1000.lca.sql
is database? yes
has manifest? yes
num signatures: 305403
** examining manifest...
total hashes: 318937259
summary of sketches:
   305403 sketches with protein, k=10, scaled=1000    318937259 total hashes

yay w00t!

why did the sourmash lca index fail!?

Let's diagnose that - first, produce a smaller taxonomy:

head -20 /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
    > subtax.csv

and subset the database:

sourmash sig cat gtdb-rs207.protein.k10.scaled1000.lca.sql --picklist subtax.csv:ident:ident -o subsig.zip

then this reproduces the problem, but, like, faster ;) -

sourmash lca index subtax.csv \
    subsig.lca.sql subsig.zip -F sql \
    --protein -k 10 --no-dna --scaled 1000

adding a --report out.txt, I see:

Unused identifiers: 19
GCF_014333955.1
GCF_002513785.1
GCF_900482335.1
GCF_008388435.1
...
No lineage provided for these identifiers: 19
GCF_000566285.1 s__Escherichia coli
GCA_900448105.1 s__Escherichia coli
GCF_003460375.1 s__Escherichia coli
...

so it looks like identifiers are not being split, oops.

Fixed with:

sourmash lca index subtax.csv subsig.lca.sql subsig.zip \
    -F sql --protein -k 10 --no-dna --scaled 1000 \
    --keep-identifier-versions --split-identifiers

which then yields

19 distinct identities in spreadsheet out of 19 rows.
19 distinct lineages in spreadsheet out of 19 rows.
... loaded 1 signatures.
loaded 3887 hashes at ksize=10 scaled=1000
19 assigned lineages out of 19 distinct lineages in spreadsheet.
19 identifiers used out of 19 distinct identifiers in spreadsheet

tada!

An alternate (and safer?) construction method -

First, build a SQLite database:

sourmash sig cat subsig.zip -o try.lca.sqldb

then add taxonomy:

sourmash tax prepare --keep-identifier-versions \
    -t subtax.csv -o try.lca.sqldb -F sql

Double check - do all these work?

Extract a query:

sourmash sig cat --include GCF_000459755.1 subsig.zip -o query.sig

Check subsig.lca.sql:

% sourmash lca classify --query query.sig --db subsig.lca.sql

...

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,GCF_000459755.1
classified 1 signatures total

Check try.lca.sqldb:

sourmash lca classify --query query.sig --db try.lca.sqldb 

...

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,GCF_000459755.1
classified 1 signatures total

Check full GTDB:

 sourmash lca classify --query query.sig --db gtdb-rs207.protein.k10.scaled1000.lca.sql

...

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
notify doneying GCF_000459755.1 s__Escherichia coli (file 1 of 1)
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,
classified 1 signatures total

yay! w00t! it all works!

Exploring questions -

Why did we need --no-dna? Or did we?

sourmash lca index subtax.csv try2.lca.sql subsig.zip \
    -F sql --protein -k 10 --scaled 1000 \
    --keep-identifier-versions --split-identifiers

followed by

sourmash lca classify --query query.sig --db try2.lca.sql 

worked:

== This is sourmash version 4.6.2.dev7+g1c8b1659. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 LCA databases. ksize=10, scaled=1000 moltype=protein
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_000459755.1 s__Escherichia coli,found,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia coli,GCF_000459755.1
classified 1 signatures total

so Tessa was just being very cautious ;).

@ctb
Copy link
Contributor Author

ctb commented Dec 16, 2022

ref #2198 - the tax code and commands are much more straightforward than the lca code these days!

@ctb
Copy link
Contributor Author

ctb commented Dec 18, 2022

note: on farm, all of this was done in ~ctbrown/scratch/2022-tessa-lca-db-fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant