You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WARNING: 12139 duplicate signatures.
WARNING: no lineage provided for 305403 signatures.
WARNING: no signatures for 317542 spreadsheet rows.
WARNING: 317542 unused lineages.
WARNING: 317542 unused identifiers.
Yay warnings! Boo warnings!
What's going on?
First: fixing it.
No taxonomies were loaded, and sourmash SQLite LCA databases without taxonomies are just sourmash SQLite databases, so it should be possible to add a taxonomy ;).
path filetype: LCA_SqliteDatabase
location: gtdb-rs207.protein.k10.scaled1000.lca.sql
is database? yes
has manifest? yes
num signatures: 305403
** examining manifest...
total hashes: 318937259
summary of sketches:
305403 sketches with protein, k=10, scaled=1000 318937259 total hashes
yay w00t!
why did the sourmash lca index fail!?
Let's diagnose that - first, produce a smaller taxonomy:
head -20 /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.taxonomy.with-strain.csv \
> subtax.csv
and subset the database:
sourmash sig cat gtdb-rs207.protein.k10.scaled1000.lca.sql --picklist subtax.csv:ident:ident -o subsig.zip
then this reproduces the problem, but, like, faster ;) -
19 distinct identities in spreadsheet out of 19 rows.
19 distinct lineages in spreadsheet out of 19 rows.
... loaded 1 signatures.
loaded 3887 hashes at ksize=10 scaled=1000
19 assigned lineages out of 19 distinct lineages in spreadsheet.
19 identifiers used out of 19 distinct identifiers in spreadsheet
(Should this maybe be moved over to https://github.com/sourmash-bio/sourmash-examples/?)
So @bluegenes built an
LCA_SqliteDatabase
using:and got the following warnings:
Yay warnings! Boo warnings!
What's going on?
First: fixing it.
No taxonomies were loaded, and sourmash SQLite LCA databases without taxonomies are just sourmash SQLite databases, so it should be possible to add a taxonomy ;).
So I tried:
and got back:
because even though the taxonomy table was empty, they still existed.
So I ran the sqlite3 command line interface -
and then inside sqlite,
which returned:
The key/value pair
SqliteLineage
plus the existence of the tablesourmash_taxonomy
were preventingsourmash tax prepare
from running.The following commands removed these:
and then I could re-run:
Now
sourmash tax summarize
produces good results:as does
sourmash sig summarize
:yay w00t!
why did the sourmash lca index fail!?
Let's diagnose that - first, produce a smaller taxonomy:
and subset the database:
then this reproduces the problem, but, like, faster ;) -
adding a
--report out.txt
, I see:so it looks like identifiers are not being split, oops.
Fixed with:
which then yields
tada!
An alternate (and safer?) construction method -
First, build a SQLite database:
then add taxonomy:
Double check - do all these work?
Extract a query:
Check
subsig.lca.sql
:Check
try.lca.sqldb
:Check full GTDB:
yay! w00t! it all works!
Exploring questions -
Why did we need
--no-dna
? Or did we?followed by
worked:
so Tessa was just being very cautious ;).
The text was updated successfully, but these errors were encountered: