# `sourmash tax` submodule
### for integrating taxonomic information

The sourmash tax (alias `taxonomy`) commands integrate taxonomic information into the results of sourmash gather. tax commands require a properly formatted taxonomy csv file that corresponds to the database used  for gather. For supported databases (e.g. GTDB), we provide these files, but they can also be generated for user-generated databases. For more information, see the [databases documentation](https://sourmash.readthedocs.io/en/latest/databases.html).

These commands rely upon the fact that gather results are non-overlapping: the fraction match for gather on each query will be between 0 (no database matches) and 1 (100% of query matched). We use this property   to aggregate gather matches at the desired taxonomic rank. For example, if the gather results for a metagenome include results for 30 different strains of a given species, we can sum the fraction match to each    strain to obtain the fraction match to this species.

As with all reference-based analysis, results can be affected by the completeness of the reference database. However, summarizing taxonomic results from gather minimizes the impact of reference database issues    that can derail standard k-mer LCA approaches. See the [blog post]() for a full explanation, and the [`sourmash tax` documentation](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-prepare-prepare-and-or-combine-taxonomy-files) for additional usage details.

## Download example inputs for `sourmash tax`

In this example, we'll be using a small test dataset run against both the `GTDB-rs202` database
and our legacy `Genbank` database. (New genbank databases coming soon, please bear with us :).


#### download and look at the gtdb-rs202 lineage file

This is the taxonomy file in `csv` format.
The column headers for `GTDB` are the accession (`ident`), and the taxonomic ranks `superkingdom` --> `species`.

In [1]:
%%bash

mkdir -p lineages
curl -L https://osf.io/p6z3w/download -o lineages/gtdb-rs202.taxonomy.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   483  100   483    0     0    273      0  0:00:01  0:00:01 --:--:--   273
100 35.2M  100 35.2M    0     0  5497k      0  0:00:06  0:00:06 --:--:--  9.8M


In [2]:
%%bash
head lineages/gtdb-rs202.taxonomy.csv

ident,superkingdom,phylum,class,order,family,genus,species
GCF_014075335.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_002310555.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_900013275.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_000168095.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_002459845.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_001614695.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri
GCF_000356585.2,d__Bacteria,p__Proteobact

#### download NCBI lineage files

Now let's go ahead and grab the Genbank taxonomy files as well. 

In [3]:
%%bash
curl -L https://osf.io/cbhgd/download -o lineages/bacteria_genbank_lineages.csv
curl -L https://osf.io/urtfx/download -o lineages/protozoa_genbank_lineages.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1267      0 --:--:-- --:--:-- --:--:--  1267
100 83.1M  100 83.1M    0     0  16.5M      0  0:00:05  0:00:05 --:--:-- 20.6M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0    803      0 --:--:-- --:--:-- --:--:--   803
100  107k  100  107k    0     0  64622      0  0:00:01  0:00:01 --:--:--  474k


If you do a `head` on these files, you'll notice they have an extra `taxid` column, but otherwise follow the same format.

In [4]:
%%bash
head lineages/protozoa_genbank_lineages.csv

accession,taxid,superkingdom,phylum,class,order,family,genus,species,strain
GCA_004431415,2762,Eukaryota,,Glaucocystophyceae,,Cyanophoraceae,Cyanophora,Cyanophora paradoxa,
GCA_000150955,556484,Eukaryota,Bacillariophyta,Bacillariophyceae,Naviculales,Phaeodactylaceae,Phaeodactylum,Phaeodactylum tricornutum,Phaeodactylum tricornutum CCAP 1055/1
GCA_000310025,2880,Eukaryota,,Phaeophyceae,Ectocarpales,Ectocarpaceae,Ectocarpus,Ectocarpus siliculosus,
GCA_000194455,2898,Eukaryota,,Cryptophyceae,Cryptomonadales,Cryptomonadaceae,Cryptomonas,Cryptomonas paramecium,
GCA_000372725,280463,Eukaryota,Haptista,Haptophyta,Isochrysidales,Noelaerhabdaceae,Emiliania,Emiliania huxleyi,Emiliania huxleyi CCMP1516
GCA_001939145,2951,Eukaryota,,Dinophyceae,Suessiales,Symbiodiniaceae,Symbiodinium,Symbiodinium microadriaticum,
GCA_900617105,2996,Eukaryota,,Chrysophyceae,Hydrurales,Hydruraceae,Hydrurus,Hydrurus foetidus,
GCA_001638955,158060,Eukaryota,Euglenozoa,Euglenida,Euglenales,Euglenaceae,Euglena,Euglena g

## Combining taxonomies with `sourmash tax prepare`

All sourmash tax commands must be given one or more taxonomy files as parameters to the `--taxonomy` argument.

`sourmash tax prepare` is a utility function that can ingest and validate multiple CSV files or sqlite3
databases, and output a CSV file or a sqlite3 database. It can be used to combine multiple taxonomies
into a single file, as well as change formats between CSV and sqlite3.

> Note: `--taxonomy` files can be either CSV files or (as of sourmash 4.2.1) sqlite3 databases.
> sqlite3 databases are much faster for large taxonomies, while CSV files are easier to view
> and modify using spreadsheet software.

Let's use `tax prepare` to combine the downloaded taxonomies and output into a sqlite3 database:

In [5]:
# to see the arguments, run the `--help` like so:
! sourmash tax prepare --help

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

usage: 

    sourmash tax prepare --taxonomy-csv <taxonomy_file> [ ... ] -o <output>

The 'tax prepare' command reads in one or more taxonomy databases
and saves them into a new database. It can be used to combine databases
in the desired order, as well as output different database formats.

Please see the 'tax prepare' documentation for more details:
  https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-prepare-prepare-and-or-combine-taxonomy-files

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           suppress non-error output
  -t FILE [FILE ...], --taxonomy-csv FILE [FILE ...], --taxonomy FILE [FILE ...]
                        database lineages
  -o OUTPUT, --output OUTPUT
                        output file
  -F {csv,sql}, --database-format {csv,sql}
                        format of output file; defaul

In [6]:
%%bash

sourmash tax prepare --taxonomy lineages/bacteria_genbank_lineages.csv \
                                lineages/protozoa_genbank_lineages.csv \
                                lineages/gtdb-rs202.taxonomy.csv \
                                -o lineages/gtdb-rs202_genbank.taxonomy.db

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloading taxonomies...
[K...loaded 882562 entries.
[Ksaving to 'lineages/gtdb-rs202_genbank.taxonomy.db', format sql...
[Kdone!


> Note that the **order is important if the databases contain overlapping
accession identifiers**. In this case, GTDB contains only a subset of all identifiers
in the NCBI taxonomy. Putting GTDB last here will allow the GTDB lineage information
to override the lineage information provided in the NCBI file, thus utilizing GTDB
taxonomy when available, and NCBI lienages for all other accessions.

In [7]:
%%bash
ls -lsrht lineages/

total 566776
 73856 -rw-r--r--  1 tessa  staff    35M Aug  2 15:16 gtdb-rs202.taxonomy.csv
196736 -rw-r--r--  1 tessa  staff    83M Aug  2 15:16 bacteria_genbank_lineages.csv
   256 -rw-r--r--  1 tessa  staff   107K Aug  2 15:16 protozoa_genbank_lineages.csv
295928 -rw-r--r--  1 tessa  staff   139M Aug  2 15:17 gtdb-rs202_genbank.taxonomy.db


We'll use this prepared database in each of the commands below.

## `sourmash tax metagenome`

In [8]:
# to see the arguments, run the `--help` like so:
! sourmash tax metagenome --help

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

usage: 

    sourmash tax metagenome --gather-csv <gather_csv> [ ... ] --taxonomy-csv <taxonomy-csv> [ ... ]

The 'tax metagenome' command reads in metagenome gather result CSVs and
summarizes by taxonomic lineage.

The default output format consists of four columns,
 `query_name,rank,fraction,lineage`, where `fraction` is the fraction
 of the query matched to that reported rank and lineage. The summarization
 is reported for each taxonomic rank.

Alternatively, you can output results at a specific rank (e.g. species)
in `krona` or `lineage_summary` formats.

Please see the 'tax metagenome' documentation for more details:
  https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-metagenome-summarize-metagenome-content-from-gather-results

optional arguments:
  -h, --help            show this help message and exit
  -g [GATHER_CSV ...], --gather-csv [GATHER_C

#### Download a small demo `sourmash gather` output file from metagenome `HSMA33MX`.
This `gather` was run at a DNA ksize of 31 against both GTDB and our legacy Genbank database.

In [9]:
%%bash
mkdir -p gather
curl -L https://osf.io/xb8jg/download -o gather/HSMA33MX_gather_x_gtdbrs202_genbank_k31.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1201      0 --:--:-- --:--:-- --:--:--  1201
100  2662  100  2662    0     0   1928      0  0:00:01  0:00:01 --:--:--     0


Take a look at this gather file:

In [10]:
%%bash
head gather/HSMA33MX_gather_x_gtdbrs202_genbank_k31.csv

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp
442000,0.08815317112086159,0.08438335242458954,0.08815317112086159,0.05815279361459521,1.6153846153846154,1.0,1.1059438185997785,"GCF_001881345.1 Escherichia coli strain=SF-596, ASM188134v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,683df1ec13872b4b98d59e98b355b52c,0.042779713511420826,442000,0,4572000,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,HSMA33MX,9687eeed,5014000
390000,0.07778220981252493,0.10416666666666667,0.07778220981252493,0.050496823586903404,1.5897435897435896,1.0,0.8804995294906566,"GCF_009494285.1 Prevotella copri strain=iAK1218, ASM949428v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,1266c86141e3a5603da61f57dd863ed0,0.052236806857755155,390000,1,4182000,outputs/abundtrim/HSMA33MX.abundtr

### summarize this metagenome and produce `krona` output at the `species` level:

In [11]:
%%bash
sourmash tax metagenome --gather-csv gather/HSMA33MX_gather_x_gtdbrs202_genbank_k31.csv \
                        --taxonomy  lineages/gtdb-rs202_genbank.taxonomy.db \
                        --output-format csv_summary krona --rank species \
                        --output-base HSMA33MX.gather-tax

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 6 gather results.
[Kof 6, missed 0 lineage assignments.
[Kloaded results from 1 gather CSVs
[Ksaving `csv_summary` output to HSMA33MX.gather-tax.summarized.csv.
[Ksaving `krona` output to HSMA33MX.gather-tax.krona.tsv.


In [27]:
# build krona plot
!ktImportText HSMA33MX.gather-tax.krona.tsv

Writing text.krona.html...


#### This will produce both `csv_summary` and `krona` output files, with the basename `HSMA33MX.gather-tax`:

In [12]:
%%bash
ls HSMA33MX.gather-tax*

HSMA33MX.gather-tax.krona.tsv
HSMA33MX.gather-tax.summarized.csv


In [13]:
%%bash
head HSMA33MX.gather-tax.summarized.csv

query_name,rank,fraction,lineage,query_md5,query_filename,f_weighted_at_rank,bp_match_at_rank
HSMA33MX,superkingdom,0.2042281611487834,d__Bacteria,9687eeed,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,0.13080306238801107,1024000
HSMA33MX,superkingdom,0.051455923414439574,Eukaryota,9687eeed,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,0.24482814790682522,258000
HSMA33MX,superkingdom,0.7443159154367771,unclassified,9687eeed,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,0.6243687897051637,3732000
HSMA33MX,phylum,0.11607499002792182,d__Bacteria;p__Bacteroidota,9687eeed,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,0.07265026877341586,582000
HSMA33MX,phylum,0.08815317112086159,d__Bacteria;p__Proteobacteria,9687eeed,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,0.05815279361459521,442000
HSMA33MX,phylum,0.051455923414439574,Eukaryota;Apicomplexa,9687eeed,outputs/abundtrim/HSMA33MX.abundtrim.fq.gz,0.24482814790682522,258000
HSMA33MX,phylum,0.7443159154367771,unclassified,9687eeed,outputs/abundtrim/HSMA33M

In [14]:
%%bash
head HSMA33MX.gather-tax.krona.tsv

fraction	superkingdom	phylum	class	order	family	genus	species
0.0885520542481053	d__Bacteria	p__Bacteroidota	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Prevotella	s__Prevotella copri
0.08815317112086159	d__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Enterobacterales	f__Enterobacteriaceae	g__Escherichia	s__Escherichia coli
0.041084962106102914	Eukaryota	Apicomplexa	Aconoidasida	Haemosporida	Plasmodiidae	Plasmodium	Plasmodium vivax
0.027522935779816515	d__Bacteria	p__Bacteroidota	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Phocaeicola	s__Phocaeicola vulgatus
0.010370961308336658	Eukaryota	Apicomplexa	Conoidasida	Eucoccidiorida	Sarcocystidae	Toxoplasma	Toxoplasma gondii
0.7443159154367771	unclassified	unclassified	unclassified	unclassified	unclassified	unclassified	unclassified


In [35]:
# generate krona html
!ktImportText HSMA33MX.gather-tax.krona.tsv

Writing text.krona.html...


### comparing metagenomes with `sourmash tax metagenome`:

We can also download a second metagenome `gather` csv and use `metagenome` to generate a
`lineage_summary` output to compare these samples.


> The lineage summary format is most useful when comparing across metagenome queries.
> Each row is a lineage at the desired reporting rank. The columns are each query used for
> gather, with the fraction match reported for each lineage. This format is commonly used
> as input for many external multi-sample visualization tools.

In [15]:
%%bash

curl -L https://osf.io/nqtgs/download -o gather/PSM7J4EF_gather_x_gtdbrs202_genbank_k31.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1180      0 --:--:-- --:--:-- --:--:--  1182
100  2312  100  2312    0     0   1202      0  0:00:01  0:00:01 --:--:--  1742


In [16]:
%%bash
sourmash tax metagenome --gather-csv gather/HSMA33MX_gather_x_gtdbrs202_genbank_k31.csv \
                                     gather/PSM7J4EF_gather_x_gtdbrs202_genbank_k31.csv \
                        --taxonomy  lineages/gtdb-rs202_genbank.taxonomy.db \
                        --output-format lineage_summary --rank species \
                        --output-base HSMA33MX-PSM7J4EF.gather-tax

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 6 gather results.
[Kof 6, missed 0 lineage assignments.
[Kloaded 5 gather results.
[Kof 5, missed 0 lineage assignments.
[Kloaded results from 2 gather CSVs
[Ksaving `lineage_summary` output to HSMA33MX-PSM7J4EF.gather-tax.lineage_summary.tsv.


In [17]:
%%bash
head HSMA33MX-PSM7J4EF.gather-tax.lineage_summary.tsv

lineage	HSMA33MX	PSM7J4EF
Eukaryota;Apicomplexa;Aconoidasida;Haemosporida;Plasmodiidae;Plasmodium;Plasmodium vivax	0.041084962106102914	0.004553734061930784
Eukaryota;Apicomplexa;Conoidasida;Eucoccidiorida;Sarcocystidae;Toxoplasma;Toxoplasma gondii	0.010370961308336658	0.0011275912915257177
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides fragilis	0	0.05134877266024807
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides ovatus	0	0.056726515742909184
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola dorei	0	0.10625379477838494
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus	0.027522935779816515	0
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri	0.0885520542481053	0
d__Bacteria;p__Proteob

Note, these are mini gather results, so your unclassified fraction will hopefully be much smaller!

## Classifying genomes with `sourmash tax genome`

To illustrate the utility of genome, let’s consider a signature consisting of two different
Shewanella strains, Shewanella baltica OS185 strain=OS185 and Shewanella baltica OS223 strain=OS223.
For simplicity, we gave this query the name “Sb47+63”.

When we gather this signature against the gtdb-rs202 representatives database, we see 66% matches to one strain, and 33% to the other:

abbreviated `gather_csv`:

```
f_match,f_unique_to_query,name,query_name
0.664,0.664,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",Sb47+63
0.656,0.335,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",Sb47+63
```
> Here, f_match shows that independently, both strains match ~65% percent of this mixed query.
> The f_unique_to_query column has the results of gather-style decomposition. As the OS223 strain
> had a slightly higher f_match (66%), it was the first match. The remaining 33% of the query
> matched to strain OS185.

#### download the gather results

In [18]:
%%bash

curl -L https://osf.io/pgsc2/download -o gather/Sb47+63_x_gtdb-rs202.gather.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1267      0 --:--:-- --:--:-- --:--:--  1267
100   820  100   820    0     0    738      0  0:00:01  0:00:01 --:--:--     0


In [19]:
%%bash

head gather/Sb47+63_x_gtdb-rs202.gather.csv

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp
5238000,0.6642150646715699,1.0,0.6642150646715699,0.6642150646715699,,,,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,38729c6374925585db28916b82a6f513,1.0,5238000,0,2648000,,Sb47+63,491c0a81,7886000
5177000,0.6564798376870403,0.5114931427467645,0.3357849353284301,0.3357849353284301,,,,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,09a08691ce52952152f0e866a59f6261,1.0,2648000,1,0,,Sb47+63,491c0a81,7886000


#### Now, let's run `tax genome` classification:

In [20]:
%%bash
# to see the arguments, run the `--help` like so:
sourmash tax genome --help

usage: 

    sourmash tax genome --gather-csv <gather_csv> [ ... ] --taxonomy-csv <taxonomy-csv> [ ... ]

The 'tax genome' command reads in genome gather result CSVs and reports likely
classification for each query genome.

By default, classification uses a containment threshold of 0.1, meaning at least
10 percent of the query was covered by matches with the reported taxonomic rank and lineage.
You can specify an alternate classification threshold or force classification by
taxonomic rank instead, e.g. at species or genus-level.

The default output format consists of five columns,
 `query_name,status,rank,fraction,lineage`, where `fraction` is the fraction
 of the query matched to the reported rank and lineage. The `status` column
 provides additional information on the classification, and can be:
  - `match` - this query was classified
  - `nomatch`- this query could not be classified
  - `below_threshold` - this query was classified at the specified rank,
     but the query fraction 

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==



In [21]:
%%bash

sourmash tax genome --gather-csv gather/Sb47+63_x_gtdb-rs202.gather.csv \
                    --taxonomy lineages/gtdb-rs202_genbank.taxonomy.db \
                    --output-base Sb47+63.gather-tax

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 2 gather results.
[Kof 2, missed 0 lineage assignments.
[Kloaded results from 1 gather CSVs
[Ksaving `classification` output to Sb47+63.gather-tax.classifications.csv.


The default output format is `csv_summary`.

This outputs a csv with taxonomic classification for each query genome. This output currently consists of six columns:
`query_name`,`rank`,`fraction`,`lineage`,`query_md5`,`query_filename`, where `fraction` is the fraction of the query matched to
the reported rank and lineage. The `status` column provides additional information on the classification:

- `match` - this query was classified
- `nomatch` - this query could not be classified
- `below_threshold` - this query was classified at the specified rank,
but the query fraction matched was below the containment threshold

In [22]:
!ls Sb47+63.gather-tax*

Sb47+63.gather-tax.classifications.csv


In [23]:
!head Sb47+63.gather-tax.classifications.csv

query_name,status,rank,fraction,lineage,query_md5,query_filename,f_weighted_at_rank,bp_match_at_rank
Sb47+63,match,species,1.0,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica,491c0a81,,1.0,7886000.0


> Here, we see that the match percentages to both strains have been aggregated, and we have 100% species-level
> Shewanella baltica annotation (fraction = 1.0).

## `sourmash tax annotate`

`sourmash tax annotate` adds a column with taxonomic lineage information for each database match to gather output.
It does not do any LCA summarization or classification. The results from `annotate` are not required for any other
`tax` command, but may be useful if you're doing your own exploration of `gather` results.

Let's annotate a previously downloaded `gather` file

In [24]:
%%bash
sourmash tax annotate --gather-csv gather/Sb47+63_x_gtdb-rs202.gather.csv \
                      --taxonomy lineages/gtdb-rs202_genbank.taxonomy.db

[K
== This is sourmash version 4.2.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 2 gather results.
[Kof 2, missed 0 lineage assignments.
[Kloaded results from 1 gather CSVs
[Ksaving `annotate` output to Sb47+63_x_gtdb-rs202.gather.with-lineages.csv.


In [25]:
!head Sb47+63_x_gtdb-rs202.gather.with-lineages.csv

intersect_bp,f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,name,filename,md5,f_match_orig,unique_intersect_bp,gather_result_rank,remaining_bp,query_filename,query_name,query_md5,query_bp,lineage
5238000,0.6642150646715699,1.0,0.6642150646715699,0.6642150646715699,,,,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,38729c6374925585db28916b82a6f513,1.0,5238000,0,2648000,,Sb47+63,491c0a81,7886000,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica
5177000,0.6564798376870403,0.5114931427467645,0.3357849353284301,0.3357849353284301,,,,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip,09a08691ce52952152f0e866a59f6261,1.0,2648000,1,0,,Sb47+63,491c0a81,7886000,d__Bacteria;p__Proteobacteria;c__Gammaproteob