# Converting the `sourmash gather` output to format required for `krona` visualization

lineage files need to be downloaded before they're used.
Below we provide documentation for their locations.

### Paths to lineage files on farm

**Genbank**
```
/home/irber/sourmash_databases/outputs/lca/lineages/protozoa_genbank_lineage.csv 
/home/irber/sourmash_databases/outputs/lca/lineages/fungi_genbank_lineage.csv 
/home/irber/sourmash_databases/outputs/lca/lineages/viral_genbank_lineage.csv
/home/irber/sourmash_databases/outputs/lca/lineages/bacteria_genbank_lineage.csv
/home/irber/sourmash_databases/outputs/lca/lineages/archaea_genbank_lineage.csv
```

**GTDB**
```
/group/ctbrowngrp/gtdb/gtdb-rs202.taxonomy.csv
```

### download links for lineage files

**Genbank**
```
wget -O viral_genbank_lineage.csv https://osf.io/q2emr/download
wget -O protozoa_genbank_lineage.csv https://osf.io/urtfx/download
wget -O fungi_genbank_lineage.csv https://osf.io/9u6qh/download
wget -O archaea_genbank_lineage.csv https://osf.io/mv5hs/download
wget -O bacteria_genbank_lineage.csv https://osf.io/cbhgd/download
```


**GTDB**
```
wget -O gtdb-rs202.taxonomy.v2.csv https://osf.io/p6z3w/download
```

## Summarize gather to lineage

In [1]:
!wget -O viral_genbank_lineage.csv https://osf.io/q2emr/download
!wget -O protozoa_genbank_lineage.csv https://osf.io/urtfx/download
!wget -O fungi_genbank_lineage.csv https://osf.io/9u6qh/download
!wget -O archaea_genbank_lineage.csv https://osf.io/mv5hs/download
!wget -O bacteria_genbank_lineage.csv https://osf.io/cbhgd/download

--2021-05-24 16:21:33--  https://osf.io/q2emr/download
Resolving osf.io (osf.io)... 35.190.84.173
Connecting to osf.io (osf.io)|35.190.84.173|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://files.osf.io/v1/resources/t3fqa/providers/osfstorage/60ac309aa480c602300d3c01?action=download&direct&version=1 [following]
--2021-05-24 16:21:33--  https://files.osf.io/v1/resources/t3fqa/providers/osfstorage/60ac309aa480c602300d3c01?action=download&direct&version=1
Resolving files.osf.io (files.osf.io)... 35.186.214.196
Connecting to files.osf.io (files.osf.io)|35.186.214.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4476110 (4.3M) [text/csv]
Saving to: ‘viral_genbank_lineage.csv’


2021-05-24 16:21:34 (17.9 MB/s) - ‘viral_genbank_lineage.csv’ saved [4476110/4476110]

--2021-05-24 16:21:34--  https://osf.io/urtfx/download
Resolving osf.io (osf.io)... 35.190.84.173
Connecting to osf.io (osf.io)|35.190.84.173|:443... connected.
HT

In [2]:
!cat *genbank_lineage.csv > all_genbank_lineages.csv

In [4]:
!scripts/gather-to-tax.py outputs/gather/HSMA33MX_gather_x_genbank_k31.csv all_genbank_lineages.csv > outputs/gather/HSMA33MX_gather_x_genbank_k31_lineage.txt

[Kexamining spreadsheet headers...
[K** assuming column 'accession' is identifiers in spreadsheet


## Read in summarized lineages and format for krona

In [5]:
import pandas as pd

In [6]:
# R code to read in lineages using tidyverse
# read_delim("outputs/gather/HSMA33MX_gather_x_genbank_k31_lineage.txt", delim = ";", skip=3, col_names = c("tmp", "phylum", "class", "order", "family", "genus", "species") %>%
#  separate(col = tmp, into = c("level", "fraction", "superkingdom"), sep = " ") %>%
#  mutate(lineage = paste(superkingdom, phylum, class, order, family, genus, species, sep = ";"))

In [20]:
# read in the lineage information, dealing with the funny formatting
lineage = pd.read_csv("outputs/gather/HSMA33MX_gather_x_genbank_k31_lineage.txt",
                      sep = ";", skiprows = 4, header = None,
                      names = ["tmp", "phylum", "class", "order", "family", "genus", "species"])

In [21]:
lineage.head()

Unnamed: 0,tmp,phylum,class,order,family,genus,species
0,superkingdom 0.245 Eukaryota,,,,,,
1,superkingdom 0.131 Bacteria,,,,,,
2,phylum 0.245 Eukaryota,Apicomplexa,,,,,
3,phylum 0.073 Bacteria,Bacteroidetes,,,,,
4,phylum 0.058 Bacteria,Proteobacteria,,,,,


In [22]:
# finish dealing with the funny formatting
lineage[['level', 'fraction', 'superkingdom']] = lineage.tmp.apply(
   lambda x: pd.Series(str(x).split(" ")))
cols = ['level', 'fraction', 'superkingdom', "phylum", "class", "order", "family", "genus", "species"]
lineage = lineage[cols]
lineage.head()

Unnamed: 0,level,fraction,superkingdom,phylum,class,order,family,genus,species
0,superkingdom,0.245,Eukaryota,,,,,,
1,superkingdom,0.131,Bacteria,,,,,,
2,phylum,0.245,Eukaryota,Apicomplexa,,,,,
3,phylum,0.073,Bacteria,Bacteroidetes,,,,,
4,phylum,0.058,Bacteria,Proteobacteria,,,,,


In [23]:
# filter to species
# note that if something isn't classified down to the species, I think it will get dropped here...
species = lineage[lineage['level'] == "species"]
species.head()

Unnamed: 0,level,fraction,superkingdom,phylum,class,order,family,genus,species
23,species,0.222,Eukaryota,Apicomplexa,Aconoidasida,Haemosporida,Plasmodiidae,Plasmodium,Plasmodium vivax
24,species,0.058,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
25,species,0.057,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Prevotellaceae,Prevotella,Prevotella copri
26,species,0.023,Eukaryota,Apicomplexa,Conoidasida,Eucoccidiorida,Sarcocystidae,Toxoplasma,Toxoplasma gondii
27,species,0.016,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,Bacteroides vulgatus


In [28]:
# format for krona
# Use ktImportText to create a chart based on a text file that lists values and wedge hierarchies to add them to. 
# Each line should be an optional quantity followed by list of wedges to contribute to (starting from the highest level), separated by tabs.
# If a hierarchy has more than one listing, the quantities will be added.
cols = ['fraction', 'superkingdom', "phylum", "class", "order", "family", "genus", "species"]
species = species[cols]
species.to_csv("outputs/gather/HSMA33MX_gather_x_genbank_k31_krona.txt", sep = "\t", index = False)

In [29]:
!ktImportText outputs/gather/HSMA33MX_gather_x_genbank_k31_krona.txt

Writing text.krona.html...


![](_static/krona_screenshot.png)