# Using taxonomic information from GTDB

Since we have the accession numbers from NCBI, it is straightforward to map them to the taxonomy inferred from [GTDB](https://gtdb.ecogenomic.org/). We downloaded the [bac120_metadata_r89.tsv](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_metadata_r89.tsv) file (200MB) and then we can work with the 'original' taxonomies (from NCBI), or with the inferred taxonomies from GTDB or SILVA. 
We do not distribute this table (neither the smaller one we created by hand with relevant columns only), but once you download it you can reproduce this notebook in a few seconds. 

The main output of this notebook is at the end, when we create a list with downloaded refseq sequences (identified by a `GCF` accession number) which also belong to the GTDB data set (meaning that they passed the quality controls).

### read csv file with information from our pilot sequences
* this step is not necessary, but we want to compare the inferred taxonomy 
* by 'pilot' sequences we mean from [first version of study](https://www.biorxiv.org/content/10.1101/626093v1)

In [2]:
# read csv with our samples, just for comparison
#a1<-read.csv("pilot/016_results/all.csv",colClasses="character")
a1<-read.csv("../notebooks/016_results/all.csv",colClasses="character")
head(a1)

X,Accession.Number,Consensus.16S,Consensus.23S,Consensus.5S,Copies.16S,Copies.23S,Copies.5S,File.Name,Longest.16S,Longest.23S,Longest.5S,Organism,Species
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
0,GCF_002736045.1,1547,2917,116,7,7,8,GCF_002736045.1_ASM273604v1_genomic.gbff.gz,1547,2916,116,Pseudomonas putida,Pseudomonas putida
1,GCF_003589865.1,1554,3109,116,7,7,8,GCF_003589865.1_ASM358986v1_genomic.gbff.gz,1552,2929,116,Salmonella enterica subsp. enterica serovar Dublin,Salmonella enterica
2,GCF_003180975.1,1554,2933,116,7,7,8,GCF_003180975.1_ASM318097v1_genomic.gbff.gz,1554,2930,116,Escherichia coli,Escherichia coli
3,GCF_001516165.2,1544,2912,116,4,4,4,GCF_001516165.2_ASM151616v2_genomic.gbff.gz,1544,2911,116,Pseudomonas aeruginosa,Pseudomonas aeruginosa
4,GCF_001697305.1,1551,2898,114,4,4,4,GCF_001697305.1_ASM169730v1_genomic.gbff.gz,1551,2898,114,Neisseria meningitidis,Neisseria meningitidis
5,GCF_001661115.1,1551,2898,114,4,4,4,GCF_001661115.1_ASM166111v1_genomic.gbff.gz,1551,2898,114,Neisseria gonorrhoeae,Neisseria gonorrhoeae


### Read taxonomy inference from GTDB
* The file `bac120_metadata_r89_taxonomy_only.tsv` has only the relevant columns from the downloaded one (we deleted other columns, most with statistics, to save space), and it is not distributed with this repository.
* However it should also work with the original spreadsheet (provided the column names don't change)
* If you want to replicate this part of the analysis, don't forget to [download the original spreadsheet](https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_metadata_r89.tsv) yourself, it is not part of this repository.

In [11]:
a2<-read.delim("../extra/bac120_metadata_r89_taxonomy_only.tsv", colClasses="character")
head(a2)

accession,gtdb_taxonomy,lsu_silva_23s_taxonomy,ncbi_genbank_assembly_accession,ncbi_organism_name,ncbi_taxonomy,ncbi_taxonomy_unfiltered,ssu_gg_taxonomy,ssu_silva_taxonomy
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
RS_GCF_001999625.1,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus;s__Enterococcus faecalis,Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis ATCC 29212,GCA_001999625.1,Enterococcus faecalis ATCC 29212,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus;s__Enterococcus faecalis,d__Bacteria;x__Terrabacteria group;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus;s__Enterococcus faecalis;x__Enterococcus faecalis ATCC 29212,k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus;s__,Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis
RS_GCF_001658645.1,d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis,Bacteria;Firmicutes;Bacilli;Bacillales;Stapyhlococcaceae;Staphylococcus;Staphylococcus epidermidis,GCA_001658645.1,Staphylococcus epidermidis,d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis,d__Bacteria;x__Terrabacteria group;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis,k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__epidermidis,Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis NIHLM088
RS_GCF_900117665.1,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii,Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii,GCA_900117665.1,Acinetobacter baumannii,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;x__Acinetobacter calcoaceticus/baumannii complex;s__Acinetobacter baumannii,k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__,Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii
RS_GCF_000652055.1,d__Bacteria;p__Actinobacteriota;c__Actinobacteria;o__Mycobacteriales;f__Mycobacteriaceae;g__Mycobacterium;s__Mycobacterium tuberculosis,Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Mycobacteriaceae;Mycobacterium;Mycobacterium tuberculosis,GCA_000652055.1,Mycobacterium tuberculosis TKK_03_0108,d__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Mycobacteriaceae;g__Mycobacterium;s__Mycobacterium tuberculosis,d__Bacteria;x__Terrabacteria group;p__Actinobacteria;c__Actinobacteria;o__Corynebacteriales;f__Mycobacteriaceae;g__Mycobacterium;x__Mycobacterium tuberculosis complex;s__Mycobacterium tuberculosis;x__Mycobacterium tuberculosis TKK_03_0108,k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Mycobacteriaceae;g__Mycobacterium;s__,Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Mycobacteriaceae;Mycobacterium;Mycobacterium bovis
RS_GCF_003037025.1,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella;s__Klebsiella pneumoniae,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae subsp. pneumoniae,GCA_003037025.1,Klebsiella pneumoniae,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella;s__Klebsiella pneumoniae,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella;s__Klebsiella pneumoniae,k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Enterobacter;Klebsiella pneumoniae subsp. pneumoniae KPNIH18
RS_GCF_002138225.1,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter pittii,Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;Acinetobacter pittii,GCA_002138225.1,Acinetobacter pittii,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter pittii,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;x__Acinetobacter calcoaceticus/baumannii complex;s__Acinetobacter pittii,k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__rhizosphaerae,Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;Acinetobacter oleivorans DR1


### Extracting taxonomic information from columns
Since we are only interested in the binomial name. A brief explanation of the functions:
* `strsplit()` splits each item of column, generating a vector of 2 elements
* `lapply()` then gets only second element of each vector (which usually is string after `;s__`)
* `paste0(,collapse=" ")` will concatenate strings from a vector

In [13]:
get_genus_sp <- function (string) {
    x<-unlist(strsplit(string, ';s__')); # now x[2] may have species (or NULL)
    if (is.na(x[2])) { x <- unlist(strsplit(x[1], ';g__')); x[2] <- paste(x[2], "sp."); } # now x[2] has genus
    x[2];
}

a2$gtdb_taxonomy   <- as.character( lapply(strsplit(a2$gtdb_taxonomy, ';s__'), '[', 2) )
a2$ncbi_taxonomy   <- as.character( lapply(a2$ncbi_taxonomy, get_genus_sp) )

#g_s_tmp <- as.character( lapply(strsplit(a2$ncbi_taxonomy_unfiltered, ';s__'), '[', 2) )
#a2$ncbi_taxonomy_unfiltered <- as.character( lapply(strsplit(g_s_tmp, ';x__'), '[', 2) )

a2$lsu_silva_23s_taxonomy <- as.character( lapply(strsplit(a2$lsu_silva_23s_taxonomy, ';'), '[', 7) )
a2$ssu_silva_taxonomy     <- as.character( lapply(strsplit(a2$ssu_silva_taxonomy,     ';'), '[', 7) )

#g_s_tmp <- as.character( lapply(strsplit(a2$ssu_gg_taxonomy, ';g__'), '[', 2) )
#a2$ssu_gg_taxonomy <- as.character(lapply(strsplit(g_s_tmp, ';s__'), paste0, collapse=" ") )

drops <- c("ncbi_genbank_assembly_accession", "ncbi_taxonomy_unfiltered", "ssu_gg_taxonomy")
a2<-a2[ , !(names(a2) %in% drops)] # remove columns (but we keep comments above in case we want to use them)
head (a2)

accession,gtdb_taxonomy,lsu_silva_23s_taxonomy,ncbi_organism_name,ncbi_taxonomy,ssu_silva_taxonomy
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
RS_GCF_001999625.1,Enterococcus faecalis,Enterococcus faecalis ATCC 29212,Enterococcus faecalis ATCC 29212,Enterococcus faecalis,Enterococcus faecalis
RS_GCF_001658645.1,Staphylococcus epidermidis,Staphylococcus epidermidis,Staphylococcus epidermidis,Staphylococcus epidermidis,Staphylococcus epidermidis NIHLM088
RS_GCF_900117665.1,Acinetobacter baumannii,Acinetobacter baumannii,Acinetobacter baumannii,Acinetobacter baumannii,Acinetobacter baumannii
RS_GCF_000652055.1,Mycobacterium tuberculosis,Mycobacterium tuberculosis,Mycobacterium tuberculosis TKK_03_0108,Mycobacterium tuberculosis,Mycobacterium bovis
RS_GCF_003037025.1,Klebsiella pneumoniae,Klebsiella pneumoniae subsp. pneumoniae,Klebsiella pneumoniae,Klebsiella pneumoniae,Klebsiella pneumoniae subsp. pneumoniae KPNIH18
RS_GCF_002138225.1,Acinetobacter pittii,Acinetobacter pittii,Acinetobacter pittii,Acinetobacter pittii,Acinetobacter oleivorans DR1


### compare NCBI and GTDB classifications
Using our samples from pilot analyses. The classifications can be the same, distinct, or "absent" for the case where the sequence did not pass the quality checks from GTDB.

In [9]:
missing = 0;
count = 0;
distinct = 0;
#for (i in 1:length(a1[,1])) {  ## around 700 sequences?
for (i in 1:20) {  ## just a small sample
    idx <- grep(a1$Accession.Number[i], a2$accession);
    if (length(idx) > 0) { 
        if (a2$ncbi_taxonomy[idx] != a2$gtdb_taxonomy[idx]) {
            print (paste(a1$Species[i], a2$gtdb_taxonomy[idx], sep=" ; ")); 
            distinct <- distinct + 1;
        }
        else count <- count + 1;
    }
    else missing <- missing + 1;
}
print (paste("same taxon=",count, " distinct taxon=", distinct, " absent from GTDB=", missing));

[1] "Pseudomonas putida ; Pseudomonas_E putida_M"
[1] "Escherichia coli ; Escherichia flexneri"
[1] "Neisseria meningitidis ; Neisseria meningitidis_B"
[1] "Neisseria meningitidis ; Neisseria meningitidis_B"
[1] "Escherichia coli ; Escherichia flexneri"
[1] "Escherichia coli ; Escherichia flexneri"
[1] "Helicobacter pylori ; Helicobacter pylori_C"
[1] "same taxon= 7  distinct taxon= 7  absent from GTDB= 6"


### Creating compacted spreadsheet with GTDB data
Since original spreadsheet is too large, we use the above shortened columns, and select only the genera of interest. This is not the final data set, since it may contain many entries for which we don't have data

In [15]:
genera <- c("Clostridium", "Enterococcus", "Listeria", "Mycobacterium", "Staphylococcus", "Streptococcus",
  "Campylobacter", "Escherichia", "Helicobacter", "Klebsiella", "Leptospira", "Neisseria", 
  "Pseudomonas", "Salmonella", "Vibrio");
xframe <- data.frame();
for (gn in genera) {
    x1 <- a2[grep(gn, a2$gtdb_taxonomy),]; # only rows that represent this genus
    if (length(xframe) > 0) {xframe <- rbind(xframe, x1); } # bind rows 
    else {xframe <- x1; } # first time we just copy
}
head(xframe)
#write.csv(xframe,"gtdb_all.csv", quote=FALSE) ## file with around 12MB 

Unnamed: 0_level_0,accession,gtdb_taxonomy,lsu_silva_23s_taxonomy,ncbi_organism_name,ncbi_taxonomy,ssu_silva_taxonomy
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
355,RS_GCF_000968245.1,Clostridium butyricum,Clostridium butyricum 5521,Clostridium sp. IBUN22A,Clostridium sp.,Clostridium butyricum
374,RS_GCF_900104115.1,Clostridium gasigenes,Clostridium sp. 7_2_43FAA,Clostridium gasigenes,Clostridium gasigenes,unidentified
534,RS_GCF_001573435.1,Clostridium_F botulinum,Clostridium botulinum A str. ATCC 19397,Clostridium botulinum,Clostridium botulinum,Clostridium botulinum NCTC 2916
569,GB_GCA_900066815.1,Clostridium_A leptum,[Clostridium] leptum DSM 753,uncultured Ruminococcus sp.,Ruminococcus sp.,uncultured bacterium
750,RS_GCF_001243045.1,Clostridium_F niameyense,Clostridium botulinum A str. ATCC 19397,Clostridium niameyense,Clostridium niameyense,Clostridium sporogenes
974,RS_GCF_001579765.1,Clostridium_P perfringens,Clostridium perfringens B str. ATCC 3626,Clostridium perfringens,Clostridium perfringens,Clostridium perfringens B str. ATCC 3626


## Saving table with species names for available sequences
This has the taxonomies according to the GTDB table (GTDB, SILVA, NCBI) for the available sequences (that we downloaded), from file `gcf_names.txt`

In [17]:
seqnames<-read.table("bigdata/gcf_names.txt",colClasses="character")$V1

xframe <- data.frame();
for (acc in seqnames) {
    idx <- grep(acc, a2$accession); # one row, with this sequence
    if (length(idx) > 0) {
        a2$accession[idx] <- acc;
        if (length(xframe) > 0) {xframe <- rbind(xframe, a2[idx,]); } # bind rows 
        else {xframe <- a2[idx,]; } # first time we just copy
    }
}
head(xframe)
write.csv(xframe,"bigdata/gtdb_list.csv", quote=FALSE)

Unnamed: 0_level_0,accession,gtdb_taxonomy,lsu_silva_23s_taxonomy,ncbi_organism_name,ncbi_taxonomy,ssu_silva_taxonomy
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
93263,GCF_000465235.1,Campylobacter_D coli,Campylobacter coli CVM N29710,Campylobacter coli CVM N29710,Campylobacter coli,Campylobacter jejuni 30318
10680,GCF_000494775.1,Campylobacter_D coli,Campylobacter jejuni K5,Campylobacter coli 15-537360,Campylobacter coli,Campylobacter jejuni 30318
28928,GCF_000583755.1,Campylobacter_D coli,Campylobacter coli RM1875,Campylobacter coli RM1875,Campylobacter coli,Campylobacter coli RM1875
16675,GCF_000583795.1,Campylobacter_D coli,Campylobacter coli RM5611,Campylobacter coli RM5611,Campylobacter coli,Campylobacter jejuni 30318
52579,GCF_000954195.1,Campylobacter_D coli,Campylobacter jejuni K5,Campylobacter coli,Campylobacter coli,Campylobacter jejuni 30318
90511,GCF_001417635.1,Campylobacter_D coli,Campylobacter coli RM5611,Campylobacter coli,Campylobacter coli,Campylobacter jejuni 30318
