# Converting and updating the main trait database

Last Updated: 2022-04-14  
Quang Nguyen  

This notebook was obtained using a conda environment with `r-base=4.1.2` and packages managed via `renv`

The role of this script is to take raw tables from Madin et al. and Weissman et al. and merge them. Additionally, include the updated version of the GOLD data set as well. 

We're prepping the latest database for export for evaluation. For this manuscript, we're merging a couple of existing databases:  

1. The comprehensive synthesis of trait-database from [Madin et al. 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7275036/). This database was last updated in 2020. Most of the database's sources are static sources, with the exception of the [GOLD database](https://gold.jgi.doe.gov/downloads). As such, we're merging the existing release of the Madin et al. database with the most recent GOLD release (2022-03-12).  
2. Manual curation of bergey's manual by [Weissman et al.](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04216-2). This database contains manual curation of the Bergey's manual specific to human-associated microbiomes.   

The way we're trying to combine these disparate sources would be to perform something similar to Madin et al. using the [R code](https://github.com/bacteria-archaea-traits/bacteria-archaea-traits/blob/master/R/functions.R) on GitHub. We're going to apply relevant transformations and mappings where apply.  

For all large files, please create a directory called `large_data` first before continuing.  

In [52]:
library(data.table)
library(dtplyr)
library(here)
library(stringdist)
library(tidyverse)
library(taxizedb)
library(BiocParallel)
here::i_am("notebooks/db_prep.ipynb");
setDTthreads(4)

here() starts at /dartfs-hpc/rc/home/k/f00345k/research/microbe_set_trait



In [53]:
library(taxizedb)
db_download_ncbi(overwrite = FALSE)

Database already exists, returning old file



This code is used to upload data onto GitHub using `piggyback` R package
```r
piggyback::pb_upload(file = here("large_files", "goldData.xlsx"), tag = "0.1", overwrite = TRUE)
```

# Analysis

In [54]:
base <- read_csv(here("data", "condensed_species_NCBI.txt")) %>% 
    select(species_tax_id, superkingdom, phylum, class, order, family, 
           genus, species, metabolism, gram_stain, pathways, 
           carbon_substrates, sporulation, motility, cell_shape) %>% 
    rename("substrate" = carbon_substrates)

[1mRows: [22m[34m14893[39m [1mColumns: [22m[34m79[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (19): species, genus, family, order, class, phylum, superkingdom, gram_s...
[32mdbl[39m (60): species_tax_id, d1_lo, d1_up, d2_lo, d2_up, doubling_h, genome_siz...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## GOLD and Weissman data processing

Before we merge our data sets, we have to prep all each of the individual GOLD and Weissman et al. database to unify both approaches. Check out the notebooks `notebook/gold_proc.ipynb` and `notebook/weissman_proc.ipynb`. 

In [55]:
gold <- readRDS(file = here("output", "databases", "gold_proc.rds"))
weissman <- readRDS(file = here("output", "databases", "weissman_proc.rds"))

## Checking for duplicates

First, we define a function to check for similar sounding names across all the unique pathways and substrates for all the data sets. Here, we use the `stringdist` function from the `stringdist` package. We use the standard OSA metric (also called the Damerau-Levenshtein distance) to query for potential similarly sounding names of identical pathways or compounds. 

In [56]:
check_matches <- function(df, type=c("pathways","substrates")){
    b_val <- base %>% pull(!!type) %>% unique() %>% str_split(pattern = ", ") %>%
        unlist() %>% unique() %>% na.omit() %>% as.vector()
    
    q_val <- df %>% pull(!!type) %>% unique() %>% str_split(pattern = ", ") %>% 
        unlist() %>% unique() %>% na.omit() %>% as.vector()
    
    check <- map(q_val, ~{
        match <- stringdist(a = .x, b = b_val)
        # match 0 is the same, and match > 2 is too different 
        ret <- b_val[match > 0 & match <= 2]
        if (length(ret) == 0){
            return(NA)
        } else {
            out <- tibble(
                query = rep(.x, length(ret)),
                ref = ret
            )
        }
    })
    check <- check[!sapply(check, function(x) all(is.na(x)))]
    
    return(check)
}

In [57]:
Reduce(check_matches(weissman, "pathways"), f = rbind)
Reduce(check_matches(weissman, "substrate"), f = rbind)

query,ref
<chr>,<chr>
elastin_degradation,plastic_degradation
elastin_degradation,gelatin_degradation


query,ref
<chr>,<chr>
sucrose,fucose
butanol,2-butanol
butanol,1-butanol
ethanol,methanol
glucose,fucose
mannose,maltose
fructose,fucose
lactose,lactate
lactose,galactose
lactose,maltose


We can see that for a lot of the compounds the names might be the same but they're actually different. However, there are certain conventions such as "_" for spaces or "-" that we might need to address for the final merge.  

Let's check the GOLD database for similar sounding names. Since GOLD does not have substrate information, we only check for pathways

In [58]:
Reduce(check_matches(gold, "pathways"), f = rbind)

query,ref
<chr>,<chr>
Nitrogen fixation,nitrogen_fixation
Methane oxidation,methane_oxidation
Chitin degradation,chitin_degradation
Chitin_degradation,chitin_degradation
Nitrogen_fixation,nitrogen_fixation
Methane_oxidation,methane_oxidation


In [59]:
trim_path <- function(vec){
    vec <- vec %>% tolower() %>% 
        str_split(pattern = "(, |\\|)") %>% 
        map(~{
            str_trim(.x) %>% str_replace_all("\\-", "") %>%
                str_replace_all(" ", "_") %>% unique()
        })
    return(vec)
}


gold$pathways <- trim_path(gold$pathways)
base$pathways <- trim_path(base$pathways)
weissman$pathways <- trim_path(weissman$pathways)
weissman$substrate <- trim_path(weissman$substrate)
base$substrate <- trim_path(base$substrate)

## Combine all data frames

After munging, let's combine all of the names! The strategy is very similar to handling multiple entries for GOLD. First, we bind all of our databases together. Then, we `group_by` and nest all our trait data into a list. Then we process these lists and return `unique` rows (deduplicated). If the rows are not unique, then we process the non-unique rows by either concatenating the traits together or vote on consensus using the top most represented trait.  

In [60]:
complete <- bind_rows(
    base %>% mutate(substrate = map_chr(substrate, ~{paste0(.x, collapse = ",")}), 
                    pathways = map_chr(pathways, ~{paste0(.x, collapse = ",")})) %>% 
            mutate(source = "madin"), 
    gold %>% mutate(pathways = map_chr(pathways, ~{paste0(.x, collapse = ",")})) %>% 
            mutate(source = "gold"), 
    weissman %>% mutate(substrate = map_chr(substrate, ~{paste0(.x, collapse = ",")}), 
                        pathways = map_chr(pathways, ~{paste0(.x, collapse = ",")})) %>% 
            mutate(source = "weissman")
)
complete <- complete %>% filter(!is.na(species)) %>% 
    group_by(species_tax_id) %>%
    nest(names = c(superkingdom, phylum, class, order, family, genus, species), 
         data = c(metabolism, gram_stain, pathways, substrate, sporulation, motility, cell_shape, source)) %>%
    ungroup() %>%
    mutate(row_data = map_dbl(data, ~{nrow(.x)}))

In [61]:
head(complete)

species_tax_id,names,data,row_data
<dbl>,<list>,<list>,<dbl>
1243001,"Bacteria , Bacteria , Actinobacteria , Actinobacteria , Actinobacteria , Actinomycetia , Propionibacteriales , Propionibacteriales , Propionibacteriaceae , Propionibacteriaceae , Acidipropionibacterium , Acidipropionibacterium , Acidipropionibacterium damnosum, Acidipropionibacterium damnosum","microaerophilic, microaerophilic, NA , positive , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1679466,"Bacteria , Bacteria , Bacteroidetes , Bacteroidetes , Flavobacteriia , Flavobacteriia , Flavobacteriales , Flavobacteriales , Flavobacteriaceae , Weeksellaceae , Apibacter , Apibacter , Apibacter adventoris, Apibacter adventoris","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1591092,"Bacteria , Proteobacteria , Betaproteobacteria, Neisseriales , Chromobacteriaceae, Aquaspirillum , Aquaspirillum soli","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1
1904463,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Poseidonibacter , Arcobacter lekithochrous , Poseidonibacter lekithochrous","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1935204,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria, Epsilonproteobacteria, Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Aliarcobacter , Arcobacter porcinus , [Arcobacter] porcinus","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
209458,"Bacteria , Firmicutes , Bacilli , Bacillales , Bacillaceae , Bacillus , Bacillus arbutinivorans","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1


In [62]:
multiple_rows <- complete %>% filter(row_data >= 2) %>% pull(species_tax_id)
length(multiple_rows)

In [63]:
reconcile <- complete %>% filter(species_tax_id %in% multiple_rows)

In [64]:
head(reconcile)

species_tax_id,names,data,row_data
<dbl>,<list>,<list>,<dbl>
1243001,"Bacteria , Bacteria , Actinobacteria , Actinobacteria , Actinobacteria , Actinomycetia , Propionibacteriales , Propionibacteriales , Propionibacteriaceae , Propionibacteriaceae , Acidipropionibacterium , Acidipropionibacterium , Acidipropionibacterium damnosum, Acidipropionibacterium damnosum","microaerophilic, microaerophilic, NA , positive , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1679466,"Bacteria , Bacteria , Bacteroidetes , Bacteroidetes , Flavobacteriia , Flavobacteriia , Flavobacteriales , Flavobacteriales , Flavobacteriaceae , Weeksellaceae , Apibacter , Apibacter , Apibacter adventoris, Apibacter adventoris","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1904463,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Poseidonibacter , Arcobacter lekithochrous , Poseidonibacter lekithochrous","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1935204,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria, Epsilonproteobacteria, Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Aliarcobacter , Arcobacter porcinus , [Arcobacter] porcinus","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1780362,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Campylobacteraceae , Campylobacter , Campylobacter , Campylobacter geochelonis, Campylobacter geochelonis","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2
1848766,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Campylobacteraceae , Campylobacter , Campylobacter , Campylobacter ornithocola, Campylobacter ornithocola","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold",2


Let's collapse traits of the same species similar to before and merge everything together to save to a database. Since Weissman et al. is a manually curated source, we're going to prefer Weissman et al. above other sources if there are conflicts. The second preference would be Madin et al. since Madin et al. also has curated static sources while GOLD relies on community submissions. In the case where there are multiple Weissman et al. traits per taxa, we concatenate pathways/substrates

In [65]:
#' @param data This is a data frame of multiple columns, where the columns of 
#'     pathways and substrates are themselves lists 
reconcile_trait <- function(data, species_tax_id, names, ...){
    nonlist <- c("metabolism", "gram_stain", "sporulation", "motility", 
                 "cell_shape")    
    
    weissman_idx <- which(data$source == "weissman")
    # reconcile each non-pathway-substrate trait 
    out <- suppressMessages(map_dfc(nonlist, ~{
        # for each data frame
        traits <- data %>% pull(.x)        
        traits <- as.data.frame(table(traits))
        if (nrow(traits) == 1){
            return(data %>% filter(!!as.symbol(.x) == traits$traits) %>% pull(.x) %>% unique())
        } else if (nrow(traits) == 0){
            return(NA_character_)
        } else {
            w_source <- data %>% slice(weissman_idx) %>% pull(.x) 
            w_source <- na.omit(unique(w_source))
            if (length(w_source) >= 2) {
                # if there are conflicting weissman et al. information 
                # first, let's check that the names are accurate according to the species_tax_id 
                query_name <- taxizedb::taxid2name(species_tax_id)
                # second, let's extract the name to match and if there is only one name, attach the genus
                spec <- names %>% slice(weissman_idx) %>% pull(species)
                gen <- names %>% slice(weissman_idx) %>% pull(genus)
                check_name <- map_dbl(spec, ~length(str_split(.x,pattern = " ")[[1]]))
                spec[which(check_name == 1)] <- paste(gen[which(check_name == 1)], spec[which(check_name == 1)])
                # third, now check with the query name 
                match_idx <- which(spec == query_name)
                # if there is over 2 matches or only 0 matches, then return NA due to confusion
                if (length(match_idx) == 1){
                    return(w_source[match_idx])
                } else {
                    return(NA_character_)
                }
            }
            if (is_empty(w_source)){
                return(data %>% filter(source == "madin") %>% pull(.x))
            } else if (is.na(w_source)) {
                return(data %>% filter(source == "madin") %>% pull(.x))
            } else {
                return(w_source)
            }
        }
    }))
    names(out) <- nonlist
    # process pathway and substrate traits
    path_vec <- data %>% pull("pathways") %>% 
        map(., ~{ str_split(.x, ',')[[1]] }) %>% 
        Reduce(f = c, x = .) %>% unique()
    path_vec <- path_vec[!path_vec %in% c("NA", NA_character_)]
    if (length(path_vec) == 0){
        out$pathways <- NA_character_
    } else {
        out$pathways <- paste(path_vec, collapse = ",")
    }
    
    path_substr <- data %>% pull("substrate") %>% 
        map(., ~{ str_split(.x, ',')[[1]] }) %>% 
        Reduce(f = c, x = .) %>% unique()
    path_substr <- path_substr[!path_substr %in% c("NA", NA_character_)]
    if (length(path_substr) == 0){
        out$substrate <- NA_character_
    } else {
        out$substrate <- paste(path_substr, collapse = ",")
    }
    return(out)
}

In [66]:
reconcile <- reconcile %>% select(-row_data)
head(reconcile)

species_tax_id,names,data
<dbl>,<list>,<list>
1243001,"Bacteria , Bacteria , Actinobacteria , Actinobacteria , Actinobacteria , Actinomycetia , Propionibacteriales , Propionibacteriales , Propionibacteriaceae , Propionibacteriaceae , Acidipropionibacterium , Acidipropionibacterium , Acidipropionibacterium damnosum, Acidipropionibacterium damnosum","microaerophilic, microaerophilic, NA , positive , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold"
1679466,"Bacteria , Bacteria , Bacteroidetes , Bacteroidetes , Flavobacteriia , Flavobacteriia , Flavobacteriales , Flavobacteriales , Flavobacteriaceae , Weeksellaceae , Apibacter , Apibacter , Apibacter adventoris, Apibacter adventoris","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold"
1904463,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Poseidonibacter , Arcobacter lekithochrous , Poseidonibacter lekithochrous","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold"
1935204,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria, Epsilonproteobacteria, Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Aliarcobacter , Arcobacter porcinus , [Arcobacter] porcinus","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold"
1780362,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Campylobacteraceae , Campylobacter , Campylobacter , Campylobacter geochelonis, Campylobacter geochelonis","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold"
1848766,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Campylobacteraceae , Campylobacter , Campylobacter , Campylobacter ornithocola, Campylobacter ornithocola","microaerophilic, NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , madin , gold"


In [67]:
reconcile %>% pull(data) %>% .[[1]]
complete %>% pull(data) %>% .[[1]]
colnames(gold)
colnames(weissman)

metabolism,gram_stain,pathways,substrate,sporulation,motility,cell_shape,source
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
microaerophilic,,,,,,,madin
microaerophilic,positive,,,,,,gold


metabolism,gram_stain,pathways,substrate,sporulation,motility,cell_shape,source
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
microaerophilic,,,,,,,madin
microaerophilic,positive,,,,,,gold


In [68]:
reconcile <- reconcile %>% mutate(traits = pmap(reconcile, reconcile_trait));

In [69]:
reconcile <- reconcile %>% select(-data)

dim(reconcile)
head(reconcile)

species_tax_id,names,traits
<dbl>,<list>,<list>
1243001,"Bacteria , Bacteria , Actinobacteria , Actinobacteria , Actinobacteria , Actinomycetia , Propionibacteriales , Propionibacteriales , Propionibacteriaceae , Propionibacteriaceae , Acidipropionibacterium , Acidipropionibacterium , Acidipropionibacterium damnosum, Acidipropionibacterium damnosum","microaerophilic, positive , NA , NA , NA , NA , NA"
1679466,"Bacteria , Bacteria , Bacteroidetes , Bacteroidetes , Flavobacteriia , Flavobacteriia , Flavobacteriales , Flavobacteriales , Flavobacteriaceae , Weeksellaceae , Apibacter , Apibacter , Apibacter adventoris, Apibacter adventoris","microaerophilic, NA , NA , NA , NA , NA , NA"
1904463,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Poseidonibacter , Arcobacter lekithochrous , Poseidonibacter lekithochrous","microaerophilic, NA , NA , NA , NA , NA , NA"
1935204,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria, Epsilonproteobacteria, Campylobacterales , Campylobacterales , Campylobacteraceae , Arcobacteraceae , Arcobacter , Aliarcobacter , Arcobacter porcinus , [Arcobacter] porcinus","microaerophilic, NA , NA , NA , NA , NA , NA"
1780362,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Campylobacteraceae , Campylobacter , Campylobacter , Campylobacter geochelonis, Campylobacter geochelonis","microaerophilic, NA , NA , NA , NA , NA , NA"
1848766,"Bacteria , Bacteria , Proteobacteria , Proteobacteria , Epsilonproteobacteria , Epsilonproteobacteria , Campylobacterales , Campylobacterales , Campylobacteraceae , Campylobacteraceae , Campylobacter , Campylobacter , Campylobacter ornithocola, Campylobacter ornithocola","microaerophilic, NA , NA , NA , NA , NA , NA"


In [70]:
complete <- complete %>% rename("traits" = "data") %>% filter(!species_tax_id %in% multiple_rows)
head(complete)

species_tax_id,names,traits,row_data
<dbl>,<list>,<list>,<dbl>
1591092,"Bacteria , Proteobacteria , Betaproteobacteria, Neisseriales , Chromobacteriaceae, Aquaspirillum , Aquaspirillum soli","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1
209458,"Bacteria , Firmicutes , Bacilli , Bacillales , Bacillaceae , Bacillus , Bacillus arbutinivorans","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1
1735111,"Bacteria , Bacteroidetes , Flavobacteriia , Flavobacteriales , Flavobacteriaceae , Bergeyella , Bergeyella porcorum","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1
1460875,"Bacteria , Proteobacteria , Gammaproteobacteria , Pasteurellales , Pasteurellaceae , Bisgaardia , Bisgaardia miroungae","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1
1448267,"Bacteria , Actinobacteria , Actinobacteria , Corynebacteriales , Corynebacteriaceae , Corynebacterium , Corynebacterium nasicanis","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1
913829,"Bacteria , Proteobacteria , Gammaproteobacteria , Chromatiales , Thioalkalispiraceae , Endothiovibrio , Endothiovibrio diazotrophicus","microaerophilic, NA , NA , NA , NA , NA , NA , madin",1


In [71]:
complete <- complete %>% select(-row_data) %>% mutate(traits = map(traits, ~{.x %>% select(-source)})) 

In [72]:
complete <- bind_rows(complete, reconcile)
head(complete)

species_tax_id,names,traits
<dbl>,<list>,<list>
1591092,"Bacteria , Proteobacteria , Betaproteobacteria, Neisseriales , Chromobacteriaceae, Aquaspirillum , Aquaspirillum soli","microaerophilic, NA , NA , NA , NA , NA , NA"
209458,"Bacteria , Firmicutes , Bacilli , Bacillales , Bacillaceae , Bacillus , Bacillus arbutinivorans","microaerophilic, NA , NA , NA , NA , NA , NA"
1735111,"Bacteria , Bacteroidetes , Flavobacteriia , Flavobacteriales , Flavobacteriaceae , Bergeyella , Bergeyella porcorum","microaerophilic, NA , NA , NA , NA , NA , NA"
1460875,"Bacteria , Proteobacteria , Gammaproteobacteria , Pasteurellales , Pasteurellaceae , Bisgaardia , Bisgaardia miroungae","microaerophilic, NA , NA , NA , NA , NA , NA"
1448267,"Bacteria , Actinobacteria , Actinobacteria , Corynebacteriales , Corynebacteriaceae , Corynebacterium , Corynebacterium nasicanis","microaerophilic, NA , NA , NA , NA , NA , NA"
913829,"Bacteria , Proteobacteria , Gammaproteobacteria , Chromatiales , Thioalkalispiraceae , Endothiovibrio , Endothiovibrio diazotrophicus","microaerophilic, NA , NA , NA , NA , NA , NA"


Let's check whether all traits are properly collapsed by counting the number of rows per trait item in the list

In [73]:
complete %>% mutate(t_rows = map_dbl(traits, ~nrow(.x))) %>% filter(t_rows > 1)
complete %>% dim()

species_tax_id,names,traits,t_rows
<dbl>,<list>,<list>,<dbl>


Even though we have a nice set of traits and associated NCBI ids, we also want to make sure the names are up-to-date. 

In [74]:
#' This function takes an NCBI species_tax_id, run `classification` through it (from taxizedb) 
#' and retrieve all the ranks from superkingdom to species
#' @param species_tax_id A string representing NCBIids
#' @param names A data.frame representing one or multiple hypothetical names. 
#'     for use mostly when NCBIids do not resolve
collapse_name <- function(species_tax_id, names, ...) {
    names <- names %>% dplyr::distinct() 
    query_ranks <- classification(species_tax_id, db = "ncbi", verbose = FALSE)[[1]]
    # if query via species_tax_id does not work. 
    # do this reverse query thing where the name is re-queried back to get NCBIids
    if (nrow(query_ranks) == 0){
        # first we loop through each candidate species name and
        # annotate or return NA if there is ambiguity
        cand_ids = vector(length = nrow(names))
        for (i in seq_len(nrow(names))){
            # rev_query here is a data frame
            rev_query <- name2taxid(names$species, out_type = "summary")
            if (nrow(rev_query) != 1){
                # if there are ambiguous names or if there are no matches, return NA
                cand_ids[i] <- NA_character_
            } else {
                cand_ids[i] <- rev_query$id
            }
        }
        # remove NAs and get only unique ids
        cand_ids <- unique(na.omit(cand_ids))
        if (length(cand_ids) >= 2){
            # if more than one name then just concatenate all the names together 
            reconciled_names <- as_tibble(map(names, ~{ paste0(.x, collapse = "|") }))
        } else if (length(cand_ids) == 1) {
            reconciled_names <- classification(cand_ids, db = "ncbi")[[1]] %>% 
                filter(!rank %in% c("no rank", "clade")) %>% 
                select(-id) %>% pivot_wider(names_from = rank, values_from = name)
        } else {
            # create an empty final names
            reconciled_names <- matrix(rep(NA_character_, ncol(names)), nrow = 1, ncol = ncol(names))
            colnames(reconciled_names) <- colnames(names)
            reconciled_names <- as_tibble(reconciled_names)
        }
    # second case where queried ranks do work
    } else {
        reconciled_names <- query_ranks %>% filter(!rank %in% c("no rank", "clade")) %>% select(-id) %>% 
            pivot_wider(names_from = rank, values_from = name)
    }
    return(reconciled_names)
}

#' This function is similar to collapse_names but only there to repair
#' tax ids that are defunct (and do not need to resolve names) 
fix_ids <- function(species_tax_id, names, ...) {
    names <- names %>% dplyr::distinct() 
    query_ranks <- taxizedb::classification(species_tax_id, db = "ncbi", verbose = FALSE)[[1]]
    if (nrow(query_ranks) == 0){
        out <- species_tax_id
    } else {
        # if there is no match, we perform a reverse query
        cand_ids = vector(length = nrow(names))
        # for each name in the list of possible names get the identifiers 
        for (i in seq_len(nrow(names))){
            # rev_query here is a data frame
            rev_query <- name2taxid(names$species, out_type = "summary")
            if (nrow(rev_query) != 1){
                # if there are ambiguous names or if there are no matches, return NA
                cand_ids[i] <- NA_character_
            } else {
                cand_ids[i] <- rev_query$id
            }
        }
        # remove NAs and get only unique ids
        cand_ids <- unique(na.omit(cand_ids))
        if (length(cand_ids) == 1) {
            out <- cand_ids
        } else {
            # if more than one possible cand_ids, then also return NA due to unresolvable names
            out <- NA_character_
        }
    }
    return(out)
}



#' Only the reverse query part 
#' @param spec_names rep
rev_query <- function(names) {
    cand_ids <- vector(length = nrow(names))
    # spec_names is a vector 
    for (i in seq_len(nrow(names))){
        rev_query <- name2taxid(names %>% slice(i) %>% pull(species), out_type = "summary")
        if (nrow(rev_query) != 1){
            # if there are ambiguous names or if there are no matches, return NA
            cand_ids[i] <- NA_character_
        } else {
            cand_ids[i] <- rev_query$id
        } 
    }
    cand_ids <- unique(na.omit(cand_ids))
    if (length(cand_ids) != 1){
        return(NA_character_)
    } else {
        return(cand_ids)
    }
}

Let's loop through every single `species_tax_id` and correct their names. We perform this procedure in parallel using the `BiocParallel` package.  

In [75]:
length(complete$species_tax_id)
length(complete$names)

We retreived all the classified names for each tax id, extract those that are NAs, and reverse query the names to identify NAs

In [76]:
class <- classification(complete$species_tax_id)

In [77]:
na_idx <- which(is.na(class))
length(na_idx)

In [78]:
id_corrected <- vector(length = length(na_idx))
for (i in seq_along(id_corrected)){
    id_corrected[i] <- rev_query(complete$names[na_idx[i]][[1]])
}
id_corrected[1:10]

In [79]:
complete$species_tax_id[na_idx] <- id_corrected

We remove taxa names (because they can be not-updated depending on the species_tax_id, remove all NAs and unnest traits

In [80]:
complete <- complete %>% drop_na(species_tax_id) %>% select(-names) %>% unnest(traits)

In [None]:
head(complete)
dim(complete)

In [129]:
saveRDS(complete, file = here("output", "databases", "db_merged.rds"))
write_csv(x = complete, file = here("output", "databases", "db_merged.csv"))