# Process GOLD database

Last updated: 2022-04-14.    
Quang Nguyen  

This notebook was obtained using a conda environment with `r-base=4.1.2` and packages managed via `renv`

Here, we leverage data downloaded in the folder `large_files` and process the data accordingly. The objective is to collapse traits into species level and straighten out some issues with regards to naming conventions, aligning it to the base database from Madin et al. 2020. 

In [6]:
library(tidyverse)
library(here)
library(dtplyr)
library(data.table)
here::i_am("notebooks/gold_proc.ipynb")


Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last


The following object is masked from ‘package:purrr’:

    transpose


here() starts at /dartfs-hpc/rc/home/k/f00345k/research/microbe_set_trait



First, we load the data from GOLD we also load the base Madin et al. database for comparison 

In [4]:
pth <- here("large_files", "goldData.csv")
gold <- read_csv(file = pth);
base <- read_csv(here("data", "condensed_species_NCBI.txt")) %>% 
    select(species_tax_id, superkingdom, phylum, class, order, family, 
           genus, species, metabolism, gram_stain, pathways, 
           carbon_substrates, sporulation, motility, cell_shape) %>% 
    rename("substrate" = carbon_substrates)

“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m428241[39m [1mColumns: [22m[34m42[39m
[36m──[39m [1mColumn specification[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (35): ORGANISM GOLD ID, ORGANISM NAME, ORGANISM NCBI SUPERKINGDOM, ORGAN...
[32mdbl[39m  (4): ORGANISM NCBI TAX ID, ORGANISM ISOLATION PUBMED ID, ORGANISM ECOSY...
[33mlgl[39m  (3): ORGANISM SALINITY CONCENTRATION, ORGANISM PRESSURE, ORGANISM CARBO...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


We convert all column names to lower case and replace spaces with `_`. We also rename all columns to be similar to the original base database from Madin et al. Finally, we nest all relevant traits into a column called `traits`

In [7]:
# convert names 
colnames(gold) <- colnames(gold) %>% 
    gsub(x = ., pattern = " ", replacement = "_") %>% 
    tolower() %>% 
    gsub(x = ., pattern = "organism_", replacement = "")


gold_reduced <- gold %>% 
    select(ncbi_tax_id, ncbi_superkingdom,  
            ncbi_phylum, ncbi_class, ncbi_order, ncbi_family, ncbi_genus, ncbi_species, 
            name, gram_stain, metabolism, oxygen_requirement, 
            sporulation, motility, cell_shape) %>% 
    rename("species_tax_id" = ncbi_tax_id,
           "superkingdom" = ncbi_superkingdom,
           "phylum" = ncbi_phylum,
           "class" = ncbi_class,
           "order" = ncbi_order,
           "family" = ncbi_family,
           "genus" = ncbi_genus,
           "species" = ncbi_species,
           "pathways" = metabolism,
           "metabolism" = oxygen_requirement) %>% 
    mutate(metabolism = str_replace(tolower(metabolism), pattern = "obe$", replacement = "obic"), 
           gram_stain = if_else(gram_stain == "Gram-", "negative", "positive"), 
           sporulation = if_else(sporulation == "Nonsporulating", "no", "yes"), 
           motility = case_when(
               motility == "Nonmotile" ~ "no", 
               motility == "Motile" ~ "yes", 
               TRUE ~ motility
           ), 
           cell_shape = tolower(str_replace(cell_shape,"-shaped","")),
           cell_shape = case_when(
               cell_shape %in% c("rod") ~ "bacillus",
               cell_shape %in% c("sphere", "oval", 
                                 "bean", "coccoid", "ovoid", 
                                 "spore", "Coccus-shaped") ~ "coccus", 
               cell_shape %in% c("helical") ~ "spiral", 
               cell_shape %in% c("curved") ~ "vibrio", 
               cell_shape %in% c("flask", "open-ring", "lancet") ~ "irregular", 
               # only Mycoplasma genitalium for flask 
               # only Thiomicrospira cyclica for open-ring
               # only Nitrolancea hollandica for lancet
               TRUE ~ cell_shape
           )) %>% 
    as.data.table()

In [8]:
# nest traits 
tbl <- gold_reduced %>%
    select(-name) %>%
    group_by(species_tax_id, superkingdom, phylum, class, order, 
             family, genus, species) %>%
    nest(traits = c(gram_stain, pathways, metabolism, 
           cell_shape, motility, sporulation))
    

# a subset of the table that has more than one row per trait nested values 
tbl_munge <- tbl %>% filter(map_lgl(traits, ~{nrow(.x) > 1})) %>% drop_na(species_tax_id)
tbl_munge

[1mSource: [22mlocal data table [11,177 x 9]
[1mGroups: [22mspecies_tax_id, superkingdom, phylum, class, order, family, genus, species
[1mCall:[22m
  _DT2 <- `_DT1`[, .(species_tax_id, superkingdom, phylum, class, order, family, genus,
  _DT2 <-   species, gram_stain, pathways, metabolism, sporulation, motility, cell_shape)][
  _DT2 <-   , .(traits = .(.SD)), by = .(species_tax_id, superkingdom, phylum, class,
  _DT2 <-     order, family, genus, species)]
  na.omit(`_DT2`[`_DT2`[, .I[map_lgl(traits, ~{
    nrow(.x) > 1
})], by = .(species_tax_id, superkingdom, phylum, class, order, 
    family, genus, species)]$V1], cols = "species_tax_id")

  species_tax_id superkingdom phylum       class      order family genus species
           [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m  
[90m1[39m          [4m5[24

`tbl_munge` is a subset of the GOLD database where more than one row of traits for each species identifier (presumably conflicting information or due to multiple strains within species). Within this, for `gram_stain`, `metabolism`, `sporulation`, `motility`, and `cell_shape`, the trait for the species will be the top trait, assuming that trait is represented more than 50\% of the rows. A lof of the times, the trait vectors are actually identical and we're merely collapsing duplicates. For `pathways` we simply append all the relevant pathways and then return only the uniquely identified ones. 

In [9]:
# This function takes a data frame and a column 
# and selects the response with the highest frequency
select_best <- function(df, column){
    vec <- unlist(df[,..column])
    freq <- as.data.frame(table(vec))
    if (nrow(freq) == 0){
        return(NA_character_)
    } else {
        freq <- freq %>% mutate(prop = Freq/sum(Freq)) %>%
            filter(prop > 0.5) %>% top_n(n = 1, wt = prop)
        return(freq %>% pull(vec) %>% as.vector())
    }
}

# This function then utilizes select_best
# to process entries with duplicates (more than one row)
# for pathways, the goal is to concatenate them
process_duplicates <- function(df){
    # get only unique rows
    df <- unique(df)
    if (nrow(df) == 1){
        return(df)
    }
    v <- c("gram_stain", "pathways", "metabolism", 
           "sporulation", "motility", "cell_shape")
    suppressMessages(res <- map_dfc(v, ~{
        if (.x == "pathways"){
            str_vec <- na.omit(df$pathways) %>% as.vector()
            if (length(str_vec) == 0){
                out <- NA_character_
            } else {
                out <- str_replace(str_vec, pattern = " ", 
                                   replacement = "_") %>% 
                    paste(collapse = ", ")
            }
        } else {
            out <- select_best(df, .x)
        }
        return(out)
    }))
    colnames(res) <- v
    res <- as.data.table(res)
    return(res)
}

In [10]:
tbl_munge_proc <- tbl_munge %>% 
    mutate(traits = map(traits, process_duplicates)) 

head(tbl_munge_proc)

[1mSource: [22mlocal data table [6 x 9]
[1mGroups: [22mspecies_tax_id, superkingdom, phylum, class, order, family, genus, species
[1mCall:[22m
  _DT2 <- `_DT1`[, .(species_tax_id, superkingdom, phylum, class, order, family, genus,
  _DT2 <-   species, gram_stain, pathways, metabolism, sporulation, motility, cell_shape)][
  _DT2 <-   , .(traits = .(.SD)), by = .(species_tax_id, superkingdom, phylum, class,
  _DT2 <-     order, family, genus, species)]
  head(na.omit(`_DT2`[`_DT2`[, .I[map_lgl(traits, ~{
    nrow(.x) > 1
})], by = .(species_tax_id, superkingdom, phylum, class, order, 
    family, genus, species)]$V1], cols = "species_tax_id")[, 
    `:=`(traits = map(traits, ..process_duplicates)), by = .(species_tax_id, 
        superkingdom, phylum, class, order, family, genus, species)], 
    n = 6L)

  species_tax_id superkingdom phylum       class      order family genus species
           [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       

We merge by extracting the `species_tax_id` column out of `tbl`, remove all rows with that identifier and then replace that with those from the `tbl_munge` database. We also process some of the trait names themselves and attempt to unify it according to the nomenclature stated in Madin et al. base database. 

In [11]:
ids <- tbl_munge_proc %>% pull(species_tax_id)

In [12]:
gold_final <- tbl %>% filter(!species_tax_id %in% ids)

gold_final <- bind_rows(as_tibble(gold_final), as_tibble(tbl_munge_proc)) %>% unnest(traits)


In [13]:
head(gold_final)

species_tax_id,superkingdom,phylum,class,order,family,genus,species,gram_stain,pathways,metabolism,sporulation,motility,cell_shape
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
515635,Bacteria,Dictyoglomi,Dictyoglomia,Dictyoglomales,Dictyoglomaceae,Dictyoglomus,Dictyoglomus turgidum,positive,Cellulose degrader,anaerobic,,,bacillus
521011,Archaea,Euryarchaeota,Methanomicrobia,Methanomicrobiales,Methanoregulaceae,Methanosphaerula,Methanosphaerula palustris,positive,Methanogen,anaerobic,,no,coccus
498848,Bacteria,Deinococcus-Thermus,Deinococci,Thermales,Thermaceae,Thermus,Thermus aquaticus,negative,,obligate aerobic,no,no,bacillus
481743,Bacteria,Firmicutes,Bacilli,Bacillales,Paenibacillaceae,Paenibacillus,Paenibacillus sp. Y412MC10,positive,,facultative,yes,yes,bacillus
634499,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Erwiniaceae,Erwinia,Erwinia pyrifoliae,negative,,facultative,no,yes,bacillus
580327,Bacteria,Firmicutes,Clostridia,Thermoanaerobacterales,Thermoanaerobacterales Family III. Incertae Sedis,Thermoanaerobacterium,Thermoanaerobacterium thermosaccharolyticum,positive,,obligate anaerobic,yes,yes,bacillus


In [23]:
write_csv(gold_final, file = here("output", "databases", "gold_proc.csv"))
saveRDS(gold_final, file = here("output", "databases", "gold_proc.rds"))

In [24]:
list.files("../output/databases/")