# Process and export WGS data sets from Human Microbiome

Last updated: 2022-04-26.   
Quang Nguyen.    

Here we obtain metagenomic samples from `curatedMetagenomicData` package for certain conditions. We filter the metadata and then retrieve both relative abundance and pathway abundance data sets. We're interested in different conditions of interest: CRC (colorectal cancer)  and IBD (inflammatory bowel disease). We also collect all samples set to be controls.  


In [27]:
library(curatedMetagenomicData)
library(tidyverse)
library(here)
library(piggyback)
here::i_am(file.path("notebooks", "retrieve_wgs.ipynb"))

here() starts at /Users/quangnguyen/research/microbe_set_trait



Here, we use the metadata to extract out samples that study three conditions of interest: CRC (colorectal cancer), T1D (type I diabetes), and IBD (inflammatory bowel disease). Additionally, we select all samples stated to be controls (without any disease). 

We save files both as individual csv files for pathway abundances and then `TreeSummarizedExperiment` objects for taxonomic data (to be converted into trait abundances). We use `returnSamples()` to retreive all relevant data.   

In [28]:
metadata <- as_tibble(sampleMetadata)

Studies for CRC

In [29]:
metadata %>% filter(study_condition == "CRC") %>% pull(study_name) %>% unique()

In [30]:
s_names <- metadata %>% filter(study_condition == "CRC") %>% pull(study_name) %>% unique()
samples <- metadata %>% filter(study_name %in% s_names, study_condition %in% c("CRC", "control"))
data <- returnSamples(samples, "relative_abundance", rownames = "NCBI");
# making data a bit leaner
colData(data) <- colData(data)[,c("study_name", "disease", "study_condition")]
saveRDS(data, file = here("data", "pred_relabun_crc_wgs_tse.rds"))
# tse here stands for TreeSummarizedExperiment

snapshotDate(): 2021-10-19


$`2021-03-31.FengQ_2015.relative_abundance`
dropping rows without rowTree matches:
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Enorma|s__[Collinsella]_massiliensis
  k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillales_unclassified|g__Gemella|s__Gemella_bergeri
  k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
  k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
  k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
  k__Bacteria|p__Pr

In [31]:
path <- returnSamples(samples, "pathway_abundance");
# this removes the stratified relative abundances
path <- path[!str_detect(rownames(path), "\\|")]
colData(path) <- colData(data)[,c("study_name", "disease", "study_condition")]

snapshotDate(): 2021-10-19



In [32]:
colData(path)
head(assay(path))

DataFrame with 1395 rows and 3 columns
                     study_name                disease study_condition
                    <character>            <character>     <character>
SID31004             FengQ_2015 CRC;fatty_liver;hype..             CRC
SID31009             FengQ_2015 fatty_liver;hyperten..         control
SID31021             FengQ_2015                healthy         control
SID31071             FengQ_2015            fatty_liver         control
SID31112             FengQ_2015            fatty_liver         control
...                         ...                    ...             ...
CCIS95097901ST-4-0 ZellerG_2014                healthy         control
CCIS95409808ST-4-0 ZellerG_2014                healthy         control
CCIS98482370ST-3-0 ZellerG_2014                healthy         control
CCIS98512455ST-4-0 ZellerG_2014                    CRC             CRC
CCIS98832363ST-4-0 ZellerG_2014                    CRC             CRC

Unnamed: 0,SID31004,SID31009,SID31021,SID31071,SID31112,SID31129,SID31159,SID31160,SID31188,SID31219,⋯,CCIS90164298ST-4-0,CCIS91228662ST-4-0,CCIS93040568ST-20-0,CCIS94417875ST-3-0,CCIS94603952ST-4-0,CCIS95097901ST-4-0,CCIS95409808ST-4-0,CCIS98482370ST-3-0,CCIS98512455ST-4-0,CCIS98832363ST-4-0
UNMAPPED,0.18532,0.283803,0.20424,0.21606,0.20954,0.188614,0.119249,0.269025,0.299294,0.206811,⋯,0.348917,0.220432,0.28214,0.377796,0.21917,0.204407,0.267223,0.294681,0.254781,0.28439
UNINTEGRATED,0.752336,0.663467,0.736379,0.720423,0.727546,0.742973,0.804209,0.676189,0.650754,0.728276,⋯,0.603559,0.735687,0.681455,0.588922,0.724008,0.739987,0.683444,0.667413,0.695994,0.670673
PWY-6737: starch degradation V,0.000872864,0.000731553,0.000826433,0.000834292,0.000768542,0.00075252,0.000585599,0.000819843,0.00073335,0.000613006,⋯,0.000560943,0.000499765,0.000429311,0.000316972,0.00076788,0.000767049,0.000582825,0.000655082,0.000526094,0.000562393
PWY-1042: glycolysis IV (plant cytosol),0.000862913,0.000747766,0.000895665,0.000971838,0.000813784,0.00088778,0.000558261,0.000997893,0.000771418,0.000659373,⋯,0.000729978,0.00055262,0.000381822,0.000493607,0.000687855,0.000751392,0.000677932,0.000529453,0.000511968,0.000637122
ILEUSYN-PWY: L-isoleucine biosynthesis I (from threonine),0.000799992,0.000701581,0.000845111,0.000996022,0.000738284,0.00087116,0.000679917,0.000859983,0.00058229,0.000697518,⋯,0.000650104,0.000470803,0.000398481,0.000284959,0.000674522,0.000668004,0.000589544,0.000458247,0.000530179,0.000526148
PWY-7111: pyruvate fermentation to isobutanol (engineered),0.000799992,0.000701581,0.000845111,0.000996022,0.000729398,0.00087116,0.000765416,0.000859983,0.000614498,0.000697518,⋯,0.000650104,0.000470803,0.000398481,0.000284959,0.000674522,0.000668004,0.000589544,0.000458247,0.000530179,0.000526148


`colData` is the label and other metadata while `assay` represent the features. Let's pivot and export

In [33]:
path_abun <- assay(path)
write.csv(t(path_abun), here("data", "pred_pathway_crc_feat.csv"))

In [34]:
meta_path <- colData(path)
write.csv(meta_path, here("data", "pred_pathway_crc_metadata.csv"))

We're going to do the same thing for our `IBD` data set

In [35]:
metadata %>% filter(study_condition == "IBD") %>% pull(study_name) %>% unique()

In [36]:
s_names <- metadata %>% filter(study_condition == "IBD") %>% pull(study_name) %>% unique()
samples <- metadata %>% filter(study_name %in% s_names, study_condition %in% c("IBD", "control"))
data <- returnSamples(samples, "relative_abundance", rownames = "NCBI");
# making data a bit leaner
colData(data) <- colData(data)[,c("study_name", "disease", "study_condition")]
saveRDS(data, file = here("data", "pred_relabun_ibd_wgs_tse.rds"))
# tse here stands for TreeSummarizedExperiment
path <- returnSamples(samples, "pathway_abundance");
# this removes the stratified relative abundances
path <- path[!str_detect(rownames(path), "\\|")]
colData(path) <- colData(data)[,c("study_name", "disease", "study_condition")]

snapshotDate(): 2021-10-19


$`2021-10-14.HallAB_2017.relative_abundance`
dropping rows without rowTree matches:
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
  k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
  k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
  k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
  k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
  k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis




In [37]:
colData(path)
head(assay(path))

DataFrame with 2881 rows and 3 columns
                              study_name     disease study_condition
                             <character> <character>     <character>
SKST006_6_G102964            HallAB_2017         IBD             IBD
SKST006_7_G102965            HallAB_2017         IBD             IBD
SKST006_4_G102962            HallAB_2017         IBD             IBD
SKST006_5_G102963            HallAB_2017         IBD             IBD
SKST006_2_G102960            HallAB_2017         IBD             IBD
...                                  ...         ...             ...
EGAR00001773343_1000IBD00723 VilaAV_2018         IBD             IBD
EGAR00001773344_1000IBD01328 VilaAV_2018         IBD             IBD
EGAR00001773345_1000IBD01329 VilaAV_2018         IBD             IBD
EGAR00001773346_1000IBD01330 VilaAV_2018         IBD             IBD
EGAR00001773347_1000IBD01332 VilaAV_2018         IBD             IBD

Unnamed: 0,SKST006_6_G102964,SKST006_7_G102965,SKST006_4_G102962,SKST006_5_G102963,SKST006_2_G102960,SKST006_3_G102961,SKST006_10_G102994,SKST006_1_G102959,SKST006_9_G103014,SKST027_3_G102945,⋯,EGAR00001773338_1000IBD00708,EGAR00001773339_1000IBD00711,EGAR00001773340_1000IBD00715,EGAR00001773341_1000IBD00720,EGAR00001773342_1000IBD00722,EGAR00001773343_1000IBD00723,EGAR00001773344_1000IBD01328,EGAR00001773345_1000IBD01329,EGAR00001773346_1000IBD01330,EGAR00001773347_1000IBD01332
UNMAPPED,0.20563,0.209946,0.187781,0.21859,0.232418,0.246693,0.218541,0.239545,0.24522,0.252768,⋯,0.414955,0.324406,0.323949,0.449578,0.509086,0.411191,0.288359,0.45749,0.382657,0.380637
UNINTEGRATED,0.750061,0.742283,0.766908,0.737545,0.720824,0.706614,0.737843,0.711877,0.70358,0.699368,⋯,0.54344,0.61572,0.627511,0.51272,0.457353,0.545759,0.664042,0.499267,0.567415,0.57189
PWY-6737: starch degradation V,0.000681276,0.000639302,0.000628194,0.000687507,0.000684065,0.000644857,0.000539748,0.000670355,0.000722124,0.000677463,⋯,0.000503671,0.000783759,0.000553519,0.000554002,0.000476965,0.000561188,0.000504029,0.000512983,0.000459279,0.000549292
PWY-1042: glycolysis IV (plant cytosol),0.000609668,0.000641384,0.000591613,0.000685281,0.000644436,0.000656113,0.000582671,0.000775793,0.000801949,0.000694438,⋯,0.000562172,0.000777546,0.000601775,0.000596447,0.000411513,0.000679171,0.000536813,0.000543392,0.000526592,0.000625602
PWY-5686: UMP biosynthesis,0.000576416,0.000574242,0.000552523,0.000526368,0.000595218,0.000601026,0.000563912,0.000638705,0.00063192,0.00059602,⋯,0.000535471,0.000643761,0.000549748,0.00047227,0.000453503,0.000428509,0.000469401,0.000538435,0.000524443,0.000487242
PWY-6163: chorismate biosynthesis from 3-dehydroquinate,0.000564095,0.000513062,0.00050037,0.000482707,0.000595965,0.000607155,0.000487917,0.000654172,0.000691006,0.000584941,⋯,0.000394237,0.000660835,0.000550726,0.000444953,0.00038439,0.000457283,0.000401115,0.000486368,0.000428287,0.000393119


In [38]:
path_abun <- assay(path)
write.csv(t(path_abun), here("data", "pred_pathway_ibd_feat.csv"))
meta_path <- colData(path)
write.csv(meta_path, here("data", "pred_pathway_ibd_metadata.csv"))