## Summarize microbiome data by genus

The goal of this notebook is to summarize microbiome data according to host taxonomy. This includes several measures of richness and evenness, as well as the most abundant microbial group at several levels of taxonomy. These are then merged with the disease data to produce a final trait table.

## Import necessary libraries

First we'll import needed libraries, taking some care to accomodate older versions of QIIME2

In [30]:
from qiime2 import Artifact
from qiime2.plugins.feature_table.methods import filter_samples
from qiime2.plugins.taxa.methods import filter_table,collapse

#The below try/except block is unsightly but the alpha function got moved between recent versions of QIIME2
#and it's nice if the notebook is compatible with either
try:
    from qiime2.plugins.diversity.methods import alpha
    from qiime2.plugins.diversity.methods import alpha_phylogenetic
except:
    from qiime2.plugins.diversity.pipelines import alpha
    from qiime2.plugins.diversity.pipelines import alpha_phylogenetic
    
from qiime2.plugins.diversity.visualizers import alpha_group_significance

from qiime2.plugins.feature_table.methods import rarefy
from qiime2.plugins.feature_table.visualizers import summarize
from qiime2.plugins.feature_table.methods import filter_features

from qiime2.metadata import Metadata
from os.path import abspath,exists,join
from os import mkdir
import shutil

from IPython.core.display import HTML

import pandas as pd
import numpy as np

## (Optional) Check what files are in the input directory

It can be useful to manually confirm that all the files are where we expect before we begin.

In [2]:
from os import listdir
listdir("../input")

['feature_tables',
 'feature_table_greengenes_mucus_1000.qza',
 'silva_metaxa2_reference_taxonomy.qza',
 'disease_by_genus.csv',
 'feature_table_silva_metaxa2_tissue_1000.qza',
 'feature_table_greengenes_skeleton_1000.qza',
 'feature_table_greengenes_mucus.qza',
 'insertion-placements_GCMP.qza',
 'silva-138-99-nb-classifier.qza',
 'all_seqs.fna',
 'physeq.noncont-feature-table.qza',
 'GCMP_decontaminated_1000_tree-rooted.qza',
 'feature_table_greengenes_tissue_1000.qza',
 'all_seqs.qza',
 'GCMP_decontaminated_1000_tax.qza',
 'feature_table_greengenes_skeleton.qza',
 'feature_table_greengenes_metaxa2_skeleton.qza',
 'feature_table_greengenes_metaxa2_mucus.qza',
 'feature_table_greengenes_metaxa2_tissue.qza',
 'physeq.noncton-rooted-tree.qza',
 'darling_2019_LHS_all.csv',
 'insertion-tree_GCMP.qza',
 'GCMP_EMP_map_r28_no_empty_samples.txt',
 'feature_table_silva_metaxa2_mucus_1000.qza',
 'insertion-tree_silva_GCMP.qza',
 'feature_table_greengenes_tissue.qza',
 'feature_table_silva_metaxa

In [9]:
##Define paths for all required input files

In [10]:
#We'll save the expected filename of each required input file to a variable. 
#This lets us check that all are present before we do any work inside QIIME2.

In [3]:
#Set paths for all required files.
#Currently all are assumed to be directly in the ../input directory
#(if they are somewhere else, just modify the input_directory variable accordingly)
input_directory = abspath("../input")

mapping_file = join(input_directory,"GCMP_EMP_map_r28_no_empty_samples.txt")
taxonomy_file = join(input_directory,"silva_metaxa2_reference_taxonomy.qza") 
tree_file = join(input_directory,"physeq_rooted_tree.qza")
feature_table_all_path = join(input_directory,"feature_tables/feature_table_decon_all_1000.qza")
feature_table_mucus_path = join(input_directory,"feature_tables/feature_table_decon_mucus_1000.qza")
feature_table_tissue_path = join(input_directory,"feature_tables/feature_table_decon_tissue_1000.qza")
feature_table_skeleton_path = join(input_directory,"feature_tables/feature_table_decon_skeleton_1000.qza")

disease_data_path = join(input_directory,"disease_by_genus.csv")


## (Optional) Check that all required files are present

If any are missing, its easier to find out now, before we've spent a long time processing the data.

In [4]:
required_files = [mapping_file,taxonomy_file,tree_file,feature_table_all_path,feature_table_mucus_path,feature_table_tissue_path,\
  feature_table_skeleton_path,disease_data_path]

for required_filepath in required_files:
    
    if not exists(required_filepath):
        raise ValueError(f"Filepath {required_filepath} is needed for this workflow but isn't present. Could it be named something else or in another directory?")

    print(f"Filepath {required_filepath} exists and is accessible...OK")

Filepath /home/qiime2/GCMP/input/GCMP_EMP_map_r28_no_empty_samples.txt exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/silva_metaxa2_reference_taxonomy.qza exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/physeq_rooted_tree.qza exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/feature_tables/feature_table_decon_all_1000.qza exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/feature_tables/feature_table_decon_mucus_1000.qza exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/feature_tables/feature_table_decon_tissue_1000.qza exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/feature_tables/feature_table_decon_skeleton_1000.qza exists and is accessible...OK
Filepath /home/qiime2/GCMP/input/disease_by_genus.csv exists and is accessible...OK


# Create an output directory if one doesn't exist

In [5]:
output_dir = abspath("../output/")
ft_output_dir = join(output_dir,"feature_tables")

if not exists(ft_output_dir):
    print(f"Output directory {ft_output_dir} does not yet exist, creating it...")
    mkdir(ft_output_dir)
    print("Done.")

## Load feature tables, the coral tree, and metadata into QIIME2 as Artifacts

In [6]:
tree = Artifact.load(tree_file)
taxonomy = Artifact.load(taxonomy_file)
metadata = Metadata.load(mapping_file)
feature_table_decon_all_1000 = Artifact.load(feature_table_all_path)
feature_table_decon_mucus_1000 = Artifact.load(feature_table_mucus_path)
feature_table_decon_tissue_1000 = Artifact.load(feature_table_tissue_path)
feature_table_decon_skeleton_1000 = Artifact.load(feature_table_skeleton_path)

#create a dictionary to hold each feature table so they are easy to access in loops
feature_tables_decon_1000 = {"mucus":feature_table_decon_mucus_1000, "tissue":feature_table_decon_tissue_1000, "skeleton":feature_table_decon_skeleton_1000, "all":feature_table_decon_all_1000}


## Use the Metadata file to find a list of all unique coral genera in the study

In [22]:
# Find unique genus names
df = metadata.to_dataframe()
#need to rename the name Calastraea to Caulastrae to match the disease data.
df['host_genus'] = df['host_genus'].replace(['Caulastraea'],'Caulastrea')
#filter the metadata so that only samples from Australia are left.
df_aust = df[df["political_area"]=="Australia"]
unique_species_names = list(set(list(df_aust['host_genus'])))
print(f"There are {len(unique_species_names)} unique species names in the dataset:")
print(sorted(unique_species_names))

There are 35 unique species names in the dataset:
['Acropora', 'Alveopora', 'Astrea', 'Caulastrea', 'Cyphastrea', 'Diploastrea', 'Dipsastraea', 'Echinophyllia', 'Echinopora', 'Favites', 'Fungid', 'Galaxea', 'Goniastrea', 'Heliopora', 'Homophyllia', 'Hydnophora', 'Isopora', 'Leptastrea', 'Lobophyllia', 'Lobophytum', 'Merulina', 'Millepora', 'Montipora', 'Not applicable', 'Pachyseris', 'Palythoa', 'Pavona', 'Physogyra', 'Platygyra', 'Pocillopora', 'Porites', 'Psammocora', 'Seriatopora', 'Stylophora', 'Turbinaria']


# Filter featuretables so that they only contain data from Australia

In [23]:
# Filter metadata to only Australia samples
filtered_feature_tables = {}
for compartment,table in feature_tables_decon_1000.items():
    feature_table_results = filter_samples(table,metadata=metadata,where= "political_area='Australia'")
    filtered_table = feature_table_results.filtered_table
    filtered_feature_tables[compartment]=filtered_table
    
print("Filtered features tables:", filtered_feature_tables)

Filtered features tables: {'mucus': <artifact: FeatureTable[Frequency] uuid: 645e03ff-b920-443a-b853-0e0861132324>, 'tissue': <artifact: FeatureTable[Frequency] uuid: 68465527-7a84-4de2-a3b9-0719f7579dde>, 'skeleton': <artifact: FeatureTable[Frequency] uuid: 4d82854f-0073-4ff4-b49a-0e07d410bffa>, 'all': <artifact: FeatureTable[Frequency] uuid: 35462808-3757-43de-b0a8-a22234ecb304>}


## Define a function for calculating the most abundant microbe in each host taxon

We want to know, for some level of taxonomic specificity, what the most abundant microbial group was in a particular taxon of hosts. For example, were Proteobacteria the most abundant phylum of microbes in all corals?

**Note: we don't call this function directly - it is instead called by the calculate per-species diversities function

In [24]:
#calculate the dominant taxon family
def get_dominant_taxon(feature_table,taxonomy,level=5):
    """Collapse the feature table at the specified level, then find which feature is most abundant"""
    try:
        collapse_results = collapse(feature_table,taxonomy,level)
        taxon_table = collapse_results.collapsed_table
        taxon_df = taxon_table.view(pd.DataFrame)
        #Calculate average abundance of each taxon in this species
    except TypeError:
        return None
    taxon_abundance_dict = {taxon_df[col].mean():col for col in list(taxon_df.columns)}
    
    mean_abundance = sorted(taxon_abundance_dict.keys())
    print(mean_abundance)
    if not mean_abundance:
        return None
    greatest_mean_abundance = mean_abundance[-1]
    most_abundant_taxon = taxon_abundance_dict[greatest_mean_abundance]
    print("Most abundant taxon:",most_abundant_taxon)
    return most_abundant_taxon

## Define a function for calculating alpha diversity within host taxa

We'll keep the specific column for the host taxon generic, so we can calculate per-species alpha diversity using e.g. the 'host_species' column, or per-genus diversity using 'host_genus' as the column.

In [25]:
def calculate_per_species_diversities(feature_table,\
                                      metadata,\
                                      species_column = "host_genus",\
                                      compartment_name = 'all',\
                                      taxonomy = None,\
                                      metrics = ['faith_pd','observed_features','gini_index','dominance','simpson_e'],\
                                      to_skip = ['none','','Not applicable','Missing: Not collected']\
                                      ):
    #Set up a DataFrame to hold results
    results_columns = [species_column,f"n_samples_{compartment_name}"]
    results_columns.extend([f"{metric}_{compartment_name}" for metric in metrics])
    taxonomy_levels = ('domain','phylum','class','order','family','genus')
    taxonomy_labels = [f"most_abundant_{level}_{compartment_name}" for level in taxonomy_levels]
    results_columns.extend(taxonomy_labels)
    print("Result columns:",results_columns)
    results_df = pd.DataFrame(columns = results_columns)
    results_df = results_df.set_index(species_column)
    metadata_df = df_aust
    #metadata_df = metadata.to_dataframe()
    unique_species_names = list(set(list(metadata_df[species_column])))
    for species in unique_species_names:
        if species in to_skip:
            continue
            
        #Filter the feature table to just our current species
        where = f"[{species_column}] = '{species}'"
        filter_results = filter_samples(feature_table, metadata = metadata,where = where)
        species_table = filter_results.filtered_table
        #We'll use the species table, not the overall feature table from here on down!
        
        print("Analyzing taxon:",species)
        #If taxonomy is provided, summarize the type of microbe with highest average
        #abundance at each taxonomic level
        if taxonomy:
            for i, taxon_label in enumerate(taxonomy_levels):
                level = i + 1 #domain is level 1 in QIIME2, not level 0
                most_abundant_taxon = get_dominant_taxon(species_table,taxonomy,level=level)
                column_label = f"most_abundant_{taxon_label}_{compartment_name}"
                results_df.loc[species,column_label] = most_abundant_taxon
            
        
        for metric in metrics:
            #print(f"Calculating {metric} for {species}")
            try:
                if metric == "faith_pd":
                    alpha_results = alpha_phylogenetic(species_table,phylogeny=tree,metric=metric)
                else:
                    alpha_results = alpha(species_table, metric = metric)
            except ValueError:
                print(f"Can't calculate {metric} for {species} {compartment_name}...skipping")
                continue
            
            alpha_diversity = alpha_results.alpha_diversity
            species_adiv = alpha_diversity.view(pd.Series)
            
            species_mean = species_adiv.mean()
            results_df.loc[species,f"{metric}_{compartment_name}"] = species_mean
            print(f"{species}\t{metric}\t{compartment_name}\t{round(species_mean,4)}")
            
            
            
        #For last metric only we'll grab n (should be the same for all)
        species_n = len(species_adiv)     
        results_df.loc[species,f"n_samples_{compartment_name}"]=species_n
    return results_df


## Generate microbial results .tsv files for each of our species

In [26]:
results_dfs = {} 
for compartment,table in sorted(filtered_feature_tables.items(),reverse=True):
    result_df = calculate_per_species_diversities(table,metadata,compartment_name = compartment,taxonomy = taxonomy)
    species_adiv_path = join(output_dir,f"adiv_trait_table_{compartment}_australia.tsv")
    results_dfs[compartment]=result_df
    result_df.to_csv(species_adiv_path,sep="\t")


Result columns: ['host_genus', 'n_samples_tissue', 'faith_pd_tissue', 'observed_features_tissue', 'gini_index_tissue', 'dominance_tissue', 'simpson_e_tissue', 'most_abundant_domain_tissue', 'most_abundant_phylum_tissue', 'most_abundant_class_tissue', 'most_abundant_order_tissue', 'most_abundant_family_tissue', 'most_abundant_genus_tissue']
Analyzing taxon: Lobophyllia
[4.818181818181818, 26.181818181818183, 969.0]
Most abundant taxon: D_0__Bacteria
[0.09090909090909091, 0.18181818181818182, 0.2727272727272727, 0.36363636363636365, 0.45454545454545453, 0.7272727272727273, 1.0, 1.0909090909090908, 1.5454545454545454, 1.6363636363636365, 1.8181818181818181, 1.9090909090909092, 2.5454545454545454, 4.818181818181818, 5.090909090909091, 5.7272727272727275, 6.818181818181818, 7.090909090909091, 10.363636363636363, 18.0, 19.181818181818183, 20.636363636363637, 20.90909090909091, 22.272727272727273, 24.363636363636363, 68.36363636363636, 106.27272727272727, 641.0909090909091]
Most abundant taxo

## Integrate these new microbial data with existing disease data

In [36]:
#Integrate data into a per species level table
from IPython.core.display import HTML,display

disease_df = pd.read_csv("../input/disease_by_genus.csv")
disease_df = disease_df.set_index('host_genus')
for compartment,results_df in results_dfs.items():
    results_df.index.rename('host_genus')    
    disease_df = pd.merge(disease_df, results_df, how="outer", left_index = True, right_index = True, indicator=False)
disease_df.to_csv("../output/GCMP_trait_table_genus_australia.tsv",sep="\t")

In [48]:
import numpy as np
#delete columns that do not have microbiome data (ie corals not from Australia)
df = pd.read_csv("../output/GCMP_trait_table_genus_australia.tsv",sep="\t")
disease_df = df.replace('',np.NaN)
disease_df_filtered = disease_df.dropna(subset=['n_samples_all'])
disease_df_filtered.to_csv("../output/GMP_trait_table_genus_australia_only.tsv",sep="\t")