## Effects of Mitochondrial Removal Protocol on Coral Microbiome Alpha and Beta Diversity, accounting for rarefaction

This notebook tests how the choice of mitochondrial annotation and removal method influences coral alpha and beta diversity, accounting for rarefaction. The strategy is to perform alpha and beta diversity analysis on coral mucus, tissue, and skeleton samples using either standard Greengenes_13_8 or SILVA annotations, or to do the same with expanded versions of these references. 

**However**, the standard Greengees_13_8 and SILVA annotations will be filtered to just samples that survive rarefaction in the analysis with the versions of these that include supplemental mitochondrial sequences.

#### How this notebook is different from `adiv_and_bdiv_effects_of_mitochondrial_removal.ipynb`

An initial notebook tested the effects of improved mitochondrial removal (adiv_and_bdiv_effects_of_mitochondrial_removal.ipynb). However, while mitochondrial removal may improve accuracy, it also reduces the number of sequences per sample. During rarefaction, many of these may fall below the rarefaction threshold and fall out of the analysis. This may cause some trends to appear non-significant in the version of samples that have mitochondria removed correctly.

In the other analysis, it was not possible to tell if this was due to better mitochondria removal eliminating artifactual effects, or because a lower sample size 

#### Set up

We'll import QIIME2 artifact API functions and objects to do the analysis, as well as some basic python functions for working with the file system (e.g. from os.path)

In [2]:
from qiime2 import Artifact,Metadata
from qiime2.plugins.feature_table.methods import filter_samples
from qiime2.plugins.taxa.methods import filter_table
#The below try/except block is unsightly but the alpha function got moved between recent versions of QIIME2
#and it's nice if the notebook is compatible with either
try:
    from qiime2.plugins.diversity.methods import alpha,beta
except:
    from qiime2.plugins.diversity.pipelines import alpha,beta
from qiime2.plugins.diversity.visualizers import alpha_group_significance,beta_group_significance

from qiime2.plugins.feature_table.methods import rarefy
from qiime2.plugins.feature_table.visualizers import summarize
from qiime2.plugins.feature_table.methods import filter_samples

from qiime2.metadata import Metadata

from os.path import abspath,exists,join
from os import mkdir

import shutil

import pandas as pd

#### Set up input filenames

We'll set up input filenames all at once so we can refer to them later.

In [6]:
#### Check that required files exist
mucus_feature_table = "../output/M_ft.qza"
tissue_feature_table = "../output/T_ft.qza"
skeleton_feature_table = "../output/S_ft.qza"
overall_feature_table = "../output/gcmp_raw_overall_ft_all_compartments.qza"

mapping_file = "../input/GCMP_EMP_map_r28_no_empty_samples.txt"
sequence_file = "../output/GCMP_seqs.qza"

output_dir = abspath("../output/effects_of_rarefaction_analysis")
input_directory = abspath("../input")

taxonomy_files = {"silva_metaxa2":"../output/silva_metaxa2_reference_taxonomy.qza",\
                 "silva":"../output/silva_reference_taxonomy.qza",\
                 "greengenes":"../output/greengenes_reference_taxonomy.qza",\
                 "greengenes_metaxa2":"../output/greengenes_metaxa2_reference_taxonomy.qza"}

required_files = [mucus_feature_table,tissue_feature_table,skeleton_feature_table,overall_feature_table,mapping_file,sequence_file]
required_files.extend(taxonomy_files.values())



#### Check that all required files really exist and are named correctly

In [8]:
print("Verifying that all needed starting data files exist.")
for existing_file in required_files:
    if not exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
print("Done.")

if not exists(output_dir):
    print(f"Output directory {output_dir} does not yet exist, creating it...")
    mkdir(output_dir)
    print("Done.")

Verifying that all needed starting data files exist.
Done.


#### Check QIIME2 version

Do a quick check that the qiime version is what's expected. If you get an error at this step due to a different qiime2 verison, the code may very well still work, but if you want to exactly reproduce the results, you'll want QIIME2 2020.8.0


In [9]:
from qiime2 import __version__ as qiime_version

if qiime_version != "2020.8.0":
    raise ValueError("This code was developed with QIIME2 2020.8.0. It will *probably* work with related versions, but there are no guarantees as some functions may change in call signature.")



#### Generate filtered tables using several sets of taxonomy annotations

We will filter mitochondria out of our feature tables using either the default taxonomies (greengenes_13_8 or SILVA), or our supplemented versions with additional metaxa2 mitochondrial 16S rRNA sequences.

In [12]:
from qiime2.plugins.feature_table.methods import filter_features
from collections import defaultdict

filtered_feature_tables_by_taxonomy = defaultdict(dict)

metadata = Metadata.load(mapping_file)
seqs = Artifact.load(sequence_file)

for label,taxonomy_file in taxonomy_files.items():
    
    print(f"Analyzing data using the {label} taxonomy ({taxonomy_file})")
    taxonomy = Artifact.load(taxonomy_file)  
    
    mucus_features = Artifact.load(mucus_feature_table)
    tissue_features = Artifact.load(tissue_feature_table)
    skeleton_features = Artifact.load(skeleton_feature_table)
    overall_features = Artifact.load(overall_feature_table)
    feature_tables = {"all":overall_features,"mucus":mucus_features,\
                      "tissue": tissue_features,"skeleton":skeleton_features,}
    
    for compartment,table in feature_tables.items():
        print("Removing mitochondria from:", compartment,table)
        #NOTE: the QIIME2 api does NOT return a single object (as I thought based on the  documentation, but a NamedTuple
        #structure with each output in it)
        filter_table_results = filter_table(table,taxonomy,exclude="mitochondria,chloroplast",mode="contains")
        filtered_table = filter_table_results.filtered_table
    
        #Save the resulting feature table to disk
        output_filename = f"feature_table_{label}_{compartment}.qza"
        output_filepath = join(output_dir,output_filename)
        print(f"Saving results to:{output_filepath}")
        filtered_table.save(output_filepath)
        
        #Output a sample summary
        summary_visualization = summarize(filtered_table,sample_metadata=metadata)
        vis = summary_visualization.visualization
        output_filename = f"feature_table_{label}_{compartment}.qzv"
        output_filepath = join(output_dir,output_filename)
        print(f"Saving summary file to:{output_filepath}")
        vis.save(output_filepath)
        
        filtered_feature_tables_by_taxonomy[label][compartment]=filtered_table
    
    print(f"Done with processing {label} taxonomy annotations!\n\n")

Analyzing data using the silva_metaxa2 taxonomy (../output/silva_metaxa2_reference_taxonomy.qza)
Removing mitochondria from: all <artifact: FeatureTable[Frequency] uuid: ca2f992d-79c2-49bb-9145-3cdae1c09977>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_all.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_all.qzv
Removing mitochondria from: mucus <artifact: FeatureTable[Frequency] uuid: 8045253c-8a06-4ba5-9188-36ae7ca39531>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_mucus.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_mucus.qzv
Removing mitochondria from: tissue <artifact: FeatureTable[Frequency] uuid: 14e5fea4-6aee-4dec-9065-33a307fb3140>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_tissue.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_tissue.qzv
Removing mitochondria from: skeleton <artifact: FeatureTable[Frequency] uuid: f7e6592c-affd-418c-aaf3-876a1294691e>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_skeleton.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_skeleton.qzv
Done with processing silva_metaxa2 taxonomy annotations!


Analyzing data using the silva taxonomy (../output/silva_reference_taxonomy.qza)
Removing mitochondria from: all <artifact: FeatureTable[Frequency] uuid: ca2f992d-79c2-49bb-9145-3cdae1c09977>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_all.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_all.qzv
Removing mitochondria from: mucus <artifact: FeatureTable[Frequency] uuid: 8045253c-8a06-4ba5-9188-36ae7ca39531>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_mucus.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_mucus.qzv
Removing mitochondria from: tissue <artifact: FeatureTable[Frequency] uuid: 14e5fea4-6aee-4dec-9065-33a307fb3140>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_tissue.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_tissue.qzv
Removing mitochondria from: skeleton <artifact: FeatureTable[Frequency] uuid: f7e6592c-affd-418c-aaf3-876a1294691e>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_skeleton.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_skeleton.qzv
Done with processing silva taxonomy annotations!


Analyzing data using the greengenes taxonomy (../output/greengenes_reference_taxonomy.qza)
Removing mitochondria from: all <artifact: FeatureTable[Frequency] uuid: ca2f992d-79c2-49bb-9145-3cdae1c09977>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_all.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_all.qzv
Removing mitochondria from: mucus <artifact: FeatureTable[Frequency] uuid: 8045253c-8a06-4ba5-9188-36ae7ca39531>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_mucus.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_mucus.qzv
Removing mitochondria from: tissue <artifact: FeatureTable[Frequency] uuid: 14e5fea4-6aee-4dec-9065-33a307fb3140>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_tissue.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_tissue.qzv
Removing mitochondria from: skeleton <artifact: FeatureTable[Frequency] uuid: f7e6592c-affd-418c-aaf3-876a1294691e>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_skeleton.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_skeleton.qzv
Done with processing greengenes taxonomy annotations!


Analyzing data using the greengenes_metaxa2 taxonomy (../output/greengenes_metaxa2_reference_taxonomy.qza)
Removing mitochondria from: all <artifact: FeatureTable[Frequency] uuid: ca2f992d-79c2-49bb-9145-3cdae1c09977>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_all.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_all.qzv
Removing mitochondria from: mucus <artifact: FeatureTable[Frequency] uuid: 8045253c-8a06-4ba5-9188-36ae7ca39531>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_mucus.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_mucus.qzv
Removing mitochondria from: tissue <artifact: FeatureTable[Frequency] uuid: 14e5fea4-6aee-4dec-9065-33a307fb3140>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_tissue.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_tissue.qzv
Removing mitochondria from: skeleton <artifact: FeatureTable[Frequency] uuid: f7e6592c-affd-418c-aaf3-876a1294691e>
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_skeleton.qza


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Saving summary file to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_greengenes_metaxa2_skeleton.qzv
Done with processing greengenes_metaxa2 taxonomy annotations!




## Rarefy tables to even depth

In [13]:
from collections import defaultdict
rarefaction_depth = 1000

rarefied_feature_tables_by_taxonomy = defaultdict(dict)

for label,filtered_feature_tables in filtered_feature_tables_by_taxonomy.items():

    for compartment,table in filtered_feature_tables.items():
        print(f"Rarefying: {compartment} feature table {table} to {rarefaction_depth} sequences/sample")
        rarefy_results = rarefy(table=table, sampling_depth=rarefaction_depth)
        #Get the rarefied table out of the NamedTuple of results
        rarefied_filtered_table = rarefy_results.rarefied_table

        #Save the resulting feature table to disk
        output_filename = f"feature_table_{label}_{compartment}_{rarefaction_depth}.qza"
        output_filepath = join(output_dir,output_filename)
        print(f"Saving results to:{output_filepath}")
        rarefied_filtered_table.save(output_filepath)

        #Store rarefied feature table in a dict so we don't have to reload
        rarefied_feature_tables_by_taxonomy[label][compartment]=rarefied_filtered_table


Rarefying: all feature table <artifact: FeatureTable[Frequency] uuid: 2bb82c5f-0195-4da4-a532-879a73aad7e4> to 1000 sequences/sample
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_all_1000.qza
Rarefying: mucus feature table <artifact: FeatureTable[Frequency] uuid: f92e8cf5-d582-4bc0-85a4-ef6611b659ab> to 1000 sequences/sample
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/feature_table_silva_metaxa2_mucus_1000.qza
Rarefying: tissue feature table <artifact: FeatureTable[Frequency] uuid: 35a8b0a7-e99e-4f27-84a4-2a9d41978fa6> to 1000 sequences/sample
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal

#### New step: harmonize sample sets between treatments

Next we will filter the samples in the rarefied gg and silva files to match the samples in the gg+metaxa2 or silva+metaxa2 annotation datasets.

In [14]:


#Substeps

#Check the list of feature tables


# Build a two level dict to hold sample ids for every feature table by taxonomic
#scheme then by compartment
sample_ids_by_taxonomy = defaultdict(dict)

#Iterate over all the feature tables from the last step
#and collect their sample ids

for taxonomy_name,data_dict in rarefied_feature_tables_by_taxonomy.items():
    print("Taxonomy scheme:",taxonomy_name)
    for compartment,table in data_dict.items():
        print("Compartment:",compartment)
        
        # View as Pandas dataframe
        df = table.view(pd.DataFrame)
        # Extract ids from this dataframe
        n_samples,n_features = df.shape
        sample_ids = list(df.index)
        print("This table has ",len(sample_ids),"samples")
        sample_ids_by_taxonomy[taxonomy_name][compartment] = sample_ids
        


Taxonomy scheme: silva_metaxa2
Compartment: all
This table has  1099 samples
Compartment: mucus
This table has  312 samples
Compartment: tissue
This table has  360 samples
Compartment: skeleton
This table has  364 samples
Taxonomy scheme: silva
Compartment: all
This table has  1155 samples
Compartment: mucus
This table has  318 samples
Compartment: tissue
This table has  389 samples
Compartment: skeleton
This table has  382 samples
Taxonomy scheme: greengenes
Compartment: all
This table has  1156 samples
Compartment: mucus
This table has  318 samples
Compartment: tissue
This table has  390 samples
Compartment: skeleton
This table has  382 samples
Taxonomy scheme: greengenes_metaxa2
Compartment: all
This table has  1100 samples
Compartment: mucus
This table has  312 samples
Compartment: tissue
This table has  361 samples
Compartment: skeleton
This table has  364 samples


Now that we have a data structure holding the sample ids for every feature table, we need to specify that we will filter certain tables based on others. 

In [15]:
pairings = {"silva":"silva_metaxa2","greengenes":"greengenes_metaxa2"}
filtered_tables = defaultdict(dict)
for target_tables,source_tables in pairings.items():
    print(f"Filtering {target_tables} to have same samples as {source_tables}")
    for compartment,table in rarefied_feature_tables_by_taxonomy[target_tables].items():
        print("Compartment:",compartment)
        
        
         # Extract ids from this table
        df = table.view(pd.DataFrame)
        n_samples,n_features = df.shape
        sample_ids = list(df.index)
        
        print("Pre-filtering, this table has ",len(sample_ids),"samples")
        
        #Convert the list of ids to QIIME2 metadata
        #this requires going list --> DataFrame --> Metadata
        id_list = sample_ids_by_taxonomy[source_tables][compartment]
        id_df = pd.DataFrame (id_list,columns=['#SampleID'])
        id_df = id_df.set_index('#SampleID')
        id_md = Metadata(id_df)
        
        filter_results = filter_samples(table,metadata=id_md)
        filtered_table = filter_results.filtered_table
        
        # View as Pandas dataframe
        df = filtered_table.view(pd.DataFrame)
        # Extract ids from this dataframe
        n_samples,n_features = df.shape
        sample_ids = list(df.index)
        print("Post-filtering, this table has ",len(sample_ids),"samples")

        #Update the feature_table dict with this new version
        rarefied_feature_tables_by_taxonomy[target_tables][compartment] = filtered_table
        
#Run qiime feature-table filter-samples to filter these sample ids
#out of the the standard gg feature table

#Repeat for metaxa2

Filtering silva to have same samples as silva_metaxa2
Compartment: all
Pre-filtering, this table has  1155 samples
Post-filtering, this table has  1099 samples
Compartment: mucus
Pre-filtering, this table has  318 samples
Post-filtering, this table has  312 samples
Compartment: tissue
Pre-filtering, this table has  389 samples
Post-filtering, this table has  360 samples
Compartment: skeleton
Pre-filtering, this table has  382 samples
Post-filtering, this table has  364 samples
Filtering greengenes to have same samples as greengenes_metaxa2
Compartment: all
Pre-filtering, this table has  1156 samples
Post-filtering, this table has  1100 samples
Compartment: mucus
Pre-filtering, this table has  318 samples
Post-filtering, this table has  312 samples
Compartment: tissue
Pre-filtering, this table has  390 samples
Post-filtering, this table has  361 samples
Compartment: skeleton
Pre-filtering, this table has  382 samples
Post-filtering, this table has  364 samples


#### Calculate alpha diversity for each combination of taxonomic scheme and anatomy 

In [16]:
metrics = ['observed_features','gini_index','dominance','simpson_e']
alpha_diversities = {}
for label, rarefied_feature_tables in rarefied_feature_tables_by_taxonomy.items():
    for compartment,table in rarefied_feature_tables.items():
        for metric in metrics:
            print(f"Calculating alpha diversity for {compartment} using {metric}")
            alpha_results = alpha(table=table,metric = metric)
            alpha_diversity = alpha_results.alpha_diversity
            alpha_diversities[f"{label}_{compartment}_{metric}_{rarefaction_depth}"] = alpha_diversity

            #Save the resulting feature table to disk
            output_filename = f"adiv_{label}_{compartment}_{metric}_{rarefaction_depth}_samples_harmonized.qza"
            output_filepath = join(output_dir,output_filename)
            print(f"Saving results to:{output_filepath}")
            alpha_diversity.save(output_filepath)

            #Calculate alpha group significance for categorical variables
            alpha_group_sig_results = alpha_group_significance(alpha_diversity=alpha_diversity,metadata=metadata)
            alpha_group_sig_visualization = alpha_group_sig_results.visualization
            output_filename = f"adiv_{label}_{compartment}_{metric}_{rarefaction_depth}_group_sig_samples_harmonized.qzv"
            output_filepath = join(output_dir,output_filename)
            print(f"Saving significance results to:{output_filepath}")
            alpha_group_sig_visualization.save(output_filepath)


Calculating alpha diversity for all using observed_features
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_all_observed_features_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_all_observed_features_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for all using gini_index
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_all_gini_index_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/g

Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_skeleton_dominance_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for skeleton using simpson_e
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_skeleton_simpson_e_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_skeleton_simpson_e_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for all using observed_features
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Glob

Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_skeleton_gini_index_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_skeleton_gini_index_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for skeleton using dominance
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_skeleton_dominance_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_anal

Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_greengenes_tissue_simpson_e_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for skeleton using observed_features
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_greengenes_skeleton_observed_features_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_greengenes_skeleton_observed_features_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for skeleton using gini_index
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Project

Calculating alpha diversity for tissue using dominance
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_greengenes_metaxa2_tissue_dominance_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_greengenes_metaxa2_tissue_dominance_1000_group_sig_samples_harmonized.qzv
Calculating alpha diversity for tissue using simpson_e
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_greengenes_metaxa2_tissue_simpson_e_1000_samples_harmonized.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disea

#### Test the effects of mitochondrial removal on between family beta-diversity

If mitochondria are misannotated at different rates between coral families, we might expect that this could potentially artificially inflate inter-family beta-diversity. Alternatively, it's possible that *removal* of mitochondria may reduce intra-family variability, effectively shrinking the observed variance within each family and thereby increasing the significance of inter-family beta diversity. The code below calculates permanova between coral families under each taxonomic scheme to test these ideas. 

In [17]:
metrics = ['braycurtis']
beta_diversities = {}
for label, rarefied_feature_tables in rarefied_feature_tables_by_taxonomy.items():
    for compartment,table in rarefied_feature_tables.items():
        for metric in metrics:
            print(f"Calculating beta diversity for {compartment} using {metric}")
            beta_results = beta(table=table,metric = metric)
            beta_dm = beta_results.distance_matrix
            beta_diversities[f"{label}_{compartment}_{metric}_{rarefaction_depth}"] = beta_dm

            #Save the resulting feature table to disk
            output_filename = f"bdiv_{label}_{compartment}_{metric}_{rarefaction_depth}.qza"
            output_filepath = join(output_dir,output_filename)
            print(f"Saving results to:{output_filepath}")
            beta_dm.save(output_filepath)

            #Calculate beta group significance for categorical variables
            sig_method = 'permanova'
            metadata_column = 'taxonomy_string_to_family'
            
            beta_group_sig_results =\
              beta_group_significance(distance_matrix=beta_dm,method=sig_method,metadata=metadata.get_column(metadata_column))
            
            beta_group_sig_visualization = beta_group_sig_results.visualization
            
            output_filename = f"bdiv_{label}_{compartment}_{metric}_{rarefaction_depth}_{sig_method}_group_sig.qzv"
            output_filepath = join(output_dir,output_filename)
            print(f"Saving significance results to:{output_filepath}")
            beta_group_sig_visualization.save(output_filepath)

Calculating beta diversity for all using braycurtis
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_all_braycurtis_1000.qza


Invalid limit will be ignored.
  ax.set_xlim(-.5, len(self.plot_data) - .5, auto=None)


Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_all_braycurtis_1000_permanova_group_sig.qzv
Calculating beta diversity for mucus using braycurtis
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_mucus_braycurtis_1000.qza
Saving significance results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_mucus_braycurtis_1000_permanova_group_sig.qzv
Calculating beta diversity for tissue using braycurtis
Saving results to:/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_rem

### Summarizing results of alpha and beta diversity analysis

The above analyses produce a long list of files in the output folder. It is quite tedious and error-prone to manually open each one to tabulate how various alpha and beta diversity measures changed with rarefaction. However, QIIME2 currently does not provide an easy way to get all of the underlying data that is shown in the .qzv file when opened with view.qiime2.org.

We will pursue the following strategy:
0. Install BeautifulSoup for HTML parsing

1. loop over the generated files
2. export the content of each
3. parse the index file with BeautifulSoup to get the info we want directly from the HTML
4. profit!

All credit for developing this approach to scraping QIIME2 qzv files goes to John Sterett on the QIIME2 forum, who helpfully shared a [question](https://forum.qiime2.org/t/beta-diversity-api/16288) that described how this is done. Thank you John!

First, we'll write a function that handles parsing the qzv file:

In [82]:
from qiime2 import Visualization
import numpy as np
import shutil
from os.path import exists
from os import listdir,mkdir
import distutils
#Install Beautiful Soup 4
#Note, since we're installing from 
#inside this notebook we have to 
#pass the yes flag to approve the
#install ahead of time.

#Use a Try/Except block to avoid
#slow reinstallation if BeautifulSoup
#is already installed.
try:
    from bs4 import BeautifulSoup
except ImportError:
    !conda install bs4 --yes
    from bs4 import BeautifulSoup
    

def bdiv_group_sig_qzv_to_dataframe(viz_filepath,\
                                    tmp_dir,\
                                    label = "bdiv_results",
                                    add_to_df = None,
                                    delete_tmp_dir = False):
    """Parse a dataframe from a QIIME2 beta diversity group significance .qzv file
    viz_filepath -- path to the qzv file
    tmp_dir -- name of a new temporary directory that can safely be deleted 
    when done
    label -- string that has a name for the results in the dataframe
       (Useful when running this script in a loop with the add_to_df option)
    add_to_df -- optionally concatenate the new results to an existing dataframe before
      returning them
    delete_tmp_dir -- optionally delete the temporary directory after finishing. 
       BE 100% SURE THE DIRECTORY IS CORRECT (not '/'!!!!) AND DOESN'T ALREADY 
       EXIST IF SETTING THIS OPTION TO True. You have been warned ;).
    """    
    bdiv_viz_file = Visualization.load(viz_filepath)
    
    #If the user told us to delete the tmp dir when done, check that it
    #either doesn't exist or is empty.
    if delete_tmp_dir:
        if exists(tmp_dir):
            raise ValueError(f"Directory {tmp_dir} already exists. \
                               Please specify a *new* not already existing \
                               directory name if using the delete_tmp_dir \
                               parameter. This is to avoid deleting your \
                               existing data by accident. Thanks.")
    
    #Make a directory to hold our exported results
    mkdir(tmp_dir)
    distutils.dir_util._path_created = {}
    bdiv_viz_file.export_data(tmp_dir)
    column = "taxonomy_string_to_family"


    with open(f'{tmp_dir}/index.html') as f:
        soup = BeautifulSoup(f, 'html.parser')




    #First find the keys for each entry in the table
    #This is hackish, but they are found by looking for all
    #header cell entries (marked th) except the few used to introduce
    #the results
    keys = [key.string for key in soup.find_all('th')[2:]]
    #keys output is ['method name','test statistic name','sample size','number of groups',
    #                'test statistic','p-value','number of permutations']

    #td tags indicate standard data cells in HTML tables.
    #so we find them to rip out the statistical results
    values = np.array([value.string for value in soup.find_all('td')])
    values.reshape((1,7))
    #values is a vertical array with the corresponding values for each key

    #Now that we have the data, we can safely delete
    #the new tmp_dir if the user asked us to:

    if delete_tmp_dir:
        shutil.rmtree(tmp_dir)
        #Sadly due to a weird interction with caching in distutils
        #we have to clear the cache after using rmtree or 
        #QIIME2's data export will fail!
        distutils.dir_util._path_created = {}
    #Make the data frame for the current column, concatenate it with the one for other columns 
    #under the given metric
    #look into changing to df[f'{column}'] == values? Bigger fish to fry rn though
    bdiv_results_df = pd.DataFrame(values,index=keys,columns=[label])
    if add_to_df is False:
        return bdiv_results_df
    else:
        #If the user gave us a DataFrame in the add_to_df option
        #concatenate our new results to it.
        bdiv_results_df = pd.concat([add_to_df,bdiv_results_df],axis=1)
        return bdiv_results_df
        
        
        
    
    

### Demonstrating this export on a single file

Here's an example of how the export function works on a single qzv file (to run locally
adjust your viz_filepath to any beta group significance .qzv file)

In [83]:
tmp_dir = join(output_dir,"tmp_qzv_exports/")
viz_filepath = "/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_greengenes_metaxa2_skeleton_braycurtis_1000_permanova_group_sig.qzv"
bdiv_results_df = bdiv_group_sig_qzv_to_dataframe(viz_filepath,tmp_dir= tmp_dir,add_to_df=None,delete_tmp_dir = True)
from IPython.display import HTML
print(bdiv_results_df)
HTML(bdiv_results_df.to_html())

                       bdiv_results
method name               PERMANOVA
test statistic name        pseudo-F
sample size                     364
number of groups                 22
test statistic              2.34464
p-value                       0.001
number of permutations          999


Unnamed: 0,bdiv_results
method name,PERMANOVA
test statistic name,pseudo-F
sample size,364
number of groups,22
test statistic,2.34464
p-value,0.001
number of permutations,999


#### Run the results in a loop
Next, we'll run the above function in a loop over our many bdiv_group_sig files to tally up how the PERMANOVA results change in each.

In [106]:
bdiv_df = pd.DataFrame()

for label, rarefied_feature_tables in rarefied_feature_tables_by_taxonomy.items():
    for compartment,table in rarefied_feature_tables.items():
        for metric in metrics:
            
            bdiv_output_filename = f"bdiv_{label}_{compartment}_{metric}_{rarefaction_depth}_{sig_method}_group_sig.qzv"
            bdiv_output_filepath = join(output_dir,bdiv_output_filename)
            print(f"Parsing {bdiv_output_filepath}")
            bdiv_df = bdiv_group_sig_qzv_to_dataframe(bdiv_output_filepath,\
                  tmp_dir= tmp_dir,\
                  label = f"bdiv__{label}__{compartment}__{metric}__{rarefaction_depth}",\
                  add_to_df=bdiv_df,\
                  delete_tmp_dir = True)

print("Done!")

Parsing /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_all_braycurtis_1000_permanova_group_sig.qzv
Parsing /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_mucus_braycurtis_1000_permanova_group_sig.qzv
Parsing /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_tissue_braycurtis_1000_permanova_group_sig.qzv
Parsing /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/bdiv_silva_metaxa2_skeleton_braycurtis_1000_permanova_group_sig.qzv
Parsing /Users/jzaneveld/Dropbox/Zaneveld_

In [128]:

bdiv_df_transposed = bdiv_df.transpose()
bdiv_df_transposed.reset_index(inplace=True)
bdiv_df_transposed = bdiv_df_transposed.rename(columns = {'index':'Method'})
new_rows = bdiv_df_transposed["Method"].str.split("__",expand=True)
bdiv_df_transposed[['Diversity Type','Taxonomy Scheme','Compartment','Metric','Rarefaction Depth']]=new_rows
bdiv_df_transposed.set_index("Method",inplace=True)

output_file = join(output_dir,"beta_diversity_results_summary.tsv")
sort_order = ["Diversity Type","Metric","Compartment","Rarefaction Depth","Taxonomy Scheme"]
bdiv_df_transposed.sort_values(sort_order,inplace=True)
print(f"Saving summary to: {output_file}")
bdiv_df_transposed.to_csv(output_file,sep="\t")
HTML(bdiv_df_transposed.to_html())



Saving summary to: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/beta_diversity_results_summary.tsv


Unnamed: 0_level_0,method name,test statistic name,sample size,number of groups,test statistic,p-value,number of permutations,Diversity Type,Taxonomy Scheme,Compartment,Metric,Rarefaction Depth
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
bdiv__greengenes__all__braycurtis__1000,PERMANOVA,pseudo-F,1098,31,6.83575,0.001,999,bdiv,greengenes,all,braycurtis,1000
bdiv__greengenes_metaxa2__all__braycurtis__1000,PERMANOVA,pseudo-F,1098,31,3.61683,0.001,999,bdiv,greengenes_metaxa2,all,braycurtis,1000
bdiv__silva__all__braycurtis__1000,PERMANOVA,pseudo-F,1097,31,6.9519,0.001,999,bdiv,silva,all,braycurtis,1000
bdiv__silva_metaxa2__all__braycurtis__1000,PERMANOVA,pseudo-F,1097,31,3.61168,0.001,999,bdiv,silva_metaxa2,all,braycurtis,1000
bdiv__greengenes__mucus__braycurtis__1000,PERMANOVA,pseudo-F,312,24,2.97668,0.001,999,bdiv,greengenes,mucus,braycurtis,1000
bdiv__greengenes_metaxa2__mucus__braycurtis__1000,PERMANOVA,pseudo-F,312,24,1.96929,0.001,999,bdiv,greengenes_metaxa2,mucus,braycurtis,1000
bdiv__silva__mucus__braycurtis__1000,PERMANOVA,pseudo-F,312,24,3.00256,0.001,999,bdiv,silva,mucus,braycurtis,1000
bdiv__silva_metaxa2__mucus__braycurtis__1000,PERMANOVA,pseudo-F,312,24,1.97825,0.001,999,bdiv,silva_metaxa2,mucus,braycurtis,1000
bdiv__greengenes__skeleton__braycurtis__1000,PERMANOVA,pseudo-F,364,22,3.2725,0.001,999,bdiv,greengenes,skeleton,braycurtis,1000
bdiv__greengenes_metaxa2__skeleton__braycurtis__1000,PERMANOVA,pseudo-F,364,22,2.34464,0.001,999,bdiv,greengenes_metaxa2,skeleton,braycurtis,1000


## Summarizing results of alpha diversity analysis

We will next summarize the alpha diversity analyses generated by .qzv files. Unfortunately, in the current version of QIIME2, I was unable to identify a simple place in the visualizer where the global p and H value for Kruskal-Wallis tests are saved. Additionally, while the built in export function produces nice .csvs for pairwise comparisons, the global values (which are the ones we mostly want) are buried in .jsonp files used in the index.html web page. Therefore, to get this done we had to pursue a strategy of exporting each qzv into a temporary directory, manually parsing the jsonp files for our specific values of interest for every metadata column of interest, then compiling all these results into a pandas dataframe. This was then exported as a .csv and used as the basis for our Supplementary Table (after light formatting and reordering of columns).

In [273]:
from ast import literal_eval
def parse_adiv_group_sig_jsonp_file(input_file):
    """Parse a jsonp file by crudly scraping out the H and p values for the overall analysis
    
    
    
    """
    import ast
    import re
    file_text = open(input_file).read()
    file_fields = re.split("[{}]",file_text)
    results = {}
    for i,field in enumerate(file_fields):    
        if '"H":' in field:
            stats_result_dict = ast.literal_eval("{"+f"{field}"+"}")
            results.update(stats_result_dict)
            
        if '"initial"' in field:
            n_samples_dict = ast.literal_eval("{"+f"{field}"+"}")
            results.update(n_samples_dict)
    return results


In [294]:
def dataframe_from_alpha_group_sig_export_dir(export_dir,\
                                              label='adiv_results',\
                                              limit_to_columns=None):
    
    results_df = df = pd.DataFrame(columns=('label','metadata_column', 'H','p','n_initial','n_filtered'))
    for i,filename in enumerate(listdir(tmp_dir)):
        if not filename.endswith('.jsonp'):
            continue

        metadata_column = filename.split("column-",1)[1].split(".jsonp")[0]
        if limit_to_columns:
            if metadata_column not in limit_to_columns:
                continue
        curr_file = join(tmp_dir,filename)
        results_dict = parse_adiv_group_sig_jsonp_file(curr_file)
        results_df.loc[i]= [label,metadata_column,results_dict["H"],results_dict["p"],results_dict["initial"],results_dict["filtered"]]
    return results_df



In [297]:
def adiv_group_sig_qzv_to_dataframe(viz_filepath,\
                                    tmp_dir,\
                                    label = "adiv_results",
                                    add_to_df = None,
                                    delete_tmp_dir = False,\
                                    limit_to_columns=None):
    """Parse a dataframe from a QIIME2 alpha diversity group significance .qzv file
    viz_filepath -- path to the qzv file
    tmp_dir -- name of a new temporary directory that can safely be deleted 
    when done
    label -- string that has a name for the results in the dataframe
       (Useful when running this script in a loop with the add_to_df option)
    add_to_df -- optionally concatenate the new results to an existing dataframe before
      returning them
    delete_tmp_dir -- optionally delete the temporary directory after finishing. 
       BE 100% SURE THE DIRECTORY IS CORRECT (not '/'!!!!) AND DOESN'T ALREADY 
       EXIST IF SETTING THIS OPTION TO True. You have been warned ;).
    limit_to_columns -- a list of columns. Within the qzv only these columns / categories
      will be parsed
    """    
    adiv_viz_file = Visualization.load(viz_filepath)
    
    #If the user told us to delete the tmp dir when done, check that it
    #either doesn't exist or is empty.
    if delete_tmp_dir:
        if exists(tmp_dir):
            raise ValueError(f"Directory {tmp_dir} already exists. \
                               Please specify a *new* not already existing \
                               directory name if using the delete_tmp_dir \
                               parameter. This is to avoid deleting your \
                               existing data by accident. Thanks.")
    
    #Make a directory to hold our exported results
    if not exists(tmp_dir):
        mkdir(tmp_dir)
    
    #This is ugly looking but necessary because
    #of a secret and horrific interaction between rmtree
    #and distutils (distutils caches that a dir is present
    #and doesn't know you've deleted it unless you clear
    #it's cache in this arcane way). Sorry.
    distutils.dir_util._path_created = {}
    
    adiv_viz_file.export_data(tmp_dir)
    #column = "taxonomy_string_to_family"
    
    adiv_results_df = dataframe_from_alpha_group_sig_export_dir(tmp_dir,label=label,\
      limit_to_columns=limit_to_columns)
    
    if delete_tmp_dir:
        shutil.rmtree(tmp_dir)
        #Sadly due to a weird interction with caching in distutils
        #we have to clear the cache after using rmtree or 
        #QIIME2's data export will fail!
        distutils.dir_util._path_created = {}
    #Make the data frame for the current column, concatenate it with the one for other columns 
    #under the given metric
    #look into changing to df[f'{column}'] == values? Bigger fish to fry rn though
    if add_to_df is False:
        return adiv_results_df
    else:
        #If the user gave us a DataFrame in the add_to_df option
        #concatenate our new results to it.
        adiv_results_df = pd.concat([add_to_df,adiv_results_df],axis=0)
        return adiv_results_df

In [299]:
tmp_dir = join(output_dir,"tmp_qzv_exports/")
adiv_df = pd.DataFrame()
metrics = ['observed_features','gini_index','dominance','simpson_e']
adiv_columns = ['taxonomy_string_to_family']
for label, rarefied_feature_tables in rarefied_feature_tables_by_taxonomy.items():
    for compartment,table in rarefied_feature_tables.items():
        for metric in metrics:
            
            adiv_output_filename = f"adiv_{label}_{compartment}_{metric}_{rarefaction_depth}_group_sig_samples_harmonized.qzv"
            adiv_output_filepath = join(output_dir,adiv_output_filename)
            print(f"Parsing {adiv_output_filepath}")
            adiv_df = adiv_group_sig_qzv_to_dataframe(adiv_output_filepath,\
                  tmp_dir= tmp_dir,\
                  label = f"adiv__{label}__{compartment}__{metric}__{rarefaction_depth}",\
                  add_to_df=adiv_df,\
                  delete_tmp_dir = True,\
                  limit_to_columns = adiv_columns)
            print(adiv_df)

print("Done!")

from IPython.display import HTML
HTML(adiv_df.to_html())


Parsing /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_all_observed_features_1000_group_sig_samples_harmonized.qzv
                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   

               metadata_column           H             p n_initial n_filtered  
148  taxonomy_string_to_family  245.315487  1.213026e-35      1099       1097  
Parsing /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/effects_of_rarefaction_analysis/adiv_silva_metaxa2_all_gini_index_1000_group_sig_samples_harmonized.qzv
                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   

               metadata_column           H       

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   

               metadata_column           H             p n_initial n_filtered  
148  taxonomy_string_to_family  245.315487  1.213026e-35      1099       1097  
148  taxonomy_string_to_family  233.741716  2.022842e-33      1099       1097  
148  taxonomy_string_to_family  206.290136  3.274478e-28      1099       1097  
148  taxonomy_string_to_family   97.364539  4.818763e-09      1099       1097  
145  taxonomy_strin

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   

               metadata_column           H             p n_initial n_filtered  
148  taxonomy_string_to_family  245.315487  1.213

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
145       adiv__silva_metaxa2__mucus__gini_index__1000   
145        adiv__silva_metaxa2__mucus__dominance__1000   
145        adiv__silva_metaxa2__mucus__simpson_e__1000   
145  adiv__silva_metaxa2__tissue__observed_features...   
145      adiv__silva_metaxa2__tissue__gini_index__1000   
145       adiv__silva_metaxa2__tissue__dominance__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   
145  adiv__silva_metaxa2__skeleton__observed_featur...   
145    adiv__silva_metaxa2__skeleton__gini_index__1000   
145     adiv__silva_metaxa2__skeleton__dominance__1000   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
148          a

                                                 label  \
148  adiv__silva_metaxa2__all__observed_features__1000   
148         adiv__silva_metaxa2__all__gini_index__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
148          adiv__silva_metaxa2__all__simpson_e__1000   
145  adiv__silva_metaxa2__mucus__observed_features_...   
..                                                 ...   
145  adiv__greengenes_metaxa2__tissue__observed_fea...   
145  adiv__greengenes_metaxa2__tissue__gini_index__...   
145  adiv__greengenes_metaxa2__tissue__dominance__1000   
145  adiv__greengenes_metaxa2__tissue__simpson_e__1000   
145  adiv__greengenes_metaxa2__skeleton__observed_f...   

               metadata_column           H             p n_initial n_filtered  
148  taxonomy_string_to_family  245.315487  1.213026e-35      1099       1097  
148  taxonomy_string_to_family  233.741716  2.022842e-33      1099       1097  
148  taxonomy_string_to_family  206.290136  3.274478e-28      1

Unnamed: 0,label,metadata_column,H,p,n_initial,n_filtered
148,adiv__silva_metaxa2__all__observed_features__1000,taxonomy_string_to_family,245.315487,1.213026e-35,1099,1097
148,adiv__silva_metaxa2__all__gini_index__1000,taxonomy_string_to_family,233.741716,2.022842e-33,1099,1097
148,adiv__silva_metaxa2__all__dominance__1000,taxonomy_string_to_family,206.290136,3.274478e-28,1099,1097
148,adiv__silva_metaxa2__all__simpson_e__1000,taxonomy_string_to_family,97.364539,4.818763e-09,1099,1097
145,adiv__silva_metaxa2__mucus__observed_features__1000,taxonomy_string_to_family,33.195214,0.07772493,312,312
145,adiv__silva_metaxa2__mucus__gini_index__1000,taxonomy_string_to_family,28.09212,0.2123181,312,312
145,adiv__silva_metaxa2__mucus__dominance__1000,taxonomy_string_to_family,30.935418,0.124388,312,312
145,adiv__silva_metaxa2__mucus__simpson_e__1000,taxonomy_string_to_family,27.33034,0.2421677,312,312
145,adiv__silva_metaxa2__tissue__observed_features__1000,taxonomy_string_to_family,113.42303,1.458049e-12,360,360
145,adiv__silva_metaxa2__tissue__gini_index__1000,taxonomy_string_to_family,107.247438,1.612841e-11,360,360


In [302]:
output_file = join(output_dir,"alpha_diversity_results_summary.tsv")
#sort_order = ["Diversity Type","Metric","Compartment","Rarefaction Depth","Taxonomy Scheme"]
#adiv_df_transposed.sort_values(sort_order,inplace=True)
new_rows = adiv_df["label"].str.split("__",expand=True)
adiv_df[['Diversity Type','Taxonomy Scheme','Compartment','Metric','Rarefaction Depth']]=new_rows
sort_order = ["Diversity Type","Metric","Compartment","Rarefaction Depth","Taxonomy Scheme"]
adiv_df.sort_values(sort_order,inplace=True)
print(adiv_df)
print(f"Saving summary to: {output_file}")
adiv_df.to_csv(output_file,sep="\t")

                                                 label  \
148             adiv__greengenes__all__dominance__1000   
148     adiv__greengenes_metaxa2__all__dominance__1000   
148                  adiv__silva__all__dominance__1000   
148          adiv__silva_metaxa2__all__dominance__1000   
145           adiv__greengenes__mucus__dominance__1000   
..                                                 ...   
145     adiv__silva_metaxa2__skeleton__simpson_e__1000   
145          adiv__greengenes__tissue__simpson_e__1000   
145  adiv__greengenes_metaxa2__tissue__simpson_e__1000   
145               adiv__silva__tissue__simpson_e__1000   
145       adiv__silva_metaxa2__tissue__simpson_e__1000   

               metadata_column           H             p n_initial n_filtered  \
148  taxonomy_string_to_family  165.526691  1.108918e-20      1100       1098   
148  taxonomy_string_to_family  207.480092  1.955855e-28      1100       1098   
148  taxonomy_string_to_family  180.631451  1.944442e-23    