# What you will get from this tutorial:

- Knowledge about the concepts of alpha and beta diversity
- How to calculate the metrics that describe alpha diversity with QIIME 2

Now, after we have processed all of the sequences and generated a feature table we are ready to study the samples’ diversity. Balanced gut microbiomes with greater diversity are present in healthy, long-living people, while disturbed microbiomes with dysbiosis (with increased levels of Proteobacteria or with reduced diversity) are observed in elderly people with different comorbidities [1] or are present in patients with diseases such as Crohn's disease, irritable bowel syndrome, obesity, autism, etc. Transplantation of fecal microbiota is aimed at restoring diversity [2]. Let’s see, if it is successful in our case.

<a href="https://www.smartnutritionbykg.com/microbial-diversity-the-single-most-important-factor-when-it-comes-to-our-health/"><img src="images/Microbial-Diveristy.jpeg" width="300" align="center"></a>

Before we start analysing the diversity we need to generate a rooted phylogenetic tree relating the features to one another. It will be stored in a Phylogeny[Rooted] artifact. To generate a phylogenetic tree we will use align-to-tree-mafft-fasttree pipeline.
First, the pipeline performs a multiple sequence alignment of the sequences in our FeatureData[Sequence] to create a FeatureData[AlignedSequence] QIIME 2 artifact. Next, it masks (or filters) the alignment to remove positions that are highly variable and only add noise to a resulting phylogenetic tree. After that, it generates phylogenetic trees – an unrooted tree and a rooted one, with the root  placed at the midpoint of the longest tip-to-tip distance in the unrooted tree.

<a href="https://slideplayer.com/slide/5986566/"><img src="images/rooted_unrooted_tree.jpg" width="350" align="center"></a>

Make sure QIIME 2 is active - type conda activate qiime2-2022.2 in your terminal before starting jupyter notebook.

In [1]:
%%bash
qiime info
qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences rep-seqs.qza \
  --o-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza \
  --o-tree unrooted-tree.qza \
  --o-rooted-tree rooted-tree.qza

System versions
Python version: 3.8.12
QIIME 2 release: 2022.2
QIIME 2 version: 2022.2.0
q2cli version: 2022.2.0

Installed plugins
alignment: 2022.2.0
composition: 2022.2.0
cutadapt: 2022.2.0
dada2: 2022.2.0
deblur: 2022.2.0
demux: 2022.2.0
diversity: 2022.2.0
diversity-lib: 2022.2.0
emperor: 2022.2.0
feature-classifier: 2022.2.0
feature-table: 2022.2.0
fragment-insertion: 2022.2.0
gneiss: 2022.2.0
longitudinal: 2022.2.0
metadata: 2022.2.0
phylogeny: 2022.2.0
quality-control: 2022.2.0
quality-filter: 2022.2.0
sample-classifier: 2022.2.0
taxa: 2022.2.0
types: 2022.2.0
vsearch: 2022.2.0

Application config directory
/home/felitsiya/miniconda3/envs/qiime2-2022.2/var/q2cli

Getting help
To get help with QIIME 2, visit https://qiime2.org
Saved FeatureData[AlignedSequence] to: aligned-rep-seqs.qza
Saved FeatureData[AlignedSequence] to: masked-aligned-rep-seqs.qza
Saved Phylogeny[Unrooted] to: unrooted-tree.qza
Saved Phylogeny[Rooted] to: rooted-tree.qza


The microbiome within a sample or between multiple samples can be described with alpha and beta diversity. They let us see the “big picture” - a broader difference in the composition microorganisms.

## Alpha diversity

Alpha diversity observes microbial diversity <b>WITHIN ONE</b> community (within-habitat diversity). The metrics here describe the richness or evenness of a microbial community, or a combination of both. Richness is the number of species present in the community. Evenness describes the distribution of the species count across the community. Evenness is highest when all species in the community have the same abundance and approaches zero as relative abundances vary.

<a href="https://www.nature.com/scitable/knowledge/library/characterizing-communities-13241173/"><img src="images/richness_evenness.jpg" width="400" align="center"></a>

In the picture above both communities contain five species, which means that richness is the same. There is a total of 25 organisms in each sample, so abundance is 25. The left community is dominated by one species. The right community has equal proportions of each species and therefore greater evenness. Thus the community on the right has higher species diversity.

Such straight-forward estimation of richness and evenness is not always possible, as the number of species in a community is dependent on the number of collected samples. More species are usually collected with more samples, which invalidates the comparisons. A solution to this problem is to standardize the sampling by creating a taxon sampling (rarefaction) curve. <a href="https://www.youtube.com/watch?v=g5BdGP4V5YA">Rarefaction</a> is a resampling (without replacement) approach to generate a curve that allows comparisons among samples by selecting a minimum sample size of all the collections. Rarefaction curves are often used when calculating alpha diversity metrics to estimate the full sample richness. The rarefaction curve is plotted as the number of species against the number of samples. This is done by randomly re-sampling the sample pool multiple times and then plotting (usually) the quantity R of species in each sample. Usually, it initially grows rapidly, as the most common species are found, and then slightly flattens, as the rarest species remain to be sampled. If we obtain a similar quantity with fewer observations, R has converged on a good estimate of the correct richness. If R is keeps increasing or decreasing, then we cannot make a good estimate. In the example below, the red curve is still increasing - it has not converged. The blue curve has reached a horizontal asymptote - the value of R is a good estimate of the richness. In the case with the ever increasing species count, we have two possibilities: we need to collect more samples, because we have not yet sampled all the present species, or we have read errors that inflate R.

<a href="https://www.drive5.com/usearch/manual9.2/rare.html"><img src="images/rare.gif" width="400" align="center"></a>

Take a look at alpha diversity explained in a <a href="https://www.youtube.com/watch?v=9ZvoR89HYP8&t=20s">video</a>.

We need to apply the core-metrics-phylogenetic method to rarefy the FeatureTable[Frequency] to a certain depth. This also computes several alpha and beta diversity metrics. Here we need to provide an important parameter : --p-sampling-depth, which is the rarefaction depth. It determines most diversity metrics, randomly subsampling the counts from each sample to the provided value. If the total count for a sample is smaller than its value, those samples will be dropped from the diversity analysis. Choosing this value is tricky and important! Make the choice by reviewing the table.qzv file. Pick a value that is as high as possible (so you retain more sequences per sample) while excluding as few samples as possible. Which sample(s) did you exclude? Control, treatment or donor? To identify their label look at the metadata (remember, you downloaded it earlier).

In [2]:
%%bash
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table.qza \
  --p-sampling-depth 1150 \
  --m-metadata-file sample-metadata.tsv \
  --output-dir core-metrics-results

Usage: qiime diversity core-metrics-phylogenetic [OPTIONS]

  Applies a collection of diversity metrics (both phylogenetic and non-
  phylogenetic) to a feature table.

Inputs:
  --i-table ARTIFACT FeatureTable[Frequency]
                          The feature table containing the samples over which
                          diversity metrics should be computed.     [required]
  --i-phylogeny ARTIFACT  Phylogenetic tree containing tip identifiers that
    Phylogeny[Rooted]     correspond to the feature identifiers in the table.
                          This tree can contain tip ids that are not present
                          in the table, but all feature ids in the table must
                          be present in this tree.                  [required]
Parameters:
  --p-sampling-depth INTEGER
    Range(1, None)        The total frequency that each sample should be
                          rarefied to prior to computing diversity metrics.
                                           

CalledProcessError: Command 'b'qiime diversity core-metrics-phylogenetic \\\n  --i-phylogeny rooted-tree.qza \\\n  --i-table table.qza \\\n  --p-sampling-depth 1150 \\\n  --m-metadata-file sample-metadata.tsv \\\n  --output-dir core-metrics-results\n'' returned non-zero exit status 1.

We can now begin to explore the microbial composition of the samples related to the sample metadata. Let’s attempt to answer some questions about the personal human microbiome. Do samples differ in richness, evenness or composition by subject-id (for each individual separately)?

We’ll first test for associations between categorical metadata columns and alpha diversity data. We’ll do that here for the Faith Phylogenetic Diversity (a measure of community richness, which is calculated by adding the branches of the rooted tree for the sample) and evenness metrics. The <a href="https://www.marinespecies.org/introduced/wiki/Measurements_of_biodiversity">Pielou</a> evenness index (J′) is 1, if all species are represented in equal numbers in the sample. If one species strongly dominates J′ is close to zero. 

In [3]:
%%bash
qiime diversity alpha-group-significance \
  --i-alpha-diversity core-metrics-results/faith_pd_vector.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization core-metrics-results/faith-pd-group-significance.qzv

qiime diversity alpha-group-significance \
  --i-alpha-diversity core-metrics-results/evenness_vector.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization core-metrics-results/evenness-group-significance.qzv

Saved Visualization to: core-metrics-results/faith-pd-group-significance.qzv
Saved Visualization to: core-metrics-results/evenness-group-significance.qzv


Are there any categorical sample metadata columns that are strongly associated with the differences in microbial community richness/evenness? Are these differences statistically significant?

Do the Faith Phylogenetic Diversity and evenness change in individuals between baseline and the end of the study? There are other questions concerning longitudinal measurements, but let’s answer this first.

Pairwise difference tests determine whether a specific metric changed significantly between pairs of paired samples (e.g., pre- and post-treatment). We will test whether the two alpha diversity metrics changed significantly between two different time points according to treatment-group. Filter the donor group from the metadata, because it doesn’t have time data.

In [5]:
%%bash
grep -v 'donor' sample-metadata.tsv > filtered-sample-metadata.tsv
qiime longitudinal pairwise-differences \
  --m-metadata-file filtered-sample-metadata.tsv \
  --m-metadata-file core-metrics-results/faith_pd_vector.qza \
  --p-metric faith_pd \
  --p-group-column treatment-group \
  --p-state-column week \
  --p-state-1 0 \
  --p-state-2 18 \
  --p-individual-id-column subject-id \
  --p-replicate-handling random \
  --o-visualization pairwise-differences.qzv
  qiime longitudinal pairwise-differences \
  --m-metadata-file filtered-sample-metadata.tsv \
  --m-metadata-file core-metrics-results/evenness_vector.qza \
  --p-metric pielou_evenness \
  --p-group-column treatment-group \
  --p-state-column week \
  --p-state-1 0 \
  --p-state-2 18 \
  --p-individual-id-column subject-id \
  --p-replicate-handling random \
  --o-visualization pairwise-differences_e.qzv

Saved Visualization to: pairwise-differences.qzv
Saved Visualization to: pairwise-differences_e.qzv
