# Phyloseq

### **Phyloseq** is an R package designed for the analysis and visualization of microbiome census data, particularly high-throughput phylogenetic sequencing data such as 16S rRNA gene sequencing. It provides a unified framework to import, store, process, analyze, and graphically represent complex microbiome datasets. The package integrates multiple data types, including amplicon sequence variants (ASV), taxonomic assignments, phylogenetic trees, and sample metadata, into a single object for streamlined analysis.

#### Key Features of Phyloseq:
1. **Data Integration**: Combines OTU/ASV tables, taxonomic classifications, phylogenetic trees, and sample metadata into a single object for easy manipulation and analysis.
2. **Analysis Tools**:
    - Alpha diversity (e.g., richness and Shannon diversity).
    - Beta diversity using ecological distance metrics and ordination methods (e.g., NMDS, PCA).
    - Taxonomic composition analysis.
3. **Visualization**: Produces publication-quality graphics using ggplot2 for bar plots, heatmaps, ordination plots, and more.
4. **Compatibility**: Works with outputs from popular OTU (or ASV) clustering pipelines and integrates with other R packages for advanced statistical analyses.

#### Applications:

Phyloseq is widely used to study microbial communities in diverse environments such as human microbiomes, soil ecosystems, marine environments, and industrial settings. It simplifies the handling of complex datasets while enabling robust ecological and statistical analyses.


### Let's get started (Note: this is all done in RStudio):

#### **Step 1**: 

The first thing you have to do, ***as with any R package***, is the make sure you have the appropriate packages installed and loaded into your active R session. 

In [None]:
## This installs the phyloseq package and it's dependencies (software components, libraries, or packages that another piece of software relies on to function correctly) to R
install.packages("BiocManager")
BiocManager::install("phyloseq")
install.packages(c("tidyverse", "ggplot2", "dada2"))

In [None]:
## This loads the dada2 package in so that way it is active during our R session. You have to "load your libraries" EVERY TIME you open R. 
library(phyloseq)
library(tidyverse)
library(ggplot2)
library(dad2) ## You should already have this loaded onto your R session. 
writeLines(capture.output(sessionInfo()), file.path("output", "sessionInfo.txt")) ## This code is typically used in workflows or reports to record session details for reproducibility. By saving sessionInfo() output, others can replicate your analysis environment by knowing exactly which versions of R and packages were used. It will be named sessionInfo.txt in your working directory.

#### **Step 2**: 

Now that we have the packages installed and running on our R session, we can start to organize our data in RStudio. We read are going to load in the ASV Table and Taxonomic Table into our session, along with our metadata:

In [None]:
## Load DADA2 outputs
otu <- read.table("~/path/to/Ast_mixed_silva_otu_table.txt",sep="\t",header=TRUE, row.names=1)
taxon <- read.table("~/path/to/Ast_mixed_silva_taxa_table.txt",sep="\t",header=TRUE,row.names=1)
## Load metadata
samples<-read.table("~/path/to/Ast_mixed_metadata.txt",sep="\t",header=T,row.names=1)

#### **Step 2**: Create a Phyloseq Object

Now that we have all of our data loaded in, we can make our phyloseq object.

##### Wait, real quick: What is a phyloseq object?
A phyloseq object is a specialized data structure in R used for storing and analyzing microbiome data, particularly high-throughput sequencing datasets like 16S rRNA gene sequencing. It combines multiple related data types into a single, coherent object, making it easier to manage and analyze complex microbiome datasets.

A phyloseq object can integrate the following core data types:

1. **OTU/ASV Table (`otu_table`)**:
    - A matrix of operational taxonomic unit (OTU) or amplicon sequence variant (ASV) abundances across samples.
    - Rows represent taxa (e.g., OTUs or ASVs), and columns represent samples (or vice versa).
2. **Taxonomy Table (`tax_table`)**:
    - A matrix describing the taxonomic classification of each OTU/ASV (e.g., Kingdom, Phylum, Genus).
3. **Sample Data (`sample_data`)**:
    - Metadata about the samples, such as environmental conditions, treatment groups, or collection dates.
4. **Phylogenetic Tree (`phy_tree`)** (optional):
    - A tree representing evolutionary relationships among OTUs/ASVs.
5. **Reference Sequences (`refseq`)** (optional):
    - DNA sequences corresponding to each OTU/ASV.

Purpose and Advantages:

- **Unified Data Storage**: All related data types are stored together in a single object.
- **Data Consistency**: Ensures that all components (e.g., OTU table, taxonomy table) are aligned and describe the same set of samples and taxa.
- **Ease of Analysis**: Simplifies downstream analyses like diversity estimation, ordination, and visualization by providing a consistent framework.
- **Reproducibility**: Facilitates reproducible workflows by keeping all data in one place.

##### Today, we will work with our ASV (OTU) table, our Taxonomy table, and our metadata.

So, let's make our very first phyloseq object!

In [None]:
OTU = otu_table(otu, taxa_are_rows=FALSE)
taxon<-as.matrix(taxon)
TAX = tax_table(taxon)
sampledata = sample_data(samples)
ps <- phyloseq(otu_table(otu, taxa_are_rows=FALSE), 
               sample_data(samples), 
               tax_table(taxon))

This R code is creating a phyloseq object by combining three key data components: an OTU table, a taxonomy table, and sample metadata.

We can inspect the phyloseq object just by running:

In [None]:
ps

Here is an example output when we run `ps`, inspecting a phyloseq object:

```
phyloseq-class experiment-level object
otu_table()    OTU Table: [100 taxa and 20 samples]
sample_data()  Sample Data: [20 samples by 5 variables]
tax_table()    Taxonomy Table: [100 taxa by 6 taxonomic ranks]
```

This indicates that:
* There are 100 taxa (e.g., OTUs or ASVs) and 20 samples in the dataset.
* The sample metadata has 5 variables (e.g., location, treatment).
* The taxonomy table contains 6 levels of classification (e.g., Kingdom to Species).

Great work! Now, when we run this with our own data, we should see something that looks like this:

```
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 8605 taxa and 20 samples ]
sample_data() Sample Data:       [ 20 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 8605 taxa by 13 taxonomic ranks ]
```

So, we know that we have 8605 taxa (unique ASVs) in our 20 samples, and our metadata has three separate variables.  

Sometimes in our data, we can have ASVs that have been classified as mitochondrian, chloroplasts, and even some eukaryota and NAs that slipped through. We have to get rid of those, as they are not biologically relevant to our study.

The first thing we are going to do is just inspect how many unique taxa we have in a given group:

In [None]:
get_taxa_unique(ps, "Family") #433
get_taxa_unique(ps, "Order") #298
get_taxa_unique(ps, "Kingdom") #4

You should see something like these numbers in your output. This is letting us know that we have 433 unique families represented in the dataset, 298 unique orders, and so on. 

Now, let's remove the mitochondria, chloroplasts, eukaryotes, and NAs that slipped through:

In [None]:
ps <- subset_taxa(ps, Family !="Mitochondria")
ps <- subset_taxa(ps, Order !="Chloroplast")
ps <- subset_taxa(ps, Kingdom !="Eukaryota")
ps <- subset_taxa(ps, Kingdom !="NA")

Now let's check how many unique taxa we have in a group:

In [None]:
get_taxa_unique(ps, "Family") #428
get_taxa_unique(ps, "Order") #294
get_taxa_unique(ps, "Kingdom") #2

We see that when we re-check, we have removed some unique families, orders, and kingdoms that were ultimately classied as mitochondria, chloroplasts, eukaryotes, and NAs. 

When we inspect our phyloseq object again, 

In [None]:
ps

It should look something like: 
```
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 7296 taxa and 20 samples ]
sample_data() Sample Data:       [ 20 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 7296 taxa by 13 taxonomic ranks ]
```
We know that we have successfully removed the mitochondria, chloroplasts, eukaryotes, and NAs from our phyloseq object because ultimately, the number of unique taxa goes down. 

#### **Step 3**: Basic Quality Control

A good rule of thumb is to only keep ASVs with a mean abundance greater than 5 across all samples. This is a way to remove very rare or low abundance ASVs. Let's take a look here:

In [None]:
ntaxa(ps) #7296
ps5<-filter_taxa(ps, function(x) mean(x) >5, TRUE)
ntaxa(ps5) #1253
get_taxa_unique(ps, "Genus") #720
get_taxa_unique(ps5, "Genus") #317

In summary, you're filtering out low-abundance ASVs, which reduces the number of ASVs from 7,296 to 1250, and the number of unique genera from 720 to 317. This filtering helps to focus on more abundant and potentially more reliable ASVs, reducing noise in your dataset.

##### **It is IMPORTANT you pick your filtering parameters for your own data. Not every dataset is as rich as this one. I have had some datasets where I've just removed singletons (ASV that is represented by only one sequence across the entire dataset, keeping ASVs with a mean abundance greater than 1 across all samples). Pick the parameters that are best for YOUR data.** 

#### Step 4: Data Normalization

In microbiome studies, we often normalize to the relative abundance of our taxa. Why?

Microbiome datasets often have varying sequencing depths (i.e., different total read counts across samples). Without normalization, these differences can bias comparisons between samples. By converting to relative abundance:
* Samples become comparable, regardless of their sequencing depth.
* The analysis focuses on the proportions of taxa within each sample rather than raw counts.

Let's try it:

In [None]:
ps_ra <- transform_sample_counts(ps5, function(x) x / sum(x))

What the Code Does:
1. `transform_sample_counts()` Function:
    * This function applies a transformation to the count data in the OTU table of a phyloseq object (ps_filtered).
    * It processes each sample independently, applying the specified function to the counts.
2. Normalization Function:
    * The function function(x) x / sum(x) divides each count (x) in a sample by the total sum of counts for that sample (sum(x)).
    * This converts raw counts into proportions or relative abundances, where the sum of all taxa in a sample equals 1 (or 100% if expressed as percentages).
3. Result:
    * The new phyloseq object `ps_ra` contains normalized data, where OTU/ASV abundances are expressed as proportions relative to the total sequencing depth of each sample.

Now that we have a phyloseq object that is normalized, we can make a couple of figures to showcase Alpha and Beta diversity, along with some taxonomy plots!

#### **Step 4**: Alpha & Beta Diversity

For the sake of quick visualization, let's work with our original phyloseq object, `ps`. 

Let's visualize the alpha diversity:

In [None]:
plot_richness(ps, measures = c("Observed", "Shannon")) +
  geom_boxplot() +
  theme_bw()

How about the beta diversity?

In [None]:
# Ordination using PCoA
ord <- ordinate(ps, method = "PCoA")
plot_ordination(ps, ord, color = "SymbiontState") +
  geom_point(size = 4)

#### **Step 5**: Taxonomic Composition

Let's visualize our phyla across our samples in a stacked bar chart:

In [None]:
ps_phylum <- tax_glom(ps_ra, "Phylum")
plot_bar(ps_phylum, fill = "Phylum") +
  theme(legend.position = "bottom")