# What you will learn today:

- How organisms are classified by scientists.
- How to classify the sequences present in gut microbiome samples.

## Taxonomy classification

Biologists classify organisms into groups based on similar characteristics. There is an entire scientific discipline for naming and classifying organisms called Taxonomy. Organisms are classified into a structural hierarchy where each group is contained, within a larger group. The highest level groups are the largest, most general and contain a wide variety of organisms (can be both living or extinct). They are split into even smaller groups, which contain organisms with even more similar features. Each classification group, or level in the hierarchy is called a taxon. The most basic taxon is the species, a group of closely related organisms that can produce viable offspring that can also reproduce.  The scientific name of each species consists of two parts (in Greek or Latin) – genus name, italicized and always capitalized, and species name, lowercase and italicized (<i>Homo sapiens</i>).

The Linnean system, the most common classification system today, has eight levels of taxa - domain, kingdom, phylum, class, order, family, genus, and species. There are three domains - Archaea, Bacteria, and Eukarya. The Archaea and the Bacteria are prokaryotes (single-cell organisms without a nucleus) differing in structural, genetic, and biochemical characteristics. Eukarya contains eukaryotes – organisms with nucleus and membrane-bound organelles. There are six kingdoms: Archaea, Bacteria, Protista, Fungi, Plantae, and Animalia. 

<a href="https://kids.britannica.com/students/article/biological-classification/611149"><img src="images/Taxa.png" width="300" align="center"></a>

## How do we obtain the OTUs?

The sequence of the 16S rRNA (in our case the V4 region) is a popular target for taxonomy and phylogeny studies because of its highly conserved nature (retains similarity throughout evolution). In order to obtain the sample’s taxonomy groups, or Operational Taxonomic Units (OTUs), that you got acquainted with in T2, the <b>SEQUENCES ARE FIRST CLUSTERED</b>, i.e. combined into groups by similar traits. Generally sequences with > 95% match are considered to represent the same genus, while > 97% matches are considered the same species. [1] Here it should be noted that different sequencing machines are prone to different errors, which can be confused for new species and we would like to avoid that. Also, we assume that 1 sequence = 1 species, but in reality one species can have multiple different copies of a gene and an identical amplified region could be shared by multiple species.

<b>SECOND IS COMPARISON TO A REFERENCE DATABASE</b> for taxonomic assignment. In our case, there are many studies that have already attempted to describe the gut microbiome, so we have a reference database. In an open reference type of clustering we cluster similar sequences to the ones that are found in the data base and perform de novo clustering on sequences that are very different. In the least computationally expensive procedure of closed reference clustering we drop those sequences, which would work for thoroughly studied samples such as human stool. In de novo clustering approaches we use the distance between sequences to obtain OTUs rather than the distance to a reference database. Here the groups resulting from clustering can change with addition of more data [2]. This is very computationally expensive. We are going to use closed reference clustering in our study.

In order to find what taxa are present we classify the ASVs that we already have and associate them with taxa names using the q2-feature-classifier plugin. We will use a pre-trained naive Bayes machine-learning classifier that was trained to differentiate taxa present in the 99% Greengenes 13_8 reference set, specifically aimed at the V4 hypervariable region. A pre-trained classifier is a model that was trained on a large benchmark dataset to solve a problem similar to the one that we have to solve. <a href="https://www.youtube.com/watch?v=O2L2Uv9pdDA">Naive bayes</a> assigns labels to sequences, based on prior probability. The classifier works by identifying k-mers (substrings of a sequence) that are diagnostic for particular taxonomic groups, and using that information to predict the taxonomic affiliation of each ASV. We can download the pre-trained classifier:

In [1]:
%%bash
qiime info
wget \
  -O "gg-13-8-99-515-806-nb-classifier.qza" \
  "https://data.qiime2.org/2022.2/common/gg-13-8-99-515-806-nb-classifier.qza"

System versions
Python version: 3.8.12
QIIME 2 release: 2022.2
QIIME 2 version: 2022.2.0
q2cli version: 2022.2.0

Installed plugins
alignment: 2022.2.0
composition: 2022.2.0
cutadapt: 2022.2.0
dada2: 2022.2.0
deblur: 2022.2.0
demux: 2022.2.0
diversity: 2022.2.0
diversity-lib: 2022.2.0
emperor: 2022.2.0
feature-classifier: 2022.2.0
feature-table: 2022.2.0
fragment-insertion: 2022.2.0
gneiss: 2022.2.0
longitudinal: 2022.2.0
metadata: 2022.2.0
phylogeny: 2022.2.0
quality-control: 2022.2.0
quality-filter: 2022.2.0
sample-classifier: 2022.2.0
taxa: 2022.2.0
types: 2022.2.0
vsearch: 2022.2.0

Application config directory
/home/felitsiya/miniconda3/envs/qiime2-2022.2/var/q2cli

Getting help
To get help with QIIME 2, visit https://qiime2.org


--2022-04-13 14:38:35--  https://data.qiime2.org/2022.2/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving data.qiime2.org (data.qiime2.org)... 54.200.1.12
Connecting to data.qiime2.org (data.qiime2.org)|54.200.1.12|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2022.2/common/gg-13-8-99-515-806-nb-classifier.qza [following]
--2022-04-13 14:38:36--  https://s3-us-west-2.amazonaws.com/qiime2-data/2022.2/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.234.184
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.234.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28289645 (27M) [binary/octet-stream]
Saving to: ‘gg-13-8-99-515-806-nb-classifier.qza’

     0K .......... .......... .......... .......... ..........  0% 79,1K 5m49s
    50K .......... .......... .......... .......... .......

  4700K .......... .......... .......... .......... .......... 17% 18,8K 5m54s
  4750K .......... .......... .......... .......... .......... 17% 81,4K 5m53s
  4800K .......... .......... .......... .......... .......... 17% 16,3K 6m3s
  4850K .......... .......... .......... .......... .......... 17% 49,1K 6m3s
  4900K .......... .......... .......... .......... .......... 17% 54,3K 6m3s
  4950K .......... .......... .......... .......... .......... 18% 21,2K 6m9s
  5000K .......... .......... .......... .......... .......... 18% 54,3K 6m9s
  5050K .......... .......... .......... .......... .......... 18% 21,2K 6m15s
  5100K .......... .......... .......... .......... .......... 18% 80,5K 6m13s
  5150K .......... .......... .......... .......... .......... 18% 81,2K 6m11s
  5200K .......... .......... .......... .......... .......... 19% 16,9K 6m19s
  5250K .......... .......... .......... .......... .......... 19% 81,0K 6m18s
  5300K .......... .......... .......... .......... .....

  9900K .......... .......... .......... .......... .......... 36% 81,3K 6m5s
  9950K .......... .......... .......... .......... .......... 36% 81,3K 6m3s
 10000K .......... .......... .......... .......... .......... 36% 29,8K 6m3s
 10050K .......... .......... .......... .......... .......... 36%  153K 6m1s
 10100K .......... .......... .......... .......... .......... 36% 79,7K 5m59s
 10150K .......... .......... .......... .......... .......... 36%  185K 5m57s
 10200K .......... .......... .......... .......... .......... 37% 81,5K 5m55s
 10250K .......... .......... .......... .......... .......... 37%  163K 5m53s
 10300K .......... .......... .......... .......... .......... 37%  162K 5m51s
 10350K .......... .......... .......... .......... .......... 37%  159K 5m48s
 10400K .......... .......... .......... .......... .......... 37% 82,5K 5m47s
 10450K .......... .......... .......... .......... .......... 38%  163K 5m45s
 10500K .......... .......... .......... .......... ....

 15100K .......... .......... .......... .......... .......... 54%  160K 3m35s
 15150K .......... .......... .......... .......... .......... 55% 29,4K 3m35s
 15200K .......... .......... .......... .......... .......... 55% 68,9K 3m34s
 15250K .......... .......... .......... .......... .......... 55%  159K 3m33s
 15300K .......... .......... .......... .......... .......... 55% 22,3K 3m33s
 15350K .......... .......... .......... .......... .......... 55% 79,2K 3m32s
 15400K .......... .......... .......... .......... .......... 55%  220K 3m31s
 15450K .......... .......... .......... .......... .......... 56% 80,4K 3m30s
 15500K .......... .......... .......... .......... .......... 56% 86,9K 3m28s
 15550K .......... .......... .......... .......... .......... 56%  168K 3m27s
 15600K .......... .......... .......... .......... .......... 56% 81,4K 3m26s
 15650K .......... .......... .......... .......... .......... 56% 69,7K 3m25s
 15700K .......... .......... .......... .......... 

 20300K .......... .......... .......... .......... .......... 73%  443K 1m50s
 20350K .......... .......... .......... .......... .......... 73%  764K 1m49s
 20400K .......... .......... .......... .......... .......... 74% 1,53M 1m48s
 20450K .......... .......... .......... .......... .......... 74% 1,11M 1m47s
 20500K .......... .......... .......... .......... .......... 74%  586K 1m46s
 20550K .......... .......... .......... .......... .......... 74%  650K 1m45s
 20600K .......... .......... .......... .......... .......... 74% 2,50M 1m44s
 20650K .......... .......... .......... .......... .......... 74% 1,11M 1m43s
 20700K .......... .......... .......... .......... .......... 75%  596K 1m42s
 20750K .......... .......... .......... .......... .......... 75%  646K 1m41s
 20800K .......... .......... .......... .......... .......... 75% 4,98M 1m40s
 20850K .......... .......... .......... .......... .......... 75% 3,21M 99s
 20900K .......... .......... .......... .......... ..

 25650K .......... .......... .......... .......... .......... 93%  196M 23s
 25700K .......... .......... .......... .......... .......... 93%  562K 23s
 25750K .......... .......... .......... .......... .......... 93%  560K 22s
 25800K .......... .......... .......... .......... .......... 93% 1,00M 21s
 25850K .......... .......... .......... .......... .......... 93%  644K 21s
 25900K .......... .......... .......... .......... .......... 93%  790K 20s
 25950K .......... .......... .......... .......... .......... 94%  577K 20s
 26000K .......... .......... .......... .......... .......... 94% 18,4M 19s
 26050K .......... .......... .......... .......... .......... 94%  581K 18s
 26100K .......... .......... .......... .......... .......... 94%  657K 18s
 26150K .......... .......... .......... .......... .......... 94%  764K 17s
 26200K .......... .......... .......... .......... .......... 95% 2,56M 16s
 26250K .......... .......... .......... .......... .......... 95%  562K 16s

Naive Bayes classifiers perform best when they are trained for the specific hypervariable region that was amplified in the particular study. You can train a classifier specific for your dataset or download pre-trained classifiers for other datasets from the QIIME 2 resource page. Let's classify our representative sequences.

In [2]:
%%bash
qiime feature-classifier classify-sklearn \
  --i-reads ./rep-seqs.qza \
  --i-classifier ./gg-13-8-99-515-806-nb-classifier.qza \
  --o-classification ./taxonomy.qza

Saved FeatureData[Taxonomy] to: ./taxonomy.qza


Let’s see the assigned taxonomy associated with the sequences.

In [3]:
%%bash
qiime metadata tabulate \
  --m-input-file ./taxonomy.qza \
  --o-visualization ./taxonomy.qzv

Saved Visualization to: ./taxonomy.qzv


Let’s also tabulate the representative sequences. This will allow us to see the sequence assigned to the identifier and interactively blast the sequence against the NCBI database.

In [4]:
%%bash
qiime feature-table tabulate-seqs \
  --i-data ./rep-seqs.qza \
  --o-visualization ./rep-seqs.qzv

Saved Visualization to: ./rep-seqs.qzv


Can you answer the following questions? 

Find the feature, 002e78333d6cf2b11aa7a5ba03dd2c68. What is the taxonomic classification of this sequence? What’s the confidence for the assignment? How many features are classified as g__Lactobacillus? Use the tabulated representative sequences to look up these features. If you blast them against NCBI, do you get the same taxonomic identifier as you obtained with q2-feature-classifier?

Let’s look at the taxonomic composition of the samples. To visualize this, we will build a taxonomic barchart of the samples we analyzed in the diversity dataset.

In [5]:
%%bash
qiime taxa barplot \
  --i-table ./table.qza \
  --i-taxonomy ./taxonomy.qza \
  --m-metadata-file ./sample-metadata.tsv \
  --o-visualization ./taxa_barplot.qzv

Saved Visualization to: ./taxa_barplot.qzv


Visualize the data at level 2 (phylum level) and sort the samples by sample-type. Can you observe a consistent difference in phylum between stool and swab samples? 

[1]	J. S. Johnson et al., “Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis,” Nat. Commun., vol. 10, no. 1, p. 5029, Nov. 2019, doi: 10.1038/s41467-019-13036-1.

[2]	S. L. Westcott and P. D. Schloss, “De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units,” PeerJ, vol. 3, p. e1487, 2015, doi: 10.7717/peerj.1487.