# NASA Simulated Space Radiation and HindLimb suspension study - Taxonomy Assignment

Run this notebook in `qiime2-2024.4`.

Continuing with the [Taxonomy](https://docs.qiime2.org/2022.11/tutorials/pd-mice/#taxonomic-classification), and [Phylogeny](https://docs.qiime2.org/2022.11/tutorials/pd-mice/#generating-a-phylogenetic-tree-for-diversity-analysis) steps. *Note we'll use a *de novo* [align-to-tree-mafft-fasttree ](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#pipelines) step so we can run through this tutorial quicker.*

In [1]:
from os import getcwd, listdir, chdir, mkdir
import qiime2 as q2

In [2]:
getcwd()

'/mnt/e/NASA_microbiome'

In [3]:
chdir('./processed')
getcwd()

'/mnt/e/NASA_microbiome/processed'

## Download classifiers if runing on your laptop:

We'll assign taxonomy using SILVA. Can obtain classifiers from the [Data Resource Page](https://docs.qiime2.org/2022.11/data-resources/).

In [5]:
mkdir('silva-classifiers')

In [6]:
! wget https://data.qiime2.org/2022.11/common/silva-138-99-515-806-nb-classifier.qza \
    -O ./silva-classifiers/silva-138-99-515-806-nb-classifier.qza

--2024-03-08 21:07:31--  https://data.qiime2.org/2022.11/common/silva-138-99-515-806-nb-classifier.qza
Resolving data.qiime2.org (data.qiime2.org)... 54.200.1.12
Connecting to data.qiime2.org (data.qiime2.org)|54.200.1.12|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2022.11/common/silva-138-99-515-806-nb-classifier.qza [following]
--2024-03-08 21:07:32--  https://s3-us-west-2.amazonaws.com/qiime2-data/2022.11/common/silva-138-99-515-806-nb-classifier.qza
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.179.192, 52.92.228.200, 52.92.162.16, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.179.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148294965 (141M) [binary/octet-stream]
Saving to: ‘./silva-classifiers/silva-138-99-515-806-nb-classifier.qza’


2024-03-08 21:07:42 (14.4 MB/s) - ‘./silva-classifiers/silva-138-99-515-

## If you are running on the HPC the classifiers are located at:
 - `/home/SE/BMIG-6202-MSR/RefDBs/q2-2022.11/silva-138-1-ssu-nr99-515f-806r-classifier.qza`
 - `/home/SE/BMIG-6202-MSR/RefDBs/q2-2022.11/silva-138-1-ssu-nr99-classifier.qza`
 
 You can setup a shortcut like this:
 
`silva_classifier='/mnt/e/NASA_microbiome/Processed/silva-classifiers/silva-138-99-515-806-nb-classifier.qza'`

In [4]:
silva_classifier='/mnt/e/NASA_microbiome/Processed/v3_v4_classifier/silva-138-1-ssu-nr99-357f-806r-classifier.qza'

## Classify sequences / reads

In the command below, I'll be running on the HPC using the shortcut `$silva_classifier`.

In [5]:
! qiime feature-classifier classify-sklearn \
    --i-reads ./dada2-pe-repseqs.qza \
    --i-classifier $silva_classifier \
    --p-n-jobs 2 \
    --o-classification ./taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: ./taxonomy.qza[0m
[0m

In [6]:
# View list of classifications
! qiime metadata tabulate \
    --m-input-file ./taxonomy.qza \
    --o-visualization ./taxonomy.qzv

[32mSaved Visualization to: ./taxonomy.qzv[0m
[0m

In [7]:
# View a taxonomy barplot
! qiime taxa barplot \
    --i-table ./dada2-pe-table.qza \
    --i-taxonomy ./taxonomy.qza \
    --m-metadata-file ./NASA-Metadata.tsv \
    --o-visualization ./taxa_barplot.qzv

[32mSaved Visualization to: ./taxa_barplot.qzv[0m
[0m

## Remove poorly classified reads

[Filtering Documentation](https://docs.qiime2.org/2020.11/tutorials/filtering/)

In [8]:
! qiime taxa filter-table \
    --i-table ./dada2-pe-table.qza \
    --i-taxonomy ./taxonomy.qza \
    --p-mode 'contains'  \
    --p-include 'p__' \
    --p-exclude 'p__;,Eukaryota,Chloroplast,Mitochondria' \
    --o-filtered-table ./table-no-ecmu.qza

[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu.qza[0m
[0m

In [9]:
# keep seq file in sync with table
! qiime feature-table filter-seqs \
    --i-data ./dada2-pe-repseqs.qza \
    --i-table ./table-no-ecmu.qza \
    --o-filtered-data rep_set-no-ecmu.qza

[32mSaved FeatureData[Sequence] to: rep_set-no-ecmu.qza[0m
[0m

In [10]:
! qiime tools export \
    --input-path rep_set-no-ecmu.qza \
    --output-path rep_set-no-ecmu-export

[32mExported rep_set-no-ecmu.qza as DNASequencesDirectoryFormat to directory rep_set-no-ecmu-export[0m
[0m

In [11]:
# View a taxonomy barplot
! qiime taxa barplot \
    --i-table ./table-no-ecmu.qza \
    --i-taxonomy ./taxonomy.qza \
    --m-metadata-file ./NASA-Metadata.tsv \
    --o-visualization ./table-no-ecmu-taxa-barplot.qzv

[32mSaved Visualization to: ./table-no-ecmu-taxa-barplot.qzv[0m
[0m

#### krona plot 
#### To install Krona [https://github.com/kaanb93/q2-krona]
### NB: krona requires 7 levels of taxonomic classification. Please check your taxonomic information that might have only 6 levels because of classifier without species. Adjust the P level based on the levels the classifier has.

In [17]:
! qiime krona collapse-and-plot \
    --i-table ./table-no-ecmu.qza \
    --i-taxonomy ./taxonomy.qza \
    --p-level 6 \
    --o-krona-plot ./table-no-ecmu-taxa-krona.qzv

[32mSaved Visualization to: ./table-no-ecmu-taxa-krona.qzv[0m
[0m

##  Other QA / QC Operations

See [q2-quality-control tutorial](https://docs.qiime2.org/2022.11/tutorials/quality-control/).

In [19]:
silva_ref_seq='/mnt/e/NASA_microbiome/Processed/references/silva-138-1-ssu-nr99-seqs-357f-806r-cln-dr-uniq.qza'

In [20]:
# remove poor quality sequence that do not have a decent match to our curated reference database.
! qiime quality-control exclude-seqs \
    --i-query-sequences ./rep_set-no-ecmu.qza \
    --i-reference-sequences $silva_ref_seq \
    --p-method vsearch \
    --p-perc-identity 0.90 \
    --p-perc-query-aligned 0.90 \
    --p-threads 8 \
    --o-sequence-hits ./hits.qza \
    --o-sequence-misses ./misses.qza \
    --verbose

Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2/raotoo/data/03a7accf-ed9f-4492-9126-23c8334ee130/data/dna-sequences.fasta --id 0.9 --strand both --maxaccepts 1 --maxrejects 0 --db /tmp/qiime2/raotoo/data/f678cd14-77e1-4e83-976e-fd56f4e8e433/data/dna-sequences.fasta --threads 8 --userfields query+target+ql+qlo+qhi --userout /tmp/tmpfkxrrkeq

vsearch v2.22.1_linux_x86_64, 63.7GB RAM, 16 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2/raotoo/data/f678cd14-77e1-4e83-976e-fd56f4e8e433/data/dna-sequences.fasta 100%                   
135945441 nt in 322776 seqs, min 150, max 2005, avg 421
Masking 100%                                                                                                                                                       

In [21]:
# filter table to match filtered sequence file
! qiime feature-table filter-features \
    --i-table ./table-no-ecmu.qza \
    --m-metadata-file ./hits.qza \
    --o-filtered-table ./table-no-ecmu-hits.qza

[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-hits.qza[0m
[0m

#### Given that we filtered our data again, you may want to re-generate the taxonomy plots. Use the prior taxonomy visualization commands above as a guid and run them below, with the new table:

In [22]:
# updated taxonomy barplot
! qiime taxa barplot \
    --i-table ./table-no-ecmu-hits.qza \
    --i-taxonomy ./taxonomy.qza \
    --m-metadata-file ./NASA-Metadata.tsv \
    --o-visualization ./table-no-ecmu-hits-taxa-barplot.qzv

[32mSaved Visualization to: ./table-no-ecmu-hits-taxa-barplot.qzv[0m
[0m

In [24]:
# updated krona plot
! qiime krona collapse-and-plot \
    --i-table ./table-no-ecmu-hits.qza \
    --i-taxonomy ./taxonomy.qza \
    --p-level 6 \
    --o-krona-plot ./table-no-ecmu-hits-taxa-krona.qzv

[32mSaved Visualization to: ./table-no-ecmu-hits-taxa-krona.qzv[0m
[0m

In [25]:
! qiime feature-table group \
    --i-table ./table-no-ecmu-hits.qza \
    --p-axis sample \
    --m-metadata-file ./NASA-Metadata.tsv \
    --m-metadata-column TreatmentGroup \
    --p-mode 'mean-ceiling' \
    --o-grouped-table ./table-no-ecmu-hits-TreatmentGroup.qza

[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-hits-TreatmentGroup.qza[0m
[0m

#### krona collapse by group 

In [26]:
! qiime feature-table group \
    --i-table ./table-no-ecmu-hits.qza \
    --p-axis sample \
    --m-metadata-file ./NASA-Metadata.tsv \
    --m-metadata-column TreatmentGroup \
    --p-mode 'mean-ceiling' \
    --o-grouped-table ./table-no-ecmu-hits-TreatmentGroup.qza

[32mSaved FeatureTable[Frequency] to: ./table-no-ecmu-hits-TreatmentGroup.qza[0m
[0m

In [27]:
! qiime krona collapse-and-plot \
    --i-table ./table-no-ecmu-hits-TreatmentGroup.qza \
    --i-taxonomy ./taxonomy.qza \
    --p-level 6 \
    --o-krona-plot ./table-no-ecmu-hits-TreatmentGroup-taxa-krona.qzv

[32mSaved Visualization to: ./table-no-ecmu-hits-TreatmentGroup-taxa-krona.qzv[0m
[0m

## Construct phylogeny

See the [Inferring Phylogenies tutorial](https://docs.qiime2.org/2022.11/tutorials/phylogeny/) for more information.

We'll run [FastTree](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#fasttree) to be quick, though I'd recomend [iqtree](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#iqtree) or [fragment-insertion](https://library.qiime2.org/plugins/q2-fragment-insertion/16/).

We'll be using the [align-to-tree-mafft-fasttree](https://docs.qiime2.org/2022.11/tutorials/phylogeny/#pipelines) pipeline.

### *de novo phylogeny*

View with [iTOL](https://itol.embl.de/) or [Empress](https://github.com/biocore/empress).

In [30]:
# pipeline: alignment through phylogeny
! qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences ./hits.qza \
    --output-dir ./mafft-fasttree-output \
    --verbose

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 1 /tmp/qiime2/raotoo/data/aee46648-71b3-4f9d-925d-06f70d1bce82/data/dna-sequences.fasta

inputfile = orig
3484 x 429 - 266 d
nthread = 1
nthreadpair = 1
nthreadtb = 1
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..
 3401 / 3484 (thread    0)
done.

Constructing a UPGMA tree (efffree=0) ... 
 3480 / 3484
done.

Progressive alignment 1/2... 
STEP  2201 / 3483 (thread    0)
Reallocating..done. *alloclen = 1862
STEP  3401 / 3483 (thread    0) h
done.

Making a distance matrix from msa.. 
 3400 / 3484 (thread    0)
done.

Constructing a UPGMA tree (efffree=1) ... 
 3480 / 3484
done.

Progressive alignment 2/

### Convert .qza files to a R-like format for downstream analysis

In [38]:
%cd '/mnt/e/NASA_microbiome/downstream/'

/mnt/e/NASA_microbiome/downstream


In [40]:
%%bash
for i in *.qza; do
qiime tools export --input-path $i --output-path .
done
biom convert -i feature-table.biom -o feature-table.tsv --to-tsv

Exported table-no-ecmu-hits.qza as BIOMV210DirFmt to directory .
Exported taxonomy.qza as TSVTaxonomyDirectoryFormat to directory .
Exported tree.qza as NewickDirectoryFormat to directory .


### Another phylogenetic approach: Fragment Insertion