In [4]:
import qiime2
from qiime2.plugins import (
    cutadapt, demux, dada2, feature_table, metadata,
    greengenes2, taxa, feature_classifier,
    vsearch
)

from qiime2 import Artifact, Metadata
import os

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

path = {
    "art" : "../data/artifacts/mock/",
    "vis" : "../visualizations/mock/",
    "res" : "../data/resources/"
 }

for filepath in path.values():
    os.makedirs(filepath, exist_ok=True)

def reverse_complement(seq):
    complement = {'A': 'T', 'C': 'G',
                  'G': 'C', 'T': 'A',
                  'N': 'N',
                  'R': 'Y', 'Y': 'R'}
    return "".join(complement[base] for base in reversed(seq))

In [5]:
# create mock metadata
metadata_df = pd.DataFrame({
    'sampleid': ['2476', '2477'], 'mock' : [0, 1]}).set_index('sampleid')

metadata_q2 = qiime2.Metadata(metadata_df)

# 1. Quality control 
## 1.1. Adapter trimming

In [None]:
# import raw data with manifest
raw_seqs = qiime2.Artifact.import_data('SampleData[PairedEndSequencesWithQuality]',
                                       '../data/manifest-mock.tsv', view_type='PairedEndFastqManifestPhred33V2')

The length of ITS region is variable, compared to 16S. 16S rRNA is long, and MiSeq will not cover the whole region with a single read. During ITS sequencing, full ITS region get sequenced and sometimes adapter sequences and primers got into resulting reads. We need to clean them, as they can "mask" as real biological DNA, while they are not. Knowing used primers and adapters we will remove them with `cutadapt`.  
Primers and adapter sequences for MiSeq:
- forward: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTAAAAGTCGTAACAAGGTTTC
- reverse: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGTTCAAAGAYTCGATGATTCAC

- Primers should be passed to `cutadapt` without overhang - [discussion on QIIME2 Forum](https://forum.qiime2.org/t/remove-primer-in-paired-end-demultiplexed-file/17376/12)
- Primers were used in another [paper](https://www.biorxiv.org/content/10.1101/2021.07.19.452952v1.full)
- Reverse complements should also be passed as in [tutorial](https://forum.qiime2.org/t/fungal-its-analysis-tutorial/7351)
```
--p-front-f GTAAAAGTCGTAACAAGGTTTC \
--p-front-r GTTCAAAGAYTCGATGATTCAC \
```

In [4]:
forward_primer = "GTAAAAGTCGTAACAAGGTTTC"
reverse_primer = "GTTCAAAGAYTCGATGATTCAC"

trimmed = cutadapt.methods.trim_paired(
    demultiplexed_sequences=raw_seqs,
    # reverse complement of reverse primer
    adapter_f = [reverse_complement(reverse_primer)],
    front_f = [forward_primer],
    # reverse complement of forward primer
    adapter_r = [reverse_complement(forward_primer)],
    front_r = [reverse_primer],
    cores = 8
)

Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: cutadapt --cores 8 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-fq52rinq/2476_0_L001_R1_001.fastq.gz -p /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-fq52rinq/2476_2_L001_R2_001.fastq.gz --adapter GTGAATCATCGARTCTTTGAAC --front GTAAAAGTCGTAACAAGGTTTC -A GAAACCTTGTTACGACTTTTAC -G GTTCAAAGAYTCGATGATTCAC /tmp/qiime2/vbezshapkin/data/5ba0a8e6-26dd-4ea9-8fba-9b6073d1068e/data/2476_0_L001_R1_001.fastq.gz /tmp/qiime2/vbezshapkin/data/5ba0a8e6-26dd-4ea9-8fba-9b6073d1068e/data/2476_2_L001_R2_001.fastq.gz

This is cutadapt 4.6 with Python 3.8.15
Command line parameters: --cores 8 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /tmp/q2-

In [5]:
quality_vis = demux.visualizers.summarize(trimmed.trimmed_sequences, n = 100_000)
quality_vis.visualization.save(path["vis"] + "quality-plot-fungi.qzv")

'../visualizations/mock/quality-plot-fungi.qzv'

<Figure size 640x480 with 0 Axes>

## 1.2. DADA2

In [10]:
qc_reads = dada2.methods.denoise_paired(
    trimmed.trimmed_sequences,
    trunc_len_f=240, trunc_len_r=169, n_threads=32,
    min_fold_parent_over_abundance=4
)

qc_reads.denoising_stats.save(path["art"] + "denoise-stats-fungi.qza")
qc_reads.table.save(path["art"] + "feature-table-fungi.qza")
qc_reads.representative_sequences.save(path["art"] + "rep-seqs-fungi.qza")


metadata.visualizers.tabulate(input=qc_reads.denoising_stats.view(Metadata)).visualization.save(path["vis"] + "denoise-stats-fungi.qzv")
feature_table.visualizers.summarize(qc_reads.table).visualization.save(path["vis"] + "feature-table-fungi.qzv")

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /tmp/tmp0xn0virg/forward --input_directory_reverse /tmp/tmp0xn0virg/reverse --output_path /tmp/tmp0xn0virg/output.tsv.biom --output_track /tmp/tmp0xn0virg/track.tsv --filtered_directory /tmp/tmp0xn0virg/filt_f --filtered_directory_reverse /tmp/tmp0xn0virg/filt_r --truncation_length 240 --truncation_length_reverse 169 --trim_left 0 --trim_left_reverse 0 --max_expected_errors 2.0 --max_expected_errors_reverse 2.0 --truncation_quality_score 2 --min_overlap 12 --pooling_method independent --chimera_method consensus --min_parental_fold 4 --allow_one_off False --num_threads 32 --learn_min_reads 1000000



package ‘optparse’ was built under R version 4.2.3 
Loading required package: Rcpp


R version 4.2.2 (2022-10-31) 
DADA2: 1.26.0 / Rcpp: 1.0.12 / RcppParallel: 5.1.6 
2) Filtering ..
3) Learning Error Rates
19731600 total bases in 82215 reads from 2 samples will be used for learning the error rates.
13894335 total bases in 82215 reads from 2 samples will be used for learning the error rates.
3) Denoise samples ..
..
5) Remove chimeras (method = consensus)
6) Report read numbers through the pipeline
7) Write output


'../visualizations/mock/feature-table-fungi.qzv'

# 2. Taxonomical classification - ITS1

In [9]:
# # Run out of memory, reset tmpdir, runs bashes script internally, doesn't respect fish shell :/
# # classification command
# export TMPDIR=./tmp/
# qiime feature-classifier classify-sklearn \
#     --i-reads data/artifacts/mock/rep-seqs-fungi.qza \
#     --i-classifier data/resources/unite_ver9_99_25.07.2023-Q2-2023.9.qza \
#     ---p-n-jobs 8 \
#     --o-classification data/artifacts/mock/rep-seqs-unite-its.qza

In [6]:
# load classification data
tax = Artifact.load(path["art"] + "rep-seqs-unite-its.qza")
table = Artifact.load(path["art"] + "feature-table-fungi.qza")

In [7]:
vis = taxa.visualizers.barplot(
    table=table,
    taxonomy=tax,
    metadata=metadata_q2
)

vis.visualization.save(path["vis"] + "taxonomy-barplot-fungi.qzv")

'../visualizations/mock/taxonomy-barplot-fungi.qzv'