# Processing 16s IBD datasets
Last updated: 2020-04-08  
Quang Nguyen

Here we're using `dada2` plugin within `qiime2` in order to process all our sequences. This data is from the Gevers et al. study and is publicly available online from ENA (project: PRJEB13679). All metadata was downloaded from the [associated `Qiita` study](https://qiita.ucsd.edu/study/description/1939). 

We use the metadata as a guide for which samples to download by creating a data processing manifest. We essentially use all samples under the same body site (Terminal Illeum) and using dada2 to denoise our samples and obtain asv sequences.  

This notebook was ran using the qiime2-2022.2 conda environment. In-line visualization of artifacts were enabled using the command `jupyter serverextension enable --py qiime2 --sys-prefix`. See [this thread].(https://forum.qiime2.org/t/update-on-embedding-visualizations-in-jupyter/10092/2)

In [54]:
from q2_types.per_sample_sequences import SingleEndFastqManifestPhred33V2
from qiime2 import Artifact
from qiime2.plugins.dada2.methods import denoise_single
import qiime2.plugins.demux.actions as demux_actions
from qiime2 import Visualization
import pandas as pd
import os
import biom
import skbio.io
dpaths = "/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/gevers_data/fastq/"

In [6]:
metadata = pd.read_csv("../metadata/gevers_metadata.tsv", sep = "\t")
metadata.columns

Index(['sample_name', 'age', 'age_unit', 'altitude', 'anonymized_name',
       'antibiotics', 'b_cat', 'biologics', 'biopsy_location', 'birthdate',
       'body_habitat', 'body_product', 'body_site', 'collection',
       'collection_timestamp', 'country', 'depth', 'description', 'diagnosis',
       'disease_duration', 'disease_extent', 'disease_stat', 'diseasesubtype',
       'dna_extracted', 'elevation', 'env_biome', 'env_feature',
       'env_material', 'env_package', 'gastric_involvement',
       'gastrointest_disord', 'geo_loc_name', 'host_common_name',
       'host_scientific_name', 'host_subject_id', 'host_taxid',
       'ileal_invovlement', 'immunosup', 'inflammationstatus', 'latitude',
       'longitude', 'mesalamine', 'orig_name', 'perianal',
       'physical_specimen_location', 'physical_specimen_remaining', 'public',
       'qiita_empo_1', 'qiita_empo_2', 'qiita_empo_3', 'qiita_study_id',
       'race', 'sample_type', 'scientific_name', 'sex', 'smoking', 'steroids',
       '

In [7]:
metadata = metadata[["sample_name", "age", "biopsy_location", "body_site", 
                     "diseasesubtype", "host_subject_id"]][
                    metadata.biopsy_location == "Terminal ileum"].reset_index().drop("index", axis = 1)
metadata = metadata.rename(columns = {"sample_name" : "sample-id"})
metadata.head()

Unnamed: 0,sample-id,age,biopsy_location,body_site,diseasesubtype,host_subject_id
0,1939.MGH100079,53.0,Terminal ileum,UBERON:ileum,iCD,7161
1,1939.MGH100698,75.0,Terminal ileum,UBERON:ileum,iCD,7225
2,1939.MGH100896.a,34.0,Terminal ileum,UBERON:ileum,iCD,7094
3,1939.MGH100896.b,34.0,Terminal ileum,UBERON:ileum,iCD,7094
4,1939.MGH101010,30.0,Terminal ileum,UBERON:ileum,iCD,7141


In [8]:
metadata.diseasesubtype.value_counts()

iCD    251
no     194
UC      72
cCD     70
IC      34
CD      18
Name: diseasesubtype, dtype: int64

In [9]:
metadata["absolute-filepath"] = dpaths + metadata["sample-id"] + ".fastq.gz"
metadata.head()

Unnamed: 0,sample-id,age,biopsy_location,body_site,diseasesubtype,host_subject_id,absolute-filepath
0,1939.MGH100079,53.0,Terminal ileum,UBERON:ileum,iCD,7161,/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/Results...
1,1939.MGH100698,75.0,Terminal ileum,UBERON:ileum,iCD,7225,/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/Results...
2,1939.MGH100896.a,34.0,Terminal ileum,UBERON:ileum,iCD,7094,/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/Results...
3,1939.MGH100896.b,34.0,Terminal ileum,UBERON:ileum,iCD,7094,/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/Results...
4,1939.MGH101010,30.0,Terminal ileum,UBERON:ileum,iCD,7141,/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/Results...


In [10]:
metadata.to_csv("../metadata/ibd_qiime2_metadata.tsv", sep="\t", index=False)

We load sequences using the qiime2 Artifact API

In [11]:
sequences = Artifact.import_data('SampleData[SequencesWithQuality]', 
                                 "../metadata/ibd_qiime2_metadata.tsv", 
                                 SingleEndFastqManifestPhred33V2)

In [8]:
# usually ran once to obtain visualizations 
if not os.path.exists("../output/sequence_process_16s/ibd_16s/demux_viz.qzv"):
    seq_viz = demux_actions.summarize(sequences)
    seq_viz.visualization.save("../output/sequence_process_16s/ibd_16s/demux_viz.qzv")


`QIIME2 View` [link](https://view.qiime2.org/visualization/?src=b3e90925-0235-402b-94e4-4d32eb1cc950&type=html) 

We can see that our reads are actually pretty good on the left, but unfortunately some nucleotides on the right hand has low quality. We're going to trim right 15 nucleotides and use mostly default options from the `dada2` R tutorials

In [13]:
if not os.path.exists("../output/sequence_process_16s/ibd_16s/feature_table/feature-table.biom"):
    feat_table, asv_sequences, dada_stats = denoise_single(demultiplexed_seqs = sequences, trunc_len=160, max_ee=2, 
                                                           trunc_q=2, pooling_method="pseudo", chimera_method="pooled", 
                                                           n_threads=10)
    feat_table.export_data("../output/sequence_process_16s/ibd_16s/feature_table")
    asv_sequences.export_data("../output/sequence_process_16s/ibd_16s/seqs")
    dada_stats.export_data("../output/seqence_process_16s/ibd_16s/dada2_stats")

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /scratch/qiime2-archive-5t18wbxt/1cdabe70-41f4-4988-a43f-7242dd6245f7/data /scratch/tmphpkbv_tb/output.tsv.biom /scratch/tmphpkbv_tb/track.tsv /scratch/tmphpkbv_tb 160 0 2 2 Inf pseudo pooled 1.0 10 1000000 NULL 16



Project requested Python version '3.8.9' but '3.6.8' is currently being used 
Loading required package: Rcpp


R version 4.1.2 (2021-11-01) 
DADA2: 1.22.0 / Rcpp: 1.0.8 / RcppParallel: 5.1.5 
1) Filtering ...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
2) Learning Error Rates
164399840 total bases in 1027499 reads from 44 samples will be used for learning the error rates.
3) Denoise samples .............................................................................................................................

Export primary tables for downstream analysis and PICRUSt2 but keep qza back-ups

Let's classify sequences for taxonomic reference