# Processing CRC 16S Data

Last updated 2022-04-12.   
Quang Nguyen    

This script processes 16S rRNA gene sequencing data from Zeller et al. 2014 paper with [ENA Project ID: PREJEB13679](https://www.ebi.ac.uk/ena/browser/view/PRJEB13679?show=reads). Script used to download raw data can be found in the `python` folder (file `download_crc.py`). The manifest file (`python/crc_16s.tsv`) can be downloaded directly from the ENA website. Paper link can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4299606/)   

In [180]:
from q2_types.per_sample_sequences import PairedEndFastqManifestPhred33V2
from qiime2 import Artifact
from qiime2.plugins.dada2.methods import denoise_single
import qiime2.plugins.demux.actions as demux_actions
import pandas as pd
import os
import biom
dpaths = "/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/crc_16s/"

In [181]:
metadata = pd.read_csv("../metadata/crc_16s_metadata.csv", index_col=0)
metadata.head()

Unnamed: 0,host subject id,run_accession,sample_accession,diagnosis,sample name,sex,age
0,DE-069,ERR674170,ERS581126,,,,71
1,DE-068,ERR674169,ERS581125,,,,66
2,DE-066,ERR674168,ERS581124,,,,36
5,DE-001,ERR674075,ERS581031,,,,82
6,DE-053,ERR674158,ERS581114,,,,77


In [182]:
metadata = metadata[["host subject id", "diagnosis", "sample name", "age"]]
metadata = metadata.rename(columns = {"host subject id" : "sample-id", "sample name" : "seq_sample_id"})
metadata.head()

Unnamed: 0,sample-id,diagnosis,seq_sample_id,age
0,DE-069,,,71
1,DE-068,,,66
2,DE-066,,,36
5,DE-001,,,82
6,DE-053,,,77


All patients without any diagnosis are all `NA` values.  From the supplementary materials found in the manuscript, all DE designated patients are Cancer patients. We download the data directly from the supplementary materials and extract the relevant tabs. 

In [183]:
de_patients = pd.read_csv("../metadata/crc_16s_DE_subjects_metadata.csv")
de_patients.head()

Unnamed: 0,Subject ID,Sample ID,Age (years),Gender,BMI (kg/m²),Country of Residence,Diagnosis,AJCC Stage,TNM Stage,Localization,Unnamed: 10,Unnamed: 11
0,DE-079,CCMD88272491ST-21-0,72.0,M,28.0,Germany,Cancer,0,TisN0M0,Rectum,,
1,DE-080,CCMD87156761ST-21-0,55.0,M,28.0,Germany,Cancer,I,T2N0M0,LC,,
2,DE-081,CCMD86707194ST-21-0,53.0,F,31.0,Germany,Cancer,I,T1N0M0,RC,,
3,DE-082,CCMD82866709ST-21-0,77.0,F,26.0,Germany,Cancer,II,T3N0M0,LC,,
4,DE-083,CCMD79987997ST-21-0,70.0,M,25.0,Germany,Cancer,I,T2N0M0,Sigma,,


In [184]:
de_patients = de_patients[["Subject ID", "Age (years)", "Diagnosis", "Sample ID"]]
de_patients = de_patients.rename(columns = {"Subject ID" : "sample-id", 
                                            "Age (years)" : "age", 
                                            "Diagnosis" : "diagnosis", 
                                            "Sample ID" : "seq_sample_id"})
de_patients.head()

Unnamed: 0,sample-id,age,diagnosis,seq_sample_id
0,DE-079,72.0,Cancer,CCMD88272491ST-21-0
1,DE-080,55.0,Cancer,CCMD87156761ST-21-0
2,DE-081,53.0,Cancer,CCMD86707194ST-21-0
3,DE-082,77.0,Cancer,CCMD82866709ST-21-0
4,DE-083,70.0,Cancer,CCMD79987997ST-21-0


In [185]:
de_in_metadata = [f for f in metadata["sample-id"].tolist() if f in de_patients["sample-id"].dropna().tolist()]
de_in_metadata

['DE-049',
 'DE-045',
 'DE-034',
 'DE-013',
 'DE-046',
 'DE-044',
 'DE-039',
 'DE-038',
 'DE-037',
 'DE-031',
 'DE-029',
 'DE-062']

In [186]:
metadata = metadata[pd.notna(metadata.diagnosis)].reset_index().drop("index", axis = 1)
de_patients = de_patients[de_patients["sample-id"].isin(de_in_metadata)]


In [187]:
metadata = metadata.append(de_patients)

In [188]:
metadata.head()
metadata.shape

(141, 4)

In [203]:
metadata["forward-absolute-filepath"] = dpaths + metadata["sample-id"] + "_R1_001.fastq.gz"
metadata["reverse-absolute-filepath"] = dpaths + metadata["sample-id"] + "_R2_001.fastq.gz"

In [207]:
metadata.to_csv("../metadata/crc_qiime2_metadata.tsv", sep = "\t", index=False)

Let's visualize the quality profile of this manifest and put it into QIIME 2 View

In [208]:
sequences = Artifact.import_data('SampleData[PairedEndSequencesWithQuality]', 
                                 "../metadata/crc_qiime2_metadata.tsv",
                                PairedEndFastqManifestPhred33V2)

In [210]:
if not os.path.exists("../output/sequence_process_16s/crc_16s/demux_viz.qzv"):
    seq_viz = demux_actions.summarize(sequences)
    seq_viz.visualization.save("../output/sequence_process_16s/crc_16s/demux_viz.qzv")

Similarly, we have visualization via QIIME 2 View [URL]()