# Mock community dataset generation
modified from tax-credit by [Caporaso Lab](https://github.com/caporaso-lab/tax-credit) and described [here](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0470-z)  

To get the mockrobiota mock communities, we have to do a little bit of work.

In [1]:
import os
from os.path import expandvars, join

In [2]:
#first, make sure you're in the project directory
project_dir = '/mnt/c/Users/Dylan/Documents/zaneveld/GCMP_Global_Disease-master/analysis/organelle_removal/'
os.chdir(project_dir)

In [3]:
#download the mockrobiota info
!git clone https://github.com/caporaso-lab/mockrobiota.git  

fatal: destination path 'mockrobiota' already exists and is not an empty directory.


In [4]:
#download and install tax-credit
!git clone https://github.com/caporaso-lab/tax-credit.git
%cd tax-credit/
!pip install .

fatal: destination path 'tax-credit' already exists and is not an empty directory.
/mnt/c/Users/Dylan/Documents/zaneveld/GCMP_Global_Disease-master/analysis/organelle_removal/tax-credit
Processing /mnt/c/Users/Dylan/Documents/zaneveld/GCMP_Global_Disease-master/analysis/organelle_removal/tax-credit
^C
[31mERROR: Operation cancelled by user[0m


In [5]:
#the organelle removal project is run on a different version of qiime than what tax-credit was written for.
#This requires modifying a line of *process_mocks.py* to prevent errors in subsequent cells 
with open('tax_credit/process_mocks.py') as file:
    lines = file.readlines()
lines[373] = '    biom_table, rep_seqs, stats = dada2.methods.denoise_single(\n'
with open('tax_credit/process_mocks.py', 'w') as file:
    file.writelines(lines)

Now we're ready to create the mockrobiota datasets. The remaining code is taken from tax-credit by the [Caporaso Lab](https://github.com/caporaso-lab/tax-credit) and described [here](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0470-z), with minor modifications to fit this project.

In [6]:
# tax-credit directory
repo_dir = join(project_dir, 'tax-credit')
#cd to main directory of repository
os.chdir(repo_dir)

In [7]:
from tax_credit.process_mocks import (extract_mockrobiota_dataset_metadata,
                                      extract_mockrobiota_data,
                                      batch_demux,
                                      denoise_to_phylogeny,
                                      transport_to_repo
                                     )

Set source/destination filepaths

In [8]:
# mockrobiota directory
mockrobiota_dir = join(project_dir, "mockrobiota")
# temp destination for mock community files
mock_data_dir = join(project_dir, "mock-community")
# destination for expected taxonomy assignments
expected_data_dir = join(repo_dir, "data", "precomputed-results", "mock-community")


First we will define which mock communities we plan to use, and necessary parameters

In [9]:
# We will just use a sequential set of mockrobiota datasets, otherwise list community names manually
communities = ['mock-{0}'.format(n) for n in range(12,23) if n != 17]
#communities = ['mock-{0}'.format(n) for n in range(16,27) if n != 17]

# Create dictionary of mock community dataset metadata
community_metadata = extract_mockrobiota_dataset_metadata(mockrobiota_dir, communities)

# Map marker-gene to reference database names in tax-credit and in mockrobiota
#           marker-gene  tax-credit-dir  mockrobiota-dir version
reference_dbs = {'16S' : ('gg_13_8_otus', 'greengenes', '13-8', '99-otus'),
                 'ITS' : ('unite_20.11.2016', 'unite', '7-1', '99-otus')
                }

Now we will generate data directories in ``tax-credit`` for each community and begin populating these will files from ``mockrobiota``. This may take some time, as this involves downloading raw data fastq files.

In [10]:
extract_mockrobiota_data(communities, community_metadata, reference_dbs, 
                         mockrobiota_dir, mock_data_dir, 
                         expected_data_dir)

## Process data in QIIME2
Finally, we can get to processing our data. We begin by importing our data, demultiplexing, and viewing a few fastq quality summaries to decide how to trim our raw reads prior to processing.

Each dataset may require different parameters. For example, some mock communities used here require different barcode orientations, while others may already be demultiplexed. These parameters may be read in as a dictionary of tuples.

In [11]:
# {community : (demultiplex, rev_comp_barcodes, rev_comp_mapping_barcodes)}
demux_params = {'mock-1' : (True, False, True),
               'mock-2' : (True, False, True),
               'mock-3' : (True, False, False),
               'mock-4' : (True, False, True),
               'mock-5' : (True, False, True),
               'mock-6' : (True, False, True),
               'mock-7' : (True, False, True),
               'mock-8' : (True, False, True),
               'mock-9' : (True, False, True),
               'mock-10' : (True, False, True),
               'mock-12' : (False, False, False),
               'mock-13' : (False, False, False),
               'mock-14' : (False, False, False),
               'mock-15' : (False, False, False),
               'mock-16' : (False, False, False),
               'mock-18' : (False, False, False),
               'mock-19' : (False, False, False),
               'mock-20' : (False, False, False),
               'mock-21' : (False, False, False),
               'mock-22' : (False, False, False),
               'mock-23' : (False, False, False),
               'mock-24' : (False, False, False),
               'mock-25' : (False, False, False),
               'mock-26' : (True, False, False), # Note we only use samples 1-40 in mock-26
              }

In [12]:
batch_demux(communities, mock_data_dir, demux_params)

mock-12 complete
mock-13 complete
mock-14 complete
mock-15 complete
mock-16 complete
mock-18 complete
mock-19 complete
mock-20 complete
mock-21 complete
mock-22 complete


To view the ``demux_summary.qzv`` (demultiplexed sequences per sample counts) and ``demux_plot_qual.qzv`` (fastq quality profiles) summaries that you just created, drag and drop the files into [q2view](https://view.qiime2.org/)

Use the fastq quality data above to decide how to proceed. As each dataset will have different quality profiles and read lengths, we will enter trimming parameters as a dictionary. We can use this dict to pass other parameters to ``denoise_to_phylogeny()``, including whether we want to build a phylogeny for each community.

In [13]:
# {community : (trim_left, trunc_len, build_phylogeny)}
trim_params = {'mock-1' : (0, 100, True),
               'mock-2' : (0, 130, True),
               'mock-3' : (0, 150, True),
               'mock-4' : (0, 150, True),
               'mock-5' : (0, 200, True),
               'mock-6' : (0, 50, True),
               'mock-7' : (0, 90, True),
               'mock-8' : (0, 100, True),
               'mock-9' : (0, 100, False),
               'mock-10' : (0, 100, False),
               'mock-12' : (0, 230, True),
               'mock-13' : (0, 250, True),
               'mock-14' : (0, 250, True),
               'mock-15' : (0, 250, True),
               'mock-16' : (19, 231, False),
               'mock-18' : (19, 231, False),
               'mock-19' : (19, 231, False),
               'mock-20' : (0, 250, False),
               'mock-21' : (0, 250, False),
               'mock-22' : (19, 250, False),
               'mock-23' : (19, 250, False),
               'mock-24' : (0, 150, False),
               'mock-25' : (0, 165, False),
               'mock-26' : (0, 290, False),
              }

Now we will quality filter with ``dada2``, and use the representative sequences to generate a phylogeny.

In [14]:
#this command may take hours
denoise_to_phylogeny(communities, mock_data_dir, trim_params)

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-ih1lwiwb/1d483815-15da-4a4c-b6ef-05cc43e1b921/data /tmp/tmpjwi0504r/output.tsv.biom /tmp/tmpjwi0504r/track.tsv /tmp/tmpjwi0504r 230 0 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16



  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 1 /tmp/qiime2-archive-vaptcq__/56f99adf-238a-45b9-bf99-6185e4b56c66/data/dna-sequences.fasta

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: FastTree -quote -nt /tmp/qiime2-archive-sjqdf83k/1b73f5fb-141d-4d2f-b73b-0b6f870ff289/data/aligned-dna-sequences.fasta

mock-12 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /

  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 1 /tmp/qiime2-archive-mppjoyd5/b2804c5c-4bd9-406f-9294-dd827d01b44a/data/dna-sequences.fasta

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: FastTree -quote -nt /tmp/qiime2-archive-kfv45_m_/7ba50e93-5f37-4d4c-a7ca-fe674b2f6d94/data/aligned-dna-sequences.fasta

mock-13 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /

  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 1 /tmp/qiime2-archive-9pib_6nk/009125ac-663c-4e34-8cc1-1aa7ba2766f3/data/dna-sequences.fasta

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: FastTree -quote -nt /tmp/qiime2-archive-1rmfsctx/7ed63ede-e403-4e36-84b6-38365d71c17e/data/aligned-dna-sequences.fasta

mock-14 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /

  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 1 /tmp/qiime2-archive-3oxn5wpr/30e357d1-ee5d-41f2-ab7a-b98f0018a3c3/data/dna-sequences.fasta

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: FastTree -quote -nt /tmp/qiime2-archive-4yk6ec_v/0bfb76f3-cbcd-4b4c-b6a4-98374641368c/data/aligned-dna-sequences.fasta

mock-15 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /

  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


mock-16 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-g8p_rt26/02ee3cb7-85c5-4ee9-ba9e-e9f60444a8cd/data /tmp/tmp5b2kihzb/output.tsv.biom /tmp/tmp5b2kihzb/track.tsv /tmp/tmp5b2kihzb 231 19 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16



  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


mock-18 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-x218sgl_/722345d9-3abe-46e1-982f-1f8a66b65a3d/data /tmp/tmpogvhc2zb/output.tsv.biom /tmp/tmpogvhc2zb/track.tsv /tmp/tmpogvhc2zb 231 19 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16



  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


mock-19 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-nl4re43s/b9a1fcb2-6872-4a65-a0ee-eaaf243afc55/data /tmp/tmpul7mdwly/output.tsv.biom /tmp/tmpul7mdwly/track.tsv /tmp/tmpul7mdwly 250 0 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16



  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


mock-20 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-fcexd7o4/f1eba1ae-5c7c-41e3-a530-9cbf6e7b6af0/data /tmp/tmp6h3usxyl/output.tsv.biom /tmp/tmp6h3usxyl/track.tsv /tmp/tmp6h3usxyl 250 0 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16



  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


mock-21 complete
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_single.R /tmp/qiime2-archive-4c1hjkq3/c7d22e53-e4fa-47e6-b4a9-4d1fd489c476/data /tmp/tmptkcxhazj/output.tsv.biom /tmp/tmptkcxhazj/track.tsv /tmp/tmptkcxhazj 250 19 2.0 2 Inf independent consensus 1.0 1 1000000 NULL 16

mock-22 complete


  os.path.join(output_dir, 'sample-frequency-detail.csv'))
  os.path.join(output_dir, 'feature-frequency-detail.csv'))


To view the ``feature_table_summary.qzv`` summaries you just created, drag and drop the files into [q2view](https://view.qiime2.org/)

## Extract results and move to repo

In [15]:
transport_to_repo(communities, mock_data_dir, repo_dir)