Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
2ac9903
Initial creation of blank kneaddata scripts
Sep 28, 2016
20495f5
Merge branch 'adding-humann2-pipeline' into adding-kneaddata-pipeline
Sep 29, 2016
b142aa8
Created empty kneaddata command
Sep 29, 2016
6ff7909
Added kneaddata import to __init__.py
Sep 29, 2016
7ee38b6
Added kneaddata default params to __init__.py
Sep 29, 2016
c12f0d0
Merging Antonio's newest pull request
Sep 29, 2016
370611b
adding kneaddata functionality
Sep 29, 2016
6715d47
Merge branch 'master' of https://github.com/qiita-spots/qp-shotgun in…
Sep 29, 2016
8b5abb3
Added method for matching reads and run prefixes
Sep 29, 2016
978f613
Added tests for make_read_pairs_per_sample
Sep 29, 2016
e65adba
Added generate_kneaddata_commands
Sep 30, 2016
d306b90
Added tests for generate_kneaddata_commands
Sep 30, 2016
4f1cbf4
commented out unused parameter for running TRF
Sep 30, 2016
85a15ee
Fixed flake8 errors
Sep 30, 2016
3c2bbda
Added kneaddata to setup
Sep 30, 2016
caf10fd
Added kneaddata plugin
Sep 30, 2016
38a3724
Typos and flake8
Sep 30, 2016
8aa7664
Fixed plugin name collision
Sep 30, 2016
0415af8
Fixed error in generate_kneaddata_commands
Sep 30, 2016
e7f42ef
Fixed flake8 errors
Sep 30, 2016
47fb1cc
Fixing unit tests
Sep 30, 2016
55c7cb2
Commented out unused commands for flake8
Sep 30, 2016
e6a41a5
Fixing flake8 errors and command tests
Sep 30, 2016
7ba8411
Fixing flake8 errors and command tests
Sep 30, 2016
2a718f5
Moved param string formatting to function
Sep 30, 2016
c24774d
Fixed param string formatting
Sep 30, 2016
c6e7632
Fixed typo
Sep 30, 2016
3817629
Added test for format_kneaddata_params
Sep 30, 2016
1c00529
Fixing flake8 errors
Sep 30, 2016
726c580
Fixing typos
Sep 30, 2016
5c9dade
Fixing typos
Sep 30, 2016
7536a5f
Fixed flake8 errors
Sep 30, 2016
b6452a3
Fixed flake8 errors
Sep 30, 2016
6bacd88
Fixed merge of antonios humann2 pull request
Oct 3, 2016
2e953f1
Added install info for FastQC and KneadData
Oct 3, 2016
25f47a5
Fixed dependency link?
Oct 3, 2016
0d94279
Merge branch 'humann2-full-pipeline' into adding-kneaddata-pipeline
Oct 3, 2016
2b3abf7
Fixed typo
Oct 3, 2016
a86586d
Upgrade pip
Oct 3, 2016
7b18a9f
Trying to process dependency links
Oct 3, 2016
4d6498d
Declared egg ver in dep link
Oct 3, 2016
87f1caf
Added trimmomatic and bowtie2 install code
Oct 11, 2016
416f407
Adding Trimmomatic install
Oct 16, 2016
13fdbb3
Merged with Antonio's most recent PR
Oct 16, 2016
31bbe78
changed kneaddata dependency install to .travis
Oct 16, 2016
440bac9
Commented out Metaphlan install for travis testing
Oct 16, 2016
4f56331
Fixed bowtie2 conda install
Oct 16, 2016
74ad741
Fixed bowtie2 conda install
Oct 16, 2016
5dea337
Adding kneaddata functionality
Oct 17, 2016
482703d
Adding tests for kneaddata functionality
Oct 17, 2016
36a6f31
Updating Trimmomatic install info
Oct 17, 2016
6b5bb74
Fixed KneadData execution
Oct 17, 2016
9c6635a
Fixed KneadData execution
Oct 17, 2016
6d03d18
Added some more fastq test files
Oct 17, 2016
af9278a
Fixed local db reference
Oct 17, 2016
8b5fc09
Flake8
Oct 17, 2016
a22922d
hiding humann2 tests from nose
Oct 17, 2016
cd5f23f
Fixed to handle single read case in artifact generation
Oct 18, 2016
8457f93
Fixed merge conflicts with master
Oct 18, 2016
6f0f4a6
Updated unit tests
Oct 18, 2016
830bd81
Addressing @wasade's comments
Oct 18, 2016
6faf771
Please work jarvis
Oct 19, 2016
04533eb
Please work jarvis
Oct 19, 2016
d0ec922
flake8
Oct 19, 2016
1448263
Fixed merge conflicts with upstream master
Oct 19, 2016
40e9ed7
Adding some comments about read pairing
Oct 20, 2016
5167ad4
Removing humann2 test masking from .travis.yml
Oct 20, 2016
d76b614
flake8
Oct 20, 2016
ea6dfff
Merging Antonios humann2 code
Oct 25, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,15 @@ install:
# At this moment this plugin has overlapping dependencies than qiita so installing everything in the same environment
- conda create --yes -n qiita_env python=2.7 pip nose flake8
pyzmq networkx pyparsing natsort mock future libgfortran
'pandas>=0.18' 'scipy>0.13.0' 'numpy>=1.7' 'h5py>=2.3.1'
'pandas>=0.18' 'scipy>0.13.0' 'numpy>=1.7' 'h5py>=2.3.1' matplotlib
- source activate qiita_env
- export CONDA_BIN=/home/travis/miniconda3/bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are using this anymore, but this no longer running on travis so this path is incorrect. Check this line for an idea on how to fix it, unless you don't need anymore, in which case can be removed.

- pip install sphinx sphinx-bootstrap-theme coveralls ipython[all]==2.4.1
# kneaddata dependencies
- conda install --yes -c bioconda bowtie2 trimmomatic=0.36 fastqc=0.11.5
- export TRIMMOMATIC_BIN=$(readlink -f $(which trimmomatic))
- export TRIMMOMATIC_DIR=`dirname $TRIMMOMATIC_BIN`
- export KD_DEMO_DB=$PWD/miniconda3/envs/qiita_env/lib/python2.7/site-packages/kneaddata/tests/data/demo_bowtie2_db/demo_db
- pip install https://github.com/biocore/qiita/archive/master.zip --process-dependency-links
# installing other deps
- conda install --yes -c jjhelmus glpk
Expand Down
86 changes: 86 additions & 0 deletions qp_shotgun/kneaddata/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# -----------------------------------------------------------------------------
# Copyright (c) 2014--, The Qiita Development Team.
#
# Distributed under the terms of the BSD 3-clause License.
#
# The full license is in the file LICENSE, distributed with this software.
# -----------------------------------------------------------------------------

from qiita_client import QiitaPlugin, QiitaCommand

from .kneaddata import kneaddata

__all__ = ['kneaddata']

# Initialize the plugin
plugin = QiitaPlugin(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is going to work. Each repository should have a single "QiitaPlugin" instance. In case that we want to have multiple "QiitaPlugin" instances we need a configure and a start_* script for each of them, in which case I don't see the benefit of having all these tools in a single repository. I think this can be added as a command to the shotgun plugin.

'KneadData', '0.5.1', 'KneadData is a tool designed to perform quality '
'control on metagenomic and metatranscriptomic sequencing data, '
'especially data from microbiome experiments.')

# Define the HUMAnN2 command
req_params = {'input': ('artifact', ['per_sample_FASTQ'])}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we specify that it needs forward and (or or) reverse fastqs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will always return all the available ones, in your code you need to check if they exist, for example this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I get it... Currently this check occurs later on in processing.

opt_params = {
# there are other parameters not included that will be ignored in this
# configuration as we assume that the correct values were set in the env
# by the admin installing the tools:
# trimmomatic
# bowtie2
# --input # input FASTQ file (add a second argument for paired input files)
# --output # directory to write output files
# --output-prefix # prefix for output files [ DEFAULT : $SAMPLE_kneaddata ]
# --log # filepath for log [ DEFAULT : $OUTPUT_DIR/$SAMPLE_kneaddata.log ]
# --bowtie2 # path to bowtie executable
# --bmtagger # path to bmtagger exectuable
# --trf # path to TRF executable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how these executable paths get specified. I assume that's what line 19-23 are referring to, and that we just need to have an executable in the $PATH?

'reference-db': ['choice:["human_genome"]', 'human_genome'], # ref db
'bypass-trim': ['boolean', 'False'], # bypass the trim step
'threads': ['integer', '1'], # threads to run
'processes': ['integer', '1'], # processes to run
'quality-scores': ['choice:["phred33","phred64"]', 'phred33'], # quality
'run-bmtagger': ['boolean', 'False'], # run BMTagger instead of Bowtie2
'run-trf': ['boolean', 'False'], # run TRF repeat finder tool
'run-fastqc-start': ['boolean', 'True'], # run FastQC on original data
'run-fastqc-end': ['boolean', 'True'], # run FastQC on filtered data
'store-temp-output': ['boolean', 'False'], # store temp output files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this parameter needs to be exposed. In which cases are we going to let the user store the temp output?

'log-level': ['choice:["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]',
'DEBUG'],

# Trimmomatic options
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need more info about this parameter. In general we don't expose these types of parameters to the user since it is system dependent (and we don't want the users to request, let's say, 3TB...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for admin and this is required by JAVA! 😭

'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter and the one below shouldn't be exposed here. The parameters exposed here are parameters that are going to be eventually exposed to the user. I this case, this parameter should never be changed (i.e. it is set up at installation time and then it always is the same).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this.

'trimmomatic-options': ['string', 'ILLUMINACLIP:$TRIMMOMATIC_DIR/adapters/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to allow the user to change this parameter, o provide a different set of parameters? This looks like a long string of values that seems unlikely that the user is going to change. If the answer for both previous questions is no, this parameter should also be removed from here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this so admins can change it.

'TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 '
'SLIDINGWINDOW:4:15 MINLEN:36'],
# Bowtie2 options
# 'bowtie2-options': ['string', '--very-sensitive']
# BMTagger options
# # TRF options
# 'match': ['integer', '2'], # matching weight
# 'mismatch': ['integer', '7'], # mismatching penalty
# 'delta': ['integer', '7'], # indel penalty
# 'pm': ['integer', '80'], # match probability
# 'pi': ['integer', '10'], # indel probability
# 'minscore': ['integer', '50'], # mimimum alignment score to report
# 'maxperiod': ['integer', '500m'] # maximum period size to report
# FastQC options
}
outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KneadData should produce a lot of outputs, and the quantity will depend on the other options specified (e.g. if you provide multiple reference dbs, or skip fastQC). how can we deal with this?

dflt_param_set = {
'Defaults': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be updated accordingly as the parameters above get removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

'reference-db': 'human_genome', 'bypass-trim': False, 'threads': 1,
'processes': 1, 'quality-scores': 'phred33', 'run-bmtagger': False,
'run-trf': False, 'run-fastqc-start': True, 'run-fastqc-end': True,
'store-temp-output': False, 'log-level': 'DEBUG', 'max-memory': '500m',
'trimmomatic': '"$TRIMMOMATIC_DIR"',
'trimmomatic-options': '"ILLUMINACLIP:$TRIMMOMATIC_DIR/adapters/'
'TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 '
'MINLEN:36"'
# , 'match': 2, 'mismatch': 7,
# 'delta': 7, 'pm': 80, 'pi': 10, 'minscore': 50, 'maxperiod': '500m'
}
}
kneaddata_cmd = QiitaCommand(
"KneadData", "Sequence QC", kneaddata, req_params, opt_params,
outputs, dflt_param_set)
plugin.register_command(kneaddata_cmd)
287 changes: 287 additions & 0 deletions qp_shotgun/kneaddata/kneaddata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
# -----------------------------------------------------------------------------
# Copyright (c) 2014--, The Qiita Development Team.
#
# Distributed under the terms of the BSD 3-clause License.
#
# The full license is in the file LICENSE, distributed with this software.
# -----------------------------------------------------------------------------

from itertools import zip_longest
from os.path import basename, join

from qiita_client import ArtifactInfo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line

from qiita_client.util import system_call, get_sample_names_by_run_prefix


def make_read_pairs_per_sample(forward_seqs, reverse_seqs, map_file):
"""Recovers read pairing information
Parameters
----------
forward_seqs : list of str
The list of forward seqs filepaths
reverse_seqs : list of str
The list of reverse seqs filepaths
map_file : str
The path to the mapping file
Returns
-------
samples: list of tup
list of 3-tuples with run prefix, sample name, fwd read fp, rev read fp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a catch so that if the run prefix isn't in the mapping file that it assumes the run prefix is equal to the full file name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea but I think that change will change the behavior compared to other plugins, which points that perhaps we should have a method to do this in qiita_client vs. having ad hoc functions. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree in principle! Probably this is going to change if we make a more sensible fastq file format or artifact structure, though, so maybe wait until then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, could you add an issue so we don't forget ... thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be covered by qiita-spots/qtp-sequencing#8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure as that doesn't deal with run_prefix as the full name, perhaps pointing to this discussion, adding a note or both will make it easier to follow up.

Raises
------
ValueError
If the rev is not an empty list and the same length than fwd seqs
The prefixes of the run_prefix don't match the file names
Notes
-----
At this stage it is required that if reverse sequences are present that all
samples have both a forward and a reverse sequence. However, the read
trimming step can sometimes eliminate all reverse reads, especially in low
coverage samples with poor overall reverse read quality.
"""

# sort forward seqs
forward_seqs.sort()

# check that rev seqs are same len
if reverse_seqs:
if len(forward_seqs) != len(reverse_seqs):
raise ValueError('Your reverse and forward files are of different '
'length. Forward: %s. Reverse: %s.' %
(', '.join(forward_seqs),
', '.join(reverse_seqs)))
reverse_seqs.sort()

# get run prefixes
# These are prefixes that should match uniquely to forward reads
# sn_by_rp is dict of samples keyed by run prefixes
sn_by_rp = get_sample_names_by_run_prefix(map_file)

# make pairings
samples = []
used_prefixes = set()
for i, (fwd_fp, rev_fp) in enumerate(zip_longest(forward_seqs,
reverse_seqs)):
# fwd_fp is the fwd read filepath
fwd_fn = basename(fwd_fp)

# iterate over run prefixes and make sure only one matches
run_prefix = None
for rp in sn_by_rp:
if fwd_fn.startswith(rp) and run_prefix is None:
run_prefix = rp
elif fwd_fn.startswith(rp) and run_prefix is not None:
raise ValueError('Multiple run prefixes match this fwd read: '
'%s' % fwd_fn)

# make sure that we got one matching run prefix:
if run_prefix is None:
raise ValueError('No run prefix matching this fwd read: %s'
% fwd_fn)

if run_prefix in used_prefixes:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can used_prefixes be a set? this will be a list lookup, and while the number of entries probably wont get too large, it's good practice to use a better lookup structure when the thing being searched over is expected to reasonably be > 5 elements

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(rule of thumb)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

raise ValueError('This run prefix matches multiple fwd reads: '
'%s' % run_prefix)

if rev_fp is None:
samples.append((run_prefix, sn_by_rp[run_prefix], fwd_fp, None))
else:
rev_fn = basename(rev_fp)
# if we have reverse reads, make sure the matching pair also
# matches the run prefix:
if not rev_fn.startswith(run_prefix):
raise ValueError('Reverse read does not match this run prefix.'
'\nRun prefix: %s\nForward read: %s\n'
'Reverse read: %s\n' %
(run_prefix, fwd_fn, rev_fn))

samples.append((run_prefix, sn_by_rp[run_prefix], fwd_fp,
rev_fp))

used_prefixes.add(run_prefix)

return(samples)


def format_kneaddata_params(parameters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation


params = []

for param in sorted(parameters):
value = parameters[param]

if value is True:
params.append('--%s' % param)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are if/elif, the continue statements are not needed - can you remove them?

elif value is False:
continue
elif value:
params.append('--%s %s' % (param, value))
continue

param_string = ' '.join(params)

return(param_string)


def generate_kneaddata_commands(forward_seqs, reverse_seqs, map_file,
out_dir, parameters):
"""Generates the KneadData commands
Parameters
----------
forward_seqs : list of str
The list of forward seqs filepaths
reverse_seqs : list of str
The list of reverse seqs filepaths
map_file : str
The path to the mapping file
out_dir : str
The job output directory
parameters : dict
The command's parameters, keyed by parameter name
Returns
-------
cmds: list of str
The KneadData commands
samples: list of tup
list of 3-tuples with run prefix, sample name, fwd read fp, rev read fp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

Raises
------
ValueError
If the rev is not an empty list and the same length than fwd seqs
The prefixes of the run_prefix don't match the file names
Notes
-----
Currently this is requiring matched pairs in the make_read_pairs_per_sample
step but implicitly allowing empty reverse reads in the actual command
generation. This behavior may allow support of situations with empty
reverse reads in some samples, for example after trimming and QC.
"""
# making sure the forward and reverse reads are in the same order

# we match filenames, samples, and run prefixes
samples = make_read_pairs_per_sample(forward_seqs, reverse_seqs, map_file)

cmds = []

param_string = format_kneaddata_params(parameters)
prefixes = []
for run_prefix, sample, f_fp, r_fp in samples:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ever the case that some samples will have reverse reads and others will not within the same study? right now the forloop implicitly supports this but I don't think that is the case and open the door for a subtle bug in the future as code fluxes over time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it's a big pain in the ass. This frequently happens with shallow shotgun runs like we have been getting a lot of recently, where the quality is not so good and you end up with the reverse read being trimmed to extinction.

I agree that this kind of thing can set us up for problems and we should keep aware of it! Perhaps for early QC tools like kneaddata we can enforce proper paired reads (assuming no trimming so everything should be consistent) while downstream tools can be more flexible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, adding that comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think that helps. Perhaps adding a comment on top of the for loop, like # see notes before changing!, might be good.

prefixes.append(run_prefix)
if r_fp is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I think this if/else block can be reduced to:

r_fp_str = ' --input "%s"' % r_fp if r_fp is not None else ""
cmds.append('kneaddata --input "%s"%s --output "%s" --output-prefix "%s" %s'
            % (f_fp, r_fp_str, join(out_dir, run_prefix), run_prefix, param_string))

to avoid some code duplication

cmds.append('kneaddata --input "%s" --output "%s" '
'--output-prefix "%s" %s' %
(f_fp, join(out_dir, run_prefix),
run_prefix, param_string))
else:
cmds.append('kneaddata --input "%s" --input "%s" --output "%s" '
'--output-prefix "%s" %s' %
(f_fp, r_fp, join(out_dir, run_prefix),
run_prefix, param_string))

return cmds, samples


def _run_commands(qclient, job_id, commands, msg):
for i, cmd in enumerate(commands):
qclient.update_job_step(job_id, msg % i)
std_out, std_err, return_value = system_call(cmd)
if return_value != 0:
error_msg = ("Error running KneadData:\nStd out: %s\nStd err: %s"
"\n\nCommand run was:\n%s"
% (std_out, std_err, cmd))
return False, error_msg

return True, ""


def _per_sample_ainfo(out_dir, samples):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding 1 artifact per output file - I don't think this is the desired behavior right? I think what we want to get is a single artifact with all the sequence file. Note that the number of output artifacts from a command should be deterministic (i.e. it is known in advance how many artifacts a command is going to generate). Note that artifact != file so it shouldn't be a problem.

Now, extending on that, it looks like we can potentially group the output files in 3 different artifacts: (1) clean paired reads with R1 and R2, (2) clean unpaired R1, and (3) clean unpaired R2. However, this doesn't work for the case in which R2 is not provided, since we only have 1 output artifact. To solve this case, I would add 2 commands, 1 for paired reads and 1 for unpaired reads. Note that internally you can be using the same functions if you want, but from the Qiita point of view this simplifies things as the output artifacts are known in advance.

To create the 3 artifacts for the paired case, the code would be something like:

paired_files = []
r1_files = []
r2_files = []
for run_prefix, sample, f_fp, r_fp in samples:
    sam_out_dir = join(out_dir, run_prefix)
    paired_files.append((join(sam_out_dir, '%s_paired_1.fastq' % run_prefix), 'preprocessed_fastq'))
    paired_files.append((join(sam_out_dir, '%s_paired_2.fastq' % run_prefix), 'preprocessed_fastq'))
    r1_files.append((join(sam_out_dir, '%s_unmatched_1.fastq' % run_prefix), 'preprocessed_fastq'))
    r2_files.append((join(sam_out_dir, '%s_unmatched_2.fastq' % run_prefix), 'preprocessed_fastq'))

ainfo = [ArtifactInfo('clean paired', 'per_sample_FASTQ', paired_files),
         ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r1_files),
         ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r2_files)

Note that the definition of the command in the __init__.py file would need to be update to show that this command outputs 3 artifacts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand better now how this works. How can we retain information about the readdirectionality in the paired reads then given the current structure? Is there something besides 'preprocessed_fastq'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a totally valid point and it shows how much we need a small type decorator system that will allow this differentiation.

ainfo = []
for run_prefix, sample, f_fp, r_fp in samples:
sam_out_dir = join(out_dir, run_prefix)

if r_fp:
ainfo.extend([
ArtifactInfo('clean paired R1', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_paired_1.fastq' %
run_prefix), 'per_sample_FASTQ')]),
ArtifactInfo('clean paired R2', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_paired_2.fastq' %
run_prefix), 'per_sample_FASTQ')]),
ArtifactInfo('clean unpaired R1', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_unmatched_1.fastq' %
run_prefix), 'per_sample_FASTQ')]),
ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_unmatched_2.fastq' %
run_prefix), 'per_sample_FASTQ')])])
else:
ainfo.extend([
ArtifactInfo('cleaned reads', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s.fastq' %
run_prefix), 'per_sample_FASTQ')])])

return ainfo


def kneaddata(qclient, job_id, parameters, out_dir):
"""Run kneaddata with the given parameters
Parameters
----------
qclient : tgp.qiita_client.QiitaClient
The Qiita server client
job_id : str
The job id
parameters : dict
The parameter values to run split libraries
out_dir : str
The path to the job's output directory
Returns
-------
bool, list, str
The results of the job
"""
# Step 1 get the rest of the information need to run kneaddata
qclient.update_job_step(job_id, "Step 1 of 3: Collecting information")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 1 of 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K

artifact_id = parameters['input']

# removing input from parameters so it's not part of the final command
del parameters['input']

# Get the artifact filepath information
artifact_info = qclient.get("/qiita_db/artifacts/%s/" % artifact_id)
fps = artifact_info['files']

# Get the artifact metadata
prep_info = qclient.get('/qiita_db/prep_template/%s/'
% artifact_info['prep_information'][0])
qiime_map = prep_info['qiime-map']

# Step 2 generating command kneaddata
qclient.update_job_step(job_id, "Step 2 of 5: Generating"
" KneadData command")
rs = fps['raw_reverse_seqs'] if 'raw_reverse_seqs' in fps else []
commands, samples = generate_kneaddata_commands(fps['raw_forward_seqs'],
rs, qiime_map, out_dir,
parameters)

# Step 3 execute kneaddata
msg = "Step 3 of 5: Executing KneadData job (%d/{0})".format(len(commands))
success, msg = _run_commands(qclient, job_id, commands, msg)
if not success:
return False, None, msg

# Step 4 generating artifacts
ainfo = _per_sample_ainfo(out_dir, samples)

return True, ainfo, ""
7 changes: 7 additions & 0 deletions qp_shotgun/kneaddata/tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# -----------------------------------------------------------------------------
# Copyright (c) 2014--, The Qiita Development Team.
#
# Distributed under the terms of the BSD 3-clause License.
#
# The full license is in the file LICENSE, distributed with this software.
# -----------------------------------------------------------------------------
Loading