# Using the Seven Bridges Public SMC-RNA DREAM Project

In this Jupyter Notebook:
- we'll review some basic Seven Bridges API functions
- examine the training data for the DREAM Challenge
- use a set of helper functions to easily filter files and apps 
- learn how to test a single app across all the DREAM training data in one click

In [1]:
from __future__ import print_function
from os import environ
from datetime import datetime
from time import sleep
import sevenbridges as sbg
from dream_helpers import *
import pprint 
pp = pprint.PrettyPrinter(indent=4)

Before using the notebook, add API_URL and AUTH_TOKEN to your OS environment

API_URL is "https://cgc-api.sbgenomics.com/v2"

AUTH_TOKEN can be retrieved from the Account settings on the CGC (see the Developer tab) 


    export API_URL="https://cgc-api.sbgenomics.com/v2"
    export AUTH_TOKEN=<cgc auth_token>

In [2]:
# Create the API object using a config built from your environment variables
api = sbg.Api(config=sbg.Config(url=environ['API_URL'], token=environ['AUTH_TOKEN']))

In [4]:
"""
Getting to know the API
    Remember that you can introspect any object using the dir() function.
"""
print("Attributes of root api object: {}".format(dir(api)))
print("\nAttributes of api.files: {}".format(dir(api.files)))

Attributes of root api object: ['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_limit', '_remaining', '_request', '_request_id', '_reset', '_session', 'apps', 'billing_groups', 'delete', 'download_pool', 'endpoints', 'files', 'get', 'headers', 'invoices', 'limit', 'oauth_token', 'patch', 'post', 'projects', 'put', 'remaining', 'request_id', 'reset_time', 'session', 'tasks', 'timeout', 'token', 'upload_pool', 'url', 'users']

Attributes of api.files: ['_API', '_URL', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_fields', '_modified_data', '_query', 'copy', 'created_on', 'delete', 

## Basic API Functions
Here's a few quick examples to help you gett to know how to make common calls using the API

In [5]:
# Get username
username = api.users.me().username
print("Username: {}".format(username))

# Get list of projects (objects)
projects_list = get_projects_list(api)

# Get list of project names
projects_list_by_name = [p.name for p in projects_list] 
print("\nList of project names: {}{}".format("\n", projects_list_by_name))

# Get list of project IDs
projects_list_by_id = [p.id for p in projects_list] 
print("\nList of project IDs: {}{}".format("\n", projects_list_by_id))

Username: gauravdream

List of project names: 
[u'DREAM-Eval', u'Kallisto-DREAM', u'DREAM']

List of project IDs: 
[u'gauravdream/dream-eval', u'gauravCGC/kallisto-dream', u'gauravdream/dream']


In [6]:
"""
Throughout this notebook, we'll be often printing the names of objects 
    (e.g projects, files, apps) to sanity-check our calls. 
Let's use a helper function to clean up our prints and save us time.
"""
print("List of project names: ")
print(*get_names(projects_list), sep="\n")
print("\nList of project IDs: ")
print(*get_ids(projects_list), sep="\n")

List of project names: 
DREAM-Eval
Kallisto-DREAM
DREAM

List of project IDs: 
gauravdream/dream-eval
gauravCGC/kallisto-dream
gauravdream/dream


## Working with Projects

"Projects" are workspaces that contains a set of files and applications which a group of collaborators can work on.

In [7]:
"""
Query all your projects and retrive a list of matching projects
"""
project_query = "DREAM"

# Print projects matched by your query
print("Query returned the following projects: ")
print(*get_projects_by_string(api, project_query), sep="\n")

Query returned the following projects: 
gauravdream/dream-eval
gauravCGC/kallisto-dream
gauravdream/dream


In [8]:
# Set the name of the project we'll use for this tutorial
DREAM_PROJECT="gauravdream/dream"

In [8]:
# Get list of applications in the your DREAM project
def get_apps_in_project(api, project):
    return list(api.apps.query(project).all())

apps = get_apps_in_project(api, DREAM_PROJECT)

# Print all apps in DREAM Project
print("Apps in '{}' project: ".format(DREAM_PROJECT))
print(*get_names(apps), sep="\n")

Apps in 'gauravdream/dream' project: 
DREAM Fusion Detection Evaluation Workflow
DREAM Isoform Quantification Evaluation Workflow
DREAM RSEM
DREAM STAR
DREAM TopHat
import-synapse-data
rsem-cut
rsem-gunzip
rsem-tar
smcFusion-STAR-workflow
smcFusion-TopHat-Workflow
smcIsoform-RSEM-Workflow
star-converter
star-fusion
star-tar
tophat-converter
tophat-grep
tophat-tar


In [10]:
# Get list of file objects in the test DREAM project
files = get_files_in_project(api, DREAM_PROJECT)

# Print all files in DREAM project
print("Files in '{}' project:".format(DREAM_PROJECT))
print(*sorted(get_names(files)), sep="\n")

Files in 'gauravdream/dream' project:
GRCh37_index.tar.gz
Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa.gz
Homo_sapiens.GRCh37.75.gtf.gz
Homo_sapiens.GRCh37.75.gtf.txt
_1_result.out
_2_result.out
ensembl.hg19.txt
rererunning_sim8_rsem_isoform_quant.tsv
rerunning_sim8_rsem_isoform_quant.tsv
rsem_index.tar.gz
sim11_filtered.bedpe
sim11_isoforms_truth.txt
sim11_mergeSort_1.fq.gz
sim11_mergeSort_2.fq.gz
sim13_filtered.bedpe
sim13_isoforms_truth.txt
sim13_mergeSort_1.fq.gz
sim13_mergeSort_2.fq.gz
sim14_filtered.bedpe
sim14_isoforms_truth.txt
sim14_mergeSort_1.fq.gz
sim14_mergeSort_2.fq.gz
sim15_filtered.bedpe
sim15_isoforms_truth.txt
sim15_mergeSort_1.fq.gz
sim15_mergeSort_2.fq.gz
sim16_filtered.bedpe
sim16_isoforms_truth.txt
sim16_mergeSort_1.fq.gz
sim16_mergeSort_2.fq.gz
sim17_filtered.bedpe
sim17_isoforms_truth.txt
sim17_mergeSort_1.fq.gz
sim17_mergeSort_2.fq.gz
sim19_filtered.bedpe
sim19_isoforms_truth.txt
sim19_mergeSort_1.fq.gz
sim19_mergeSort_2.fq.gz
sim1_filtered.bedpe
sim1_isof

In [22]:
# Get only the fastq files - filter for "mergeSort" in filename 
#     (note that there are others ways to do this)
str_filter = "mergeSort"
mergeSort_files = get_files_by_filename_filter(api, DREAM_PROJECT, str_filter)

# Print filenames for the gunzipped fastq files identified
print("Files in '{}' project with '{}' in filename:".format(DREAM_PROJECT, str_filter))
print(*sorted(get_names(mergeSort_files)), sep="\n")

Files in 'gauravdream/dream' project with 'mergeSort' in filename:
sim11_mergeSort_1.fq.gz
sim11_mergeSort_2.fq.gz
sim13_mergeSort_1.fq.gz
sim13_mergeSort_2.fq.gz
sim14_mergeSort_1.fq.gz
sim14_mergeSort_2.fq.gz
sim15_mergeSort_1.fq.gz
sim15_mergeSort_2.fq.gz
sim16_mergeSort_1.fq.gz
sim16_mergeSort_2.fq.gz
sim17_mergeSort_1.fq.gz
sim17_mergeSort_2.fq.gz
sim19_mergeSort_1.fq.gz
sim19_mergeSort_2.fq.gz
sim1_mergeSort_1.fq.gz
sim1_mergeSort_2.fq.gz
sim21_mergeSort_1.fq.gz
sim21_mergeSort_2.fq.gz
sim2_mergeSort_1.fq.gz
sim2_mergeSort_2.fq.gz
sim3_mergeSort_1.fq.gz
sim3_mergeSort_2.fq.gz
sim4_mergeSort_1.fq.gz
sim4_mergeSort_2.fq.gz
sim5_mergeSort_1.fq.gz
sim5_mergeSort_2.fq.gz
sim7_mergeSort_1.fq.gz
sim7_mergeSort_2.fq.gz
sim8_mergeSort_1.fq.gz
sim8_mergeSort_2.fq.gz


In [21]:
# Get metadata properties for a single file
print("Metadata for '{}'".format(mergeSort_files[0].name))
pp.pprint(dict(mergeSort_files[0].metadata))

Metadata for 'sim11_mergeSort_2.fq.gz'
{   u'experimental_strategy': u'RNA-Seq',
    u'investigation': u'DREAM SMC-RNA',
    u'paired_end': u'2',
    u'sample_id': u'sim11'}


In [25]:
# Return list of file ids based on metadata
metadata_filters = {
                    "sample_id": "sim8",
                    "experimental_strategy": "RNA-Seq"
                   }

files_by_metadata = get_files_by_metadata(api, DREAM_PROJECT, metadata_filters)

# Print filenames for all files returned by metadata filter
print("Files in '{}' project with metadata filter:".format(DREAM_PROJECT))
print(*get_names(files_by_metadata), sep="\n")

Files in 'gauravdream/dream' project with metadata filter:
sim8_isoforms_truth.txt
sim8_filtered.bedpe
sim8_mergeSort_1.fq.gz
sim8_mergeSort_2.fq.gz


In [27]:
# Another implementation - trim a list (e.g. filenames) by explicit file extension

files_by_ext = get_files_by_extension(api, DREAM_PROJECT, ext="fq.gz")

# Print files returned by extension query
print("Files in '{}' project with extension filter:".format(DREAM_PROJECT))
print(*sorted(get_names(files_by_ext)), sep="\n")

Files in 'gauravdream/dream' project with extension filter:
sim11_mergeSort_1.fq.gz
sim11_mergeSort_2.fq.gz
sim13_mergeSort_1.fq.gz
sim13_mergeSort_2.fq.gz
sim14_mergeSort_1.fq.gz
sim14_mergeSort_2.fq.gz
sim15_mergeSort_1.fq.gz
sim15_mergeSort_2.fq.gz
sim16_mergeSort_1.fq.gz
sim16_mergeSort_2.fq.gz
sim17_mergeSort_1.fq.gz
sim17_mergeSort_2.fq.gz
sim19_mergeSort_1.fq.gz
sim19_mergeSort_2.fq.gz
sim1_mergeSort_1.fq.gz
sim1_mergeSort_2.fq.gz
sim21_mergeSort_1.fq.gz
sim21_mergeSort_2.fq.gz
sim2_mergeSort_1.fq.gz
sim2_mergeSort_2.fq.gz
sim3_mergeSort_1.fq.gz
sim3_mergeSort_2.fq.gz
sim4_mergeSort_1.fq.gz
sim4_mergeSort_2.fq.gz
sim5_mergeSort_1.fq.gz
sim5_mergeSort_2.fq.gz
sim7_mergeSort_1.fq.gz
sim7_mergeSort_2.fq.gz
sim8_mergeSort_1.fq.gz
sim8_mergeSort_2.fq.gz


In [31]:
# Get a file object by its explicit filename

file_by_name = get_file_by_name(api, DREAM_PROJECT, filename="sim1_mergeSort_1.fq.gz")

# Print filename for file returned by extension query
print("File returned by filename filter: {}".format(file_by_name.name))

File returned by filename filter: sim1_mergeSort_1.fq.gz


In [34]:
# Get file objects by querying the filenames with a string
#     for example, get all paired_end=1 fq.gz files
str_query = "_1."
files_by_string = get_files_by_string(api, DREAM_PROJECT, query=str_query)

# Print filename for files returned by filename query
print("Files in '{}' project with '{}' in filename:".format(DREAM_PROJECT, str_query))
print(*sorted(get_names(files_by_string)), sep="\n")

Files in 'gauravdream/dream' project with '_1.' in filename:
sim11_mergeSort_1.fq.gz
sim13_mergeSort_1.fq.gz
sim14_mergeSort_1.fq.gz
sim15_mergeSort_1.fq.gz
sim16_mergeSort_1.fq.gz
sim17_mergeSort_1.fq.gz
sim19_mergeSort_1.fq.gz
sim1_mergeSort_1.fq.gz
sim1a_shrunk_1.fq
sim21_mergeSort_1.fq.gz
sim2_mergeSort_1.fq.gz
sim3_mergeSort_1.fq.gz
sim4_mergeSort_1.fq.gz
sim5_mergeSort_1.fq.gz
sim7_mergeSort_1.fq.gz
sim8_mergeSort_1.fq.gz


# Working with CWL Applications

"Applications" are tools (single executables) or workflows (chains of tools) used to analyze data. All applications on the CGC are described using Draft 2 of the Common Workflow Language (stay tuned for info on support for CWL v1.0!).

In [38]:
# Print a list of apps (by name) in the project
print("The apps in the '{}' project: ".format(DREAM_PROJECT))
print(*sorted(get_names(get_apps_in_project(api, DREAM_PROJECT))), sep="\n")

The apps in the 'gauravdream/dream' project: 
DREAM Fusion Detection Evaluation Workflow
DREAM Isoform Quantification Evaluation Workflow
DREAM RSEM
DREAM STAR
DREAM TopHat
import-synapse-data
rsem-cut
rsem-gunzip
rsem-tar
smcFusion-STAR-workflow
smcFusion-TopHat-Workflow
smcIsoform-RSEM-Workflow
star-converter
star-fusion
star-tar
tophat-converter
tophat-grep
tophat-tar


In [41]:
# Grab the "RSEM" apps
apps = get_apps_by_string(api, DREAM_PROJECT, query="RSEM")

# Print a list of apps (by query) in the project
print("The apps in the '{}' project filtered by an app-name query: ".format(DREAM_PROJECT))
print(*sorted(get_names(apps)), sep="\n")

The apps in the 'gauravdream/dream' project filtered by an app-name query: 
DREAM RSEM
rsem-cut
rsem-gunzip
rsem-tar
smcIsoform-RSEM-Workflow


In [50]:
# Retrive a specific app

DREAM_APP = get_app_by_name(api, DREAM_PROJECT, app_name='smcIsoform-RSEM-Workflow')

# Print the ID and name of your app
print("App ID: {}".format(DREAM_APP.id))
print("App Name: {}".format(DREAM_APP.name))

App ID: gauravdream/dream/smcisoform-rsem-workflow/1
App Name: smcIsoform-RSEM-Workflow


In [160]:
# You can get the CWL description (format: JSON) for your app
cwl_app = DREAM_APP.raw

# Get the required inputs (type: dict) from the CWL description 
inputs = DREAM_APP.raw['inputs']

# Print the Input Port labels (if exists) and IDs
print("App inputs by (Label, ID):")
print(*zip(get_input_labels(inputs), get_input_ids(inputs)), sep="\n")

App inputs by (Label, ID):
(u'TUMOR_FASTQ_1', u'#input')
(u'TUMOR_FASTQ_2', u'#input_1')
(u'index', u'#index')
(None, u'#pairedend')
(None, u'#strandspecific')
(None, u'#threads')
(None, u'#output_filename')
(None, u'#f')


In [176]:
"""
When submitting tasks, you will need to submit a dict() with the inputs and values,
    where the keys are the input port IDs
    and the values are the actual inputs, e.g. file objects, integers, strings, etc.

In dream_helpers.py, there is the generate_input_object method to help you with this.

"""
print("All inputs: ")
generate_input_object(DREAM_APP, required=False, print_opt=True)
print("\nRequired inputs: ")
generate_input_object(DREAM_APP, required=True, print_opt=True)

All inputs: 
{   'f': '',
    'index': '',
    'input': '',
    'input_1': '',
    'output_filename': '',
    'pairedend': '',
    'strandspecific': '',
    'threads': ''}

Required inputs: 
{   'f': '', 'index': '', 'input': '', 'input_1': '', 'output_filename': ''}


{'f': '', 'index': '', 'input': '', 'input_1': '', 'output_filename': ''}

In [58]:
# Now that we have the app, and the information for which files are needed, let's get the files
# Grab the rsem_index file by searching by string and retrieving the first result
RSEM_INDEX = get_files_by_string(api, DREAM_PROJECT, query='rsem_index')[0]
print("Index File: {}".format(RSEM_INDEX.name))

Index File: rsem_index.tar.gz


In [77]:
# Grab all the fq.gz files
fastqs_all = get_files_by_extension(api, DREAM_PROJECT, ext='fq.gz')

# Define a list of filters you want to apply to your fastq files
#     Note the use of 'sim1_' to ensure that a query of "sim1" doesn't return "sim11" files as well
filter_list = ['sim8_', 'sim1_']
fastqs = filter_by_prefixes(fastqs_all, filter_list)

# Print the final list of files
print("Filtered list: ")
print(*sorted(get_names(fastqs)), sep="\n")

Filtered list: 
sim1_mergeSort_1.fq.gz
sim1_mergeSort_2.fq.gz
sim8_mergeSort_1.fq.gz
sim8_mergeSort_2.fq.gz


In [78]:
# Split the fastqs into two lists by paired-end value
FQ1 = [fq for fq in fastqs if '_1.' in fq.name]
FQ2 = [fq for fq in fastqs if '_2.' in fq.name]
fastqs_paired = zip(FQ1, FQ2)

# Note that we've introduced many ways to do this query
#     e.g. you can split by metadata value ['paired_end']

# There are additional functions for pairing fastqs in dream_helpers.py

# Verify that the lists split properly
print("Pairs of fastq files: ")
print(*[(f[0].name, f[1].name) for f in fastqs_paired], sep="\n")

Pairs of fastq files: 
(u'sim1_mergeSort_1.fq.gz', u'sim1_mergeSort_2.fq.gz')
(u'sim8_mergeSort_1.fq.gz', u'sim8_mergeSort_2.fq.gz')


In [80]:
# The Main Event - run a task for each pair of fastqs!

for i, fq in enumerate(fastqs_paired):
    
    # Create individualized task names with sample ID and current time
    sample_id = fq[0].metadata['sample_id']
    current_time = datetime.now().strftime("%m-%d-%Y %H:%M:%S")
    TASK_NAME = "DREAM_{}_{} - {}".format(DREAM_APP.name, sample_id, current_time)
    
    # Create the input object
    #     - index is the same for all tasks
    #     - iterate over the lists to pair files
    #     - set custom output filename (get sample ID from file)

    INPUTS = {
        "index": RSEM_INDEX,
        "input": fq[0],
        "input_1": fq[1],
        "output_filename": sample_id + "_isoform_quants.tsv"
    }

    # Create the task
    api.tasks.create(name=TASK_NAME, 
                     project=DREAM_PROJECT,
                     app=DREAM_APP, 
                     inputs=INPUTS,
                     run=False) # IMPORTANT! set run=True if you want to run, not just draft the tasks
    print("Task created: {}".format(TASK_NAME))
    print("Input files: {}, {}".format(fq[0].name, fq[1].name))
    print("Output file(s): {}".format(sample_id + "_isoform_quants.tsv"))
    print("\n")

Task created: DREAM_smcIsoform-RSEM-Workflow_sim1 - 08-22-2016 22:50:06
Input files: sim1_mergeSort_1.fq.gz, sim1_mergeSort_2.fq.gz
Output file(s): sim1_isoform_quants.tsv


Task created: DREAM_smcIsoform-RSEM-Workflow_sim8 - 08-22-2016 22:50:08
Input files: sim8_mergeSort_1.fq.gz, sim8_mergeSort_2.fq.gz
Output file(s): sim8_isoform_quants.tsv




In [153]:
"""
Let's put it all together!
    This last section will give you the power to run an app on all the test files!
    Use it carefully!
"""

DREAM_PROJECT = "gauravdream/dream"
DREAM_APP = get_app_by_name(api, DREAM_PROJECT, app_name='smcIsoform-RSEM-Workflow')
INDEX = get_file_by_name(api, DREAM_PROJECT, filename="rsem_index.tar.gz")
FASTQS = split_fastqs_tuple(fastqs=get_all_fastqs(api, DREAM_PROJECT))

def test_rsem(api, project, app, index, fastqs, run_opt=False):
    
    # Set up list of task objects, which we can call later to check status, grab outputs, etc.
    tasks = []
    
    for i, fq in enumerate(fastqs):

        # Create individualized task names using the file's sample ID and current time
        sample_id = fq[0].metadata['sample_id']
        current_time = datetime.now().strftime("%m-%d-%Y %H:%M:%S")
        task_name = "DREAM_{}_{} - {}".format(app.name, sample_id, current_time)
        output_filename = sample_id + "_isoforms_quants.tsv"

        # Create the input object -- remember to rename based on your app's input IDs!!
        inputs = {
            "index": index,
            "input": fq[0],
            "input_1": fq[1],
            "output_filename": output_filename
        }

        # Create the task
        task = api.tasks.create(name=task_name, project=project, app=app, inputs=inputs, run=run_opt)
        tasks.append(task)
        sleep(5) # good idea to put a short (5-15 sec) break between jobs
        
        # Print information about the tasks
        print("\nTask created: {}".format(task_name))
        print("Input files: {}, {}".format(fq[0].name, fq[1].name))
        print("Output file(s): {}".format(output_filename))
    
    return tasks

# To run tasks, set run_opt to True
RUN_OPT = False
tasks = test_rsem(api, DREAM_PROJECT, DREAM_APP, INDEX, FASTQS, run_opt=RUN_OPT)
print("\n{} tasks created in '{}' with '{}'".format(len(tasks), DREAM_PROJECT, DREAM_APP.name))


Task created: DREAM_smcIsoform-RSEM-Workflow_sim11 - 08-23-2016 15:32:22
Input files: sim11_mergeSort_1.fq.gz, sim11_mergeSort_2.fq.gz
Output file(s): sim11_isoforms_quants.tsv

Task created: DREAM_smcIsoform-RSEM-Workflow_sim13 - 08-23-2016 15:32:29
Input files: sim13_mergeSort_1.fq.gz, sim13_mergeSort_2.fq.gz
Output file(s): sim13_isoforms_quants.tsv

Task created: DREAM_smcIsoform-RSEM-Workflow_sim14 - 08-23-2016 15:32:36
Input files: sim14_mergeSort_1.fq.gz, sim14_mergeSort_2.fq.gz
Output file(s): sim14_isoforms_quants.tsv

Task created: DREAM_smcIsoform-RSEM-Workflow_sim15 - 08-23-2016 15:32:42
Input files: sim15_mergeSort_1.fq.gz, sim15_mergeSort_2.fq.gz
Output file(s): sim15_isoforms_quants.tsv

Task created: DREAM_smcIsoform-RSEM-Workflow_sim16 - 08-23-2016 15:32:49
Input files: sim16_mergeSort_1.fq.gz, sim16_mergeSort_2.fq.gz
Output file(s): sim16_isoforms_quants.tsv

Task created: DREAM_smcIsoform-RSEM-Workflow_sim17 - 08-23-2016 15:32:56
Input files: sim17_mergeSort_1.fq.gz

In [157]:
# The "tasks" list now stores the task objects that you've generated
#     You can refresh the status of each tasks and print using this function
def refresh_task_status(task_object, print_opt=False):
    task_object.reload()
    if print_opt:
        print("'{}' status: {}".format(task_object.name, task_object.status))
    return task_object

def refresh_task_status_list(tasks_objects_list, print_opt=False):
    return [refresh_task_status(t, print_opt) for t in tasks_objects_list]

tasks = refresh_task_status_list(tasks, print_opt=True)

'DREAM_smcIsoform-RSEM-Workflow_sim11 - 08-23-2016 15:32:22' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim13 - 08-23-2016 15:32:29' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim14 - 08-23-2016 15:32:36' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim15 - 08-23-2016 15:32:42' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim16 - 08-23-2016 15:32:49' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim17 - 08-23-2016 15:32:56' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim19 - 08-23-2016 15:33:02' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim1 - 08-23-2016 15:33:09' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim21 - 08-23-2016 15:33:15' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim2 - 08-23-2016 15:33:22' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim3 - 08-23-2016 15:33:29' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim4 - 08-23-2016 15:33:35' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflow_sim5 - 08-23-2016 15:33:42' status: DRAFT
'DREAM_smcIsoform-RSEM-Workflo

In [158]:
"""
Evaluation!
    After your tasks are done completing, you can evaluate the "truthiness" of your tool's output.
    For your convenience, the truth file of each sample has been annotated the with sample_id,
    so you can easily grab the truth file for each task using the sample_id of the inputs.
    
    For example, the files with "sample_id" = "sim11":
        sim11_mergeSort_1.fq.gz     # Training data
        sim11_mergeSort_2.fq.gz     # Training data
        sim11_isoforms_truth.txt    # Isoform quantification truth file
        sim11_filtered.bedpe        # Gene fusion detection truth file
    
    The evaluation workflows are:
        DREAM Isoform Quantification Evaluation Workflow
        DREAM Fusion Detection Evaluation Workflow
"""

# Grab the evaluation app and print the input object
EVAL_DREAM_APP = get_app_by_name(api, project=DREAM_PROJECT, app_name="DREAM Isoform Quantification Evaluation Workflow")
generate_input_object(EVAL_DREAM_APP, required=False, print_opt=True)

{   'gtf': '', 'input': '', 'truth': ''}


In [156]:
# Forgot your tasks' inputs and outputs? You can recall them easily:
print_task = tasks[0]
print("Task: {}".format(print_task.name))
print("\nStatus: {}".format(refresh_task_status(print_task).status))
print("\nInputs: {}".format(print_task.inputs))
print("\nOutputs: {}".format(print_task.outputs))

Task: DREAM_smcIsoform-RSEM-Workflow_sim11 - 08-23-2016 15:32:22
Status: DRAFT

Inputs: {u'index': <File: id=57a9f513e4b0a2cad67e8581>, u'f': u'"1,6"', u'pairedend': True, u'output_filename': u'sim11_isoforms_quants.tsv', u'threads': 8, u'input_1': <File: id=57a9f513e4b0a2cad67e854b>, u'input': <File: id=57a9f513e4b0a2cad67e84db>, u'strandspecific': True}

Outputs: {u'OUTPUT': None}


In [154]:
# Evaluate RSEM output function
def evaluate_rsem(api, project, eval_app, task_objects, run_opt=False):
        
    # Iterate over each task, grab the truth file, and set inputs
    for i, task in enumerate(task_objects):
        
        eval_tasks = []
        
        # Create individualized task names using the input file's sample ID and current time
        sample_id = task.inputs['input'].metadata['sample_id'] # IMPORTANT!! For your submission, you should set label as "TUMOR_FASTQ_*" and replace "input" here with "TUMOR_FASTQ_1"
        current_time = datetime.now().strftime("%m-%d-%Y %H:%M:%S")
        eval_task_name = "Evaluation_{}_{} - {}".format(eval_app.name, sample_id, current_time)

        # Grab the appropriate output from our task
        input_tsv = task.outputs['OUTPUT']
        
        # Here's the fun part - grab the truth_tsv
        truth_txt = get_file_by_name(api, DREAM_PROJECT, filename = sample_id + "_isoforms_truth.txt")
        
        # Get the GTF file used for evaluation
        input_gtf = get_file_by_name(api, project=DREAM_PROJECT, filename="Homo_sapiens.GRCh37.75.gtf.txt")
        
        # Create the input object -- remember to rename based on your app's input IDs!
        eval_inputs = {   
                        'gtf': input_gtf, 
                        'input': input_tsv, 
                        'truth': truth_txt
                      }

        # Create the task
        eval_task = api.tasks.create(name=eval_task_name, project=project, app=eval_app, inputs=eval_inputs, run=run_opt)
        eval_tasks.append(eval_task)
        sleep(10)
        
        # Print information about the tasks
        print("\nTask created: {}".format(eval_task_name))
        print("Inputs: {}, {}, {}".format(input_gtf.name, input_tsv.name, truth_txt.name))
        print("Output(s): {}".format(task.outputs))
    
    return eval_tasks

# Check is all tasks are completed, then trigger evaluation
tasks = refresh_task_status_list(tasks)
if all([task.status == "COMPLETED" for task in tasks]):
    evaluation_tasks = evaluate_rsem(api, DREAM_PROJECT, EVAL_DREAM_APP, task_objects=tasks, run_opt=False)
else:
    print("No tasks drafted or started. Inputs unavailable.")

No tasks drafted or started. Inputs unavailable.
