# How can I split a workflow and create paired batch tasks?
### Overview
There are some workflows which are **partially** ammenable for batch processing, a good example is the [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/) suite to calculate _differential expression_ in RNA-seq files. The Public workflow for this is [RNA-seq Differential Expression](https://igor.sbgenomics.com/public/apps#workflow/sevenbridges/public-apps/rna-seq-differential-expression) Here the first stage (_Cuffquant_) can be parallelized such that only a single tumor-normal pair is processed together. The second stage (_Cuffdiff_) then processes all of the outputs cuffquant all together. A **scientifically valid** tutorial is [here](https://github.com/sbg/okAPI/blob/advanced_access/Tutorials/CGC/thyroid.ipynb) for anyone with TCGA Controlled-Data access and a [Cancer Genomics Cloud](https://cgc.sbgenomics.com) account.

Here we are going to use an **arbitrary** organization (_alphabetical_) to match **randomly** between _Urothelial Bladder Carcinoma_ (BLCA) vs _Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma_ (CESC) samples rather than using the UUID in the metadata to match _normal_ vs _tumor_ samples.

This tutorial is conceptually similar to [batch_o_tasks_standard](batch_o_tasks_standard.ipynb); however, here we are creating the batch of _single tasks_ **within** the API instead of using the _batch-task functionality_ shown in that tutorial.

 1. Create a project
 2. (optional) Add members
 3. Copy WGS bam files from Public [CCLE project](https://igor.sbgenomics.com/u/sevenbridges/cancer-cell-line-encyclopedia-ccle/)
 4. Upload workflows which were previously modified in Public Apps
 5. Organize input files alphabetically
 6. Run a set of first-stage processing
 7. Wait for the first-stage to finish
 7. Pass the outputs of the first-stage to the second-stage.
 
Throughout this **tutorial**, we will link back to different **recipes** in case you need more detail about the calls. We will also link to the **documentation** for each call. Both links will be under the **PROTIPS** section heading at the end of the markdown section.

### Prerequisites
 1. You need your _authentication token_ and the API needs to know about it. See <a href="Setup_API_environment.ipynb">**Setup_API_environment.ipynb**</a> for details.
 2. You have cloned the Public Project _Cancer Cell Line Encyclopedia (CCLE)_. We will walk through that in the markdown of Step 3.

 
### WARNING
This will burn through some processing credits (**about \$1** per pair of files, **\$10** total for eight input pairs followed by second stage processing). You can create _DRAFT_ tasks but not run them just see how it works. To do this, just remove "run = True" from the following line: 

```python
    my_task = api.tasks.create(name=task_name, project=my_project, \
                           app=my_app, inputs=inputs, \
                           run = True)     
```

## Imports
We import the _Api_ class from the official sevenbridges-python bindings below.

In [None]:
import sevenbridges as sbg

## Initialize the object
The _Api_ object needs to know your **auth\_token** and the correct path. Here we assume you are using the .sbgrc file in your home directory. For other options see <a href="Setup_API_environment.ipynb">Setup_API_environment.ipynb</a>

In [None]:
# [USER INPUT] Specify platform {cgc, sbg}
prof = 'sbpla'


config_config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_config_file)

## 1) Create a  new project
We create a project using your first billing group. The project is described by a small dictionary:
* **billing_group** *Billing group* that will be charged for this project
* **description**   (optional) Project description
* **name**   Name of the project, may be *non-unique*<sup>1</sup>
* **tags**   List of tags, currently _unused_. **cannot** be set while creating project

#### PROTIPS
 * A detailed _recipe_ for creating projects is [here](../../Recipes/SBPLAT/projects_makeNew.ipynb)
 * Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/create-a-new-project)

In [None]:
# [USER INPUT] Set project name here:
new_project_name = 'ML'                          
      
    
# What are my funding sources?
billing_groups = api.billing_groups.query()  

# Pick the first group (arbitrary)
print((billing_groups[0].name + \
       ' will be charged for computation and storage (if applicable) for your new project'))

# Set up the information for your new project
new_project = {
        'billing_group': billing_groups[0].id,
        'description': """A project created by the API recipe (projects_makeNew.ipynb).
                      This also supports **markdown**
                      _Pretty cool_, right?
                   """,
        'name': new_project_name
}

# check if this project already exists. LIST all projects and check for name match
my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name]      
              
if my_project:    # exploit fact that empty list is False, {list, tuple, etc} is True
    print('A project with the name (%s) already exists, please choose a unique name' \
          % new_project_name)
    raise KeyboardInterrupt
else:
    # CREATE the new project
    my_project = api.projects.create(name = new_project['name'], \
                                     billing_group = new_project['billing_group'], \
                                     description = new_project['description'])
    
    # (re)list all projects, and get your new project
    my_project = [p for p in api.projects.query(limit=100).all() \
              if p.name == new_project_name][0]

## 2) (optional) Add project members
Teamwork - it gets stuff done! You might want to add some members to your project, if so please follow the next cell.

#### PROTIPS
 * A detailed _recipe_ for adding members to project is [here](../../Recipes/SBPLAT/projects_addMembers.ipynb)
 * Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/add-a-member-to-a-project)

In [None]:
# [USER INPUT] List names of members to add (prefilled with Jacqueline & Fede:
user_names =['jrosains',
            'ftorri']


# Permissions - here we are assigning all users the same permissions (could also be a list)
user_permissions = {'write': True,
                    'read': True,
                    'copy': True,
                    'execute': False,
                    'admin': False
                    }

for name in user_names:
    my_project.add_member(user = name, permissions = user_permissions)

## 3) Copy WGS bam files from the CCLE project
There is a helpful Public Project on the Seven Bridges Platform called CCLE. We are going to take all of our files from there. The first step, which cannot be done with the API, is to clone that project.

### Clone the project (GUI)
Log in to the Seven Bridges [Platform]() and click on **Public projects**. In the drop-down menu, select _Cancer Cell Line Encyclopedia (CCLE)_. Near the top of the screen, press the blue button **Copy this project**.

<img src = "images/CCLE_0.png" height="462" width="780"> 

A dialog box will ask for the new project name. You can just press the **Copy** button.

<img src = "images/CCLE_1.png" height="288" width="405"> 

You will be taken to your new project.

<img src = "images/CCLE_2.png" height="416" width="780"> 

### Search and copy files
Now that we have the project copied, we also have all of the files. We will search files within that project and copy the ones which fit our criteria - listed here:

 * experimental strategy is **WXS**
 * file extension is **bam**

#### PROTIPS
 * A detailed, related _recipe_ for copying files from a project is [here](../../Recipes/SBPLAT/files_copyFromMyProject.ipynb)
 * Detailed documentation of these particular REST architectural style request is available [here (list files)](http://docs.sevenbridges.com/v1.0/docs/list-files-primary-method) and [here (copy files)](http://docs.sevenbridges.com/docs/copy-a-file)

In [None]:
# [USER INPUT] Set the source project name:
source_project_name = 'Copy of Cancer Cell Line Encyclopedia (CCLE)'  
files_to_copy = 8
reference_genome = 'HG19_Broad_variant.fasta'
annotations = 'Homo_sapiens.GRCh37.68.gtf'


# get details of your source project
source_project = [p for p in api.projects.query(limit=100).all() \
                  if p.name == source_project_name]

if not source_project:  # exploit fact that empty list is False, {list, tuple, etc} is True
    print('Source project (%s) not found, check spelling' % source_project_name)
    raise KeyboardInterrupt
else:
    source_project = source_project[0]

In [None]:
# [CESC files] list all files in source project that are RNA-Seq; filter out bam
source_files = api.files.query(limit = 100, project = source_project, \
                              metadata = {'experimental_strategy' : 'RNA-Seq',
                                          'sample_type' : 'Cell Line',
                                          'investigation' : 'CCLE-CESC'})
source_files = [f for f in source_files.all() if \
               f.name[-3:] == 'bam']

# List the files you already have
my_file_names = [f.name for f in \
                 api.files.query(limit = 100, project = my_project.id).all()]

# Copy files to your project
CESC_files = []    # will use this list later as an input
count = 0
for f in source_files:
    if f.name in my_file_names:
        print('file already exists in your project, skipping')
    else:
        print('File (%s) does not exist in Project (%s); copying now' % \
          (f.name, my_project.name))
        new_f = f.copy(project = my_project)
        CESC_files.append(new_f)
    count += 1
    if count >= files_to_copy:
        break

In [None]:
# [BLCA files] list all files in source project that are RNA-Seq; filter out bam
source_files = api.files.query(limit = 100, project = source_project, \
                              metadata = {'experimental_strategy' : 'RNA-Seq',
                                          'sample_type' : 'Cell Line',
                                          'investigation' : 'CCLE-BLCA'})
source_files = [f for f in source_files.all() if \
               f.name[-3:] == 'bam']

# List the files you already have
my_file_names = [f.name for f in \
                 api.files.query(limit = 100, project = my_project.id).all()]

# Copy files to your project
BLCA_files = []    # will use this list later as an input
count = 0
for f in source_files:
    if f.name in my_file_names:
        print('file already exists in your project, skipping')
    else:
        print('File (%s) does not exist in Project (%s); copying now' % \
          (f.name, my_project.name))
        new_f = f.copy(project = my_project)
        BLCA_files.append(new_f)
    count += 1
    if count >= files_to_copy:
        break

In [None]:
# Reference Genome
ref_file = api.files.query(limit = 100, \
                           project = source_project, \
                           names = [reference_genome])[0]

if ref_file.name in my_file_names:
    print('file already exists in your project, skipping')
    ref_genome = api.files.query(limit = 100, \
                       project = my_project, \
                       names = [reference_genome])[0]
else:
    print('File (%s) does not exist in Project (%s); copying now' % \
      (ref_file.name, my_project.name))
    ref_genome = ref_file.copy(project = my_project)
    
# Annotations
ref_file = api.files.query(limit = 100, \
                           project = source_project, \
                           names = [annotations])[0]

if ref_file.name in my_file_names:
    print('file already exists in your project, skipping')
    annotations = api.files.query(limit = 100, \
                           project = my_project, \
                           names = [annotations])[0]
else:
    print('File (%s) does not exist in Project (%s); copying now' % \
      (ref_file.name, my_project.name))
    annotations = ref_file.copy(project = my_project)

## 4) Create a workflow from the Application JSON
We will load a tool from it's JSON here because it has been modified from the version in _Public Apps_. This is _not_ the most common user-flow, but maybe is useful to see. We need to import _json_ here to do this correctly. Please be **careful** when exporting and importing Apps as normal _copy-paste_ operations may induce JSON formatting errors.

#### Note
These JSON files were exported after modifications to the Public App detailed in screenshots [here](https://github.com/sbg/okAPI/blob/advanced_access/Tutorials/CGC/thyroid.ipynb)

#### PROTIPS
 * Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/add-an-app-using-raw-cwl)

In [None]:
# Load the Application JSONs
import json

f = open('files/rna-seq-diff-expression-first.json', 'r')
first_raw = f.read()
first = json.loads(first_raw)

f = open('files/rna-seq-diff-expression-second.json', 'r')
second_raw = f.read()
second = json.loads(second_raw)

# Create the Workflows
a_first_id = (my_project.id + '/first')
my_app_first = api.apps.install_app(id =a_first_id, raw = first)

a_second_id = (my_project.id + '/second')
my_app_second = api.apps.install_app(id =a_second_id, raw = second)

## Organize files into a **cohort**
Here we don't have matched samples so we are organizing _arbitrarily_. However, it is _straight-forward_ to reorganize the data by UUID or some other metadata. We will scan through the metadata in both **CESC_files** and **BLCA_files** and sort by _Case ID_ alphabetically.

In [None]:
# Create a list of case_ids (or any other metadata)
case_id_CESC = []
for f in CESC_files:
    case_id_CESC.append(f.metadata['case_id'])
    
case_id_BLCA = []
for f in BLCA_files:
    case_id_BLCA.append(f.metadata['case_id'])

# Sort lists alphabetically
ind_CESC = sorted(range(len(case_id_CESC)), key=lambda k: case_id_CESC[k])
ind_BLCA = sorted(range(len(case_id_BLCA)), key=lambda k: case_id_BLCA[k])

## Build and run _first-stage_ tasks
Here we use the API to create a _new\_task_ dictionary that we will use for each pair of files. All of the front-end tasks will be drafted and starting within seconds.

#### Note
Here we are not doing any error checking as in [batch_o_tasks_standard](batch_o_tasks_standard.ipynb); instead we are firing up the tasks directly.

#### PROTIPS
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/create-a-new-task)

In [None]:
my_task_list = []

# Loop through all CESC files
for ii in range(len(case_id_CESC)):
    # Format the JSON to pass values to the FRONT-END workflow (frontend RNA-seq cuffquant)
    task_name = 'first_stage_%i' % (ii)
    inputs = {
        'Reference' : ref_genome,
        'Annotations' : annotations,
        'BAM_Group_A' : [CESC_files[ind_CESC[ii]]],
        'BAM_Group_B' : [BLCA_files[ind_BLCA[ii]]],
        'library_type' : 'fr-unstranded',
        'group_name' : 'CESC',
        'group_name_1' : 'BLCA'
    }

    my_task = api.tasks.create(name=task_name, project=my_project, \
                               app=my_app_first, inputs=inputs, \
                               run = True)
    my_task_list.append(my_task)

if len(my_task_list) > 0:
    print("""
    %i tasks have been created. Enjoy a break, treat yourself to a muffin, 
    and come back to us once you've gotten an email that tasks are done.
    (alternatively, use the task monitoring cells below)""" % (len(my_task_list)))

## Check task completion
These tasks may take a long time to complete, here are two ways to check in on them:
* Wait for email confirmation <sup>1</sup>
* Ping the task to see it's _status_. Here we use a 10 min interval, adjust it appropriately for longer or shorter workflows

<sup>1</sup> Emails will arrive regardless of whether the task was started by GUI or API. These notifications can be configured [here](http://docs.sevenbridges.com/docs/account-settings#section-manage-email-notifications)

#### PROTIPS
* The closest recipe for _monitoring tasks_ is [here](../../Recipes/SBPLAT/tasks_monitorAndGetResults.ipynb)
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/perform-an-action-on-a-specific-task)

In [None]:
# [USER INPUT] Set loop time (seconds):
loop_time = 600


from time import sleep

for t in my_task_list:
    # Check on one task at a time, 
    #   if ANY running, we are NOT done (no sense to query others
    print('Pinging SBPLAT for task completion')
    
    flag = {'task_running': True}
    while flag['task_running']:
        details = t.get_execution_details()
        if details.status == 'COMPLETED':
            flag['task_running'] = False
            print('Task has completed, life is beautiful')
        elif details.status  == 'FAILED' or details.status == 'ABORTED':  
            print('Task (%s) failed, check it out' * t.name)
            flag['task_running'] = False
        else:
            sleep(loop_time)

## Build and run _second-stage_ task
This is similar to our approach with the front-end task. We first get all the files in the project (there will be more since out front-end tasks have completed<sup>2</sup>). Then we seach for the file extension **.cxb**. A more elegant option would be to query the outputs of the tasks like this:

```python
    task_details = api.tasks.get(id = my_task_list[0].id)
    print(task_details.outputs)
```

#### Notes
Here we are not doing any error checking as in [batch_o_tasks_standard](batch_o_tasks_standard.ipynb); instead we are firing up the tasks directly.
<sup>2</sup> Note this means we _ignore_ any failed tasks. You can choose your own adventure here, e.g. re-running  (or QC-ing) failed tasks, all-or-none processing, etc.

#### PROTIPS
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/create-a-new-task)

In [None]:
# Recheck the files in the project, make a list of the abundances (cxb)
my_files = api.files.query(limit = 100, project = my_project)
abundances = [f for f in my_files.all() if f.name[-3:] == 'cxb']

# Format the JSON to pass values to the BACK-END workflow [for backend RNA-seq cuffquant]
task_name = 'second_stage'
inputs = {
    'Reference' : ref_genome,
    'Annotations' : annotations,
    'sample_files' : abundances,
    'FDR': 0.05,
    'library_type': u'ff-unstranded',
    'min_reps_for_js_test': 3,
    'library_norm_method': u'classic-fpkm',
    'dispersion_method': u'per-condition'
}
    
my_task = api.tasks.create(name=task_name, project=my_project, \
                               app=my_app_second, inputs=inputs)
# my_task.run()
    
print("You've made it to the end, yaay!")