# How can I split a workflow and create paired batch tasks?
### Overview
There are some workflows which are **partially** ammenable for batch processing, a good example is the [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/) suite to calculate _differential expression_ in RNA-seq files.

Here we are going to use an **arbitrary** organization (_alphabetical_) to match **randomly** between _Urothelial Bladder Carcinoma_ (BLCA) vs _Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma_ (CESC) samples rather than using the UUID in the metadata to match _normal_ vs _tumor_ samples.

This tutorial is conceptually similar to [batch_o_tasks_standard](batch_o_tasks_standard.ipynb); however, here we are creating the batch of _single tasks_ **within** the API instead of using the _batch-task functionality_ shown in that tutorial.

 1. Create a project
 2. (optional) Add members
 3. Copy WGS bam files from Public [CCLE project](https://igor.sbgenomics.com/u/sevenbridges/cancer-cell-line-encyclopedia-ccle/)
 4. Upload workflows
 5. Organize input files alphabetically
 6. Run a set of first-stage processing
 7. Wait for the first-stage to finish
 7. Pass the outputs of the first-stage to the second-stage.
 
Throughout this **tutorial**, we will link back to different **recipes** in case you need more detail about the calls. We will also link to the **documentation** for each call. Both links will be under the **PROTIPS** section heading at the end of the markdown section.

### Prerequisites
 1. You need your _authentication token_ and the API needs to know about it. See <a href="Setup_API_environment.ipynb">**Setup_API_environment.ipynb**</a> for details.
 
### WARNING
This will burn through some processing credits. You can create _DRAFT_ tasks but not run them just see how it works. To do this, just remove "run = True" from the following line: 

```python
    my_task = api.tasks.create(name=task_name, project=my_project, \
                           app=my_app, inputs=inputs, \
                           run = True)     
```

## Imports
We import the _Api_ class from the official sevenbridges-python bindings below.

In [1]:
import sevenbridges as sbg

## Initialize the object
The _Api_ object needs to know your **auth\_token** and the correct path. Here we assume you are using the .sbgrc file in your home directory. For other options see <a href="Setup_API_environment.ipynb">Setup_API_environment.ipynb</a>

In [2]:
# [USER INPUT] specify platform {cgc, sbpla, etc}
prof = 'default'


config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_file)

## 1) Create a  new project
We create a project using your first billing group. The project is described by a small dictionary:
* **billing_group** *Billing group* that will be charged for this project
* **description**   (optional) Project description
* **name**   Name of the project, may be *non-unique*<sup>1</sup>

#### PROTIPS
 * A detailed _recipe_ for creating projects is [here](../../Recipes/SBPLAT/projects_makeNew.ipynb)
 * Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/create-a-new-project)

In [5]:
# [USER INPUT] Set project name here:
new_project_name = 'MAL'                          
      
    
# check if this project already exists. LIST all projects and check for name match
# Note that you can have more than one project with the same name. It is best practice to find things by ID.
my_project_exists = [p for p in api.projects.query(limit=100).all() 
              if p.name == new_project_name]      
              
if my_project_exists:    # exploit fact that empty list is False
    # If a project with the same name already exists, reuse the existing one
    my_project = my_project_exists[0]

    print('Project {} will be reused for next steps.'.format(my_project.id))

else: 
    # What are my funding sources?
    billing_groups = api.billing_groups.query()  

    # Pick the first group (arbitrary)
    print((billing_groups[0].name +
           ' will be charged for computation and storage (if applicable) for your new project'))

    # Set up the information for your new project
    new_project = {
            'billing_group': billing_groups[0].id,
            'description': """A project created by the API recipe (projects_makeNew.ipynb).
                          This also supports **markdown**
                          _Pretty cool_, right?
                       """,
            'name': new_project_name
    }
    
    # CREATE the new project
    my_project = api.projects.create(
        name=new_project['name'], 
        billing_group=new_project['billing_group'],
        description=new_project['description'],
    )

    print('Your new project {} has been created.'.format(my_project.name))

BIX_Customer_project_Belgrade will be charged for computation and storage (if applicable) for your new project
Your new project MAL has been created.


## 2) (optional) Add project members
Teamwork - it gets stuff done! You might want to add some members to your project, if so please follow the next cell.

#### PROTIPS
 * A detailed _recipe_ for adding members to project is [here](../../Recipes/SBPLAT/projects_addMembers.ipynb)
 * Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/add-a-member-to-a-project)

In [None]:
# [USER INPUT] List names of members to add (prefilled with Jacqueline & Fede:
user_names =['jrosains',
            'ftorri']


# Permissions - here we are assigning all users the same permissions (could also be a list)
user_permissions = {'write': True,
                    'read': True,
                    'copy': True,
                    'execute': False,
                    'admin': False
                    }

for name in user_names:
    my_project.add_member(user=name, permissions=user_permissions)

## 3) Copy WXS bam files from the CCLE project
The Cancer Cell Line Encyclopedia (CCLE) public project contains Open Access sequencing data (in the form of reads aligned to the hg19 broad variant reference genome) for nearly 1000 cancer cell line samples. You can use the data contained within this project for your analyses on the Platform. Learn more about the [CCLE public project](http://docs.sevenbridges.com/docs/ccle).

For this tutorial, we will obtain our files from the CCLE public project. To do so, we will specify the project ID of the Public Project. 

### (OPTIONAL) Clone the project (GUI)
We can also clone this project on the visual interface. This step cannot be done with the API. After cloning, the project will be available in project list.
Log in to the Seven Bridges [Platform](https://igor.sbgenomics.com) and click on **Public Projects**. From the page, click on **Copy Project** actionfor **Cancer Cell Line Encyclopedia (CCLE)** 

<img src = "images/CCLE_0.png" height="462" width="780"> 

A dialog box prompt you for the new project name. Rename the project or simply press the **Copy** button.

<img src = "images/CCLE_1.png" height="288" width="405"> 

You can then go to your new project.

<img src = "images/CCLE_2.png" height="416" width="780"> 

### Search and copy files
Now that we have the project copied, we can access all of its files. We will search files within that project and copy the files containing:

 * an experimental strategy of **WXS**
 * a file extension of **bam**

#### PROTIPS
 * A detailed, related recipe for copying files from a project is [here](../../Recipes/SBPLAT/files_copyFromMyProject.ipynb).
 * Detailed documentation of these particular REST architectural style request is available [here (list files)](http://docs.sevenbridges.com/v1.0/docs/list-files-primary-method) and [here (copy files)](http://docs.sevenbridges.com/docs/copy-a-file).

In [22]:
# [USER INPUT] Set the source project name:
source_project_id = 'sevenbridges/cancer-cell-line-encyclopedia-ccle-1'  
files_to_copy = 8
reference_genome = 'HG19_Broad_variant.fasta'
annotations_file = 'Homo_sapiens.GRCh37.75.gtf'


# get details of your source project
source_project = api.projects.get(source_project_id)

In [23]:
# [CESC files] list all files in source project that are RNA-Seq; filter out bam
source_files = api.files.query(limit=100, project=source_project,
                              metadata={'experimental_strategy' : 'RNA-Seq',
                                        'sample_type' : 'Cell Line',
                                        'investigation' : 'CCLE-CESC'})
source_files = [f for f in source_files.all() if
               f.name[-3:] == 'bam']

# List the files you already have
my_file_names = [f.name for f in api.files.query(limit = 100, project = my_project.id).all()]

# Copy files to your project
CESC_files = []    # will use this list later as an input
count = 0
for f in source_files:
    if f.name in my_file_names:
        print('File ({}) already exists in your project, skipping'.format(f.name))
        CESC_files.append(api.files.query(project=my_project, names =[f.name])[0])
    else:
        print('File ({}) does not exist; copying now'.format(f.name))
        new_f = f.copy(project = my_project)
        CESC_files.append(new_f)
    count += 1
    if count >= files_to_copy:
        break

File (G25239.MFE-296.1.bam) already exists in your project, skipping
File (G26218.HEC-1-A.2.bam) already exists in your project, skipping
File (G26254.EFE-184.2.bam) already exists in your project, skipping
File (G27259.AN3_CA.1.bam) already exists in your project, skipping
File (G27265.SNG-M.1.bam) already exists in your project, skipping
File (G27326.EN.1.bam) already exists in your project, skipping
File (G27372.COLO_684.1.bam) already exists in your project, skipping
File (G27455.RL95-2.2.bam) already exists in your project, skipping


In [24]:
# [CESC files] list all files in source project that are RNA-Seq; filter out bam
source_files = api.files.query(limit=100, project=source_project,
                              metadata={'experimental_strategy' : 'RNA-Seq',
                                        'sample_type' : 'Cell Line',
                                        'investigation' : 'CCLE-BLCA'})
source_files = [f for f in source_files.all() if
               f.name[-3:] == 'bam']

# List the files you already have
my_file_names = [f.name for f in api.files.query(limit = 100, project = my_project.id).all()]

# Copy files to your project
BLCA_files = []    # will use this list later as an input
count = 0
for f in source_files:
    if f.name in my_file_names:
        print('File ({}) already exists in your project, skipping'.format(f.name))
        BLCA_files.append(api.files.query(project=my_project, names =[f.name])[0])
    else:
        print('File ({}) does not exist; copying now'.format(f.name))
        new_f = f.copy(project = my_project)
        BLCA_files.append(new_f)
    count += 1
    if count >= files_to_copy:
        break

File (G20466.5637.2.bam) already exists in your project, skipping
File (G26243.HT-1197.2.bam) already exists in your project, skipping
File (G27217.TCCSUP.1.bam) already exists in your project, skipping
File (G27234.647-V.1.bam) already exists in your project, skipping
File (G27235.BC-3C.1.bam) already exists in your project, skipping
File (G27264.RT-112.1.bam) already exists in your project, skipping
File (G27276.639-V.1.bam) already exists in your project, skipping
File (G27287.253J-BV.1.bam) already exists in your project, skipping


In [25]:
# Get the reference_genome from the same project
ref_file = api.files.query(limit=100, project=source_project,
                           names=[reference_genome])[0]

if ref_file.name in my_file_names:
    ref_genome = api.files.query(limit=100, project=my_project,
                                 names=[reference_genome])[0]
    print('File ({}) already exists in your project, skipping'.format(ref_file.name))
else:
    print('File ({}) does not exist; copying now'.format(ref_file.name))
    ref_genome = ref_file.copy(project=my_project)
    
# Annotations
# Get the reference_genome from the same project
ref_file = api.files.query(limit=100, project=source_project,names=[annotations_file])[0]

if ref_file.name in my_file_names:
    annotations = api.files.query(limit=100, project=my_project,
                                  names=[annotations_file])[0]
    print('File ({}) already exists in your project, skipping'.format(ref_file.name))
else:
    print('File ({}) does not exist; copying now'.format(ref_file.name))
    annotations = ref_file.copy(project=my_project)

File (HG19_Broad_variant.fasta) already exists in your project, skipping
File (Homo_sapiens.GRCh37.75.gtf) already exists in your project, skipping


## 4) Create a workflow from the Application JSON
We will load a tool from it's JSON here because it has been modified from the version in _Public Apps_. This is _not_ the most common user-flow, but maybe is useful to see. We need to import _json_ here to do this correctly. Please be **careful** when exporting and importing Apps as normal _copy-paste_ operations may induce JSON formatting errors.

#### PROTIPS
 * Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/add-an-app-using-raw-cwl)

In [26]:
# Load the Application JSONs
import json

f = open('files/rna-seq-diff-expression-first.json', 'r')
first_raw = f.read()
first = json.loads(first_raw)

f = open('files/rna-seq-diff-expression-second.json', 'r')
second_raw = f.read()
second = json.loads(second_raw)

# Create the Workflows
a_first_id = (my_project.id + '/first')
my_app_first = api.apps.install_app(id=a_first_id, raw=first)

a_second_id = (my_project.id + '/second')
my_app_second = api.apps.install_app(id=a_second_id, raw=second)

## Organize files into a **cohort**
Here we don't have matched samples so we are organizing _arbitrarily_. However, it is _straight-forward_ to reorganize the data by UUID or some other metadata. We will scan through the metadata in both **CESC_files** and **BLCA_files** and sort by _Case ID_ alphabetically.

In [27]:
# Create a list of case_ids (or any other metadata)
case_id_CESC = []
for f in CESC_files:
    case_id_CESC.append(f.metadata['case_id'])
    
case_id_BLCA = []
for f in BLCA_files:
    case_id_BLCA.append(f.metadata['case_id'])

# Sort lists alphabetically
ind_CESC = sorted(range(len(case_id_CESC)), key=lambda k: case_id_CESC[k])
ind_BLCA = sorted(range(len(case_id_BLCA)), key=lambda k: case_id_BLCA[k])

## Build and run _first-stage_ tasks
Here we use the API to create a _new\_task_ dictionary that we will use for each pair of files. All of the front-end tasks will be drafted and starting within seconds.

#### Note
Here we are not doing any error checking as in [batch_o_tasks_standard](batch_o_tasks_standard.ipynb); instead we are firing up the tasks directly.

#### PROTIPS
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/create-a-new-task)

In [28]:
my_task_list = []

# Loop through all CESC files
for ii in range(len(case_id_CESC)):
    # Format the JSON to pass values to the FRONT-END workflow (frontend RNA-seq cuffquant)
    task_name = 'first_stage_%i' % (ii)
    inputs = {
        'Reference' : ref_genome,
        'Annotations' : annotations,
        'BAM_Group_A' : [CESC_files[ind_CESC[ii]]],
        'BAM_Group_B' : [BLCA_files[ind_BLCA[ii]]],
        'library_type' : 'fr-unstranded',
        'group_name' : 'CESC',
        'group_name_1' : 'BLCA'
    }

    my_task = api.tasks.create(name=task_name, project=my_project, \
                               app=my_app_first, inputs=inputs, \
                               run = True)
    my_task_list.append(my_task)

if len(my_task_list) > 0:
    print("""
    %i tasks have been created. Enjoy a break, treat yourself to a muffin, 
    and come back to us once you've gotten an email that tasks are done.
    (alternatively, use the task monitoring cells below)""" % (len(my_task_list)))


    8 tasks have been created. Enjoy a break, treat yourself to a muffin, 
    and come back to us once you've gotten an email that tasks are done.
    (alternatively, use the task monitoring cells below)


## Check task completion
These tasks may take a long time to complete, here are two ways to check in on them:
* Wait for email confirmation <sup>1</sup>
* Ping the task to see it's _status_. Here we use a 10 min interval, adjust it appropriately for longer or shorter workflows

<sup>1</sup> Emails will arrive regardless of whether the task was started by GUI or API. These notifications can be configured [here](http://docs.sevenbridges.com/docs/account-settings#section-manage-email-notifications)

#### PROTIPS
* The closest recipe for _monitoring tasks_ is [here](../../Recipes/SBPLAT/tasks_monitorAndGetResults.ipynb)
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/perform-an-action-on-a-specific-task)

In [None]:
# [USER INPUT] Set loop time (seconds):
loop_time = 600

from time import sleep

for t in my_task_list:
    # Check on one task at a time, 
    #   if ANY running, we are NOT done (no sense to query others)
    print('Pinging SBPLAT for task completion')
    
    flag = {'task_running': True}
    while flag['task_running']:
        details = t.reload()
        if t.status == 'COMPLETED':
            flag['task_running'] = False
            print('Task has completed, life is beautiful')
        elif t.status  == 'FAILED' or details.status == 'ABORTED':  
            print('Task (%s) failed, check it out' * t.name)
            flag['task_running'] = False
        else:
            sleep(loop_time)

Pinging SBPLAT for task completion


## Build and run _second-stage_ task
Now we collect outputs from the first-stage tasks and provide them to the second-stage task.

#### Notes
Here we are not doing any error checking as in [batch_o_tasks_standard](batch_o_tasks_standard.ipynb); instead we are firing up the tasks directly. Note this means we _ignore_ any failed tasks. You can choose your own approach here, e.g. re-running (or QC-ing) failed tasks, all-or-none processing, etc.

#### PROTIPS
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.sevenbridges.com/docs/create-a-new-task)

In [None]:
# Set input for the second stage task
abundances = [task.outputs['abundances'] for task in my_task_list]

# Format the JSON to pass values to the BACK-END workflow [for backend RNA-seq cuffquant]
task_name = 'second_stage'
inputs = {
    'Reference' : ref_genome,
    'Annotations' : annotations,
    'sample_files' : abundances,
    'FDR': 0.05,
    'library_type': u'ff-unstranded',
    'min_reps_for_js_test': 3,
    'library_norm_method': u'classic-fpkm',
    'dispersion_method': u'per-condition'
}
    
my_task = api.tasks.create(name=task_name, project=my_project,
                               app=my_app_second, inputs=inputs)
my_task.run()
    
print("You've made it to the end, yaay!")