# How do I create a _batch_ of tasks and _dynamically_ choose a workflow for each of my files?
### Overview
_[Use Case] Batch process a set of WXS files with SAMTools at specific regions of interest_

This example will process Whole Exome Sequencing (WXS) files (bam) and thier indices (bai) and use the Seven Bridges version of [SAM Tools](https://github.com/samtools/samtools) software suite to extract a _sbset of alignments_. Conceptually the API will scan all available files within a project; for each _bam_ file it finds the corresponding:
 - index file (bai)
 - _Case\_UUID_
 - _Disease Name_
 - _File Size_

The bam _File Size_ will determine the EC2 allocation size<sup>1</sup> for SAMTools, _Case_UUID_ and _Disease Name_ are also collected to illustrate the technique. Users can build on this for refining the batch. The script then loops the bam files and starts one task for each. There is also the option to ping the CGC for task completion and download files. Results are always saved on the CGC

We will work through everything else. This example requires a mix of skills. 
 
 - Data Browser Use
 - Workflow Copying
 - API use

<sup>1</sup> Note: dynamic allocation size is already available on the back-end; we don't **need** to do this manually via API. However, we intend to demonstrate the flexibility you can achieve for your _own ends_.

### Prerequisites
 1. You need your _authentication token_ and the API needs to know about it. See <a href="set_AUTH_TOKEN.ipynb">**set_AUTH_TOKEN.ipynb**</a> for details.
 2. You need _TCGA **Controlled** Data_ access
 
### WARNING
This will burn through some processing credits, depending on how many files you pull from _Data Browser_ (about $0.12 to $0.23 per file). You can create _DRAFT_ tasks to just see how it works, swap the commenting in **Build and run tasks** to only run: 
```python
myTask = API(method='POST', data=new_task, path='tasks/')        # task created in DRAFT state
```

## Steps on the GUI
  
We will always be working with the [Cancer Genomics Cloud](https://cgc.sbgenomics.com), but will mix _GUI_ and _API_ tasks here. GUI tasks will be descriptive (markdown cells); API tasks will be in an executable cell but preceded with an explanation in a markdown cell. 

### 1) Create a project
Create a project in the GUI. Name it 'Batch is Super'. Mark that it will contain _TCGA Controlled Data_
<img src="images/batch_1.png"> 

### 2) Use _Data Browser_ to get expressions
Go into _Data Browser_ and construct the following query (should generate approximately 132 files). Click the DataFormat nodes and **Copy files to Project**. Select _Batch is Super_.
<img src="images/batch_2.png"> 

    - hasCase
         - hasAgeAtDiagnosis = 60 < age < 65
         - hasDiseaseType = Acute Myeloid Leukemia
         - hasGender = MALE
         - hasFile
             - hasDataFormat = BAM, BAI
             - hasExperimentalStrategy = WXS
         
Now lets get into some iPython!

## Imports and Definitions
We will use a Python class (API) as a wrapper for API calls. All classes and methods defined in <a href="defs/apimethods.py" target="_blank">_defs/apimethods.py_</a>. 

In [None]:
from defs.apimethods import *

## Common Workflow Language (CWL) description of our Apps
We could copy these Apps from the Public Reference Apps, but _what fun would that be?_ Let's instead specify the exact CWL JSON! In the **Find your project** section, we will create our apps with it.

In [None]:
f = open('files/reg_tool.json', 'r')
regular_tool_raw = f.read()
regular_tool = json.loads(regular_tool_raw)

f = open('files/large_tool.json', 'r')
large_tool_raw = f.read()
large_tool = json.loads(large_tool_raw)

## User Input
We need to set a few things here, depending on which areas we want to look at. Additionally, this would be the space to set project and tool names.

Please be careful that target_region **must be an array** even if there is only one target. If you wanted to only target a single SNP on chromosome 2 at base pair 387840, use:
```python
target_region = ['2:387840-387840']
```

In [None]:
# Set project and app names:
project_name = 'Batch is Super'
app_name_reg = 'SAMtools View Region (Regular)' 
app_name_XL = 'SAMtools View Region (Large)' 

# Limit for calling app_name_reg, if (size >= size_limit) call app_name_XL
size_limit = 25000000000                        

# File extensions we will be working with
input_ext = 'bam'
index_ext = 'bai'
output_ext = 'sam'

# Prefix for created tasks
task_name = 'batch_SAMtoolsView_'

# Regions to investigate (item[1] is TP53)
target_region = ['2:387840-387840',
                 '17:7668402-7687550',
                 '3:192000-192000']         # format is N:n-m  where N is chromosome #,
                                            #   n is starting base pair, and m is ending base pair. if n = m, it is a
                                            #   SNP. Can be an array of regions. If a single SNP, keep it an array of 1
# n_regions = len(target_region)
# if n_regions > 1:
#     flag = {'multi_region': True}
# else:
#     flag = {'multi_region': False}

## Find your project
This code searches through all projects in your account and then gets the details of the _project\_name_ to make sure you've properly set the things in the GUI above:

 - All files
 
Then we will _create_ the apps we need from the CWL in the earlier cell

 - All apps
     - details of the app matching app_name_reg
     - details of the app matching app_name_XL
     
#### PROTIPS
* The recipes involved in this cell are [here](../../Recipes/CGC/projects_detailOne.ipynb), [here](../../Recipes/CGC/files_listAll.ipynb), and [here](../../Recipes/CGC/files_detailOne.ipynb)

In [None]:
# LIST all projects
existing_projects = API(path='projects')  

# DETAIL my_project
p_index = existing_projects.name.index(project_name)
my_project = API(path=('projects/'+ existing_projects.id[p_index]))  

# LIST all files in project
my_files = API(path='files', query={'limit':100, 'project': my_project.id}) 

# CREATE your tools from the JSON
API(path=('apps/' + my_project.id + '/sam-normal/0/raw'), \
             method='POST', data = regular_tool)
API(path=('apps/' + my_project.id + '/sam-large/0/raw'), \
             method='POST', data = large_tool)

# LIST all apps in project and make sure they were created
my_apps = API(path='apps', query={'limit':100, 'project': my_project.id}) 
if len(my_apps.id) > 1:
    for ii, a_name in enumerate(my_apps.name):
        if app_name_reg == a_name:
            my_inputs = API(path=('apps/' + my_apps.id[ii]))
        elif app_name_XL == a_name:
            my_inputs_XL = API(path=('apps/' + my_apps.id[ii]))
    del ii, a_name
else:
    print "Apps were not created in selected project, cannot continue"
    raise KeyboardInterrupt

## Organize files into a **cohort**
The _Data Browser_ is excellent for finding files. However, there are challenges to working with them smoothly, especially as the number of files grows. Specifically

 - File naming ambiguity between patients and centers (related to the change from **TCGA Barcode** to **UUID**)
     - This is not a critical issue here, but as an example we save mulitple metadata before starting tasks
         - CASE_UUID
         - disease_type
         - size (of file)
 - Uncertainty whether samples are matched (e.g. does the index file (BAI) exist for all input files (BAM))
     - check comment = index file exists
     
We only take action on file size, calling a task with a bigger EC2 allocation if the input file exceeds the standard. However, this example should be _illustrative_ for users needing other metadata which is all accessible from 
``` python 
singleFile['metadata']
```

In [None]:
case_ids = {'uuid': [None], 'index': [None], 'input': [None], 'size': [None], 'disease': [None]}

# Collect input file metadata. Saving the case_UUID and DiseaseType 
# (not used, but you can add whatever filtering rocks your world)
for ii, f_name in enumerate(my_files.name):
    if f_name[-len(input_ext):] == input_ext:      # input_ext defined for 'SAMtools View Region'
        single_file = API(path=('files/' + my_files.id[ii]))
        if (f_name + '.' + index_ext) in my_files.name:   # INDEX exists for this INPUT
            case_ids['uuid'].append(single_file.metadata['case_uuid'])
            case_ids['input'].append(ii)
            case_ids['index'].append(my_files.name.index(f_name + '.' + index_ext))
            case_ids['size'].append(single_file.size)
            case_ids['disease'].append(single_file.metadata['disease_type'])

case_ids['uuid'].pop(0)
case_ids['index'].pop(0)
case_ids['input'].pop(0)
case_ids['size'].pop(0)
case_ids['disease'].pop(0)

print('There are %i indexed bam files within the project' % (len(case_ids['uuid'])))

## Build and run tasks
Here we use the API to create a _new\_task_ dictionary that we will use for each pair of files<sup>1</sup>. Once it connects to the CGC, we will have all of the front-end tasks drafted and starting within seconds.

<sup>1</sup> Note, we overwrite the 'app' entry after creating the task to switch to a larger EC2 instance based on 
     
#### PROTIPS
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.cancergenomicscloud.org/docs/create-a-new-task)

In [None]:
my_task_list = [None]
for ii,case_id in enumerate(case_ids['input']):    
    new_task = {
        'description': 'Created from batch_SAMtoolsView.ipynb',
        'name': (task_name + str(ii)),
        'app': my_inputs.id,
        'project': my_project.id,
        'inputs': {
            'regions_array': target_region,       # region(s) of interest
            'input_bam_or_sam_file': {                 # BAM file
                'class': 'File',
                'path': my_files.id[case_id],
                'name': my_files.name[case_id]
            },
            'input_index': {                           # BAI index
                'class': 'File',
                'path': my_files.id[case_ids['index'][ii]],
                'name': my_files.name[case_ids['index'][ii]]
            }
        }
    }

#         if flag['multi_region']:
#             for reg in target_region[1:]:
#                 new_task['inputs']['regions_array'].append(reg)

    # check if larger task is need
    if case_ids['size'][ii] >= size_limit:
        new_task['app'] = my_inputs_XL.id

    # CREATE and RUN tasks
    my_task = API(method='POST', data=new_task, path='tasks/', query = {'action': 'run'})
    my_task_list.append(my_task.id)
    # ALTERNATIVE: create a DRAFT tasks, do not run
#     myTask = API(method='POST', data=new_task, path='tasks/')        # task created in DRAFT state
my_task_list.pop(0)

print("""
%i tasks have been created. Enjoy a break, treat yourself to a coffee, 
and come back to us once you've gotten an email that tasks are done.
(alternatively, use the task monitoring cells below)""" % (len(my_task_list)))

## Check task completion
These tasks may take a long time to complete, here are two ways to check in on them:
* Wait for email confirmation <sup>1</sup>
* Ping the task to see it's _status_. Here we use a 2 min interval, adjust it appropriately for longer or shorter workflows

<sup>1</sup> Emails will arrive regardless of whether the task was started by GUI or API

#### PROTIPS
* The closest recipe for _monitoring tasks_ is [here](../../Recipes/CGC/tasks_monitorAndGetResults.ipynb)
* Detailed documentation of this particular REST architectural style request is available [here](http://docs.cancergenomicscloud.org/docs/perform-an-action-on-a-specific-task)

In [None]:
# [USER INPUT] Set loop time (seconds):
loop_time = 120

for t_id in my_task_list:
    # Check on one task at a time, 
    #  if ANY running, we are not done (no sense to query others)
    flag = {'taskRunning': True}
    while flag['taskRunning']:
        task = api_call(('tasks/' + t_id))
        if task['status'] == 'COMPLETED':
            flag['taskRunning'] = False
            print('Task has completed, life is beautiful')
        elif (task['status'] == 'FAILED') or (task['status'] == 'ABORTED'):
            print('Task (%s) failed, check it out' \
                  % (t_id))
            flag['taskRunning'] = False
        else:
            sleep(loop_time) 

## (optional) Branch point
You have now completed all (most) of your tasks and will have a large set of output files. One interesting case would be another set of code here to do
 - Quality Control
 - Second level of analysis (e.g. these output files will serve as inputs to another App, e.g see [thyroid.ipynb](thyroid.ipynb)

## (optional) Download processed SAM files
You will already have all of these saved in your project (and a _lot of emails_ - one for each completed task). You may also download all of the SAMs 

In [None]:
# [USER INPUT] Set file extension(s) to download here:

dl_list = ["links to file downloads"]

my_files = API(path='files', query={'project': my_project.id})
for ii, f_name in enumerate(my_files.name):
    if (f_name[-len(output_ext):] == output_ext):
        dl_list.append(api_call(path=('files/' + my_files.id[ii] + '/download_info'))['url'])
        
download_files(dl_list)