# How do I summarize metadata _properties_ for all the files that were used in a task?

### Overview
Here we will build on Gaurav's _Gene Data Munger_ app, which neatly cleans and organizes level 3 gene expression data into a table with dimensions [number_of_genes x number_of_patients]. Here, we will:
 
 1. Query a task that Gaurav ran
 2. List all input files to that task
 3. Query the metadata for each input file
 4. Pass this information back to an tool which makes a table [number_of_properties x number_of_patients] on the platform

### Prerequisites
 1. You need to be a member (or owner) of _Gaurav's_ project.
 2. You need your _authentication token_ and the API needs to know about it. See <a href="set_AUTH_TOKEN.ipynb">**set_AUTH_TOKEN.ipynb**</a> for details.
 
### Imports and Definitions
A single call is sufficient to get a file list. We will show two different options, both of which are defined in the apimethods.py file.

In [None]:
from defs.apimethods import *

## List all the _files_ and _tasks_ in Gaurav's project
We will list all of your files and tasks in Gaurav's project

#### PROTIPS
* The recipe for _listing projects_ is [here](../../Recipes/CGC/projects_listAll.ipynb)
* The recipe for _listing files_ in a project is [here](../../Recipes/CGC/files_listAll.ipynb)
* The recipe for _listing tasks_ in a project is [here](../../Recipes/CGC/tasks_listAll.ipynb)

In [None]:
# [USER INPUT]
project_name = 'Gene Expression'
ind_task = 0      # task to get inputs from

# LIST all projects
existing_projects = API(path='projects')                              
if project_name in existing_projects.name:
    p_index = existing_projects.name.index(project_name)
else:
    print('project does not exist, please check name.')
    raise KeyboardInterupt
    
# LIST all files in the project
my_files = API(path='files', query={'project': existing_projects.id[p_index], \
                                   'limit':100})
print('There are %i files in project (%s)'\
      % (len(my_files.name),existing_projects.name[p_index]))

# LIST all task in the project
my_tasks = API(path='tasks', query={'project': existing_projects.id[p_index]})
print('There are %i tasks in project (%s)'\
      % (len(my_tasks.name),existing_projects.name[p_index]))

## Get the files that were used for the most recent task
We will list all of your files and tasks in Gaurav's project

#### PROTIPS
* The recipe for getting task inputs is [here](../../Recipes/CGC/tasks_monitorAndGetResults.ipynb)
* The recipe for getting the metadata for single files is [here](../../Recipes/CGC/files_detailOne.ipynb)

In [None]:
# DETAIL a single task
single_task = API(method='GET', path=('tasks/' + my_tasks.id[ind_task]))

# parse out the file names
my_input_files = [ii['name'] for ii in single_task.inputs['input_files']]

## Get all metadata for the files 
Now we loop through each of the input files, matching it to the list of all files.

In [None]:
# [USER INPUT] 
metadata_to_include = [u'aliquot_id', 
                       u'data_subtype', 
                       u'case_uuid', 
                       u'disease_type', 
                       u'data_type', 
                       u'gender', 
                       u'sample_uuid', 
                       u'sample_id', 
                       u'investigation', 
                       u'data_format', 
                       u'sample_type', 
                       u'platform', 
                       u'case_id', 
                       u'primary_site', 
                       u'age_at_diagnosis', 
                       u'race', 
                       u'vital_status', 
                       u'experimental_strategy', 
                       u'ethnicity', 
                       u'aliquot_uuid']

# Initate a metadata dictionary
metadata = {metadata_to_include[0] : []}
for md in metadata_to_include[1:]:
    metadata[md] = []

# SINGLE-FILE method
for f_name in my_input_files:
    f_id = my_files.id[my_files.name.index(f_name)]
    single_file = API(path=('files/' + f_id))
    keys = single_file.metadata.keys()
    for md in metadata_to_include:
        if md in keys:
            metadata[md].append(single_file.metadata[md])
        else:
            # Error handling for missing data, is 0 best choice?
            metadata[md].append(0)    

## Can we save these locally?
Yes we can! There are already sitting in our Python variables, holding out for a hero - a data scientist hero!

In [None]:
# put these in a directory
dl_dir = 'tables/'
try:                    
    # make sure we have the download directory
    os.stat(dl_dir)
except:
    os.mkdir(dl_dir)
    
    
# SAVE the index file  
f_id = open((dl_dir + 'metadata.txt'), 'w')
for md_prop in metadata_to_include:
    f_id.write(md_prop + ' \n')
f_id.close()


# SAVE individual files
for md_prop in metadata_to_include:
    f_id = open((dl_dir + md_prop + '_index.txt'), 'w')
    
    for md in metadata[md_prop]:
        if type(md) != str:
            f_id.write((str(md) + ' \n'))
        else:
            f_id.write((md + ' \n'))
    f_id.close()
    
    
# SAVE one big table
f_id = open((dl_dir + 'big_table.txt'), 'w')

for md_prop in metadata_to_include:
    for md in metadata[md_prop]:
        if type(md) != str:
            f_id.write((str(md) + ' \t'))
        else:
            f_id.write((md + ' \t'))
    f_id.write('\n')

f_id.close()

## Write a table in a different project
Now we need to write this metadata somewhere, what better place than on the CGC?

#### PROTIPS
* The recipe for getting app inputs is [here](../../Recipes/CGC/apps_detailOne.ipynb)

In [None]:
# [USER INPUT]
project_name = 'Keep on Smiling'
app_name = 'Table Maker'

# LIST all projects
existing_projects = API(path='projects')                              
if project_name in existing_projects.name:
    p_index = existing_projects.name.index(project_name)
    my_project = API(path=('projects/'+ existing_projects.id[p_index]))
else:
    print('Project does not exist, please check name.')
    raise KeyboardInterupt
    
# LIST all apps in the project
my_apps = API(path='apps', query={'project': my_project.id, \
                                   'limit':100})
print('There are %i apps in project (%s)'\
      % (len(my_apps.name),my_project.name))
if app_name in my_apps.name:
    a_index = my_apps.name.index(app_name)
else:
    print('App does not exist, please check name.')
    raise KeyboardInterupt
    
single_app = API(path=('apps/' + my_apps.id[a_index]))

## What we should do next
If I was able to wrap better apps:

In [None]:
single_app.raw['inputs']
in_list = [ii['id'][1:] for ii in single_app.raw['inputs']]
# print(in_list)

new_task = {'description': 'Create a table of metadata, started with hackathon_Metadata.ipynb',
            'name': 'Table Maker Run',
            'app': (single_app.id),
            'project': my_project.id,
            'inputs': {
                'f_name_out': 'awesome', 
                'properties': metadata_to_include, 
                'age_at_diagnosis': metadata['age_at_diagnosis'], 
                'aliquot_id': metadata['aliquot_id'], 
                'aliquot_uuid': metadata['aliquot_uuid'], 
                'case_id': metadata['case_id'], 
                'case_uuid': metadata['case_uuid'], 
                'data_format': metadata['data_format'], 
                'data_subtype': metadata['data_subtype'], 
                'data_type': metadata['data_type'], 
                'disease_type': metadata['disease_type'],
                'ethnicity': metadata['ethnicity'], 
                'experimental_strategy': metadata['experimental_strategy'],
                'gender': metadata['gender'], 
                'investigation': metadata['investigation'], 
                'platform': metadata['platform'], 
                'primary_site': metadata['primary_site'], 
                'race': metadata['race'], 
                'sample_id': metadata['sample_id'], 
                'sample_type': metadata['sample_type'], 
                'sample_uuid': metadata['sample_uuid'], 
                'vital_status': metadata['vital_status']
            }
}

my_task = API(method='POST', data=new_task, path='tasks/', query = {'action': 'run'})