# How do I get a list of all files that _match_ a particular metadata _property_?

### Overview
Here we focus on listing all files within a single project that **match** a particular metadata property. One _use-case_ which will benefit greatly from this is:

 * I have _hundreds_ of files in my project
 * I want to run a task(s) which only uses _type X_ files
 * I want to query all _type X_ files with one call

The other examples of doing this (e.g. the _Organizing files into a Cohort_ cells [here](../../Tutorials/CGC/batch_SAMtoolsView.ipynb) or [here](../../Tutorials/CGC/thyroid.ipynb)) are following the general strategy of:

 1. List all the files (n = _N_)
 2. Loop through the list
 3. Split off the file extension and see if it's _feasible_
 4. Get the metadata of any feasible files (as [here](files_detailOne.ipynb))
 5. If the property matches, add it to a _list_ of files to process

This works fine, but will result in up to **_N_+1 API calls**. Here we will show how to do this with only **one API** call and show the speed improvement.

### Prerequisites
 1. You need to be a member (or owner) of _at least one_ project.
 2. You need your _authentication token_ and the API needs to know about it. See <a href="set_AUTH_TOKEN.ipynb">**set_AUTH_TOKEN.ipynb**</a> for details.
 3. You understand how to <a href="projects_listAll.ipynb" target="_blank">list</a> projects you are a member of (we will just use that call directly and pick one here).
 4. Your project should have _many files_ inside. **PROTIP** Complete the **Tutorial** [batch\_SAMtoolsView](../../Tutorials/CGC/batch_SAMtoolsView.ipynb) so you will have a project full of files. I used the same project name.
 
### Imports and Definitions
A single call is sufficient to get a file list. We will show two different options, both of which are defined in the apimethods.py file.

In [1]:
from defs.apimethods import *
import time as timer

## Use API() _object_
We will list all of your files, then compare

 * using the query by metadata call
 * looping through them to check the metadata
 
Here we are checking two (hard-coded) metadata properties. It's possible to check as many as you'd like.

#### Note
The properies *age\_at\_diagnosis* and *days\_to\_death* are currently **not** working within the query. We are working to fix this anomaly. 

In [2]:
# [USER INPUT] Set project (p_) and file (f_) indices here:
project_name = 'Batch is Super'
metadata_property = ['reference_genome',
                     'vital_status'
                    ]
metadata_value = ['HG19_Broad_variant',
                  'Alive'
                 ]

# LIST all projects
existing_projects = API(path='projects')                              
if project_name in existing_projects.name:
    p_index = existing_projects.name.index(project_name)
else:
    print('project does not exist, please check name.')
    raise KeyboardInterupt
    
# LIST all files in the project
my_files = API(path='files', query={'project': existing_projects.id[p_index], \
                                   'limit':100})
print('There are %i files in your project' % len(my_files.name))

# Query by metadata
T0 = timer.time()
my_matching_files = API(path='files', \
                        query={'project': existing_projects.id[p_index],
                               'limit':100,
                               ('metadata.' + metadata_property[0]):\
                               metadata_value[0],
                               ('metadata.' + metadata_property[1]):\
                               metadata_value[1]
                              })
     
print("""
There are %i files matching the metadata criteria.
Total query time (metadata query method) was %f seconds.""" \
      % (len(my_matching_files.name), timer.time()-T0))

There are 198 files in your project

There are 24 files matching the metadata criteria.
Total query time (metadata query method) was 1.228818 seconds.


In [3]:
# SINGLE-FILE method
T0 = timer.time()
file_list = []
for f_id in my_files.id:
    single_file = API(path=('files/' + f_id))
    if single_file.metadata[metadata_property[0]] == metadata_value[0] \
    and single_file.metadata[metadata_property[1]] == metadata_value[1] :
        file_list.append(single_file.name)

print("""
There are %i files matching the metadata criteria.
Total query time (single file method) was %f seconds.""" \
      % (len(file_list)+1, timer.time()-T0))


There are 25 files matching the metadata criteria.
Total query time (single file method) was 246.575445 seconds.


## Additional Information
Detailed documentation of this particular REST architectural style request is available [here](http://docs.cancergenomicscloud.org/docs/list-files-in-a-project)