# How do I get a list of all files that _match_ a particular metadata _property_?

### Overview
Here we focus on listing all files within a single project that **match** a particular metadata property. One _use-case_ which will benefit greatly from this is:

 * I have _hundreds_ of files in my project
 * I want to run a task(s) which only uses _type X_ files
 * I want to query all _type X_ files with one call

Our prior examples of doing this (e.g. the _Organizing files into a Cohort_ cells [here](https://github.com/sbg/okAPI/blob/advanced_access/Tutorials/CGC/batch_SAMtoolsView.ipynb) followed the general strategy of:

 1. List all the files (n = _N_)
 2. Loop through the list
 3. Split off the file extension and see if it's _feasible_
 4. Get the metadata of any feasible files (as [here](files_detailOne.ipynb))
 5. If the property matches, add it to a _list_ of files to process

This works, but will result in up to **_N_+1 API calls**. Here we will show how to do this with only **one API** call and show the speed improvement. If you run this code in the next ten minutes, we'll include the **special bonus** of searching for files using a list of names!

### Prerequisites
 1. You need your _authentication token_ and the API needs to know about it. See <a href="set_AUTH_TOKEN.ipynb">**set_AUTH_TOKEN.ipynb**</a> for details.
 2. You understand how to <a href="projects_listAll.ipynb" target="_blank">list</a> projects you are a member of (we will just use that call directly and pick one here).
 3. You have already cloned the Public Project _Cancer Cell Line Encyclopedia (CCLE)_.
 
## Imports
We import the _Api_ class from the official sevenbridges-python bindings below. We are also going to grab the time library for crude benchmarking. We are also going to get fancy with future division

In [1]:
import sevenbridges as sbg
import time as timer
from __future__ import division

## Initialize the object
The _Api_ object needs to know your **auth\_token** and the correct path. Here we assume you are using the .sbgrc file in your home directory. For other options see <a href="Setup_API_environment.ipynb">Setup_API_environment.ipynb</a>

In [2]:
# [USER INPUT] specify platform {cgc, sbpla, etc}
prof = 'sbpla'


config_file = sbg.Config(profile=prof)
api = sbg.Api(config=config_file)

## Search by metadata
This is the **optimal** way to query files matching a particular metadata. Here, we check two (hard-coded) metadata properties. It's possible to check as many as you'd like. We are going to use _Copy of Cancer Cell Line Encyclopedia (CCLE)_ which is a nice big project with **2579** files.

#### Notes:
 * The search by metadata function does **not** work with booleans or integers right now. This is a **known** bug so you **know** we are on it!

In [3]:
# [USER INPUT] Set metadata properties and values here; set project name
project_name = 'Copy of Cancer Cell Line Encyclopedia (CCLE)'
metadata_to_match = {'experimental_strategy': 'WXS',
                     'platform':'Illumina'
                    }


# Find project
my_project = [p for p in api.projects.query(limit=100).all()
              if p.name == project_name]

if not my_project:  #    empty list is False, {list, tuple, etc} is True
    print('Target project ({}) not found, please check spelling'.format(project_name))
    raise KeyboardInterrupt
else:
    my_project = my_project[0]
    my_project = api.projects.get(id = my_project.id)

# How many files do we have?
my_files = api.files.query(project = my_project)
print('There are {} files in your project'.format(my_files.total))

# Query by metadata
T0 = timer.time()
my_matched_files = api.files.query(
    project=my_project, limit=100, 
    metadata = metadata_to_match)
 
print("""
There are {} files matching the metadata criteria.
This is {} percent of the dataset.
Total query time (metadata query method) 
was {} seconds."""
      .format(my_matched_files.total,
              100*(my_matched_files.total/my_files.total),
              timer.time()-T0))

There are 2579 files in your project

There are 654 files matching the metadata criteria.
This is 25.3586661497 percent of the dataset.
Total query time (metadata query method) 
was 0.383278846741 seconds.


## Loop through metadata
This is very likely the non-optimal approach - please take a look in the mirror and ask yourself "Do I really need this?" We are going to mimic the operation above, with a few approximations:

 * build a list of all the file names
 * randomly sample 100 of them
 * search files by **file names**
 * check the metadata of each file within the list individually
 * build a matched_file_list for any single file that matches
 
This would let us do things like check for booleans (bug), integers (bug), or **ranges** - none of these are currently possible. 

In [4]:
import numpy as np


# Build list of file names
f_names = [f.name for f in my_files.all()]
f_total = len(f_names)

# Random index of file names to check (only taking 100)
some_files = list(np.random.choice(
    f_names, size=100, replace=False))

# file list of only those 100 files
some_of_my_files = api.files.query(
    project = my_project, limit = 100,
    names = some_files)

print('Good news, we have taken {} random files'
      .format(some_of_my_files.total))

Good news, we have taken 100 random files


In [5]:
keys = metadata_to_match.keys()
vals = []
for k in keys:
    vals.append(metadata_to_match[k])

T0 = timer.time()
file_list = []

for f in some_of_my_files:
    single_f = api.files.get(id = f.id)
    if single_f.metadata[keys[0]] == vals[0] \
    and single_f.metadata[keys[1]] == vals[1]:
        file_list.append(single_f)

print("""
There are {} files matching the metadata criteria.
This is {} percent of the dataset.
Total query time (single file method) 
was {} seconds."""
      .format(len(file_list)+1,
              100*((len(file_list)+1)/100),
              timer.time()-T0))


There are 23 files matching the metadata criteria.
This is 23.0 percent of the dataset.
Total query time (single file method) 
was 18.7982280254 seconds.


## Comparison
There is a _randomness_ here, but running this a few times I've found about a **factor of 40-50x** speed advantage to _including metadata in the query_ rather than _checking files individually_. Keep in mind, only 100 individual files were checked, this would scale **very poorly** if all 2579 files were checked. 

## Additional Information
Detailed documentation of this particular REST architectural style request is available [here](http://docs.cancergenomicscloud.org/docs/list-files-in-a-project)