# Xtract: Extracting Metadata from heterogeneous files stored across various computing machinery

### In this demo, we will show that we can extract data from a distributed data collection:
* Materials Data Facility files stored at NCSA (National Center for Supercomputing Application, UIUC) accessed by Globus HTTPS
* Personal free text data from my Google Drive

All metadata extractions are done in two phases. The first phase (the crawl) creates a manifest of the physical attributes of each file in its native location (e.g., POSIX file system, Globus, Google Drive, Box (in development), ...). The second phase, the extraction, applies metadata extractors to the files based on what was learned from the crawl, and will transfer files to other computing machinery if required by the user. 

In [122]:
import os 
import json
import pickle
import requests

# Globus Auth imports
from fair_research_login import NativeClient


# Google Drive Auth imports
from googleapiclient.discovery import build
from google.auth.transport.requests import Request
from google_auth_oauthlib.flow import InstalledAppFlow

# Step 1: Crawling

### First, we want to *crawl* all of the data on our Globus HTTPS-accessible endpoint containing our materials data.

First, we should parameterize our search to point to the Globus endpoint containing the data `source_ep_id`, the relative path on the endpoint at which the data are stored `source_ep_path`, and the funcX \[1\] endpoint at which we want to conduct the extractions after we create our initial manifest. 

We should also define a "grouper". In the context of metadata extraction a **group** is any collection of files that can be treated as 1 'data' entity. File membership to a group is not mutually exclusive. In this case we use MaterialsIO \[2\] that leverages file extensions to link together files (like DFT measurement files).  

\[1\] Every machine on which we are running extractions should have a funcX endpoint installed and 'Active' (https://funcx.readthedocs.io/en/latest/)

\[2\] MaterialsIO can be found here: https://github.com/materials-data-facility/MaterialsIO

In [123]:
# First, we'll define the Globus (and fx) endpoint on which we want to extract. 

source_ep_path = "/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/"
source_ep_id = "82f1b5c6-6e9b-11e5-ba47-22000b92c6ec"
funcx_ep_id = "82ceed9f-dce1-4dd1-9c45-6768cf202be8"

base_url = "https://data.materialsdatafacility.org"
grouper = "matio"  

# eb_url = "http://127.0.0.1:5000"
eb_url = "http://xtract-crawler-4.eba-ghixpmdf.us-east-1.elasticbeanstalk.com"

Now that we've defined our Globus HTTPS-accessible data path, we must authenticate with Globus so that we can read file metadata, download files, and run funcX on our desired endpoint. 

In [124]:
client = NativeClient(client_id='7414f0b4-7d05-4bb6-bb00-076fa3f17cf5')
tokens = client.login(
    requested_scopes=['https://auth.globus.org/scopes/56ceac29-e98a-440a-a594-b41e7a084b62/all', 
                      'urn:globus:auth:scope:transfer.api.globus.org:all',
                     "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all", 
                     'email', 'openid'],
    no_local_server=True,
    no_browser=True)

auth_token = tokens["petrel_https_server"]['access_token']
transfer_token = tokens['transfer.api.globus.org']['access_token']
funcx_token = tokens['funcx_service']['access_token']

headers = {'Authorization': f"Bearer {auth_token}", 'Transfer': transfer_token, 'FuncX': funcx_token, 'Petrel': auth_token}
print(f"Headers: {headers}")

Headers: {'Authorization': 'Bearer AgkW75JJoV6wmON7QlpX326vW8M6lrzg1V1OGyPG6njMrweO5wuOCwWyQz9NMJ1q8mKbEmEvMo66J1Cjdm32phYVw7', 'Transfer': 'Ag6pzYn2NjWPG1zDNO341E06zJ3kQ9wg2vxM9XXOpPVgryWzJms2C9zOd2zegga4y642Q2emlqlkGkh1qey37SV6Pe', 'FuncX': 'AgO9Kl1aWgvrJ0w3aq4m7w2p63qJjV5a0Kpy2Xp0Oza6Pxzd9MsnCQax6o44kdvX3xwB3pgG01j6piezgWk8soXy8', 'Petrel': 'AgkW75JJoV6wmON7QlpX326vW8M6lrzg1V1OGyPG6njMrweO5wuOCwWyQz9NMJ1q8mKbEmEvMo66J1Cjdm32phYVw7'}


In [125]:
crawl_url = f'{eb_url}/crawl'
print(f"Crawl URL is : {crawl_url}")
crawl_req = requests.post(f'{eb_url}/crawl', json={'repo_type': "GLOBUS", 'eid': source_ep_id, 'dir_path': source_ep_path, 'Transfer': transfer_token, 'Authorization': funcx_token,'grouper': grouper, 'https_info': {'base_url':base_url}})
print(crawl_req.content)
globus_crawl_id = json.loads(crawl_req.content)['crawl_id']
print(f"Crawl ID: {crawl_id}")

Crawl URL is : http://xtract-crawler-4.eba-ghixpmdf.us-east-1.elasticbeanstalk.com/crawl
b'{"crawl_id":"6aa64b7e-cdf9-4729-9588-bf1806774678"}\n'
Crawl ID: 5a6f87e1-1d64-4753-aee0-5909bc988e93


We can **periodically poll for the crawl status** below. When crawling is completed, it will be denoted under the `crawl_status` field of the returned JSON. 

In [127]:
crawl_status = requests.get(f'{eb_url}/get_crawl_status', json={'crawl_id': globus_crawl_id})
print(crawl_status)
crawl_content = json.loads(crawl_status.content)
print(f"Crawl Status: {crawl_content}")

<Response [200]>
Crawl Status: {'bytes_crawled': 33760192, 'crawl_id': '6aa64b7e-cdf9-4729-9588-bf1806774678', 'crawl_status': 'SUCCEEDED', 'files_crawled': 8, 'groups_crawled': 80}


## Google Drive 

### Now let's say the second half of our file collection is on our personal Google Drive. 

#### The requests to the Xtract crawler service will look roughly the same as in the Globus case, except (1) the auth credentials will use Google OAuth 2, and (2) we will use the `simple_ext` grouper that groups files based on extension (or Google doc type). 

Crawling a sub-component of one's Google Drive is not yet supported: you must crawl the entire repo. 

In [146]:
# We must first load the authentication credentials for our application and walk through the flow. 
project_id = os.environ["goog_project_id"]
client_id = os.environ["goog_client_id"]
client_secret = os.environ["goog_client_secret"]

# Add the secret stuff to our credentials document...
with open("config.json", "r") as f:
    creds = json.load(f)
print(creds)
creds['web']['client_id'] = client_id
creds['web']['client_secret'] = client_secret
creds['web']['project_id'] = project_id

# And write it.
with open("credentials.json", "w") as f: 
    json.dump(creds, f)

SCOPES = ['https://www.googleapis.com/auth/drive.metadata.readonly', 'https://www.googleapis.com/auth/drive.readonly']

# Stolen from Google Quickstart docs
# https://developers.google.com/drive/api/v3/quickstart/python
def do_login_flow():
    creds = None
    # The file token.pickle stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)

    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)

    return creds, None  # Returning None because Tyler can't figure out how he wants to structure this yet. 

# THIS should force-open a Google Auth window in your local browser. If not, you can manually copy-paste it. 
auth_creds = do_login_flow()

# Now delete the file so you don't accidentally `git add` it. 
os.remove("credentials.json")

{'web': {'auth_uri': 'https://accounts.google.com/o/oauth2/auth', 'token_uri': 'https://oauth2.googleapis.com/token', 'auth_provider_x509_cert_url': 'https://www.googleapis.com/oauth2/v1/certs', 'redirect_uris': ['http://localhost', 'http://localhost/potato', 'http://localhost:8080/Callback'], 'javascript_origins': ['http://localhost:8080']}}
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=364500245041-r1eebsermd1qp1qo68a3qp09hhpc5dfi.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A51314%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.metadata.readonly+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.readonly&state=Aqhq6BMWa7tGW4dBuvgfp5lm4lJYYK&access_type=offline


In [129]:
r = requests.post(url=f"{eb_url}/crawl",
                  data=pickle.dumps({'auth_creds': auth_creds, 'repo_type': 'GDRIVE'}) )

crawl_mdata = json.loads(r.content)
print(f"Crawl Started! Here's your trackable crawl_id:\n {crawl_mdata}")
gdrive_crawl_id = crawl_mdata['crawl_id']

Crawl Started! Here's your trackable crawl_id:
 {'crawl_id': 'b2ebce7a-3a18-4493-8c2f-bb45e9819e8f'}


In [140]:
r = requests.get(url=f'{eb_url}/get_crawl_status', json={'crawl_id': gdrive_crawl_id})
print(json.loads(r.content))

{'crawl_end_t': 1595955695.994185, 'crawl_start_t': 1595955657.875675, 'crawl_status': 'COMPLETED', 'gdrive_mdata': {'doc_types': {'is_gdoc': 1152, 'is_user_upload': 2035}, 'first_ext_tallies': {'compressed': 6, 'hierarch': 1, 'images': 377, 'other': 321, 'presentation': 151, 'tabular': 203, 'text': 2128}}, 'groups_crawled': 3187, 'n_commit_threads': 5, 'repo_type': 'GDrive', 'total_crawl_time': 38.118510007858276}


# Step 2: Extraction

At this point we have TWO manifests. 1 for Globus and 1 for Google Drive. Now we want to take both of these and send them to the 'Xtract' service.

We first need to define where we would like our metadata to go. I will choose a folder on my ANL's Petrel. 

The last step of 'Extraction' is **validation**. This means that all empty or 'None' metadata attributes are removed, but will soon mean that metadata are optionally schema-matched (e.g., if it doesn't match the schema, then discard). 


In [141]:
# Here we will send up two requests to initialize both manifest extractions. 
petrel_ep_id = "4f99675c-ac1f-11ea-bee8-0e716405a293"
label_by = ["crawl_id", "datetime"]  # We first want the crawl_id followed by the 'datetime' for each file attribute.  
petrel_mdata_path = "/demo/metadata"

In [142]:
# Site 1: 
# xtract_globus = requests.post(f'{eb_url}/extract', json={'crawl_id': crawl_id,
#                                                   'repo_type': "HTTPS",
#                                                   'headers': json.dumps(headers),
#                                                   'funcx_eid': funcx_ep_id, 
#                                                   'source_eid': source_ep_id,
#                                                   'dest_eid': dest_ep_id,
#                                                   'mdata_store_path': mdata_path})
# print(f"Xtract response (should be 200): {xtract_globus}")

In [143]:
# Site 2:
# xtract_gdrive = requests.post(f'{eb_url}/extract', data=pickle.dumps({
#                                                   'crawl_id': gdrive_crawl_id,
#                                                   'gdrive_pkl': auth_creds,
#                                                   'headers': json.dumps(headers),
#                                                   'funcx_eid': funcx_ep_id, 
#                                                   'source_eid': source_ep_id,
#                                                   'dest_eid': petrel_ep_id,
#                                                   'mdata_store_path': petrel_mdata_path,
#                                                   'label_by': label_by}))
# print(f"Xtract response (should be 200): {xtract_gdrive}")

At this point we've sent up a couple extractions. However, this is time consuming and not as demo-friendly (especially since we're transferring all files). Let's jump into some 'out of the oven' file metadata. 

# "Out of the Oven" -- seeing our metadata. 

Let's say we just logged onto Petrel and we see a bunch of our metadata objects. Each metadata object will include information about the physical metadata of each file in the group, and any further group metadata. 

In [144]:
# Materials (MDF) metadata (for DFT calculation)
dft_mdata = {'group': ['/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/INCAR', '/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/OUTCAR', '/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/POSCAR'], 'metadata': {'properties': [{'name': 'Converged', 'scalars': [{'value': True}], 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Total Energy', 'scalars': [{'value': -387.69146755}], 'units': 'eV', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Pressure', 'scalars': [{'value': 0.45}], 'units': 'kbar', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Positions', 'vectors': [[{'value': 6.04}, {'value': 7.91}, {'value': 5.23}], [{'value': 6.56}, {'value': 6.64}, {'value': 5.26}], [{'value': 7.97}, {'value': 6.61}, {'value': 5.32}], [{'value': 8.6}, {'value': 7.84}, {'value': 5.43}], [{'value': 9.99}, {'value': 8.22}, {'value': 5.45}], [{'value': 10.53}, {'value': 9.49}, {'value': 5.3}], [{'value': 11.93}, {'value': 9.41}, {'value': 5.18}], [{'value': 12.41}, {'value': 8.12}, {'value': 5.35}], [{'value': 13.81}, {'value': 7.76}, {'value': 5.26}], [{'value': 14.38}, {'value': 6.5}, {'value': 5.21}], [{'value': 15.79}, {'value': 6.53}, {'value': 5.25}], [{'value': 16.4}, {'value': 7.77}, {'value': 5.27}], [{'value': 20.29}, {'value': 7.99}, {'value': 5.38}], [{'value': 19.77}, {'value': 9.28}, {'value': 5.45}], [{'value': 18.36}, {'value': 9.39}, {'value': 5.47}], [{'value': 17.79}, {'value': 8.13}, {'value': 5.37}], [{'value': 6.11}, {'value': 9.07}, {'value': 8.27}], [{'value': 6.4}, {'value': 7.72}, {'value': 8.29}], [{'value': 7.79}, {'value': 7.45}, {'value': 8.35}], [{'value': 8.62}, {'value': 8.55}, {'value': 8.45}], [{'value': 10.06}, {'value': 8.68}, {'value': 8.47}], [{'value': 10.81}, {'value': 9.84}, {'value': 8.32}], [{'value': 12.17}, {'value': 9.52}, {'value': 8.19}], [{'value': 12.42}, {'value': 8.17}, {'value': 8.36}], [{'value': 13.74}, {'value': 7.57}, {'value': 8.26}], [{'value': 14.08}, {'value': 6.23}, {'value': 8.21}], [{'value': 15.47}, {'value': 6.01}, {'value': 8.24}], [{'value': 16.29}, {'value': 7.13}, {'value': 8.26}], [{'value': 20.16}, {'value': 6.67}, {'value': 8.35}], [{'value': 19.87}, {'value': 8.03}, {'value': 8.42}], [{'value': 18.5}, {'value': 8.38}, {'value': 8.45}], [{'value': 17.72}, {'value': 7.24}, {'value': 8.35}], [{'value': 5.0}, {'value': 8.22}, {'value': 5.22}], [{'value': 5.99}, {'value': 5.71}, {'value': 5.21}], [{'value': 8.49}, {'value': 5.65}, {'value': 5.31}], [{'value': 9.9}, {'value': 10.37}, {'value': 5.21}], [{'value': 12.58}, {'value': 10.26}, {'value': 5.0}], [{'value': 13.72}, {'value': 5.64}, {'value': 5.19}], [{'value': 16.34}, {'value': 5.6}, {'value': 5.32}], [{'value': 21.36}, {'value': 7.82}, {'value': 5.41}], [{'value': 20.41}, {'value': 10.14}, {'value': 5.58}], [{'value': 17.81}, {'value': 10.33}, {'value': 5.45}], [{'value': 5.14}, {'value': 9.55}, {'value': 8.26}], [{'value': 5.68}, {'value': 6.91}, {'value': 8.24}], [{'value': 8.13}, {'value': 6.41}, {'value': 8.33}], [{'value': 10.34}, {'value': 10.82}, {'value': 8.23}], [{'value': 12.96}, {'value': 10.25}, {'value': 8.01}], [{'value': 13.28}, {'value': 5.5}, {'value': 8.19}], [{'value': 15.85}, {'value': 5.0}, {'value': 8.31}], [{'value': 21.18}, {'value': 6.32}, {'value': 8.37}], [{'value': 20.65}, {'value': 8.77}, {'value': 8.55}], [{'value': 18.12}, {'value': 9.41}, {'value': 8.44}], [{'value': 7.35}, {'value': 9.04}, {'value': 5.37}], [{'value': 11.16}, {'value': 6.95}, {'value': 5.57}], [{'value': 15.09}, {'value': 8.92}, {'value': 5.28}], [{'value': 18.97}, {'value': 6.87}, {'value': 5.31}], [{'value': 7.6}, {'value': 9.95}, {'value': 8.4}], [{'value': 10.99}, {'value': 7.23}, {'value': 8.58}], [{'value': 15.2}, {'value': 8.49}, {'value': 8.28}], [{'value': 18.67}, {'value': 5.8}, {'value': 8.28}]], 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Forces', 'vectors': [[{'value': 0.863256}, {'value': 0.086954}, {'value': 0.011232}], [{'value': 0.058336}, {'value': -0.38005}, {'value': -0.407159}], [{'value': 0.369395}, {'value': -0.840704}, {'value': 0.078283}], [{'value': -0.989989}, {'value': 0.115059}, {'value': -0.529506}], [{'value': -1.177388}, {'value': -0.4077}, {'value': -0.234922}], [{'value': 0.206971}, {'value': -0.846463}, {'value': -0.520037}], [{'value': -0.419666}, {'value': 0.061881}, {'value': 0.067498}], [{'value': 1.432998}, {'value': 0.242094}, {'value': -0.94118}], [{'value': 0.207165}, {'value': -0.136524}, {'value': -0.280707}], [{'value': 0.043085}, {'value': 0.513096}, {'value': -0.12301}], [{'value': -0.038375}, {'value': -1.108488}, {'value': -0.329525}], [{'value': -1.238444}, {'value': -0.008117}, {'value': -0.177893}], [{'value': -1.132169}, {'value': 1.310382}, {'value': -0.184788}], [{'value': 0.32348}, {'value': -0.023265}, {'value': 0.11206}], [{'value': 0.547953}, {'value': -0.525569}, {'value': -0.608435}], [{'value': -0.72414}, {'value': 0.039158}, {'value': -0.379669}], [{'value': 0.973983}, {'value': -0.417712}, {'value': 0.600017}], [{'value': 0.257641}, {'value': 0.277942}, {'value': 0.095813}], [{'value': -0.119324}, {'value': -0.93163}, {'value': 0.444111}], [{'value': -0.648758}, {'value': 0.480501}, {'value': 0.07564}], [{'value': -1.421623}, {'value': 0.001043}, {'value': 0.438738}], [{'value': -0.185929}, {'value': -0.663209}, {'value': 0.103826}], [{'value': -0.079881}, {'value': 0.354541}, {'value': 0.767929}], [{'value': 1.725866}, {'value': -0.128846}, {'value': -0.179765}], [{'value': -0.025553}, {'value': -0.119867}, {'value': 0.525276}], [{'value': 0.072923}, {'value': 0.349377}, {'value': 0.527388}], [{'value': -0.153059}, {'value': -0.992536}, {'value': 0.351901}], [{'value': -1.478067}, {'value': -0.048776}, {'value': 0.568025}], [{'value': -0.963163}, {'value': 1.422391}, {'value': 0.338857}], [{'value': 0.381117}, {'value': -0.136229}, {'value': 0.60701}], [{'value': 0.378182}, {'value': -0.434394}, {'value': -0.18893}], [{'value': -0.649249}, {'value': 0.034309}, {'value': 0.142296}], [{'value': 0.073998}, {'value': 0.156906}, {'value': -0.03658}], [{'value': -0.153496}, {'value': 0.182746}, {'value': -0.008696}], [{'value': 0.193143}, {'value': 0.104435}, {'value': 0.007201}], [{'value': 0.10084}, {'value': 0.129564}, {'value': -0.042539}], [{'value': 0.114153}, {'value': 0.070023}, {'value': 0.003411}], [{'value': 0.29518}, {'value': -0.43871}, {'value': 0.010627}], [{'value': 0.329391}, {'value': -0.188561}, {'value': -0.117654}], [{'value': -0.089627}, {'value': -0.680694}, {'value': -0.020625}], [{'value': 0.299236}, {'value': 0.2895}, {'value': -0.106121}], [{'value': -0.031666}, {'value': -0.091318}, {'value': 0.138207}], [{'value': 0.012084}, {'value': 0.219448}, {'value': -0.046172}], [{'value': -0.23013}, {'value': 0.059022}, {'value': -0.031376}], [{'value': 0.203569}, {'value': 0.191906}, {'value': 0.010944}], [{'value': 0.182022}, {'value': -0.039957}, {'value': -0.034154}], [{'value': -0.008499}, {'value': -0.083097}, {'value': 0.014616}], [{'value': 0.16469}, {'value': -0.51989}, {'value': 0.004421}], [{'value': 0.30964}, {'value': -0.295606}, {'value': -0.142287}], [{'value': -0.023936}, {'value': -0.723566}, {'value': -0.028907}], [{'value': 0.24721}, {'value': 0.149575}, {'value': -0.15134}], [{'value': 0.061727}, {'value': -0.388801}, {'value': 0.126646}], [{'value': -0.600474}, {'value': 0.372887}, {'value': -0.61}], [{'value': 0.260705}, {'value': 0.739735}, {'value': -1.005127}], [{'value': 0.168658}, {'value': 1.288811}, {'value': -0.856089}], [{'value': 0.832946}, {'value': 0.314949}, {'value': -0.571667}], [{'value': -0.447386}, {'value': 0.802511}, {'value': 0.50578}], [{'value': 0.306996}, {'value': 0.779622}, {'value': 0.857982}], [{'value': 0.419567}, {'value': 0.859536}, {'value': 0.881659}], [{'value': 0.611884}, {'value': -0.399626}, {'value': 0.477469}]], 'conditions': [{'name': 'positions', 'vectors': [[{'value': 6.04}, {'value': 7.91}, {'value': 5.23}], [{'value': 6.56}, {'value': 6.64}, {'value': 5.26}], [{'value': 7.97}, {'value': 6.61}, {'value': 5.32}], [{'value': 8.6}, {'value': 7.84}, {'value': 5.43}], [{'value': 9.99}, {'value': 8.22}, {'value': 5.45}], [{'value': 10.53}, {'value': 9.49}, {'value': 5.3}], [{'value': 11.93}, {'value': 9.41}, {'value': 5.18}], [{'value': 12.41}, {'value': 8.12}, {'value': 5.35}], [{'value': 13.81}, {'value': 7.76}, {'value': 5.26}], [{'value': 14.38}, {'value': 6.5}, {'value': 5.21}], [{'value': 15.79}, {'value': 6.53}, {'value': 5.25}], [{'value': 16.4}, {'value': 7.77}, {'value': 5.27}], [{'value': 20.29}, {'value': 7.99}, {'value': 5.38}], [{'value': 19.77}, {'value': 9.28}, {'value': 5.45}], [{'value': 18.36}, {'value': 9.39}, {'value': 5.47}], [{'value': 17.79}, {'value': 8.13}, {'value': 5.37}], [{'value': 6.11}, {'value': 9.07}, {'value': 8.27}], [{'value': 6.4}, {'value': 7.72}, {'value': 8.29}], [{'value': 7.79}, {'value': 7.45}, {'value': 8.35}], [{'value': 8.62}, {'value': 8.55}, {'value': 8.45}], [{'value': 10.06}, {'value': 8.68}, {'value': 8.47}], [{'value': 10.81}, {'value': 9.84}, {'value': 8.32}], [{'value': 12.17}, {'value': 9.52}, {'value': 8.19}], [{'value': 12.42}, {'value': 8.17}, {'value': 8.36}], [{'value': 13.74}, {'value': 7.57}, {'value': 8.26}], [{'value': 14.08}, {'value': 6.23}, {'value': 8.21}], [{'value': 15.47}, {'value': 6.01}, {'value': 8.24}], [{'value': 16.29}, {'value': 7.13}, {'value': 8.26}], [{'value': 20.16}, {'value': 6.67}, {'value': 8.35}], [{'value': 19.87}, {'value': 8.03}, {'value': 8.42}], [{'value': 18.5}, {'value': 8.38}, {'value': 8.45}], [{'value': 17.72}, {'value': 7.24}, {'value': 8.35}], [{'value': 5.0}, {'value': 8.22}, {'value': 5.22}], [{'value': 5.99}, {'value': 5.71}, {'value': 5.21}], [{'value': 8.49}, {'value': 5.65}, {'value': 5.31}], [{'value': 9.9}, {'value': 10.37}, {'value': 5.21}], [{'value': 12.58}, {'value': 10.26}, {'value': 5.0}], [{'value': 13.72}, {'value': 5.64}, {'value': 5.19}], [{'value': 16.34}, {'value': 5.6}, {'value': 5.32}], [{'value': 21.36}, {'value': 7.82}, {'value': 5.41}], [{'value': 20.41}, {'value': 10.14}, {'value': 5.58}], [{'value': 17.81}, {'value': 10.33}, {'value': 5.45}], [{'value': 5.14}, {'value': 9.55}, {'value': 8.26}], [{'value': 5.68}, {'value': 6.91}, {'value': 8.24}], [{'value': 8.13}, {'value': 6.41}, {'value': 8.33}], [{'value': 10.34}, {'value': 10.82}, {'value': 8.23}], [{'value': 12.96}, {'value': 10.25}, {'value': 8.01}], [{'value': 13.28}, {'value': 5.5}, {'value': 8.19}], [{'value': 15.85}, {'value': 5.0}, {'value': 8.31}], [{'value': 21.18}, {'value': 6.32}, {'value': 8.37}], [{'value': 20.65}, {'value': 8.77}, {'value': 8.55}], [{'value': 18.12}, {'value': 9.41}, {'value': 8.44}], [{'value': 7.35}, {'value': 9.04}, {'value': 5.37}], [{'value': 11.16}, {'value': 6.95}, {'value': 5.57}], [{'value': 15.09}, {'value': 8.92}, {'value': 5.28}], [{'value': 18.97}, {'value': 6.87}, {'value': 5.31}], [{'value': 7.6}, {'value': 9.95}, {'value': 8.4}], [{'value': 10.99}, {'value': 7.23}, {'value': 8.58}], [{'value': 15.2}, {'value': 8.49}, {'value': 8.28}], [{'value': 18.67}, {'value': 5.8}, {'value': 8.28}]]}, {'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Density', 'scalars': [{'value': 0.040651963745469616}], 'units': 'g/(cm^3)', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Stresses', 'matrices': [[[{'value': -0.30066}, {'value': -0.0966}, {'value': -0.03838}], [{'value': -0.0966}, {'value': 0.35117}, {'value': -0.03328}], [{'value': -0.03838}, {'value': -0.03328}, {'value': 1.31201}]]], 'units': 'kbar', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Number of atoms', 'scalars': [{'value': 60}], 'units': '/unit cell', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Initial volume', 'scalars': [{'value': 27000.0}], 'units': 'Angstrom^3/cell', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}, {'name': 'Final volume', 'scalars': [{'value': 27000.0}], 'units': 'Angstrom^3/cell', 'conditions': [{'name': 'XC Functional', 'scalars': [{'value': 'PAW_PBE'}]}, {'name': 'Cutoff Energy', 'scalars': [{'value': 550.0}], 'units': 'eV'}, {'name': 'k-Points per Reciprocal Atom', 'scalars': [{'value': 60.0}]}, {'name': 'Pseudopotentials', 'vectors': [[{'value': 'C'}, {'value': 'H'}, {'value': 'S'}]]}], 'methods': [{'name': 'Density Functional Theory', 'software': [{'name': 'VASP', 'version': '5.3.3'}]}], 'dataType': 'COMPUTATIONAL'}], 'category': 'system.chemical', 'chemicalFormula': 'C32H20S8'}, 'files': {'/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/INCAR': {'size_b': 476, 'extension': None, 'path_type': 'GLOBUS', 'globus_ep': '82f1b5c6-6e9b-11e5-ba47-22000b92c6ec'}, '/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/OUTCAR': {'size_b': 2269941, 'extension': None, 'path_type': 'GLOBUS', 'globus_ep': '82f1b5c6-6e9b-11e5-ba47-22000b92c6ec'}, '/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/POSCAR': {'size_b': 523, 'extension': None, 'path_type': 'GLOBUS', 'globus_ep': '82f1b5c6-6e9b-11e5-ba47-22000b92c6ec'}}}

import pprint
pprint.pprint(dft_mdata)


{'files': {'/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/INCAR': {'extension': None,
                                                                                                    'globus_ep': '82f1b5c6-6e9b-11e5-ba47-22000b92c6ec',
                                                                                                    'path_type': 'GLOBUS',
                                                                                                    'size_b': 476},
           '/thurston_selfassembled_peptide_spectra_v1.1/DFT/MoleculeConfigs/di_30_-10.xyz/OUTCAR': {'extension': None,
                                                                                                     'globus_ep': '82f1b5c6-6e9b-11e5-ba47-22000b92c6ec',
                                                                                                     'path_type': 'GLOBUS',
                                                                                           

In [145]:
gdoc_mdata = {'group': ['11s1EWKAdqIgaEC0ksSKunrQq_WgSKyZH6l1yig-lFgA'], 
              'files': [{"path": '11s1EWKAdqIgaEC0ksSKunrQq_WgSKyZH6l1yig-lFgA', "path_type": "gdrive", "is_gdoc": True, "mimeType": "text/plain", "owner": "skluzacek@uchicago.edu"}], 
              'metadata': {'keywords': 
                           {'cover': 44, 
                            'longest': 32, 
                            'vertex': 32, 
                            'independent': 28, 
                            'practice': 22, 
                            'compute': 22, 
                            'programming': 18, 
                            'building': 16, 
                            'bridges': 16, 
                            'knapsack': 16, 
                            'matching': 16, 
                            'substrings': 16, 
                            'distance': 16, 
                            'balanced': 16, 
                            'partition': 16, 
                            'shortest': 16, 
                            'paths': 16, 
                            'increasing': 16, 
                            'subsequence': 16, 
                            'final': 15}, 
                           'extract time': 0.46085143089294434, 
                           'parser': 'text'}}

import pprint
pprint.pprint(gdoc_mdata)
    

{'files': [{'is_gdoc': True,
            'mimeType': 'text/plain',
            'owner': 'skluzacek@uchicago.edu',
            'path': '11s1EWKAdqIgaEC0ksSKunrQq_WgSKyZH6l1yig-lFgA',
            'path_type': 'gdrive'}],
 'group': ['11s1EWKAdqIgaEC0ksSKunrQq_WgSKyZH6l1yig-lFgA'],
 'metadata': {'extract time': 0.46085143089294434,
              'keywords': {'balanced': 16,
                           'bridges': 16,
                           'building': 16,
                           'compute': 22,
                           'cover': 44,
                           'distance': 16,
                           'final': 15,
                           'increasing': 16,
                           'independent': 28,
                           'knapsack': 16,
                           'longest': 32,
                           'matching': 16,
                           'partition': 16,
                           'paths': 16,
                           'practice': 22,
                           'pro