# Using Xtract to extract MaterialsIO metadata from MDF files

This Xtract-MDF demo illustrates how to crawl, extract metadata from, and store metadata for any Globus HTTPS-accessible repository. 

Moreover, users can execute metadata extraction workflows on any machine running a funcX endpoint, whether it's ANL's Cooley, a laptop, or the cloud (for this demo, we use an EC2 instance). 

In [23]:
from fair_research_login import NativeClient
from home_run.base import _get_file
import requests

ModuleNotFoundError: No module named 'home_run'

In [20]:
client = NativeClient(client_id='7414f0b4-7d05-4bb6-bb00-076fa3f17cf5')
tokens = client.login(
    requested_scopes=['https://auth.globus.org/scopes/56ceac29-e98a-440a-a594-b41e7a084b62/all', 
                      'urn:globus:auth:scope:transfer.api.globus.org:all',
                     "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all", 
                     'email', 'openid'],
    no_local_server=True,
    no_browser=True)

auth_token = tokens["petrel_https_server"]['access_token']
transfer_token = tokens['transfer.api.globus.org']['access_token']
funcx_token = tokens['funcx_service']['access_token']

headers = {'Authorization': f"Bearer {transfer_token}", 'Transfer': transfer_token, 'FuncX': funcx_token}
print(f"Headers: {headers}")

Headers: {'Authorization': 'Bearer Aggyraa2e24qQk031o9mo3WEGGmwYVQz3JpQpE0KVbrqmYxVG3IJC0Ogkbp6jYW52P5lneqBX9O234u1qwOakUWGO6', 'Transfer': 'Aggyraa2e24qQk031o9mo3WEGGmwYVQz3JpQpE0KVbrqmYxVG3IJC0Ogkbp6jYW52P5lneqBX9O234u1qwOakUWGO6', 'FuncX': 'AgP2gDp2zB82W1JxlWz9vjKYONlVw27nGK789bo00rbDgEQaggcmCy8Ywq6XJq9nPg9xPBk0K9OXqyHq5lVmXI1gv0'}


In [21]:
mdf_file = "/MDF/mdf_connect/prod/data/h2o_13_v1-1/split_xyz_files/watergrid_60_HOH_180__0.7_rOH_1.8_vario_PBE0_AV5Z_delta_PS_data/watergrid_PBE0_record-1237.xyz"
mdf_url = f'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org{mdf_file}'

get_payload = {'url': mdf_url, 'headers': headers}

req = requests.get(mdf_url, headers=headers)

print(req.content)


SSLError: HTTPSConnectionPool(host='e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org', port=443): Max retries exceeded with url: /MDF/mdf_connect/prod/data/h2o_13_v1-1/split_xyz_files/watergrid_60_HOH_180__0.7_rOH_1.8_vario_PBE0_AV5Z_delta_PS_data/watergrid_PBE0_record-1237.xyz (Caused by SSLError(CertificateError("hostname 'e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org' doesn't match '1e4d7bc2-bd3f-11e9-9397-02ff96a5aa76.e.globus.org'",),))

In [50]:
crawl_url = f'{eb_url}/crawl'
print(f"Crawl URL is : {crawl_url}")
crawl_req = requests.post(f'{eb_url}/crawl', json={'eid': source_ep_id, 'dir_path': source_ep_path, 'Transfer': transfer_token, 'Authorization': funcx_token,'grouper': grouper})
print(crawl_req.content)
crawl_id = json.loads(crawl_req.content)['crawl_id']
print(f"Crawl ID: {crawl_id}")

Crawl URL is : http://xtractv1-env-2.p6rys5qcuj.us-east-1.elasticbeanstalk.com/crawl
b'{"crawl_id":"2cbccc2c-df88-4cc0-bcac-568cf92e9ddf"}\n'
Crawl ID: 2cbccc2c-df88-4cc0-bcac-568cf92e9ddf


We can get crawl status, seeing how many groups have been identified in the crawl. 

Note that measuring the total files yet to crawl is impossible, as the BFS may not have discovered all files yet, and Globus does not yet have a file counting feature for all directories and subdirectories. I.e., we know when we're done, but we don't know until we get there. 

In [51]:
# crawl_id = "92c4ecd5-0eb1-44f3-9f0e-378b8713838e"

crawl_status = requests.get(f'{eb_url}/get_crawl_status', json={'crawl_id': crawl_id})
print(crawl_status)
crawl_content = json.loads(crawl_status.content)
print(f"Crawl Status: {crawl_content}")

<Response [200]>
Crawl Status: {'crawl_status': 'complete', 'groups_crawled': 40}


## Step 3: Xtract

Next we launch a non-blocking metadata extraction workflow that will automatically find all groups generated from our crawl_id, ship parsers to our endpoint as funcX, transfer the file (if necessary), and extract/send back metadata to the central Xtract service. This will just run constantly until the crawl is done and there are crawled groups left to extract. 

In [22]:
import sys
sys.executable

'/Users/tylerskluzacek/opt/anaconda3/envs/funcx-jan/bin/python'

In [117]:
xtract_status = requests.get(f'{eb_url}/get_extract_status', json={'crawl_id': crawl_id})
xtract_content = json.loads(xtract_status.content)
print(f"Xtract Status: {xtract_content}")

Xtract Status: {'FINISHED': 18, 'IDLE': 1, 'PENDING': 11, 'crawl_id': '2eaaeb11-66ed-4fa5-8049-83b81eaf484d'}


## Step 4: Access / Flush

We might want to flush all new metadata blobs to a separate Globus endpoint. Here we initialize a results poller that creates a file of each metadata attribute to a folder at this path: `<mdata_path>/<crawl_id>/<group_id>`

In [246]:
poller = requests.post(f'{poller_url}/', json={'crawl_id': crawl_id, 'mdata_ep_id': mdata_ep_id, 'Transfer': transfer_token})
print(f'Flush Status: {poller}')

Flush Status: <Response [200]>
