# Using Xtract to extract MaterialsIO metadata from MDF files

This Xtract-MDF demo illustrates how to crawl, extract metadata from, and store metadata for any Globus HTTPS-accessible repository. 

Moreover, users can execute metadata extraction workflows on any machine running a funcX endpoint, whether it's ANL's Cooley, a laptop, or the cloud (for this demo, we use an EC2 instance). 

In [12]:
from fair_research_login import NativeClient
import requests
import pickle
import json
import mdf_toolbox

# Globus endpoint and directory path to where the data live (default: MDF data repo on petrel#researchdatalanalytics).
source_ep_path_1 = "/MDF/mdf_connect/prod/data" #/_test_einstein_9vpflvd_v1.1"


source_ep_id = "e38ee745-6d04-11e5-ba46-22000b92c6ec"  # MDF at Petrel 
dest_ep_id = "af7bda53-6d04-11e5-ba46-22000b92c6ec"  # where they will be extracted

# Globus endpoint whatand file path at which we want to store metadata documents
mdata_ep_id = "5113667a-10b4-11ea-8a67-0e35e66293c2"  
mdata_path = None

data_prefetch_path = "/project2/chard/skluzacek/data_to_process"
funcx_ep_id = "b08b8612-cd87-47f4-b2ff-8c69c2c03d53"

base_url = ""  # Use this for the prefetch case! 

# DEV crawler: 
eb_url = "http://xtractcrawler5-env.eba-akbhvznm.us-east-1.elasticbeanstalk.com/"

# Grouping strategy we want to use for grouping. This will, by default, use all .group() functions from matio parsers.
grouper = "file_is_group"


## Step 1: Login 

Here we request tokens from Globus Auth coming from three separate scopes. When fresh tokens are needed, tthe NativeClient will provide a link at which the user can authenticate with their Globus ID, providing a box at which to paste the Authentication Code. The scopes are as follows: 

* **petrel_https_server**: needed to access the MDF data on Petrel. Will need to change if processing data off-Petrel. 
* **transfer_token**: needed to crawl the Globus endpoint and transfer metadata to its final location. 
* **funcx_token**: needed to orchestrate the metadata exraction at the given funcX endpoint.

Additionally we package the tokens as *headers* that we can easily ship with later requests. 

In [13]:
# client = NativeClient(client_id='7414f0b4-7d05-4bb6-bb00-076fa3f17cf5')
# tokens = client.login(
#     requested_scopes=['https://auth.globus.org/scopes/56ceac29-e98a-440a-a594-b41e7a084b62/all', 
#                       'urn:globus:auth:scope:transfer.api.globus.org:all',
#                       "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all"],# , 
#                      # 'email'],# , 'openid'],
#     no_local_server=True,
#     no_browser=True, force=True)

# auth_token = tokens["petrel_https_server"]['access_token']
# transfer_token = tokens['transfer.api.globus.org']['access_token']
# funcx_token = tokens['funcx_service']['access_token']

auths = mdf_toolbox.login(
    services=[
        "openid",
        "data_mdf",
        "search",
        "petrel",
        "transfer",
        "dlhub",
        "https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all",
    ],
    app_name="Foundry",
    make_clients=True,
    no_browser=False,
    no_local_server=False,
)


print(auths)
fx_scope = 'https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all'
headers = {'Authorization': f"Bearer {auths['petrel']}", 'Transfer': auths['transfer'], 'FuncX': auths[fx_scope], 'Petrel': auths['petrel']}
fx_headers = {'Authorization': f"Bearer {auths[fx_scope].access_token}",
             'Search': auths['search'].authorizer.access_token,
             'Openid': auths['openid'].access_token}
print("\n" + str(fx_headers))

{'search': <globus_sdk.search.client.SearchClient object at 0x7ff0d933e6a0>, 'data_mdf': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7ff0d933e9e8>, 'petrel': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7ff0d933ea20>, 'dlhub': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7ff0d933eef0>, 'transfer': <globus_sdk.transfer.client.TransferClient object at 0x7ff0d9344860>, 'https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7ff0d933eeb8>, 'openid': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7ff0d933ef28>}

{'Authorization': 'Bearer AgVzQJqp2yBYB9a0MKBOYQbpvXBBlEjQDqGKxexvd8PM9deP45UbCM0aPPkYYNM6PPy9wP66nWBVg4Tlj3yVxf2GYM', 'Search': 'Ag6bm2aO5WlYM6P0M07zdmw5VYle99pOGzrzKqozYO3wxBkW4lT2CBe2qX0Nm7Dorqa7nbbPqgzxY7c1m4NrKugGY7', 'Openid': 'Ag8yPE0We4o5e0dovD7YgaW2ExXlXyGdEM7xmoJ6GWYXM65mkw

## Step 2: Crawl
Crawling, behind the scenes, will scan a Globus directory breadth-first (using globus_ls), first extracting physical metadata such as path, size, and extension. Next, since the *grouper* we selected is 'matio', the crawler will execute matio's `get_groups_by_postfix()` function on all file names in a directory in order to return groups for each of matio's parsers (besides *generic* and *noop*). 

The crawl will run as a non-blocking thread, and return a crawl_id that will be used extensively to track progress of our metadata extraction workflow.

In [14]:
# TODO: Adjust this to the Google Drive model!!!

crawl_url = f'{eb_url}/crawl'
print(f"Crawl URL is : {crawl_url}")

first_ep_dict = {
    'repo_type': 'GLOBUS',
    'eid': source_ep_id,
    'dir_paths': [source_ep_path_1],# , source_ep_path_2],
    'grouper': grouper, 
    # 'prefetch_list':
    #     [{'dest_ep_id': dest_ep_id, 'path': '/project2/chard/skluzacek/data_to_process'}]
}

tokens = {'Transfer': auths['transfer'].authorizer.access_token, 
          'Authorization': f"Bearer {auths['petrel'].access_token}", 
          'FuncX': auths[fx_scope].access_token}
print(tokens)

crawl_req = requests.post(crawl_url, json={'endpoints': [first_ep_dict], 'tokens': tokens})
print(crawl_req.content)
crawl_id = json.loads(crawl_req.content)['crawl_id']
print(f"Crawl ID: {crawl_id}")


Crawl URL is : http://xtractcrawler5-env.eba-akbhvznm.us-east-1.elasticbeanstalk.com//crawl
{'Transfer': 'Agq7z74DPW6bwq671j6Ya9WvgkEYQ975n8oQyr6GdzBroXnPqdUqCrJQN3qomyP082JEg7dj1YOaY1CP769jgF2Qa9', 'Authorization': 'Bearer AgePVk0lbgK6P8GbKGzgyEGr3zzkkw4QgxDpyqP0kDPlblP0VlUVCV0a0EXkgbyQkplM5WP4eMgXEDuKWrB2XSNMmg', 'FuncX': 'AgVzQJqp2yBYB9a0MKBOYQbpvXBBlEjQDqGKxexvd8PM9deP45UbCM0aPPkYYNM6PPy9wP66nWBVg4Tlj3yVxf2GYM'}
b'{"crawl_id":"fc1e0b65-2782-4914-9f1f-0ea9b8b0acfa","status":"200 (OK)"}\n'
Crawl ID: fc1e0b65-2782-4914-9f1f-0ea9b8b0acfa


We can get crawl status, seeing how many groups have been identified in the crawl. 

Note that measuring the total files yet to crawl is impossible, as the BFS may not have discovered all files yet, and Globus does not yet have a file counting feature for all directories and subdirectories. I.e., we know when we're done, but we don't know until we get there. 

In [22]:
# TODO: update the crawl status to query the DB (might require occasionally updating db)
crawl_status = requests.get(f'{eb_url}/get_crawl_status', json={'crawl_id': crawl_id})
print(crawl_status)
crawl_content = json.loads(crawl_status.content)
print(f"Crawl Status: {crawl_content}")

<Response [200]>
Crawl Status: {'bytes_crawled': 5476192682, 'crawl_id': 'fc1e0b65-2782-4914-9f1f-0ea9b8b0acfa', 'crawl_status': 'crawling', 'files_crawled': 12615, 'groups_crawled': 12615}


In [121]:
import time
while True:
    req = requests.get(f'{eb_url}/fetch_crawl_mdata', json={'crawl_id': crawl_id, 'n': 100})
    print(req.content)
    time.sleep(1)

b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[{"base_url":null,"crawl_timestamp":1629919112.08806,"file_id":"ce3410be-d565-49de-9368-f176175e3011","metadata":{"physical":{"extension":null,"path_type":"globus","size":205918}},"path":"/MDF/mdf_connect/prod/data/_test_einstein_9vpflvd_v1.1/OUTCAR","size":205918},{"base_url":null,"crawl_timestamp":1629919112.087977,"file_id":"aacf3f08-456e-4aa4-8000-c24086f3a944","metadata":{"physical":{"extension":null,"path_type":"globus","size":473}},"path":"/MDF/mdf_connect/prod/data/_test_einstein_9vpflvd_v1.1/INCAR","size":473},{"base_url":null,"crawl_timestamp":1629919112.088217,"file_id":"01c88c45-9989-40a0-a9a7-2b0552bebf48","metadata":{"physical":{"extension":null,"path_type":"globus","size":732}},"path":"/MDF/mdf_connect/prod/data/_test_einstein_9vpflvd_v1.1/POSCAR","size":732}],"num_files":3,"queue_empty":true}\n'
b'{"crawl_i

b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_files":0,"queue_empty":true}\n'
b'{"crawl_id":"e06d9c03-40f5-45d9-b5e2-5174732abe70","file_ls":[],"num_fi

KeyboardInterrupt: 

In [99]:
# crawl_id
# fetch_mdata = requests.get(f'{eb_url}/fetch_crawl_mdata', json={'crawl_id': crawl_id, 'n': 2})
# print(fetch_mdata.content)
# # fetch_content = json.loads(fetch_mdata.content)
# print(f"Crawl Status: {crawl_content}")

# tokens['Bearer']
print(f"Tokens: {tokens}")



# HERE WE WILL TEST CONFIGURING OUR ENDPOINT. 
config_status = requests.post(f"{eb_url}/configure_ep/{funcx_ep_id}", json={'headers': fx_headers, 
                                                                            'timeout': 25, 
                                                                            'ep_name': 'tyler_test_ep_2', 
                                                                            'globus_eid': '12345', 
                                                                            'xtract_path':'/Users/tylerskluzacek/.xtract',
                                                                            'local_download_path': 'foobar',
                                                                            'local_mdata_path': '/Users/tylerskluzacek/Desktop/metadata'
                                                                            })
config_content = json.loads(config_status.content)
print(f"Returned: {config_content}")


Tokens: {'Transfer': 'Agwvnakew9eJVYMvPPl2w2VaGw0yprWExvJrk47Qj23Wq7Ky7DIlCQj2l3WMYbb2NyP248BWOYo4G7S2kdDbwhEnqz', 'Authorization': 'Bearer Ag7rlbGlyY01B9VedoOm8VVpJ3aaDxobo5O1exrpJV8qjxrJ7zsWCyv7WQw1j98jn492qpGrzvqlnWtzK5lvWTWwXP', 'FuncX': 'Ag0YXW3mMOadwbO6JyOd88WVmmVX60arVO7oGDrzqJEXQ0Ev7Xt7CxeeQOJ4GeMe2DOQ9rDeV8J64bsw2qYNlF8dnN'}


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## Step 3: Xtract

Next we launch a non-blocking metadata extraction workflow that will automatically find all groups generated from our crawl_id, ship parsers to our endpoint as funcX, transfer the file (if necessary), and extract/send back metadata to the central Xtract service. This will just run constantly until the crawl is done and there are crawled groups left to extract. 

In [11]:
# crawl_id = "246d0edf-2641-46f7-9700-3cc5a49c4890"
xtract = requests.post(f'{eb_url}/extract', json={'crawl_id': crawl_id,
                                                  'repo_type': "HTTPS",
                                                  'headers': json.dumps(headers),
                                                  'funcx_eid': funcx_ep_id, 
                                                  'source_eid': source_ep_id,
                                                  'dest_eid': dest_ep_id,
                                                  'mdata_store_path': mdata_path, 
                                                  'data_prefetch_path': data_prefetch_path, 
                                                  'prefetch_remote': True})
print(f"Xtract response (should be 200): {xtract}")

TypeError: Object of type 'TransferClient' is not JSON serializable

In [16]:
xtract_status = requests.get(f'{eb_url}/get_extract_status', json={'crawl_id': crawl_id})
xtract_content = json.loads(xtract_status.content)
print(f"Xtract Status: {xtract_content}")

Xtract Status: {'crawl_id': 'd74d5cd1-7f77-474b-9ed0-8c56af832070', 'poll_status': 'RUNNING', 'send_status': 'SUCCEEDED'}


## Step 4: Access / Flush

We might want to flush all new metadata blobs to a separate Globus endpoint. Here we initialize a results poller that creates a file of each metadata attribute to a folder at this path: `<mdata_path>/<crawl_id>/<group_id>`

In [239]:
import time 

while True: 
    poller = requests.post(f'{eb_url}/fetch_crawl_mdata', json={'crawl_id': crawl_id, 'Transfer': transfer_token, 'n': 100})
    print(f'Flush Status: {poller}')
    time.sleep(5)

NameError: name 'transfer_token' is not defined