# Using Xtract to index research artifacts stored on Jetstream


### This Xtract-Jetstream demo illustrates how to crawl, extract metadata from, and ingest metadata for any Globus Endpoint.

#### We begin by importing important libraries. Of note, we use the `mdf_toolbox` library as a wrapper for Globus Auth. 

In [1]:
from xtract_sdk.endpoint import XtractEndpoint
from xtract_sdk.client import XtractClient

import time
import json


## Step 0: Configuration

#### Here we provide configuration details for our metadata extraction job, including specifications for both Globus and funcX.

In [2]:
# TODO: FIX THIS TO BE AN ACTUAL PATH ON THE ENDPOINT. 
xep1 = XtractEndpoint(repo_type='globus',
                      globus_ep_id='f9959bd2-e98f-11eb-884c-aba19178789c',
                      funcx_ep_id='e1398319-0d0f-4188-909b-a978f6fc5621',
                      dirs=['/home/tskluzac/cord-19-test/'], 
                      grouper='file_is_group', 
                      local_mdata_path='/home/tskluzac/mdata',
                      remote_mdata_path=None)

xtr = XtractClient(auth_scopes=["petrel"])

# TODO: WANT TO PUT THIS BACK IN HIDING
fx_headers = {'Authorization': f"Bearer {xtr.auths[xtr.funcx_scope].access_token}",
             'Search': xtr.auths['search'].authorizer.access_token,
             'Openid': xtr.auths['openid'].access_token}


# Run this if you want to update the container library.
xep1.register_containers(container_path='/home/tskluzac/.xtract/.containers', headers=fx_headers)


Register containers status: OK


## Step 1: Login 

Here we use `mdf_toolbox` to request tokens from Globus Auth. When fresh tokens are needed, users will authenticate with their Globus ID by following the directions in the STDOUT. Notable auth scopes are as follows: 

* **openid**: provides username for identity.
* **search**: interact with Globus Search
* **petrel**: read or write data on Petrel. Not needed if no data going to Petrel.
* **transfer**: needed to crawl the Globus endpoint and transfer metadata to its final location.
* **funcx_scope**: needed to orchestrate the metadata exraction at the given funcX endpoint.

The following code block initializes all of the tokens.

In [None]:
# xtr = XtractClient(auth_scopes=["petrel"])

## Step 2: Crawl
Crawling, behind the scenes, will scan a Globus directory breadth-first (using globus_ls), first extracting physical metadata such as path, size, and extension. Next, since the *grouper* we selected is 'file_is_group', the crawler will simply create `n` single-file groups. 

The crawl is **non-blocking**, and the crawl_id here will be used to execute and monitor downstream extraction processes. 

In [3]:
cids = xtr.crawl([xep1])

We can get crawl status, seeing how many groups have been identified in the crawl. 

Note that measuring the total files yet to crawl is impossible, as the BFS may not have discovered all files yet, and Globus does not yet have a file counting feature for all directories and subdirectories. I.e., we know when we're done, but we do not know until we get there. 

**Warning:** it currently takes up to 30 seconds for a crawl to start. *Why?* Container warming time. 

In [4]:
while True:
    x = xtr.get_crawl_status(crawl_ids=cids)
    all_x = True
    # print(x)
    for item in x:
        print(item)
        if item['status'] != 'complete':
            all_x = False
    if all_x:
        break
    time.sleep(2)


{'crawl_id': '9175a3cb-b016-4b9f-a96c-12b34b2d203b', 'status': 'crawling', 'message': 'OK', 'data': {'groups_crawled': 0, 'files_crawled': 0, 'bytes_crawled': 0}}
{'crawl_id': '9175a3cb-b016-4b9f-a96c-12b34b2d203b', 'status': 'crawling', 'message': 'OK', 'data': {'groups_crawled': 3, 'files_crawled': 3, 'bytes_crawled': 399171}}
{'crawl_id': '9175a3cb-b016-4b9f-a96c-12b34b2d203b', 'status': 'crawling', 'message': 'OK', 'data': {'groups_crawled': 3, 'files_crawled': 3, 'bytes_crawled': 399171}}
{'crawl_id': '9175a3cb-b016-4b9f-a96c-12b34b2d203b', 'status': 'complete', 'message': 'OK', 'data': {'groups_crawled': 3, 'files_crawled': 3, 'bytes_crawled': 399171}}


## Step 3a. You can directly flush the crawl metadata via REST 

#### Why? Downloading crawl metadata is useful for many file organization tasks, such as: 
- I want a list of all files on my file system
- I want to know the total size (GB) of a folder
- I want to tally files by extension

#### Currently Foundry uses Xtract to create a list of all files in user-submitted folders. Check it out here: 
TODO: LINK TO FOUNDRY. 

**Caution**: if you flush the crawl metadata (3a), **you may not** extract metadata from them (3b). If you want to do both, you must launch two separate crawl jobs. 

In [5]:
# while True:
#     req = requests.get(f'{eb_url}/fetch_crawl_mdata', json={'crawl_id': crawl_id, 'n': 100})
#     print(req.content)
#     time.sleep(1)

In [6]:
# print(f"Tokens: {tokens}")

# # HERE WE WILL TEST CONFIGURING OUR ENDPOINT. 
# config_status = requests.post(f"{eb_url}/configure_ep/{funcx_ep_id}", json={'headers': fx_headers, 
#                                                                             'timeout': 25, 
#                                                                             'ep_name': 'tyler_test_ep_2', 
#                                                                             'globus_eid': '12345', 
#                                                                             'xtract_path':'/Users/tylerskluzacek/.xtract',
#                                                                             'local_download_path': 'foobar',
#                                                                             'local_mdata_path': '/Users/tylerskluzacek/Desktop/metadata'
#                                                                      })
# config_content = json.loads(config_status.content)
# print(f"Returned: {config_content}")


## Step 3b: Xtract

Next we launch a non-blocking metadata extraction workflow that will automatically find all groups generated from our crawl_id, ship parsers to our endpoint as funcX, transfer the file (if necessary), and extract/send back metadata to the central Xtract service. This will just run constantly until the crawl is done and there are crawled groups left to extract. 

In [7]:
import requests

# Just grab the first one here. 
crawl_id = cids[0]

# TODO: Break open the xtract-client object
# xtr.

eb_extract_url = 'http://127.0.0.1:5000'
fx_headers = {'Authorization': f"Bearer {xtr.auths[xtr.funcx_scope].access_token}",
             'Search': xtr.auths['search'].authorizer.access_token,
             'Openid': xtr.auths['openid'].access_token}

xtract = requests.post(f'{eb_extract_url}/extract', json={
    'crawl_id': crawl_id, 
    'tokens': fx_headers, 
    'local_mdata_path': xep1.local_mdata_path,
    'fx_ep_ids': ['e1398319-0d0f-4188-909b-a978f6fc5621'], # TODO: THIS IS HARDCODED. 
    'remote_mdata_path': xep1.remote_mdata_path})
print(f"Xtract response (should be 200): {xtract}")


Xtract response (should be 200): <Response [403]>


In [None]:
xtract_status = requests.get(f'{eb_extract_url}/get_extract_status', json={'crawl_id': crawl_id})
print(f"Xtract Status: {json.loads(xtract_status.content)['status']}")
print(f"Xtract Counters: {json.loads(xtract_status.content)['counters']}")

## Step 4 (optional): Globus Search ingest

#### In this step we create (and name) a Globus Search index for our data.



In [None]:
# eb_extract_url = 'http://127.0.0.1:5000'

search_index = "ce2d9637-ad96-423f-99bc-935de889f640"

fx_headers = {'Authorization': f"Bearer {auths[funcx_scope].access_token}",
             'Search': auths[search_all].authorizer.access_token,
             'Openid': auths['openid'].access_token}

search_info = {
    'dataset_mdata': {'organizer':  'Tyler J. Skluzacek'},
    'search_index_id': search_index,
    'mdata_dir': local_mdata_dir,  
    'tokens': fx_headers
}

resp = requests.post(f'{eb_extract_url}/ingest_search', json=search_info)
print(resp)




## Step 5: Metadata transfer (archive)

#### Metadata, by default, are stored on the filesystem of the machine on which they were extracted. Here we can move them to a Globus endpoint of our choosing. 

Here I will push the metadata to ALCF's Petrel data store and opt not to DELETE them from Jetstream. 

In [None]:
tokens = {"Transfer": auths["petrel"].access_token}

while True:
    xtract_status = requests.post(f'{eb_extract_url}/offload_mdata', json={
        'crawl_id': crawl_id, 
        'tokens': crawl_tokens, 
        'source_ep': source_ep_id, 
        'mdata_ep': mdata_ep_id, 
        'delete_source': False})

    response = json.loads(xtract_status.content)
    print(response['status'])
    if response['status'] == 'SUCCESS':
        break

## Step 6: Let's query the index!

In [None]:
# COMING SOON.