# Using Xtract to index research artifacts stored on Jetstream

### This Xtract-Jetstream demo illustrates how to crawl, extract metadata from, and ingest metadata for any Globus Endpoint.

#### We begin by importing important libraries. Of note, we use the `mdf_toolbox` library as a wrapper for Globus Auth.

In [1]:
import mdf_toolbox
import requests
import jsonschema

#### Now, we import the Client and Endpoint classes from the xtract_sdk SDK

In [2]:
from xtract_sdk.client import XtractClient
from xtract_sdk.endpoint import XtractEndpoint

## Step 1.a: Login
### Creating an XtractClient object

Here we create an XtractClient object to request tokens from Globus Auth. When fresh tokens are needed, users will authenticate with their Globus ID by following the directions in the STDOUT. Default auth scopes are as follow:

* **openid**: provides username for identity.
* **data_mdf**: FILL IN
* **search**: interact with Globus Search
* **petrel**: read or write data on Petrel. Not needed if no data going to Petrel.
* **transfer**: needed to crawl the Globus endpoint and transfer metadata to its final location.
* **dlhub**: FILL IN
* **funcx_scope**: needed to orchestrate the metadata exraction at the given funcX endpoint.

Additional auth scopes can be added with the `auth_scopes` argument.

The following code block initializes all of the tokens by creating an `XtractClient` object.

In [5]:
xtr = XtractClient(auth_scopes=None, dev=False)  
print(f'Auths: {xtr.auths}')

Auths: {'openid': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7fa069ccd8e0>, 'https://auth.globus.org/scopes/facd7ccc-c5f4-42aa-916b-a0e270e2c2a9/all': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7fa069ccd6a0>, 'search': <globus_sdk.services.search.client.SearchClient object at 0x7fa069f5c6d0>, 'data_mdf': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7fa069f77fd0>, 'petrel': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7fa0695bae50>, 'dlhub': <globus_sdk.authorizers.refresh_token.RefreshTokenAuthorizer object at 0x7fa069f5cb20>, 'transfer': <globus_sdk.services.transfer.client.TransferClient object at 0x7fa069f94ee0>}


## Step 1.b: Defining endpoints
### Creating an XtractEndpoint object

Here we create an XtractEndpoint object to be used later in a crawl, etc.

Required arguments are as follow:
* **repo_type**: at this point, only Globus is accepted. GDrive and others to be implemented at a later date.
* **globus_ep_id**: the source endpoint ID, at this point assumed to be a Globus ID (see previous bullet point)
* **dirs**: directory paths for where the data resides
* **grouper**: grouping strategy we want to use for grouping.

The XtractEndpoint can also be given a `funcx_ep_id`.

The following code block creates two `XtractEndpoint` objects which we will then be able to crawl on, etc.

In [6]:
xep1 = XtractEndpoint(repo_type='globus',
                      globus_ep_id='f9959bd2-e98f-11eb-884c-aba19178789c',
                      funcx_ep_id='aaaa-0000-3333',
                      dirs=['/home/tskluzac/cord-19'], 
                      grouper='file_is_group')

xep2 = XtractEndpoint(repo_type='globus',
                      globus_ep_id='f9959bd2-e98f-11eb-884c-aba19178789c',
                      dirs=['/home/tskluzac/cord-19'], 
                      grouper='file_is_group')

## Step 2.a: Crawl
Crawling, behind the scenes, will scan a Globus directory breadth-first (using globus_ls), first extracting physical metadata such as path, size, and extension. Next, since the *grouper* we selected is 'file_is_group', the crawler will simply create `n` single-file groups. 

The crawl is **non-blocking**, and the crawl_id here will be used to execute and monitor downstream extraction processes. 

The crawl ID for each endpoint will be stored in the `xtr` XtractClient object as a list.

In [7]:
xtr.crawl([xep1, xep2])
print(f'Crawl IDs: {xtr.crawl_ids}')

Crawl IDs: ['4343ae57-118d-4ba9-9355-81fefa4fd6f7', '018d51be-9bf4-44db-acb3-79ad5a36150d']


## Step 2.b: Crawl status
We can get crawl status, seeing how many groups have been identified in the crawl. If `.crawl()` has already been run, then `.get_crawl_status()` will get the status of the IDs stored in the XtractClient object by `.crawl()`. Otherwise, a list of `crawl_ids` may be given to `.get_crawl_status()`.

Note that measuring the total files yet to crawl is impossible, as the BFS may not have discovered all files yet, and Globus does not yet have a file counting feature for all directories and subdirectories. I.e., we know when we're done, but we do not know until we get there. 

**Warning:** it currently takes up to 30 seconds for a crawl to start. *Why?* Container warming time. 

In [8]:
import time

while True:

    crawl_statuses = xtr.get_crawl_status(crawl_ids=None)
    for resp in crawl_statuses:
        print(resp)

    sub_statuses = [d['status'] for d in crawl_statuses]
    if all(s == 'complete' for s in sub_statuses):
        break

    time.sleep(2)

{'crawl_id': '4343ae57-118d-4ba9-9355-81fefa4fd6f7', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '018d51be-9bf4-44db-acb3-79ad5a36150d', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '4343ae57-118d-4ba9-9355-81fefa4fd6f7', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '018d51be-9bf4-44db-acb3-79ad5a36150d', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '4343ae57-118d-4ba9-9355-81fefa4fd6f7', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '018d51be-9bf4-44db-acb3-79ad5a36150d', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '4343ae57-118d-4ba9-9355-81fefa4fd6f7', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '018d51be-9bf4-44db-acb3-79ad5a36150d', 'status': 'initializing', 'message': 'OK or error', 'data': {}}
{'crawl_id': '4343ae57-118d-4ba9-9355-81fefa4fd6f7', 'status': '

KeyboardInterrupt: 