# Using Xtract to index research artifacts stored on Jetstream

### This Xtract-Jetstream demo illustrates how to crawl, extract metadata from, and ingest metadata for any Globus Endpoint.

We begin by importing the Client and Endpoint classes from the xtract_sdk SDK

In [None]:
from xtract_sdk.client import XtractClient
from xtract_sdk.endpoint import XtractEndpoint

## Step 1.a: Login
### Creating an XtractClient object

Here we create an XtractClient object to request tokens from Globus Auth. When fresh tokens are needed, users will authenticate with their Globus ID by following the directions in the STDOUT. Default auth scopes are as follow:

* **openid**: provides username for identity.
* **data_mdf**: FILL IN
* **search**: interact with Globus Search
* **petrel**: read or write data on Petrel. Not needed if no data going to Petrel.
* **transfer**: needed to crawl the Globus endpoint and transfer metadata to its final location.
* **dlhub**: FILL IN
* **funcx_scope**: needed to orchestrate the metadata exraction at the given funcX endpoint.

Additional auth scopes can be added with the `auth_scopes` argument.

The following code block initializes all of the tokens by creating an `XtractClient` object.

In [None]:
xtr = XtractClient(auth_scopes=None, force_login=False)  
print(f'Auths: {xtr.auths}')

## Step 1.b: Defining endpoints
### Creating an XtractEndpoint object

Here we create an XtractEndpoint object to be used later in a crawl, etc.

Required arguments are as follow:
* **repo_type**: at this point, only Globus is accepted. GDrive and others to be implemented at a later date.
* **globus_ep_id**: the source endpoint ID, at this point assumed to be a Globus ID (see previous bullet point)
* **dirs**: directory paths for where the data resides
* **grouper**: grouping strategy we want to use for grouping.

The XtractEndpoint can also be given a `funcx_ep_id`.

The following code block creates two `XtractEndpoint` objects which we will then be able to crawl on, etc.

In [None]:
xep1 = XtractEndpoint(repo_type="GLOBUS",
                      globus_ep_id='cb61bb16-5144-11ec-a6c6-9b4f84e67de8',
                      funcx_ep_id='6b3a1745-5e0e-4c60-82db-0faac6cc246f',
                      dirs=['/home/tskluzac/Documents/to-transfer-smaller'],
                      grouper='file_is_group',
                      local_mdata_path = "/home/tskluzac/mdata")

transferxep = XtractEndpoint(repo_type="GLOBUS",
                            globus_ep_id='cb61bb16-5144-11ec-a6c6-9b4f84e67de8',
                            dirs=['/home/tskluzac/Documents/to-transfer-smaller'],
                            grouper='file_is_group',
                            local_mdata_path=['/home/tskluzac/mdata'])

#transferxep.register_containers(container_path='/home/tskluzac/.xtract/.containers')

## Step 2.a: Crawl
Crawling, behind the scenes, will scan a Globus directory breadth-first (using globus_ls), first extracting physical metadata such as path, size, and extension. Next, since the *grouper* we selected is 'file_is_group', the crawler will simply create `n` single-file groups. 

The crawl is **non-blocking**, and the crawl_id here will be used to execute and monitor downstream extraction processes. 

The crawl ID for each endpoint will be stored in the XtractClient object as a list `self.crawl_ids`.

In [None]:
xtr.crawl([xep1])#, xep2])
print(f'Crawl IDs: {xtr.crawl_ids}')

## Step 2.b: Crawl status
We can get crawl status, seeing how many groups have been identified in the crawl. If `xtr.crawl()` has already been run, then `xtr.get_crawl_status()` will get the status of the IDs stored in `xtr.crawl_ids`. Otherwise, a list of `crawl_ids` may be given to `xtr.get_crawl_status()`.

This will return a dictionary resembling: 
```
{‘crawl_id’: ‘xxx’,
 ‘status’: ‘xxx’, 
 ‘message’: “OK” if everything is fine otherwise describes error,
 ‘data’: {'bytes_crawled': xxx, ..., 'files_crawled': xxx}}
```

Note that measuring the total files yet to crawl is impossible, as the BFS may not have discovered all files yet, and Globus does not yet have a file counting feature for all directories and subdirectories. I.e., we know when we're done, but we do not know until we get there. 

**Warning:** it currently takes up to 30 seconds for a crawl to start. *Why?* Container warming time. 

In [None]:
import time

while True:
    
    crawl_statuses = xtr.get_crawl_status(crawl_ids=None)
    for resp in crawl_statuses:
        print(resp)
    
    sub_statuses = [d['status'] for d in crawl_statuses]
    if all(s == 'complete' for s in sub_statuses):
        break
    
    time.sleep(2)

## Step 2.c: Flushing crawl metadata

After running a crawl, we can use `xtr.flush_crawl_metadata()` to return a list of all metadata from the crawl. 

Similarly with `.get_crawl_status()`, if `xtr.crawl()` has already been run, then `xtr.flush_crawl_metadata()` will get the status of the IDs stored in `xtr.crawl_ids`. Otherwise, a list of `crawl_ids` may be given to `xtr.flush_crawl_metadata()`.

Flushing crawl metadata will return a dictionary resembling:
```
{"crawl_id": String,
 "file_ls": List,
 "num_files": Integer,
 "queue_empty": Boolean}
```

In [None]:
while True:
    
    print(xtr.flush_crawl_metadata(crawl_ids=None))

    time.sleep(1)

## Step 3: Xtract-ing

Under construction

In [None]:
print(f"Should be 200: {xtr.xtract()}")

## Step 3b: Getting Xtract status

Under construction

In [None]:
import time

while True:
    
    xtract_statuses = xtr.get_xtract_status()
    for resp in xtract_statuses:
        print(resp)
    
    sub_statuses = [d['xtract_status'] for d in xtract_statuses]
    if all(s == 'complete' for s in sub_statuses):
        break
    
    time.sleep(2)

## Offload Metadata

Under construction

In [None]:
from xtract_sdk.client import XtractClient
from xtract_sdk.endpoint import XtractEndpoint

import globus_sdk
import time

xtr = XtractClient(auth_scopes=None, force_login=False)  
print(f'Auths: {xtr.auths}')

# transferxep = XtractEndpoint(repo_type="GLOBUS",
#                             globus_ep_id='cb61bb16-5144-11ec-a6c6-9b4f84e67de8',
#                             dirs=['/home/tskluzac/Documents/to-transfer-smaller'],
#                             grouper='file_is_group',
#                             local_mdata_path='/home/tskluzac/mdata')

transferxep = XtractEndpoint(repo_type="GLOBUS",
                            globus_ep_id='4f99675c-ac1f-11ea-bee8-0e716405a293',
                            dirs=['/tyler_g_test/FREN'],
                            grouper='file_is_group',
                            local_mdata_path='/tyler_g_test/FREN')

In [None]:
xtr.crawl_and_wait([transferxep])

In [None]:
xtr.offload_metadata(dest_ep_id='0caf6e8e-4974-11ec-a515-b537d6c07c1d',
                    dest_path='Desktop/mdata/',
                    delete_source=True)