# Xtract SDK v0.0.7a6

## Login: Creating an XtractClient object

In [179]:
from xtract_sdk.client import XtractClient

First, we import the XtractClient class from the Xtract SDK.

In [180]:
xtr = XtractClient(auth_scopes=[], dev=True, force_login=False)
# print(xtr.auths)

Here we create an XtractClient object to request tokens from Globus Auth.

The **auth_scopes** argument accepts an optional list of strings which correspond to authorization scopes. While additional auth scopes may be added with the **auth_scopes** argument, there are a number of 
default scopes automatically requested within the system. These are: 

* **openid**: provides username for identity.
* **search**: interact with Globus Search
* **petrel**: read or write data on Petrel. Not needed if no data going to Petrel.
* **transfer**: needed to crawl the Globus endpoint and transfer metadata to its final location.
* **funcx_scope**: needed to orchestrate the metadata extraction at the given funcX endpoint.

When true, **force_login** makes you go through the full authorization flow again.

## Defining endpoints: Creating an XtractEndpoint object
Endpoints in Xtract are the computing fabric that enable us to move files and apply extractors to files. To this end, 
an Xtract endpoint is the combination of the following two software endpoints: 
* **Globus endpoints** [required] enable us to access all file system metadata about files stored on an endpoint, and enables us to transfer files between machines for more-efficient processing.
* **FuncX endpoints** [optional] are capable of remotely receiving extraction functions that can be applied to files on the Globus endpoint. Note that the absence of a funcX endpoint on an Xtract endpoint means that a file must be transferred to an endpoint *with* a valid funcX endpoint in able to have its metadata extracted.

In [181]:
from xtract_sdk.endpoint import XtractEndpoint

In order to create an Xtract endpoint, we first import the XtractEndpoint class from the Xtract SDK.

In [182]:
# xep1 = XtractEndpoint(repo_type="GLOBUS",
#                       globus_ep_id='f9959bd2-e98f-11eb-884c-aba19178789c',
#                       dirs=['/home/tskluzac/cord-19-test'],
#                        #globus_ep_id='24a66214-76eb-11ec-9b64-f9dfb1abb183',
#                       # dirs=['/home/tskluzac/demo_files'],
#                       grouper='file_is_group',
#                       #funcx_ep_id='6b3a1745-5e0e-4c60-82db-0faac6cc246f',
#                       funcx_ep_id='e1398319-0d0f-4188-909b-a978f6fc5621',
#                       metadata_directory='/home/tskluzac/mdata')

xep1 = XtractEndpoint(repo_type="GLOBUS",
                      globus_ep_id='08925f04-569f-11e7-bef8-22000b9a448b',
                      dirs=['/eagle/Xtract/cdiac-decomp'],
                       #globus_ep_id='24a66214-76eb-11ec-9b64-f9dfb1abb183',
                      # dirs=['/home/tskluzac/demo_files'],
                      grouper='file_is_group',
                      #funcx_ep_id='6b3a1745-5e0e-4c60-82db-0faac6cc246f',
                      funcx_ep_id='e1398319-0d0f-4188-909b-a978f6fc5621',
                      metadata_directory='/home/tskluzac/mdata')

Then we can actually create an endpoint object to be used later in a crawl, xtraction, etc. The arguments are as follow:
* **repo_type**: (str) at this point, only Globus is accepted. Google Drive and others will be made available at a later date. 
* **globus_ep_id**: (uuid str) the Globus endpoint ID.
* **dirs**: (list of str) directory paths on Globus endpoint for where the data reside.
* **grouper**: (str) grouping strategy for files.
* **funcx_ep_id**: (optional uuid str) funcX endpoint ID.
* **metadata_directory** (optional str) directory path on Globus endpoint for where xtraction metadata should go.

## Crawling

In [171]:
xtr.crawl(endpoints=[xep1])

['5e4e0c8b-ded3-4f56-8de4-121587006b57']

Where **endpoints** is a list of XtractEndpoint objects.

The crawl ID for each endpoint will be stored in the XtractClient object as a list `xtr.crawl_ids`. Furthermore, each endpoint will be stored in the XtractClient object in a dictionary `cid_to_endpoint_map`, where each crawl id key maps to the corresponding endpoint as a value.

Behind the scenes, this will scan a Globus directory breadth-first (using globus_ls), first extracting physical metadata such as path, size, and extension. Next, since the *grouper* we selected is 'file_is_group', the crawler will create a single-file group for every endpoint given. 

The crawl is **non-blocking**, and the crawl_id here will be used to execute and monitor downstream extraction processes. 

### Getting Crawl status

In [178]:
xtr.get_crawl_status(crawl_ids=None)

[{'crawl_id': '5e4e0c8b-ded3-4f56-8de4-121587006b57',
  'status': 'initializing',
  'message': 'OK',
  'data': {}}]

We can get crawl status, seeing how many groups have been identified in the crawl. If `xtr.crawl()` has already been run, then `xtr.get_crawl_status()` will get the status of the IDs stored in `xtr.crawl_ids`. Otherwise, a list of `crawl_ids` may be given to `xtr.get_crawl_status()`.

This will return a dictionary resembling: 
```
{‘crawl_id’: String,
 ‘status’: String, 
 ‘message’: “OK” if everything is fine otherwise describes error,
 ‘data’: {'bytes_crawled': Integer, ..., 'files_crawled': Integer}}
```

Note that measuring the total files yet to crawl is impossible, as the BFS may not have discovered all files yet, and Globus does not yet have a file counting feature for all directories and subdirectories. I.e., we know when we're done, but we do not know until we get there. 

**Warning:** it currently takes up to 30 seconds for a crawl to start. *Why?* Container warming time. 

### Crawl and wait

In [83]:
# xtr.crawl_and_wait(endpoints=[xep1])

Where **endpoints** is a list of XtractEndpoint objects.

For ease of testing, we've implemented a **crawl_and_wait** functionality, which will crawl the given endpoints and then print the crawl status of all given endpoints every two seconds until all have completed crawling.

## Xtract-ing

### Registering containers for Xtraction

In [148]:
import requests
# print(xtr.extract_url)
# x = requests.get(xtr.extract_url)
xtr.register_containers(endpoint=xep1,
                        container_path='/home/tskluzac/.xtract/.containers')


'Register containers status (should be 200): 200'

Where **endpoint** argument should be an XtractEndpoint object, and **container_path** (str) argument should be the path to the xtraction containers on the Globus endpoint.


In order to perform an xtraction, we must have the requisite containers for each extractor that is to be used. After creating client and endpoint instances, containers must be registered for each endpoint, using `.register_containers()`. 

This can be executed regardless of **crawl** completion status.

### Xtract

In [149]:
xtr.xtract()

[<Response [200]>]

The **crawl** method must have already been run, and an **xtract**ion will be run for each endpoint that was given to **crawl**. **xtract** will return the HTTP status response code, which should be 200.

### Getting Xtract status

In [150]:
xtr.get_xtract_status()

import time
while(True):
    print(xtr.get_xtract_status())
    time.sleep(2)

[{'xtract_status': 'SCHEDULED', 'xtract_counters': {'cumu_orch_enter': 0, 'cumu_pulled': 5, 'cumu_scheduled': 3, 'cumu_to_schedule': 5, 'flagged_unknown': 0, 'fx': {'failed': 0, 'pending': 0, 'success': 0}}}]
[{'xtract_status': 'EXTRACTING', 'xtract_counters': {'cumu_orch_enter': 3, 'cumu_pulled': 5, 'cumu_scheduled': 3, 'cumu_to_schedule': 5, 'flagged_unknown': 0, 'fx': {'failed': 0, 'pending': 0, 'success': 0}}}]
[{'xtract_status': 'EXTRACTING', 'xtract_counters': {'cumu_orch_enter': 3, 'cumu_pulled': 5, 'cumu_scheduled': 3, 'cumu_to_schedule': 5, 'flagged_unknown': 0, 'fx': {'failed': 0, 'pending': 0, 'success': 0}}}]
[{'xtract_status': 'EXTRACTING', 'xtract_counters': {'cumu_orch_enter': 3, 'cumu_pulled': 5, 'cumu_scheduled': 3, 'cumu_to_schedule': 5, 'flagged_unknown': 0, 'fx': {'failed': 0, 'pending': 0, 'success': 0}}}]
[{'xtract_status': 'EXTRACTING', 'xtract_counters': {'cumu_orch_enter': 3, 'cumu_pulled': 5, 'cumu_scheduled': 3, 'cumu_to_schedule': 5, 'flagged_unknown': 0, 'f

KeyboardInterrupt: 

The **xtract** method must have already been run, and this call will return a list of **xtract statuses**, one for each endpoint given to **crawl**.

This will return a dictionary resembling:

```
{'xtract_status': String,
 'xtract_counters': {'cumu_orch_enter': Integer, 
                     'cumu_pulled': Integer, 
                     'cumu_scheduled': Integer, 
                     'cumu_to_schedule': Integer, 
                     'flagged_unknown': Integer, 
                     'fx': {'failed': Integer, 
                            'pending': Integer, 
                            'success': Integer}}
```

## Offload metadata

In [None]:
xtr.offload_metadata(dest_ep_id='0caf6e8e-4974-11ec-a515-b537d6c07c1d',
                     dest_path='Desktop/mdata/',
                     timeout=600,
                     delete_source=False)

The **offload_metadata** method can be used to transfer files between two endpoints, and is included in this SDK for the purpose of transferring metadata from **xtract**ion. It takes the following arguments:
* **dest_ep_id**: (str) the ID of the endpoint to which the files are being transferred.
* **dest_path**: (optional str) the path on the destination endpoint where the files should go
* **timeout**: (optional int, default 600) how long the transfer should wait until giving up if unsuccessful
* **delete_source**: (optional boolean, default False) set to True if the source files should be deleted after metadata completion

This method will transfer the metadata to a new folder (in the destination path, if supplied) which is named in the convention **YYYY-MM-DD-hh:mm:ss**. Calling the function will return the path to this folder on the destination endpoint.

## Search: coming soon! 

## Downloaders: coming soon! 