# Skluma Globus Crawler Notebook

#### <font color = 'grey'> This Jupyter notebook walks through how one can initialize, run, and evaluate a Skluma metadata extraction job on a Globus endpoint.</font>   

## <font color = 'blue'> Step 1: Initialization.  </font>
    
### Here we use MDF forge to create a Globus transfer client, which allows us to access the endpoint, check that the scannable directory exists, and pull the files to the server running this Jupyter notebook.  

In [7]:
import globus_sdk
from mdf_forge.forge import Forge

## DEAR USERS: CHANGE THIS TO MATCH THE GLOBUS ENDPOINT AND INSCRIBED ENDPOINT THAT YOU WANT TO SCAN. 
ENDPOINT_UUID = "3a261574-3a83-11e8-b997-0ac6873fc732"
ROOT_DIR = "/"

In [8]:
# If first time logging in on machine, will direct you to Globus Auth COPY/PASTE page.
mdf = Forge()   

# From mdf's stored Globus refresh token, get an authorized transfer-client 
tc = mdf.transfer_client

# Attempt to connect to the Globus endpoint. Will give status 200 if successful! 
try: 
    r = tc.get_endpoint(ENDPOINT_UUID)
    print("HTTP Status Code:", r.http_status)
    print("Endpoint Display Name:", r["display_name"])
    
except globus_sdk.GlobusAPIError as ex:
    print("HTTP Status Code:", ex.http_status)
    print("Error Code      :", ex.code)
    print("Error Message   :", ex.message)

HTTP Status Code: 200
Endpoint Display Name: CDIAC Dataset


## <font color='blue'> Step 2: Running Skluma </font> 

### Next, we iterate over the directories in the Globus endpoint, submit each file to Skluma for metadata extraction, and push the metadata to a Globus Search index

In [None]:
### DFS by iteratively searching for directories in directories. 
endpoint_id = ENDPOINT_UUID
endpoint_path = ROOT_DIR

visited = set()
unvisited_list = [endpoint_path]

# Just so we can count directories without too many print statements
dir_count = 0
file_count = 0  ### TODO: create rudimentary file counter. 

while unvisited_list: # while unvisited elements.
    
    current_path = unvisited_list[-1]
    unvisited_list.pop()
        
    visited.add(current_path)
    r = tc.operation_ls(endpoint_id, path=current_path)
    for item in r:
        if item["type"] == 'dir' and current_path + "/" + item["name"] + "/" not in visited:
            dir_count += 1
            unvisited_list.append(current_path + "/" + item["name"] + "/")
            if dir_count % 50 == 0: 
                print("Processed " + str(file_count) + " files in " + str(dir_count) + " directories...")
                
print("\n Total Directories Processed: " + dir_count)
print("\n Total Files Processed: " + file_count)

print("TODO: Pushing files to Globus Search...")  ### TODO. 

Processed 0 files in 50 directories...
Processed 0 files in 100 directories...
Processed 0 files in 150 directories...
Processed 0 files in 200 directories...
Processed 0 files in 250 directories...
Processed 0 files in 300 directories...
Processed 0 files in 350 directories...
Processed 0 files in 400 directories...
Processed 0 files in 450 directories...
Processed 0 files in 500 directories...
Processed 0 files in 550 directories...


## <font color='blue'> Step 3: Evaluate Outputs </font>

### Finally, we evaluate Skluma's metadata outputs. 