# Skluma Globus Crawler Notebook

#### <font color = 'grey'> This Jupyter notebook walks through how one can initialize, run, and evaluate a Skluma metadata extraction job on a Globus endpoint.</font>   

## <font color = 'blue'> Step 1: Initialization.  </font>
    
#### Here we use MDF forge to create a Globus transfer client, which allows us to access the endpoint, check that the scannable directory exists, and pull the files to the server running this Jupyter notebook.  

In [16]:
import globus_sdk
from mdf_forge.forge import Forge

## DEAR USERS: CHANGE THIS TO MATCH THE GLOBUS ENDPOINT AND INSCRIBED ENDPOINT THAT YOU WANT TO SCAN. 
ENDPOINT_UUID = "3a261574-3a83-11e8-b997-0ac6873fc732"
ROOT_DIR = "/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/"

In [17]:
# If first time logging in on machine, will direct you to Globus Auth COPY/PASTE page.
mdf = Forge()   

# From mdf's stored Globus refresh token, get an authorized transfer-client 
tc = mdf.transfer_client

# Attempt to connect to the Globus endpoint. Will give status 200 if successful! 
try: 
    r = tc.get_endpoint(ENDPOINT_UUID)
    print("HTTP Status Code:", r.http_status)
    print("Endpoint Display Name:", r["display_name"])
    
except globus_sdk.GlobusAPIError as ex:
    print("HTTP Status Code:", ex.http_status)
    print("Error Code      :", ex.code)
    print("Error Message   :", ex.message)

HTTP Status Code: 200
Endpoint Display Name: CDIAC Dataset


## <font color='blue'> Step 2: Running Skluma </font> 

#### Next, we iterate over the directories in the Globus endpoint, submit each file to Skluma for metadata extraction, and push the metadata to a Globus Search index

In [26]:
### DFS by iteratively searching for directories in directories. 
endpoint_id = ENDPOINT_UUID
endpoint_path = ROOT_DIR

destination_id = "8a8c01a4-02bb-11e8-a65a-0a448319c2f8"

visited_dirs = set()
unvisited_dir_list = [endpoint_path]
visited_files = set()
unvisited_file_list = []

# Just so we can count directories without too many print statements
dir_count = 0
file_count = 0  ### TODO: create rudimentary file counter. 

while unvisited_dir_list: # while unvisited elements exist...
    
    current_path = unvisited_dir_list[-1]
    unvisited_dir_list.pop()
        
    visited_dirs.add(current_path)
    r = tc.operation_ls(endpoint_id, path=current_path)
    for item in r:
        if item["type"] == 'dir' and current_path + "/" + item["name"] + "/" not in visited_dirs:
            dir_count += 1
            unvisited_dir_list.append(current_path + "/" + item["name"] + "/")
            if dir_count % 50 == 0: 
                print("Processed " + str(file_count) + " files in " + str(dir_count) + " directories...")      
        elif item["type"] == 'file':
            print(current_path + item["name"])
            tdata = globus_sdk.TransferData(tc, endpoint_id,
                                destination_id,
                                label="debug sklobus notebook")
            tdata.add_item(current_path + item["name"], "/home/skluzacek/Downloads/" + item["name"])
            
            submit_result = tc.submit_transfer(tdata)
            print("Task ID:", submit_result["task_id"])

            # (2) extract metadata
            # (3) delete file's local copy
                
print("\n Total Directories Processed: " + str(dir_count))
print("\n Total Files Processed: " + str(file_count))

print("TODO: Pushing files to Globus Search...")  ### TODO. 

/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/README
Task ID: 0902c134-f40c-11e8-8ccf-0a1d4c5c824a
/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/README~
Task ID: 094d0168-f40c-11e8-8ccf-0a1d4c5c824a
/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/a23_hy1.csv
Task ID: 09236705-f40c-11e8-8ccf-0a1d4c5c824a
/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/a23do.pdf
Task ID: 09e62974-f40c-11e8-8ccf-0a1d4c5c824a

 Total Directories Processed: 0

 Total Files Processed: 0
TODO: Pushing files to Globus Search...


## <font color='blue'> Step 3: Evaluate Outputs </font>

#### Finally, we evaluate Skluma's metadata outputs. 