# Skluma Globus Crawler Notebook

#### <font color = 'grey'> This Jupyter notebook walks through how one can initialize, run, and evaluate a Skluma metadata extraction job on a Globus endpoint.</font>   

## <font color = 'blue'> Step 1: Initialization.  </font>
    
#### Here we use MDF forge to create a Globus transfer client, which allows us to access the endpoint, check that the scannable directory exists, and pull the files to the server running this Jupyter notebook.  

In [1]:
import globus_sdk
import urllib
from mdf_forge.forge import Forge

## DEAR USERS: CHANGE THIS TO MATCH THE GLOBUS ENDPOINT AND INSCRIBED ENDPOINT THAT YOU WANT TO SCAN. 
ENDPOINT_UUID = "3a261574-3a83-11e8-b997-0ac6873fc732"

ROOT_DIR = "/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/"
DESTINATION_UUID = "8661d976-f71d-11e8-8cd4-0a1d4c5c824a"
SKLUMA_SERVER_ROUTE = "http://127.0.0.1:5001/"


# TODO: Remove the following (just get from Globus path) 
DESTINATION_STAGE = "/home/tskluzac/Downloads/"


In [2]:
# If first time logging in on machine, will direct you to Globus Auth COPY/PASTE page.
mdf = Forge()   

# From mdf's stored Globus refresh token, get an authorized transfer-client 
tc = mdf.transfer_client

# Attempt to connect to the Globus endpoint. Will give status 200 if successful! 
try: 
    r = tc.get_endpoint(ENDPOINT_UUID)
    print("HTTP Status Code:", r.http_status)
    print("Endpoint Display Name:", r["display_name"])

    
except globus_sdk.GlobusAPIError as ex:
    print("HTTP Status Code:", ex.http_status)
    print("Error Code      :", ex.code)
    print("Error Message   :", ex.message)


HTTP Status Code: 200
Endpoint Display Name: CDIAC Dataset


## <font color='blue'> Step 2: Running Skluma </font> 

#### Next, we iterate over the directories in the Globus endpoint, submit each file to Skluma for metadata extraction, and push the metadata to a Globus Search index

In [10]:
import requests

### DFS by iteratively searching for directories in directories. 
endpoint_id = ENDPOINT_UUID
endpoint_path = ROOT_DIR


visited_dirs = set()
unvisited_dir_list = [endpoint_path]
visited_files = set()
unvisited_file_list = []

# Just so we can count directories without too many print statements
dir_count = 0
file_count = 0  ### TODO: create rudimentary file counter. 

while unvisited_dir_list: # while unvisited elements exist...
    
    current_path = unvisited_dir_list[-1]
    unvisited_dir_list.pop()
        
    visited_dirs.add(current_path)
    r = tc.operation_ls(endpoint_id, path=current_path)
    for item in r:
        
        file_path = current_path + "/" + item["name"]
        dir_path = file_path + "/"
        
        if item["type"] == 'dir' and dir_path not in visited_dirs:
            dir_count += 1
            unvisited_dir_list.append(current_path + "/" + item["name"] + "/")
            if dir_count % 50 == 0: 
                print("Processed " + str(file_count) + " files in " + str(dir_count) + " directories...")      
        elif item["type"] == 'file':
            print(current_path + item["name"])
            tdata = globus_sdk.TransferData(tc, endpoint_id,
                                DESTINATION_UUID,
                                label="debug sklobus notebook")
            
            # TODO: Can rename file right here -- USE UUID you bum. 
            path_comps = (current_path+item["name"]).split('/')
            uniq_filename = '?'.join(path_comps)  # Use this only for reconstructing filename (not tracking). 
            print(uniq_filename)
            
            # Filename changed to path-path-... 
            tdata.add_item(current_path + item["name"], DESTINATION_STAGE + item["name"])
            submit_result = tc.submit_transfer(tdata)
            print("Transfer Task ID:", submit_result["task_id"])

            
            # Here we submit the job to the Skluma server.
            # TODO: Send old file path and new name here. 
            # TODO: Bring this back when working on race conditions. 
#             job_post_url = SKLUMA_SERVER_ROUTE + "process_file/" + filename
            
            submit_job = requests.post(job_post_url, allow_redirects=True)
#             submit_job = urllib.request.urlopen(job_post_url).read()
#             print(submit_job.content) # Callback from Skluma server.
                        
                
print("\nTotal Directories Processed: " + str(dir_count))
print("\nTotal Files Processed: " + str(file_count))

print("TODO: Pushing files to Globus Search...")  ### TODO. 

/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/README
---cdiac.ornl.gov---pub8old---pub8---oceans---a23woce---README
Transfer Task ID: fc3c2a5c-ff24-11e8-9345-0e3d676669f4
/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/README~
---cdiac.ornl.gov---pub8old---pub8---oceans---a23woce---README~
Transfer Task ID: ff7159e0-ff24-11e8-9345-0e3d676669f4
/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/a23_hy1.csv
---cdiac.ornl.gov---pub8old---pub8---oceans---a23woce---a23_hy1.csv
Transfer Task ID: 02e48462-ff25-11e8-9345-0e3d676669f4
/cdiac.ornl.gov/pub8old/pub8/oceans/a23woce/a23do.pdf
---cdiac.ornl.gov---pub8old---pub8---oceans---a23woce---a23do.pdf
Transfer Task ID: 061b32e8-ff25-11e8-9345-0e3d676669f4

Total Directories Processed: 0

Total Files Processed: 0
TODO: Pushing files to Globus Search...


## <font color='blue'> Step 3: Evaluate Outputs </font>

#### Finally, we evaluate Skluma's metadata outputs. 

In [None]:
# TODO 1: Graph (from DB) of outstanding/finished/failed extraction jobs. 

# TODO 2: Evaluate the number and type of each type of file. (see Fig 1)

# TODO 3: Fine-grained analysis of structured data. 

# TODO 4: Fine-grained analysis of free text data.

# TODO 5: Fine-grained analysis of image data. 