#### Copyright IBM Inc. All Rights Reserved.
#### SPDX-License-Identifier: Apache-2.0

# FAQs:
- I have a REST-UID, how can I find where the experiment instance is on the filesystem?
- How can I get all instances of a particular experiment type with a tag like X?
- How do I determine which of the experiment instances I've retrieved has suceeded?
- How can I aggregate particular results files from all experiment instances that have that file?
- How can I see what files were output by all components Y in replica N of a given instance?
- How can I get file SOMEFILE from a particular component?
- What is in the datastore?

## Terminology
- `experiment` - The definition of `parameterised virtual experiment package` (see [documentation](https://st4sd.github.io/overview/creating-a-parameterised-package/) for more information on parameterised virtual experiment packages)
- `experiment instance` - A running or completed instance of an `experiment` with a unique `rest-uid`
- `component` - A node in the workflow graph
- `tag` - Some custom metadata added to an experiment instance by a user when it was launched

In [None]:
# import requests
from __future__ import print_function
from urllib.error import HTTPError

import experiment.service.db
import json
import urllib
import logging
import pandas as pd

import ipywidgets as widgets
from IPython.display import display

def pretty_json(what):
    return json.dumps(what, indent=2, sort_keys=True)

logging.getLogger().setLevel(logging.CRITICAL)

In [None]:
!pip install pandas

## Setup
- **NOTE** Set the routes to the ST4SD project where the instances you want to query live

In [None]:
# NOTE: this route belongs to a cluster with minimal resources 
# e.g. can process roughly 1/2 mols at a time
route_st4sd_runtime_service = 'https://st4sd-prod.ve-5446-9ca4d14d48413d18ce61b80811ba4308-0000.us-south.containers.appdomain.cloud/rs'
route_st4sd_rest = None
route_st4sd_registry = None

# WSC - requires setting up your /etc/hosts file first
# route_st4sd_runtime_service = 'https://flow-api-dev.apps.foc.c699.net'
# route_st4sd_rest = 'https://foc-cdb-rest-dev.apps.foc.c699.net'
# route_st4sd_registry = 'https://foc-cdb-registry-dev.apps.foc.c699.net'

### Authenticate to the ST4SD Stack

Virtually all instances of the ST4SD stack have authentication enabled. Please follow the instructions in the cells below to correctly authenticate, if required.

**NOTE: If you are running this notebook via the st4sd-runtime-core container please ensure it has been recently updated**

In [None]:
authentication_enabled = False
try:
    response = urllib.request.urlopen('/'.join((route_st4sd_runtime_service, 'oauth/sign_in')))
except HTTPError as e:
    if e.code == 403:
        authentication_enabled = True
        print("Authentication is enabled, please proceed with the 'Authentication Enabled' section")
    else:
        print("Authentication is not enabled, skip to 'Connect to API'")

#### Authentication Enabled
- Visit the URL printed in the cell below
- If it's your first time accessing this ST4SD instance you will need to login to the OpenShift cluster once before you can proceed further.
   - Contact the administrator of the OpenShift cluster hosting the ST4SD instance. They need to add you as a user on their OpenShift cluster and give you permission to view the namespaced objects in the OpenShift project hosting the ST4SD instance.
   - Upon visiting the ${url_sign_in} you will have to choose the login method (depending on the OpenShift instance this can be W3Id, IBM SSO, LDAP, etc)
   - If this is the first time you login, you will be prompted to give your consent for the workflow-operator ServiceAccount to know that your username has authenticated to OpenShift. You need to agree to this before you can authenticate to the `st4sd-runtime-service` REST-API.
- After logging in, you will be presented with an authentication token that you will provide to the experiment.service.db.ExperimentRestAPI wrapper in a python cell below.
   - The very first time you visit the ST4SD runtime service, your browser may attempt to use stale cookies. Please visit `${route_st4sd_runtime_service}/oauth/sign_in` to trigger a new login cycle and invalidate your browser stale cookies.
      - For example: https://st4sd-prod.ve-5446-9ca4d14d48413d18ce61b80811ba4308-0000.us-south.containers.appdomain.cloud/oauth/sign_in

**The token will last for 168 hours**

In [None]:
if authentication_enabled:
    auth_url = '/'.join((route_st4sd_runtime_service, 'authorisation/token'))
    print(f"Visit this URL to get your authentication token:\n{auth_url}")

Run this cell and paste in the widget below the value of the authorization token you've been presented with when you visited the URL in the previous cell. Alternatively, you can use an OpenShift token (the one that you'd normaly provide to the `--token` parameter of oc login) with the `cc_bearer_key` agument to experiment.service.db.ExperimentRestAPI

NOTE: Visual Studio Code's renderer does not currently allow pasting in widgets such as the one below. If this is the case for you, you can fall back to pasting the token in the cell after this one.

In [None]:
w_label = 'Input your authentication token here:'
display(w_label)
token_widget = widgets.Password()
display(token_widget)

In [None]:
auth_token = ''
if authentication_enabled and auth_token == '':
    auth_token = token_widget.value
    if auth_token == '':
        print("Authentication is required. Please fill in your token in the box above.")
        raise Exception("Missing authentication")

### Connect to the API

The cell below will validate the token you provided, ensuring you have successfully authenticated to ST4SD. 

In [None]:
# Ensure that your account is authorised to use the ST4SD Runtime Service REST API
try:
    api = experiment.service.db.ExperimentRestAPI(route_st4sd_runtime_service, route_st4sd_registry, 
                                      route_st4sd_rest, max_retries=2, secs_between_retries=1,
                                      cc_auth_token=auth_token)
except experiment.service.errors.UnauthorisedRequest as e:
    print(f"Visit {auth_url} to retrieve your authentication token. Then use it to set the value of "
          "\"auth_token\" in the above cell. Execute that cell and then execute this one.")
else:
    print(f"You've successfully authenticated to {route_st4sd_runtime_service}")

## FAQ: I have a REST-UID, how can I find where the experiment instance is on the filesystem?

In [None]:
restUID = 'band-gap-dft-gamess-us-27b9dc-ka72wvld' #ADD YOUR REST-UID HERE

In [None]:
experiment = api.cdb_get_user_metadata_document_for_rest_uid(restUID)

In [None]:
print(pretty_json(experiment))

In [None]:
#Use this so you can execute 'FAQ: I want to aggregate particular results ...' on this output
#successful = [experiment]

## FAQ: How can I get all instances of an experiment ?

In [None]:
#Retrieve all instances:
# - of all parameterised virtual experiment package whose names begin with "band-gap-"
# - whose exit status is "Success"

#NOTE: `cdb_get_document` query can use any keys of any document whose values are NOT containers (lists etc)
#(sub-keys are queried using a "key-path" notation c.f. below)
#Only documents which contain ALL keys in the query, with values that match the regexes supplied, are returned
#This mean for example you don't have to ask for a document of type X if the key you are querying only exists
#in documents of type X

# For example, the Experiment-type documents are the only documents that contain the "metadata" field.

query = {
    # Case sensitive regular expression - use {"$regex": ..., "$options", "i"} for case insensitive query
    'metadata.userMetadata.st4sd-package-name': {'$regex': 'band-gap-.*'},
     # Exit status must be equal to "Success" (case sensitive)
    'status.exit-status': 'Success'
}

In [None]:
instances = api.cdb_get_document(query)

In [None]:
print(pretty_json(instances[0]))

## FAQ: How can I get all instances of a specific version of an experiment given its full identifier ?

In [None]:
#Retrieve all experiment instances:
# - that ran on the "kubernetes" backend, AND
# - with a specific parameterised virtual experiment package identifier, AND
# - whose exit status is "Success"


identifier = "band-gap-pm3-gamess-us@sha256x822c80d16092f6e8f36689e915163c24decb261cb142bfc2aa931182"
query = {
  'metadata.variables.global.backend': 'kubernetes',
  'metadata.userMetadata.experiment-id': identifier,
  'status.exit-status': 'Success'
}

In [None]:
instances = api.cdb_get_document(query)

In [None]:
print(pretty_json(instances[0]))

## FAQ: How can I get all instances of a specific version of an experiment given its name and tag ?

In [None]:
#Retrieve all experiment instances:
# - that ran on the "kubernetes" backend, AND
# - whose parameterised virtual experiment package name starts with "band-gap-", AND
# - whose exit status is "Success"

# Each version of a PVEP has 0+ tags and exactly 1 digest
# To query the db for a tagged PVEP version first you need to obtain the digest

experiment_name = "band-gap-dft-gamess-us"
tag = "latest"
tagged_pvep = ":".join([experiment_name, tag])

# Then you can query the Datastore (CDB) using the name and digest of the PVEP version
pvep_def = api.api_experiment_get(tagged_pvep)
digest = pvep_def['metadata']['registry']['digest']
identifier = '@'.join([experiment_name, digest])

query = {'metadata.userMetadata.experiment-id': identifier}

# The above query is equivalent to:
# query = {'metadata.userMetadata.st4sd-package-name': experiment_name,
#          'metadata.userMetadata.st4sd-package-digest': digest}
# Because a version of a PVEP can have multiple tags (which can change over time)
# we do not currently record the tags at the point of submission of the PVEP instance 

In [None]:
instances = api.cdb_get_document(query)

In [None]:
print(pretty_json(instances[0]))

## FAQ: How do I determine which experiment instances suceeded or failed?

In [None]:
#Note you could also submit a query to above to include only succeesful instances in the first place
#Retrieve all experiment instances:
# - that ran on the "kubernetes" backend
# - whose experiment-id starts with "band-gap"
# - whose exit status is "Success"
#query = {'metadata.variables.global.backend': 'kubernetes',
#         'metadata.userMetadata.experiment-id': {'$regex': 'band-gap.*'},
#         'status.exit-status':'Success'}

In [None]:
successful = list(filter(lambda x: x['status']['exit-status'] == 'Success', instances))

In [None]:
max_to_show = 5
for idx, e in enumerate(successful):
    
    if idx > max_to_show:
        break
        
    print(e['metadata']['userMetadata']['rest-uid'], e['metadata']['variables']['global']['backend'], 
          e['status']['exit-status'])
    try:
        print('Outputs available:', pretty_json(e['output']), end='\n\n')
    except:
        print('No outputs available', end='\n\n')

## FAQ: How can I get all experiments which had a specific field (e.g. a user metadata)?

In [None]:
# You can use the $exists mongoDB operator.
# For example to find all experiments which the user annotated with the "planet"
# user-provided metadata "planet" (regardless of the value) you can use:

query = {
  # Changing True to False below will ask for `experiment` type documents which do not contain
  # the field metadata.userMetadata.planet
  'metadata.userMetadata.planet': {"$exists": True},
  'type': "experiment"
}

In [None]:
instances = api.cdb_get_document(query)

In [None]:
print(pretty_json(instances[0]))

## FAQ P1: I want to aggregate particular results files from all  experiment instances that have that file ...
## FAQ P2: ... And merge with the input file

- Here we retrieve the results from the `status` json as we have it already
- Then retrieve the input file from the `st4sd-datastore`
- Then join the two tables using the `label` column

In [None]:
for ins in instances:
    try:
        s = api.api_rest_uid_status(ins['metadata']['userMetadata']['rest-uid'])
        print(pretty_json(s))
    except Exception as error:
        pass

In [None]:
inputFilename='input/input_smiles.csv'

In [None]:
#For each experiment with output:
#1. retrieve the ExtractEnergies
#2. retrieve the input file
#NOTE: The input file is required if you want to translate the "label" column in the output file to the 
#value of some other column provided in the inputs e.g. SMILES
#NOTE: Optionally we can retrieve all the results from disk rather than via the `status` object
import io
results = []
for ins in list(successful):
    try:
        #Here we retrive the result from the REST-API status - could also get it via CDB
        print('Retrieving for', ins['metadata']['userMetadata']['rest-uid'])
        d = api.cdb_get_data(instance=ins['instance'],
                             component='ExtractEnergies', 
                             filename='energies.csv')[1][0]
        print(d)
        s = d.csvRepresentation()
        result = pd.read_csv(io.StringIO(u'%s' % s), sep=",")
        d = api.cdb_get_file_from_instance_uri(ins['instance'], inputFilename)     
        inputs = pd.read_csv(io.StringIO(d.decode("utf-8")))
        #Join the two based on the `label` column
        results.append(result.merge(inputs, on='label'))
    except KeyError as error: 
        print('Instance %s does not have TDDFT output' % ins['rest-uid'])

In [None]:
#Aggregate all results into one table
aggregated = pd.concat(results, axis=0)
display(aggregated)

In [None]:
with open('results.csv', 'w') as f:
    aggregated.to_csv(f, index=False)

## FAQ: How can I see what files were output by all components Y in replica N  of a given instance

In [None]:
instance_uri=list(successful)[0]['instance'] #Put your instance URI here

In [None]:
#Example: Get data for particular molecule 
#This will retrieve component documents for  all components run on molecule 0
comps = api.cdb_get_document_component(component='.*0', instance=instance_uri)
comps

## FAQ: How can I get file SOMEFILE from a particular component 

In [None]:
#Get a file for a particular component 
component = 
filename = 'out.stdout'
instance=

In [None]:
d = api.cdb_get_components_raw_files(instance=instance, 
                                     filename='.*out.stdout', 
                                     component=component)
d

## FAQ: How can I retrieve the log of an experiment instance?

In [None]:
log_bytes = api.cdb_get_file_from_instance_uri(instance_uri, "output/experiment.log")
print(log_bytes.decode('utf-8'))

## FAQ: How can I retrieve the log of a component?

In [None]:
api.cdb_get_components_last_stdout(instance=instance_uri, component=comps[0])

## FAQ: What is in the datastore?

## Datastore Structure

The `st4sd-datastore` is a document database. Each workflow instance has 3 sets of documents: 
- `experiment`: 1 document that contains information for 1 instance of a workflow
- `user-metadata`: 1 document contains the user provided key:value pairs associated with a particular workflow instance
- `component`: 1 per component in the experiment instance. Each Document contains metadata for 1 component of the workflow instance.

Each document as a `uid` which **uniquely** identifies the document. Documents associated with a given workflow instance have the same `instance` field which is a `file://` URI with the format `file://<cluster-label>/absolute/path/to/the/intance/directory`.


### User Metadata


A **user-metadata** document has the following keys

- **uid**: A file:// URI that *uniquely* identifies this particular Mongo Document
- **type**: The document type: In this case always `user-metadata`
- **instance**: The file:// URI of the instance that the document refers to (unique per instance)
- **rest-uid**: The ID that the consumable-computing REST-API returns after a successfull call to api_instance_create()
- 0+ custom key:values (keys are strings, values can be JSON-serializable objects, typically strings)

### Component Documents

A **component** document has the following keys

- **uid**: A file:// URI that *uniquely* identifies this particular Mongo Document
- **type**: The document type: In this case `component`
- **name**: The components name
- **instance**: The file:// URI of the instance that the document refers to (unique per instance)
- **location**: The absolute path of the working directory of the component  (not a URI)
- **stage**: The stage the component was in. Components are uniquely specified by their name and stage
- **files**: A list of the files in the components working directory (paths relative to working directory). May be input or output
- **producers**:
   - `<the uid of a component document whose output(s) *this* component reads>`:
       - `path/relative/to/producer/working/directory`
       - `another/path/relative/to/producer/working/directory`
- **flowir**: The FlowIR definition of the component
- **memoization-hash**: A hash which can be used during memoization (i.e reuse a cached component's outputs instead of executing a component from scratch).
- **memoization-hash-fuzzy**: As above but used during fuzzy memoization
- **component-state**: last known state of component (finished/failed/shutdown/running)

### Experiment Documents

An **experiment** document has the following keys

- **uid**: A file:// URI that *uniquely* identifies this particular Mongo Document
- **type**: The document type: In this case `experiment`
- **name**: The experiments name
- **instance**: The file:// URI of the instance that the document refers to (unique per instance)
- **output**:   
  - `<name of key output:str>`:
    - creationtime: `<datetime in format "%Y-%m-%dT%H%M%S.%f": str>`
    - description: `<str>`
    - filename: `<name of file:str>`
    - filepath: `<absolute path to file:str>`
    - final: `<yes/no:str>`
    - production: `<yes/no:str>`
    - type: `<the contents of the `type` field from the associated output entry in FlowIR:str>`
    - version: `<times this output has been updated:int>`
- **metadata**: A dictionary of experiment metadata
   - **arguments**: The elaunch command-line
   - **data**: The `-d` arguments to elaunch
   - **inputs**: The `-i` arguments to elaunch
   - **instanceName**: The final part of the **instance** path/URL
   - **pid**: The process id of the instance
   - **userVariables**: A dictionary with the contents of the `variables.conf` passed to elaunch if present
   - **version**: The st4sd-runtime-core version
   - **userMetadata**: Same as content of user metadata document
- **status**
  - experiment-state: (running, finished)
  - stage-state: (running, finished, failed, shutdown)
  - updated: `<string reprensetion of timestamp>`
  - exit-status: (Success, Failed, or just empty)
  - error-description(optional): `<string reprensetion of float in [0.0, 1.0]>`
  - const: `<string reprensetion of float in [0.0, 1.0]>`
  - total-progress: `<string reprensetion of float in [0.0, 1.0]>`
  - stage-progress: `<string reprensetion of float in [0.0, 1.0]>`
  - stages (a list):
     - `<name of stage:str>`

In [None]:
#NOTE: The rest-uid will ONLY get you the user-metadata doc of the instance from the st4sd-datastore. 
#For the other-docs you need the instance-uri - which is also in the user-metadata doc
restUID = 'band-gap-dft-gamess-us-27b9dc-ka72wvld' #Put your own RESTUID here

In [None]:
#Example user-metadata doc 
usermeta_doc = api.cdb_get_document({'rest-uid':restUID, 'type':'user-metadata'})[0]
print(pretty_json(usermeta_doc))
instance_uri = usermeta_doc['instance']

In [None]:
#Example experiment-instance doc
instance_doc = api.cdb_get_document({'instance':instance_uri, 'type':'experiment'})[0]
print(pretty_json(instance_doc))

In [None]:
#Example component doc
component_doc = api.cdb_get_document({'instance':instance_uri, 'type':'component', 'name':'GeometryOptimisation0'})[0]
print(pretty_json(component_doc))