#### Copyright IBM Inc. All Rights Reserved.
#### SPDX-License-Identifier: Apache-2.0

# Experiment Submission and Control


This notebook demonstrates
* Authenticating to the `st4sd-runtime-service` REST-API
* Querying available experiments
* Submitting a [`band-gap-pm3-gamess-us`](https://github.com/st4sd/band-gap-gamess/) experiment (semi-emperical variant)
* Querying status of experiment instance
* Retrieving top-level output results from the instance

For information on how to retrieve detailed information via the `st4sd-datastore` see the notebook `ST4SD Datastore - Common Query Examples` located in same repository as this notebook.


## Notes

* Go to https://st4sd.github.io/overview/api-docs/openapi/st4sd-runtime-service/st4sd-runtime-service.openapi.html for the full API specification
* For details on the **band-gap-gamess** experiments see https://github.com/st4sd/band-gap-gamess/
   
## Terminology
- **experiment**: Refers to the *definition* of a particular workflow e.g. the `band-gap-pm3-gamess-us` experiment
- **instance**: Refers to a particular execution of an experiment

## Setup

In [None]:
from __future__ import annotations
from urllib.error import HTTPError

import experiment.service.db
import json
import urllib
import logging
import typing
import json
import typing
import time
import datetime
import pandas

import ipywidgets as widgets
from IPython.display import display
from IPython.display import clear_output


logging.basicConfig(format='%(levelname)-9s %(name)-15s: %(funcName)-20s %(asctime)-15s: %(message)s')
root=logging.getLogger()
root.setLevel(logging.CRITICAL)

def pretty_json(what):
    return json.dumps(what, indent=2, sort_keys=True)

# This is just a jupyter notebook helper class to display information
class PrettyInstanceStatus:
    def __init__(self, status: typing.Dict[str, typing.Any]):
        self.text = pretty_json(status)
    
    def _repr_markdown_(self):
        exp_state = instance_status['status']['experiment-state']
        outputs = instance_status['outputs']
        
        ret = f"""
Experiment state is: **{exp_state}** <br>
Experiment has produced the following outputs so far {list(outputs)} <br> <br>

```json
{self.text}
```
"""
        return ret


Modify the routes below to match the ones of your OpenShift cluster. 

**Note**: You just need to define `route_st4sd_runtime_service` endpoint, the ExperimentRESTAPI wrapper will discover the remaining endpoints automatically.

**Note**: You must be an OpenShift user with access to the namespace that hosts the ST4SD instance before you can authenticate to the ST4SD instance.

If you don't remember the URL to your ST4SD instance log on your OpenShift environment and get the URL of the route `st4sd-authentication` in the namespace that you deployed ST4SD in.

In [None]:
# NOTE: this route belongs to a cluster with minimal resources 
# e.g. can process roughly 1/2 mols at a time
# NOTE: This cluster requires a TUNNELALL VPN
route_st4sd_runtime_service = 'https://st4sd-prod.ve-5446-9ca4d14d48413d18ce61b80811ba4308-0000.us-south.containers.appdomain.cloud'
route_st4sd_rest = None
route_st4sd_registry = None

### Authenticate to the ST4SD Stack

Virtually all instances of the ST4SD stack have authentication enabled. Please follow the instructions in the cells below to correctly authenticate, if required.

**NOTE: If you are running this notebook via the st4sd-runtime-core container please ensure it has been recently updated**

In [None]:
authentication_enabled = False
try:
    response = urllib.request.urlopen('/'.join((route_st4sd_runtime_service, 'experiments/')))
except HTTPError as e:
    if e.code == 403:
        authentication_enabled = True
        print("Authentication is enabled, please proceed with the 'Authentication Enabled' section")
    else:
        print("Authentication is not enabled, skip to 'Connect to API'")

#### Authentication Enabled
- Visit the URL printed in the cell below
- If it's your first time accessing this ST4SD instance you will need to login to the OpenShift cluster once before you can proceed further.
   - Contact the administrator of the OpenShift cluster hosting the ST4SD instance. They need to add you as a user on their OpenShift cluster and give you permission to view the namespaced objects in the OpenShift project hosting the ST4SD instance.
   - Upon visiting the ${url_sign_in} you will have to choose the login method (depending on the OpenShift instance this can be W3Id, IBM SSO, LDAP, etc)
   - If this is the first time you login, you will be prompted to give your consent for the workflow-operator ServiceAccount to know that your username has authenticated to OpenShift. You need to agree to this before you can authenticate to the `st4sd-runtime-service` REST-API.
- After logging in, you will be presented with an authentication token that you will provide to the experiment.service.db.ExperimentRestAPI wrapper in a python cell below.
   - The very first time you visit the ST4SD runtime service, your browser may attempt to use stale cookies. Please visit `${route_st4sd_runtime_service}/oauth/sign_in` to trigger a new login cycle and invalidate your browser stale cookies.
      - For example: https://st4sd-prod.ve-5446-9ca4d14d48413d18ce61b80811ba4308-0000.us-south.containers.appdomain.cloud/oauth/sign_in

**The token will last for 168 hours**

In [None]:
if authentication_enabled:
    auth_url = '/'.join((route_st4sd_runtime_service, 'authorisation/token'))
    print(f"Visit this URL to get your authentication token:\n{auth_url}")

Run this cell and paste in the widget below the value of the authorization token you've been presented with when you visited the URL in the previous cell. Alternatively, you can use an OpenShift token (the one that you'd normaly provide to the `--token` parameter of oc login) with the `cc_bearer_key` agument to experiment.service.db.ExperimentRestAPI

NOTE: Visual Studio Code's renderer does not currently allow pasting in widgets such as the one below. If this is the case for you, you can fall back to pasting the token in the cell after this one.

In [None]:
w_label = 'Input your authentication token here:'
display(w_label)
token_widget = widgets.Password()
display(token_widget)

In [None]:
auth_token = ''
if authentication_enabled and auth_token == '':
    auth_token = token_widget.value
    if auth_token == '':
        print("Authentication is required. Please fill in your token in the box above.")
        raise Exception("Missing authentication")
    if auth_token.startswith("\""):
        auth_token = auth_token[1:]
    if auth_token.endswith("\""):
        auth_token = auth_token[:-1]
    token_widget.value = ""

### Connect to the API

The cell below will validate the token you provided, ensuring you have successfully authenticated to ST4SD. 

In [None]:
# Ensure that your account is authorised to use the ST4SD Runtime Service REST API
try:
    api = experiment.service.db.ExperimentRestAPI(route_st4sd_runtime_service, route_st4sd_registry, 
                                      route_st4sd_rest, max_retries=2, secs_between_retries=1,
                                      cc_auth_token=auth_token)
except experiment.service.errors.UnauthorisedRequest as e:
    print(f"Visit {auth_url} to retrieve your authentication token. Then use it to set the value of "
          "\"auth_token\" in the above cell. Execute that cell and then execute this one.")
else:
    print(f"You've successfully authenticated to {route_st4sd_runtime_service}")

## List and Add Parameterised Virtual Experiment Packages

Find out more about parameterised virtual experiment packages (i.e.`experiments`) on our [website](https://pages.github.ibm.com/st4sd/overview/creating-a-parameterised-package/).

In [None]:
# Query available experiments
experiments = api.api_experiment_list()
to_show = 5

print(f"There are {len(experiments.keys())} registered experiments", end='')
if len(experiments.keys()) > to_show:
    print(". The first 5 are:", end='\n\n')
else:
    print(":", end='\n\n')

for idx, e in enumerate(experiments.keys()):
    if idx > to_show:
        break
    print(e)

In [None]:
if len(experiments.keys()) > 0:
    print("The entry for the first experiment is:", end='\n\n')
    first_experiment = experiments[list(experiments.keys())[0]]
    print(pretty_json(first_experiment))

In [None]:
#This adds a band-gap-pm3-gamess-us to the target ST4SD deployment
#If it already exists its definition gets updated to match what is typed below
package = {
    "base": {
        "packages": [
            {
                "source": {
                    "git": {
                        "location": {
                            "url": "https://github.com/st4sd/band-gap-gamess.git",
                            "tag": "1.0.0"
                        }
                    }
                },
                "config": {
                    "path": "semi-empirical/homo-lumo-dft-semi-empirical.yaml",
                    "manifestPath": "semi-empirical/manifest.yaml"
                }
            }
        ]
    },
    "metadata": {
        "package": {
            "name": "band-gap-pm3-gamess-us",
            "tags": [
                "latest",
                "1.0.0"
            ],
            "maintainer": "https://github.com/michael-johnston",
            "description": "Uses the PM3 semi-empirical method to perform the geometry optimization and calculate the band-gap and related properties. The calculation is performed with GAMESS-US",
            "keywords": [
                "smiles",
                "computational chemistry",
                "semi-empirical",
                "geometry-optimization",
                "pm3",
                "homo",
                "lumo",
                "band-gap",
                "gamess-us"
            ]
        }
    },
    "parameterisation": {
        "presets": {
            "runtime": {
                "args": [
                    "--failSafeDelays=no",
                    "--registerWorkflow=yes"
                ]
            }
        },
        "executionOptions": {
            "variables": [
                {
                    "name": "numberMolecules"
                },
                {
                    "name": "startIndex"
                },
                {
                    "name":  "gamess-walltime-minutes"
                },
                {
                    "name":  "gamess-grace-period-seconds"
                },
                {
                    "name":  "number-processors"
                }
            ],
            "platform": [
                "openshift",
                "openshift-kubeflux"
            ]
        }
    }
}


package = api.api_experiment_push(package)

In [None]:
# The package you get back contains information that the registry auto-discovers about it
# uncomment the line below if you want to take a look at this information
# print(json.dumps(package, indent=2))


In [None]:
# You may delete a version of the band-gap-pm3-gamess-us by referencing it 
# either via a `:tag` or the full `@${digest}`

# For example:
# identifier = '@'.join(package['metadata']['package']['name'], package['metadata']['registry']['digest'])
# api.api_experiment_delete(identifier)

# WARNING: The only way re-instate the experiment definition 
# is to api_experiment_push() it once again.

## Submit Experiment

In [None]:
#Input data - A csv formatted string containing a label a SMILES
#You can also use the next cell to load a file from your hard drive instead
molecules = '''label,smiles
mymolecule,CCCCCCCC[SH2+]
'''.rstrip()

In [None]:
#Read some input data for band-gap-pm3-gamess-us experiment
# df: pandas.DataFrame = pandas.read_csv('~/my_molecules.csv', engine="python", sep=None)
# df_filtered = df[["label", "smiles"]]
# molecules = df_filtered.to_csv(index=False).rstrip()

In [None]:
#Input configuration
#See the experiment description for defaults and other options

# Remove stray newlines at end of string representation of CSV file
molecules = molecules.rstrip()

experimentConfiguration = {
    "inputs": [
        {"filename": "input_smiles.csv", "content": molecules}
    ],
    "data": [
      # (Optional) override contents of data 
      # files, similarly to providing inputs
      # Note: "band-gap-pm3-gamess" does not support overriding data files
      # because it does not contain parameterisation.executionOptions.data[] settings
      # e.g. if it supported overriding the data file "input_molecule.txt"
      # the following Dictionary would be valid here:
      # {"filename": "input_molecule.txt", 
      #  "content": "the string representation of the file contents"}
    ],
    "variables": {
        "startIndex": 0,
        # you can submit multiple molecules in 1 experiment
        "numberMolecules": len(molecules.split("\n")) -1,
    },
    "metadata": {
      # you can provide user-metadata `key: value` pairs which you can use
      # in the future for querying the database (user-metadata)
      "author": "amazing-person"
    },
    "additionalOptions": [
      # you can provide additional arguments to elaunch here
      # but they cannot override those of the experiment definition
      # the additionalOptions of which will automatically be used too
      "--useMemoization=y"
    ],
    "orchestrator_resources": {
      "cpu": "1",
      "memory": "2Gi"
    }
}

In [None]:
#Submit an instance of the parameterised virtual experiment package
experimentId = 'band-gap-pm3-gamess-us'
rest_uid = api.api_experiment_start(experimentId, experimentConfiguration)

In [None]:
#Print REST-uid of experiment instance
print("rest-uid:", rest_uid)

In [None]:
#Get instance status - run this cell periodically, until experiment state becomes "Finished"
#it should take about 5 minutes
instance_status = api.api_rest_uid_status(rest_uid)

status = instance_status['status']
status = {key: status[key] for key in status if key != "meta"}
print("Status of instance is\n",json.dumps(status, indent=2))

#Uncomment to see verbose state of instance
#PrettyInstanceStatus(instance_status)

Virtual experiments produce "key-outputs" which you may download (see cell below).

The above experiment produces just 1 output `AnionResults`. The cell below will keep running until the virtual experiment has produced an output, it will then fetch it, display it, and exit the `while` loop.

In [None]:
while True:
    clear_output(wait=True)
    instance_status = api.api_rest_uid_status(rest_uid)

    print(f"Outputs produced so far are {pretty_json(instance_status['outputs'])}", end='\n\n')
    exp_state = instance_status['status']['experiment-state']
    if exp_state is None:
        next_call = datetime.datetime.now() + datetime.timedelta(seconds=10)
        print(f"Kubernetes is spinning up objects - try again at {next_call}")
        time.sleep(10)
        continue
                                         
    if exp_state in ["running", "initialising"]:
        print(f"Experiment is {exp_state}, it may produce more outputs")
        print("The experiment in this example, only produces 1 output - OptimisationResults")
    else:
        print(f"Experiment is {exp_state} - it will not produce new outputs", end='\n\n')

    # Get the CSV data associated with a particular result type
    # If you attempt to fetch an output for which there is no entry in the instance_status['outputs'] dictionary,
    # or there is no workflow instance the statement below will raise an InvalidHTTPRequest exception, 
    # read the description of the Exception for more information.
    if 'OptimisationResults' in instance_status['outputs']:
        filename, contents = api.api_rest_uid_output(rest_uid, 'OptimisationResults')
        print("Contents (i.e. bytes) of", filename, "are:")
        print(contents.decode('utf-8'))
        break
    else:
        next_call = datetime.datetime.now() + datetime.timedelta(seconds=10)
        print(f"Experiment has not produced outputs yet - try again at {next_call}")
        time.sleep(10)

The experiment instance will asynchronously register itself to the ST4SD datastore - this may take up to 1 minute after the virtual experiment instance transitions to the `running` state. It will then asynchronously update its state on the ST4SD datastore database.


One of the features of the Datastore (often referred to as ST4SD Centralized Database or CDB) is the generation of status reports about the experiment instance.

In [None]:
# First wait for the experiment to register itself to the ST4SD Datastore
# this can take up to 1 minute after the virtual experiment instance transitions to the 

print(f"Waiting for virtual experiment instance {rest_uid} to register to the ST4SD Datastore")

instance_uri = None
while instance_uri is None:
    doc = api.cdb_get_document_experiment_for_rest_uid(rest_uid, query={
        "status.experiment-state": "finished"})   
    if doc is not None:
        instance_uri = doc['instance']
    else:
        time.sleep(10)

print(f"The Instance URI of the virtual experiment is {instance_uri} - fetching the status report")
report = None

while report is None:
    try:
        report = api.cdb_get_detailed_status_report(instance_uri)
        break
    except ValueError as e: 
        print(f"{e} - try again in 10 seconds")
        time.sleep(10)

print(report)

# Get the Interface of the Experiment instance

Experiments may optionally have an interface that maps an input system id (e.g. a SMILES) to properties that the parameterised virtual experiment package measures. Interfaces help users understand the contents of the output of a virtual experiment without the need to understand the internals of the experiment.

For more information, read our [documentation](https://st4sd.github.io/overview/using-a-virtual-experiment-interface/) about interfaces of parameterised virtual experiment packages

Below, we demonstrate extracting the interface of the experiment instance you ran earlier.

In [None]:
doc = api.cdb_get_document_experiment_for_rest_uid(rest_uid, include_properties=["*"])
df: pandas.DataFrame = pandas.DataFrame.from_dict(doc['interface']['propertyTable'])

df

## Interact with S3 and Datasets

You may use `input` and `data` files stored on S3 and Dataset objects ([with the help of Datashim](https://github.com/datashim-io/datashim)). 

First, you need to create a `S3` bucket and credentials to access it. If you are using IBM Cloud, you can [use our guide](https://st4sd.github.io/overview/UsingCloudObjectStore).

### Using S3 directly


Below, we show how to read `input`/`data` files stored on S3:

In [None]:
#See the experiment description for defaults and other options
experimentConfiguration = {
    "variables": {
        "startIndex": 0,
        "numberMolecules": 1,
    },
    "orchestrator_resources": {
      "cpu": "1",
      "memory": "2Gi"
    },
    "s3": {
      # See guide https://pages.github.ibm.com/st4sd/overview/UsingCloudObjectStore
      # to populate the values here. If you Datashim is running on your cluster you may use
      # Datasets instead of typing the S3 credentials here (scroll notebook)
      "accessKeyID": "the  contents of `cos_hmac_keys.access_key_id`",
      "secretAccessKey": "the contents of  `cos_hmac_keys.secret_access_key`",
      "endpoint":  "https endpoint (e.g. https://s3.eu-de.cloud-object-storage.appdomain.cloud)",
      "bucket": "the name of your bucket here"
    },
    "data": [{
        # The contents of this data fill will be read from S3
       "filename": "/path/relative/to/S3/bucket/input_smiles.csv"
   }]
}

ST4SD also supports storing key-outputs of virtual experiments on S3. Recall that experiments may optionally define key-outputs (snippet from [`sum-numbers`](https://github.com/st4sd/sum-numbers/blob/main/conf/flowir_package.yaml):

```yaml
output:
  TotalSum:
    data-in: "stage2.Sum/out.stdout:copy"
... rest of FlowIR ...
```

For example, in the FlowIR yaml above the experiment defines a single key-output named `TotalSum`. The key-output maps to the file `out.stdout` that the `Sum` component in stage `2` generates.


We can instruct ST4SD to upload the key-outputs to the path `/run1_output/` in bucket `my-bucket` on S3 like so:

In [None]:
#See the experiment description for defaults and other options
experimentConfiguration = {
    "variables": {
        "startIndex": 0,
        "numberMolecules": 1,
    },
    "orchestrator_resources": {
      "cpu": "1",
      "memory": "2Gi"
    },
    "s3Store":{
      "credentials": {
        # See guide https://pages.github.ibm.com/st4sd/overview/UsingCloudObjectStore
        # to populate the values here. If you Datashim is running on your cluster you may use
        # Datasets instead of typing the S3 credentials here (scroll notebook)
        "accessKeyID": "the  contents of `cos_hmac_keys.access_key_id`",
        "secretAccessKey": "the contents of  `cos_hmac_keys.secret_access_key`",
        "endpoint": "your endpoint prefixed with https:// (e.g. https://s3.eu-de.cloud-object-storage.appdomain.cloud)",
        "bucket": "my-bucket", # the name of your bucket
      },
      # optional - defaults to "/""
      "bucketPath": "/run1_output/"
    },
}

You may combine the `"s3"` and `"s3Store"` fields to both read `input`/`data` files from S3 and store key-outputs to s3.

## Using Datashim to access S3

[Datashim](https://github.com/datashim-io/datashim) is a Kubernetes-native framework that enables you to create `Dataset` objects which act as gateways to S3 buckets and can also be mounted as regular `PersistentVolumeClaims` in your pods. 

Datashim is an optional component of ST4SD, you can find installation instructions here <https://github.com/datashim-io/datashim>

In ST4SD we support using Datasets to provide `input`/`data` files to virtual experiments that are stored on S3 as well as store key-outputs of virtual experiments to S3.

To provide `input` and `data` files that are stored on a S3 bucket for which you have already created a Dataset object ([instructions to create Dataset](https://st4sd.github.io/overview/UsingCloudObjectStore/#datashim-method)) you can use the following snippet:

In [None]:
#See the experiment description for defaults and other options
experimentConfiguration = {
    "variables": {
        "startIndex": 0,
        "numberMolecules": 1,
    },
    "orchestrator_resources": {
      "cpu": "1",
      "memory": "2Gi"
    },
    "s3": {
      # By specifying a `dataset` we do not need to type out the credentials for accessing S3
      "dataset": "my-dataset"
    },
    "data": [{
        # The contents of this data fill will be read from the S3 bucket that `my-dataset` proxies
       "filename": "/path/relative/to/dataset/bucket/input_smiles.csv"
   }]
}

We can instruct ST4SD to upload the key-outputs to the path `/run1_output/` in the bucket `my-bucket` that the `my-dataset` proxies so:

In [None]:
#See the experiment description for defaults and other options
experimentConfiguration = {
    "variables": {
        "startIndex": 0,
        "numberMolecules": 1,
    },
    "orchestrator_resources": {
      "cpu": "1",
      "memory": "2Gi"
    },
    "s3Store": {
      "datasetStoreURI": "dataset://my-dataset/run1_output/"
    }
}

Finally, you may use both `s3.dataset` and `s3Store.datasetStoreURI` to read `input`/`data` from a dataset and store them on a dataset (the same or a different one).

## Mounting Kubernetes objects (Secret, PersistentVolumeClaim, Dataset, ConfigMap)

You may mount `readOnly` volumes to your virtual-experiments and map them to `application-dependencies` which are folders that appear in the root-directory of your virtual experiment instances. You can reference similar to how you reference the files under the `data` and `input` top-level directories.

The schema of the `volumes` array is:

```python
{
  "volumes": [
      {
        "type":
          {
            # You must use exactly one of the fields below
            "configMap": "<name of configMap, OR",
            "persistentVolumeClaim": "<name of PVC>, OR",
            "dataset": "name of Dataset",
            "secret": "name of Secret",
          },
         "applicationDependency": "application-dependencies entry that will point to contents of volume",
         "subPath": "Path within the volume from which the container's volume should be mounted. Defaults to empty string (volume's root) - not applicable to configMaps or secrets"
      }
  ]
}
```

### Example

We can mount a `Dataset` as the `foo` application dependency of an virtual experiment.

First, we need to include the application dependency in the FlowIR definition of the virtual experiment:

```yaml
application-dependencies:
  default:
  # indicate that this workflow expects a `foo` dependency.
  - foo
 
# Components reference the contents of the `foo` in the same way 
# that they reference data/input files, for example:
components:
- name: hello
  command:
    executable: cat
    arguments: foo/message.txt:ref
  references:
  - foo/message.txt:ref
```

We can use the payload below to mount the S3 bucket `my-bucket` that the `my-dataset` Dataset proxies as the `foo` application-dependency like so:

In [None]:
#See the experiment description for defaults and other options
experimentConfiguration = {
    "variables": {
        "startIndex": 0,
        "numberMolecules": 1,
    },
    "orchestrator_resources": {
      "cpu": "1",
      "memory": "2Gi"
    },
    "s3Store": {
      "datasetStoreURI": "dataset://my-dataset/run1_output/"
    },
    "volumes":[
        {
            "type": {
                "dataset": "my-dataset"
            },
            "applicationDependency": "foo"
        }
    ]
}

## Query the ST4SD Datastore

Keep in mind that the `st4sd-datastore` API will truncate files that are larger than 32MB. In such a case
the returned contents will include the message below right at the end of the truncated contents:
`FILE TRUNCATED to 33554432 bytes, actual file size is <the-file-size-in-bytes>`

In [None]:
docs = api.cdb_get_document_experiment(query={})
print(f"Recorded workflow instances: {len(docs)}")

In [None]:
# Uncomment to print each `experiment` document
# for d in docs:
#     print(pretty_json(d['type']))

In [None]:
docs, files = api.cdb_get_data(stage=1, instance='band-gap-.*', component='ExtractEnergies', filename='energies.csv')
print("Last Matching component document is", end='\n\n')
print(pretty_json(docs[-1]))

## Image Pull Secrets

Kubernetes uses so-called imagePullSecrets to pull images from container registries. If your workflow instances need to access a container registry use the API below to create/update imagePullSecrets. You may also list the existing imagePullSecrets that the mvp2-stack utilizes to pull images for the pods your workflow instances generate.

In [None]:
help(experiment.service.db.ExperimentRestAPI.api_image_pull_secrets_upsert)

In [None]:
help(experiment.service.db.ExperimentRestAPI.api_image_pull_secrets_list)

## Datasets

If your cluster-admin has installed [Datashim](https://github.com/datashim-io/datashim) (formerly known as DLF) you will be able to create and mount Datasets using the Datashim framework. For the time being, Datashim supports Dataset objects which point to COS/S3 buckets.

In [None]:
help(experiment.service.db.ExperimentRestAPI.api_datasets_create)

In [None]:
help(experiment.service.db.ExperimentRestAPI.api_datasets_list)