# Getting started with Determined AI, the open-source deep learning training platform - Lab 2

For this part of the lab we will consider the well known Iris classification problem of predicting iris species based on sepal and petal size.  We leverage Determined AI to train and tune a TensorFlow Keras model on TensorFlow's publicly available iris [training](http://download.tensorflow.org/data/iris_training.csv) and [test](http://download.tensorflow.org/data/iris_test.csv) datasets.

The small dataset consists of:
* 150 samples (120 samples for the training dataset, 30 samples for the test data set)
* 3 labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor)
* 4 features: Sepal length,Sepal width,Petal length,Petal width in cm

In this part of the lab you will:
* Install Determined AI CLI to interact with the Determined AI system running on Kubernetes cluster.
* Get familiar with the Determined AI Command Line Interface (CLI).
* Create Determined AI **Experiments** to train your model with a single GPU, with multiple GPUs and using Determined AI hyperparamater tuning.
* Interact with the Determined AI WebUI to visualize experiments metrics.

### 1- Install Determined AI CLI

The Determined CLI is a command line tool that allows you to interact with the Determined AI system. For example, the CLI allows you to launch new experiment to train your deep learning (DL) model.

Here, you will use the line magic function below to install **Determined AI CLI** as a new Python packages in the current Jupyter kernel. The Determined CLI will be installed in the folder: ***~/.local/bin***

>**Note:** After installation of the package, you will need to restart the current kernel to use the newly installed package from the local Jupyter Notebook server.

> **Note:**  When you see a [*] next to the next action it means your execution step is busy working within the notebook. 

In [None]:
# use the line magic function below to install new Python packages in the current Jupyter kernel.
# restart the current kernel to use the newly installed package
%pip install kfp-pipeline-spec --quiet
%pip install determined --quiet

### 2- Restart your Kernel to use the newly installed Python packages.
From the menu bar, select **"Kernel"** then **"Restart Kernel..."**

### 3- Fetch the KubeConfig file for your tenant user ID

Your local Jupyter Notebook has been tailored to interact with the Kubernetes resources using the Kubernetes API through the command line interface "***kubectl***".

When prompted to enter the password, make sure to enter password for your StudentID credentials you received in the Workshop-on-demand registration e-mail.

>**Note:** You can ignore _InsecureRequestWarning_ message

If the password is correct, you will see the message ***kubeconfig set for user Student\<YourID\>***.

In [None]:
%kubeRefresh

### 4- Determined AI components

For this hands-on workshop, the Determined AI system has been installed on the Kubernetes cluster managed by HPE Ezmeral Runtime Enterprise on a Kubernetes namespace _determinedai_.  

When installing Determined AI on Kubernetes, an instance of the **Determined Master** and a **PostgreSQL database** are deployed in the Kubernetes cluster. These components run as a container within a Kubernetes POD. Run the code cell below and check out the output. You should see one container POD for the Determined Master service and another POD for the Database service.

In [None]:
!kubectl get pod -n determinedai | grep determined

The Determined Master is the central component of the Determined System. The Master is responsible for:
* **Scheduling** Determined training workloads as a collection of Kubernetes PODs. The Master brings up PODs to run workloads such as model training tasks, TensorBoard instance and Jupyter Notebooks instances.
* **Tracking and storing** all model training workloads metadata (description, labels, hyperparameters, search algorithm used, training metrics, validation metrics, start/end time, logs) in the PostgreSQL database.
* **Saving** model artifacts (model files, code, model definition files) and training checkpoints of Determined training workloads in a _checkpoint storage_ to keep records of the progress and ensure workload resiliency. Determined will automatically retry failed training tasks from latest checkpoint.
* **Serving** the Web User Interface for users to visualize training and validation metrics across their model training tasks.

>**Note:** _For this hands-on lab, the Kubernetes cluster worker nodes that run the Determined AI system, have been configured to connect to a distributed file system provided by HPE Ezmeral Runtime Enterprise. The distributed file system provides shared storage that works with Determined AI system to:_ 
>* _allow training workloads to access model training/validation datasets for the Iris classification model example,_ 
>* _store model training artifacts (model files, model codes),_ 
>* _store training workload checkpoints (saved versions of trained models that users can access later. Checkpoints are also used by Determined to retry failed training tasks)._

### 5- Set the endpoint URL of the Determined master. 

To use Determined and interact with Determined cluster, you need to tell the CLI where the Determined master service is running. Run the code cell below to get the Determined Master endpoint URL.

In [None]:
#
# Getting the DeterminedAI Master service endpoint URL:
#
#print ("The Determined Master service endpoint URL is: ")
masterUrl=!kubectl describe service determined-master-service-stagingdetai -n determinedai | grep gateway/8080 | awk '{print $3}'
det_master = str(masterUrl)[2:-2] # we remove any potential brackets
determined_master = "http://" + det_master
#print (f"The Determined Master Service endpoint URL is: http://{det_master}")
print (f"The Determined Master Service endpoint URL is: {determined_master}")
#print (f"{determined_master}")

### 6- Authenticate to Determined AI
Run the code cell below and follow steps 1 to 4 to authenticate to Determined AI system as student\<yourId\>: 

In [None]:
userID = "student900"
#
print ("")
print ("export DET_MASTER=" + determined_master) 
print ("~/.local/bin/det user login " + userID)

1. Start a Terminal in the Launcher (navigate to Launcher tab --> Click Terminal tile; or go to Menu --> File --> New Launcher --> Terminal)
2. In the Terminal, copy/paste the two commands above to authenticate to Determined AI as user Student<yourID>.
3. Press return when prompted to enter a password. Users have `blank` password per default in DeterminedAI.
4. Then continue from Step 7 onwards. 

### 7- Check connectivity to the Determined AI Master service endpoint

Run your first Determined CLI command to verify the connectivity to the Determined Master.

The Det CLI command is in the form: _det -m det_master_URL_or_IP \<command_argument\>_

In [None]:
# The Determined AI CLI is installed in '$HOME/.local/bin'
!~/.local/bin/det -m {determined_master} version

### 8- Create Determined AI experiments to train your model

In Determined AI introduced some fundamental concepts that are leveraged in this workshop: _Experiment, Trial and Hyperparamater_.

**Experiment:** In Determined AI terms, an ***experiment*** is a collection of one or more DL training tasks. An Determined experiment can either train a single model with a single training task (one a single or multiple GPUs), or it can define a search over a user-defined hyperparameter space with several training tasks.

**Trial:** Each training task in an experiment is called a ***trial***. A trial is a training task that consists of the dataset (training and validation/test dataset), a deep learning model (e.g.: the Python scripts), and an experiment configuration file that defines the values for all of the model’s hyperparameters. All the elements of a training task are put together in a `model definition directory`.

**Hyperparameters:** These are user-defined variables that define how a model is trained. They affect the accuracy of the trained model. By choosing the best combination of hyperparameters you can obtain better end results. 

Run the code cell below to display the content of the model definition directory:

In [None]:
!ls ~/source_control/Code -l

The Determined model definition directory contains:
- `model_def.py`: The model definition exposed to Determined. This is the core code for the model. This includes data loading code, building the model and compiling the model.
- `startup-hook.sh`: Additional dependencies that Determined will automatically install into each POD container (trial runner) for this experiment. Here, Pandas Python library will be installed.
- `*.yaml` a set of experiment configuration files that each defines an experiment to train the model
     - const.yaml: Trains the model with constant hyperparameter values and data located in a shared file system storage.
     - distributed.yaml: Same as const.yaml, but trains the model with multiple GPUs (distributed training).
     - adaptive.yaml: Performs a hyperparameter search using Determined's state-of-the-art adaptive hyperparameter tuning algorithm (aka a `Searcher` method).

#### Create an experiment to train a single instance of the model with a single GPU by defining the hyperparameters in a const.yaml file

Let’s start simple by training the Iris model on a single GPU. 

Let's take a closer look at this experiment configuration file.

In [None]:
!cat ~/source_control/Code/const.yaml

As you can see above, the hyperparameters are defined as fixed values, with the ***Searcher*** method defined as _Single_ that does not perform hyperparameter search. 

The metric ***val_categorical_accuracy*** is used to evaluate the performance of the training and validation over 5000 batches. The smaller the metric the better the training and validation.  

Notice the ***bind_mounts*** attributes: to run an experiment that uses data stored in a distributed file system, bind_mounts attributes are specified in the experiment configuration file. Here, the bind_mounts point to the shared file system path mounted on the Kubernetes cluster worker nodes by HPE Ezmeral Runtime Enterprise. 

##### Let's run your first experiment!  We submit the experiment configuration and model directory to the Determined master using the CLI command below.

The command below specifies the model definition directory to be used and the model configuration file _const.yaml_. 

The command will return the Experiment Id.

In [None]:
# launch experiment to train a single model on a single GPU
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/const.yaml ~/source_control/Code

Using the command below, you will see that Determined Master has launched the training task for your experiment as a container POD with name:

 _exp-\<experimentID\>-trial-\<TriaID\>-\<unique-name\>_

In [None]:
!kubectl get pods -n determinedai

The command below is used to list your experiment and its status in the Determined system. 

Run the code cell regularly to track the execution progress until the status change from **ACTIVE** to **COMPLETED**.

In [None]:
!~/.local/bin/det -m {determined_master} experiment list | tail -1

### 9- Monitor and visualize your experiment in Determined AI Web User Interface

To access information on both training and validation performance, simply go to the Determined **WebUI** by entering the service endpoint URL of the Determined Master in your web browser connected to the Internet.

* Run the code cell below to get the Determined Master WebUI URL. 
* Then, click on the link to connect. This will open a new tab in your browser.
* You will be prompted to enter your credentials. Type your StudentID as credentials and press return. The password is `blank` by default.
* Upon login you should see the WebUI dashboard.

In [None]:
port = !kubectl describe service determined-master-service-stagingdetai -n determinedai | grep gateway/8080 | awk '{print $3}' | cut -d':' -f 2
portUI = str(port)[2:-2]
print (f"The Determined Master WebUI URL is: http://notebooks.hpedev.io:{portUI}")
print (f"Login using your student Identifier: {userID}, do not enter password. Click on Sign In button")

![Det-WebUI-Users](DetWebUI-Login.png)

From the WebUI, make sure **you select your StudentID** from the ***Users*** drop-down list as shown in the picture below.  
From the dashboard, you can easily select the experiment you want to visualize.

![Det-WebUI-Users](DetWebUI-Users.png)

First experiment: After the experiment completes, we can see on the experiment detail page that training the model with the hyperparameter settings in `const.yaml` yields a validation accuracy of ~93%. 

### 10 - TensorBoard visualization

TensorBoard is a widely used tool for visualizing and inspecting deep learning models. Determined AI is integrated with TensorBoard for deeper analysis of your experiment and to help you examine your neural network model by viewing the training and validation loss curves for your experiment in TensorBoard. 

* Determined AI lets you launch a Tensorboard server and access TensorBoard in one-click from the WebUI, or you can run the following command in Determined’s command line. This may take a minute or so as Determined has to launch the Tensorboard server as a Kubernetes POD. 

In [None]:
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"Start a Tensordboard server instance for your Experiment {myexp} with TensorBoard instance ID:")
!~/.local/bin/det -m {determined_master} tensorboard start -d {myexp}

* Run the code cell below to get the Tensorboard URL for your experiment. Then, click on the link to connect.

>**Note:** The associated TensorBoard server is launched as a container POD in the Kubernetes cluster. Determined AI proxies HTTP requests to and from the TensorBoard container through the Determined AI master node.

In [None]:
#!~/.local/bin/det -m {determined_master} tensorboard list | grep RUNNING
mytensorboard=!~/.local/bin/det -m {determined_master} tensorboard list | grep RUNNING | cut -d'|' -f 1
mytensorboard=str(mytensorboard)[3:-3]
#print (f"{mytensorboard}")
print (f"Your tensorboard is running at http://notebooks.hpedev.io:{portUI}/proxy/{mytensorboard}/")

* When you have finished with Tensorboard, run the code cell below to kill theTensorboard process

In [None]:
!~/.local/bin/det -m {determined_master} tensorboard kill {mytensorboard}

### 11 - List the best model created by the training process

In [None]:
##!~/.local/bin/det -m {determined_master} experiment list-checkpoints {myexp}

#list the best Trial (training task):
!~/.local/bin/det -m {determined_master} experiment list-checkpoints --best 1 {myexp}

### 12- Inference with Determined AI (section under development)!!!!
When you train a model with Determined, all of the artifacts (model files) and metrics associated with that training tasks are tracked and stored in _checkpoint storage_. The artifacts are accessible programmatically. This makes it really easy for you to export your best-performing trained model out of Determined and load it for inference (the process of using a trained model and new data to make a prediction).

More information for downloading a trained model can be found [here](https://docs.determined.ai/latest/post-training/use-trained-models.html).

In code, this looks like:

In [None]:
# After the experiment completes, you can use the line below to load it without re-training the model
masterUrl=determined_master
print (f"Determined Master service endpoint is: {masterUrl}")
print ("")
print ("Fetch the Experiment Id of your last Experiment.")
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"Your last experiment ID is:{myexp}")

In [None]:
!~/.local/bin/det -m {determined_master} experiment download {myexp}

In [None]:
from determined.experimental import client
passw = ""
client.login(master=determined_master, user=userID, password=passw)
print(f"Loading top checkpoint from Experiment ID: {myexp}")
# Download latest checkpoint of trained model. 
# It returns an instance of Checkpoint representing the checkpoint that has the best validation metric.
checkpoint = client.get_experiment(myexp).top_checkpoint()
#
#print(f"Downloaded checkpoint to: {checkpoint}")


In [None]:
model = checkpoint.load()
print(f"Downloaded checkpoint to: {checkpoint}")
print("Loaded trained model")

In [None]:
# Now you can use the model to make predictions.

X_new = np.array([[3, 2, 1, 0.2], [  4.9, 2.2, 3.8, 1.1 ], [  5.3, 2.5, 4.6, 1.9 ]])
#Prediction of the species from the input vector
prediction = model.(X_new)
print("Prediction of Species: {}".format(prediction))

### 13- Delete the checkpoints to reclaim some storage space in the storage file system

By default, Determined saves the most recent and the best checkpoint per training task (trial). 

We can reclaim some storage disk space by changing the checkpoint garbage collection policy as shown below:

In [None]:
# Delete the checkpoints data for the single model training using a single GPU
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"{myexp}")
!~/.local/bin/det -m {determined_master} experiment set gc-policy --yes --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 {myexp}

### 14- Create an experiment to train a single instance of the model with multiple GPUs (distributed training)

Next, let's run an experiment with the same model definition (same code), but this time we leverage Determined's distributed training functionality. 

**Distributed training:** Determined can coordinate multiple GPUs to train a DL model more quickly leveraging GPUs across multiple machines. Typically, ML engineers use distributed training to train models on larger datasets.

>**Note:** Distributed training performs best with complex models; therefore, the simple Iris model used in this example may not demonstrate the full benefits of using distributed training.

Determined AI automatically executes **[data parallel](https://www.oreilly.com/content/distributed-tensorflow/)** training **without requiring any model code changes**. All you need to do to start a multi-GPU training workload is to specify the desired number of GPUs you want to use in the experiment configuration file. For example:

                                      resources:
                                          slots_per_trial: 2

With this configuration, each trial within an experiment will use 2 GPUs to train a single model, whether leveraging 2 GPUs on a single machine or 2 GPUs across multiple machines in the Kubernetes cluster.


Let's take a closer look at the experiment configuration file for distributed training:

In [None]:
!cat ~/source_control/Code/distributed.yaml

Next, submit the experiment with the experiment configuration file _distributed.yaml_:

In [None]:
# launch experiment to train a single model on muliple GPUs
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/distributed.yaml ~/source_control/Code

In [None]:
!kubectl get pods -n determinedai

In [None]:
# Delete the checkpoints data for the distributed training
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"{myexp}")
!~/.local/bin/det -m {determined_master} experiment set gc-policy --yes --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 {myexp}

### 15- Train multiple models as part of a hyperparameter search, using Determined AI hyperparameter tuning functionality (HPO)

Next, let's run an experiment with the same model definition (same code), but this time leveraging Determined's hyperparameter tuning (aka Hyperparameter Optimization or **HPO**). ML engineers typically use HPO to efficiently determine the hyperparameter values that yield the best-performing model. Here the hyperparameters in the experiment configuration file are specified as ranges instead of fixed values, and the `adaptive_asha` searcher is used to explore the hyperparameter space.

With HPO, an experiment consists of multiple training tasks (trials), each with different hyperparameters. Determined AI hyperparameter tuning functionality helps you find the best combination of hyperparameters for your particular model. 

The number of trials to run,  the set of user-defined hyperparameters range and the search algorithm (aka the searcher method) are defined in the configuration file _adaptive.yaml_.

>Note: The **searcher** is a method that is used to find effective hyperparameter settings within a predifined range of hyperparameter values.

More about Hyperparameter optimization and Searcher methods supported by Determined AI can be found [here](https://docs.determined.ai/latest/training-hyperparameter/index.html#hyperparameter-tuning)

In [None]:
!cat ~/source_control/Code/adaptive.yaml

In [None]:
# Launch experiment to train the model with hyperparameter tuning
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/adaptive.yaml ~/source_control/Code

In [None]:
!kubectl get pods -n determinedai

In [None]:
# Delete the checkpoints data for the HPO training
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"{myexp}")
!~/.local/bin/det -m {determined_master} experiment set gc-policy --yes --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 {myexp}

Third experiment: On the experiment detail page, we see the best categorical accuracy that Determined's adaptive search achieves over time.  When the experiment finishes, we find that we reach 100% accuracy on the 30 test set examples, an improvement over the results of the fixed hyperparameter experiment.  We can drill in to the best-performing trial and view the associated hyperparameter values.