# Getting started with Determined AI, the open-source deep learning training platform - Lab 2

In this part of the lab, we consider the well known iris classification problem of predicting iris species based on sepal and petal size.  We leverage Determined AI to train and tune a TensorFlow Keras model on TensorFlow's publicly available iris [training](http://download.tensorflow.org/data/iris_training.csv) and [test](http://download.tensorflow.org/data/iris_test.csv) datasets.

The dataset consists of:
* 150 samples (120 samples for the training dataset, 30 samples for the test data set)
* 3 labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor)
* 4 features: Sepal length,Sepal width,Petal length,Petal Width in cm

>**Note:** _The Iris training and the validation/test datasets are stored in the HPE Ezmeral Data Fabric, a distributed file system integrated with HPE Ezmeral Runtime Enterprise. The shared file system has been made accessible to Determined AI system._

In this part of the lab you will:
* Install Determined AI CLI to interact with the Determined AI system running on Kubernetes cluster.
* Use Determined AI to train your model with a single GPU, with multiple GPU and with hyperparamater tuning
* Get familiar with the Determined AI WebUI and the Determined AI Command Line Interface (CLI)

### 1- Install Determined AI CLI

The Determined CLI is a command line tool that allows you to interact with the Determined AI system. For example, the CLI allows you to launch new experiment to train your deep learning model.

Here, you will use the line magic function below to install **Determined AI CLI** as a new Python packages in the current Jupyter kernel.

>**Note:** After installation of the package, you will need to restart the current kernel to use the newly installed package from the local Jupyter server.

In [None]:
# use the line magic function below to install new Python packages in the current Jupyter kernel.
# restart the current kernel to use the newly installed package
%pip install kfp-pipeline-spec --quiet
%pip install determined --quiet

### 2- Restart your Kernel to use the newly installed Python packages.
From the menu bar, select **"Kernel"** then **"Restart Kernel..."**

### 3- Fetch the KubeConfig file for your tenant user ID

When prompted to enter the password, make sure to enter password for your StudentID credentials.

>**Note:** You can ignore _InsecureRequestWarning_ message

If the password is correct, you will see the message ***kubeconfig set for user Student\<YourID\>***.

In [None]:
%kubeRefresh

### 4- Determined AI components

For this hands-on workshop, the Determined AI system has been installed on the Kubernetes cluster managed by HPE Ezmeral Runtime Enterprise on a Kubernetes namespace _determinedai_.  

When installing Determined AI on Kubernetes, an instance of the **Determined Master** and a **Postgres database** are deployed in the Kubernetes cluster. Run the code cell below and check out the output. You should see one container POD for the Determined Master service and another POD for the Database service.

The Determined Master is the central component of the Determined System. The Master is responsible for:
* Tracking and storing all Determined workloads (model training tasks) metadata (description, labels, hyperparmeters, search algorithm used, validation metrics, start/end time, logs) produced by model training tasks in the Postgres database.
* Scheduling Determined training workloads as a collection of Kubernetes PODs.
* Saving model artifacts (model files) and training checkpoints of Determined training workloads in a _checkpoint persistent storage_ to keep records of the progress and ensure workload resiliency.
* Serving the Web User Interface to visualize metrics across training tasks.


In [None]:
!kubectl get pod -n determinedai | grep determined

### 5- Set the endpoint URL of the Determined master. 

To use Determined and interact with Determined cluster, you need to tell the CLI where the Determined master service is running.

In [None]:
#
# Getting the DeterminedAI Master service endpoint URL:
#
#print ("The Determined Master service endpoint URL is: ")
masterUrl=!kubectl describe service determined-master-service-stagingdetai -n determinedai | grep gateway/8080 | awk '{print $3}'
det_master = str(masterUrl)[2:-2]
determined_master = "http://" + det_master
#print (f"The Determined Master Service endpoint URL is: http://{det_master}")
print (f"The Determined Master Service endpoint URL is: {determined_master}")
#print (f"{determined_master}")

#### If needed, replace the URL with the endpoint URL obtained from the output of the previous code cell.

In [None]:
# Test importing Determined.  If Determined is properly installed, you will see no output.
import determined as det

### 6- Authenticate to Determined AI

In [None]:
userID = "student900"
#
print ("")
print ("export DET_MASTER=" + determined_master) 
print ("~/.local/bin/det user login " + userID)

- Start a Terminal in the Launcher (navigate to Launchr tab --> Click Terminal tile; or go to Menu --> File --> New Launcher --> Terminal)
- In the Terminal, copy/paste the two commands above to authenticate to Determined AI as user Student<yourID>.
- Press return when prompted to enter a password.
- Then continue from Step 7 onwards. 

### 7- Check connectivity to the Determined AI Master service endpoint

In [None]:
# The Determined AI CLI is installed in '$HOME/.local/bin'
!~/.local/bin/det -m {determined_master} slot list

### 8- Create Determined AI experiments to train your model

In Determined AI there are some foundational concepts to understand: Experiment, Trial, hyperparamater.

**Experiment:** In Determined AI terms, an ***experiment*** is a collection of one or more DL training tasks. An experiment can either train a single model (with a single training task), or can define a search over a user-defined hyperparameter space (with several training tasks).

**Trial:** Each training task in an experiment is called a ***trial***. A trial is a training task that consists of the dataset (training and validation/test dataset), a deep learning model (e.g.: the Python scripts), and an experiment configuration file that defines the values for all of the model’s hyperparameters. All the elements of a training task are put together in a model definition directory.

**Hyperparameters:** These are user-defined variables that define how a model is trained. They affect the accuracy of the trained model. By choosing the best combination of hyperparameters you can obtain better end results. 

>**Note:** Determined AI on Kubernetes works by scheduling Determined workloads such as model training tasks as a collection of Kubernetes PODs.

In [None]:
!ls ~/source_control/Code -l

The Determined model definition directory contains:
- `model_def.py`: The TensorFlow Keras model definition exposed to Determined. The core code for the model. This includes building and compiling the model.
- `startup-hook.sh`: Additional dependencies that Determined will automatically install into each POD container (trial runner) for this experiment. Here, Pandas Python library will be installed.
- `*.yaml` a set of configuration files that each define an experiment to train the model
     - const-fsmount.yaml: Train the model with constant hyperparameter values and data located in NFS shared storage.
     - distributed-fsmount.yaml: Same as const.yaml, but trains the model with multiple GPUs (distributed training).
     - adaptive-fsmount.yaml: Perform a hyperparameter search using Determined's state-of-the-art adaptive hyperparameter tuning algorithm.
     
**Data:** For this hands-on workshop, the Iris training dataset and validation/test dataset are stored in the HPE Ezmeral Data Fabric, a Distributed File System integrated into HPE Ezmeral Runtime Enterprise.

#### 8.1- Create an experiment to train a single instance of the model with a single GPU by defining the hyperparameters in a const-fsmount.yaml file

Let’s start simple by training the Iris model on a single GPU. We specify the hyperparameters as fixed values.

Let's take a closer look at this file:

In [None]:
!cat ~/source_control/Code/const-fsmount.yaml

#### Let's run your first experiment!  We submit the experiment configuration and model directory to the Determined master using the following CLI command:

The command below specifies the model definition directory to be used and the model configuration file _const-fsmount.yaml_. 

Optionally, the _-f_ flag could be used to print verbose output onto your terminal and to follow the progress of the experiment in real-time.

In [None]:
# launch experiment to train a single model on a single GPU
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/const-fsmount.yaml ~/source_control/Code

Using the command below, you will see that Determined Master has launched the training task for your experiment as a container POD with name: _exp-\<experimentID\>-trial-\<TriaID\>-\<unique-name\>_.

In [None]:
!kubectl get pods -n determinedai

The command below is used to list your experiment and its status in the Determined system. 

Run the code cell regularly to track the execution progress until the status change from **ACTIVE** to **COMPLETED**.

In [None]:
!~/.local/bin/det -m {determined_master} experiment list | tail -1

#### 8.2- Create an experiment to train a single instance of the model with multiple GPUs (distributed training) by defining the hyperparameters in a distributed-fsmount.yaml file

Next, let's run an experiment with the same model definition, but this time we leverage Determined's distributed training functionality. 

**Distributed training:** Determined can coordinate multiple GPUs to train a single trial more quickly leveraging GPUs across multiple machines. 

>**Note:** Distributed training performs best with complex models; therefore, the simple Iris model used in this example may not demonstrate the full benefits of using distributed training.

Determined AI automatically executes **[data parallel](https://www.oreilly.com/content/distributed-tensorflow/)** training without requiring any model code changes. All you need to do to start a multi-GPU training workload is to specify the desired number of GPUs you want to use in the experiment configuration file. For example:

                                      resources:
                                          slots_per_trial: 8

With this configuration, each trial within an experiment will use 8 GPUs to train a single model, whether leveraging 8 GPUs on a single machine or 8 GPUs across multiple machines in the Kubernetes cluster.


Let's take a closer look at the experiment configuration file for distributed training:

Next, submit the experiment with the same command as above; however, our experiment configuration file will now be distributed.yaml:


In [None]:
!cat ~/source_control/Code/distributed-fsmount.yaml

In [None]:
# launch experiment to train a single model on muliple GPUs
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/distributed-fsmount.yaml ~/source_control/Code

In [None]:
!kubectl get pods -n determinedai

#### 8.3- Train multiple models as part of a hyperparameter search, using Determined AI hyperparameter tuning functionality, by defining the hyperparameters in an adaptive-fsmount.yaml file

Next, let's run an experiment with the same model definition, but this time we leverage Determined's hyperparameter tuning to efficiently determine the hyperparameter values that yield the best-performing model.  We specify the hyperparameters as ranges instead of fixed values, and the `adaptive_simple` searcher to explore the hyperparameter space.

Determined AI will run multiple training tasks (trials), each with different hyperparameters. Determined AI hyperparameter tuning functionality helps you find the best combiantion of hyperparameters for your particular model. 

The number of trials to run,  the set of user-defined hyperparameters range and the search algorithm (aka the searcher method) are defined in the configuration file _adaptive-fsmount.yaml_.

>Note: The **searcher** is a method that is used to find effective hyperparameter settings within a predifined range of hyperparameter values.

More about Hyperparameter optimization with Determined AI can be found [here](https://docs.determined.ai/latest/training-hyperparameter/index.html#hyperparameter-tuning)

In [None]:
!cat ~/source_control/Code/adaptive-fsmount.yaml

In [None]:
# Launch experiment to train the model with hyperparameter tuning
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/adaptive-fsmount.yaml ~/source_control/Code

In [None]:
!kubectl get pods -n determinedai

### 9- Monitor and visualize your experiment in Determined AI Web User Interface

To access information on both training and validation performance, simply go to the WebUI by entering the service endpoint URL of the Determined master in your web browser connected to the Internet.

* Run the code cell below to get the Determined Master WebUI URL. 
* Then, click on the link to connect. 
* You will be prompted to enter your credentials. Type your StudentID as credentials and press return. The password is left blank.

In [None]:
port = !kubectl describe service determined-master-service-stagingdetai -n determinedai | grep gateway/8080 | awk '{print $3}' | cut -d':' -f 2
portUI = str(port)[2:-2]
print (f"The Determined Master WebUI URL is: http://notebooks.hpedev.io:{portUI}")
print (f"Login using your student Identifier: {userID}, do not enter password.")

First experiment: After the experiment completes, we can see on the experiment detail page that training the model with the hyperparameter settings in `const.yaml` yields a validation accuracy of ~93%. 

Third experiment: On the experiment detail page, we see the best categorical accuracy that Determined's adaptive search achieves over time.  When the experiment finishes, we find that we reach 100% accuracy on the 30 test set examples, an improvement over the results of the fixed hyperparameter experiment.  We can drill in to the best-performing trial and view the associated hyperparameter values.

### 10 - Tensorboard visualization
Determined AI is integrated with Tensorboard for deeper analysis, help you better understand your neural network model by viewing the training and validation loss curves for your experiment in Tensorboard. 

* Determined AI lets you launch a Tensorboard server and access TensorBoard in one-click from the WebUI, or you can run the following command in Determined’s command line. This may take a minute or so as Determined has to launch the Tensorboard server as a Kubernetes POD. 

In [None]:
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"{myexp}")
!~/.local/bin/det -m {determined_master} tensorboard start {myexp}

* Run the code cell below to get the Tensorboard URL for your experiment. Then, click on the link to connect.

>**Note:** The associated TensorBoard server is launched as a container POD in the Kubernetes cluster. Determined AI proxies HTTP requests to and from the TensorBoard container through the Determined AI master node.

In [None]:
#!~/.local/bin/det -m {determined_master} tensorboard list | grep RUNNING
mytensorboard=!~/.local/bin/det -m {determined_master} tensorboard list | grep RUNNING | cut -d'|' -f 1
mytensorboard=str(mytensorboard)[3:-3]
#print (f"{mytensorboard}")
print (f"Your tensorboard is running at http://notebooks.hpedev.io:{portUI}/proxy/{mytensorboard}/")

* When you have finished with Tensorboard, run the code cell below to kill theTensorboard process

In [None]:
!~/.local/bin/det -m {determined_master} tensorboard kill {mytensorboard}

### 11- Inference with Determined AI
When you train a model with Determined, all of the artifacts (model files) and metrics associated with that training tasks are tracked and stored in _checkpoint storage_. The artifacts are accessible programmatically. This makes it really easy for you to export your best-performing trained model out of Determined and load it for inference (the process of using a trained model and new data to make a prediction).

More information for downloading a trained model can be found [here](https://docs.determined.ai/latest/post-training/use-trained-models.html).

In code, this looks like:

In [None]:
# After the experiment completes, you can use the line below to load it without re-training the model

userID = "student900"
passw = ""

masterUrl=determined_master
print (f"Determined Master service endpoint is: {masterUrl}")
print ("")
print ("Fetch the Experiment Id of your last Experiment.")
myexp=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1
myexp=str(myexp)[5:-3]
print (f"Your last experiment ID is:{myexp}")

In [None]:
!~/.local/bin/det -m {determined_master} experiment download {myexp}

In [None]:
from determined.experimental import client
client.login(master=determined_master, user=userID, password=passw)
print(f"Loading top checkpoint from Experiment ID: {myexp}")
# Download latest checkpoint of trained model. 
# It returns an instance of Checkpoint representing the checkpoint that has the best validation metric.
checkpoint = client.get_experiment(myexp).top_checkpoint()
#
#print(f"Downloaded checkpoint to: {checkpoint}")


In [None]:
model = checkpoint.load()
print(f"Downloaded checkpoint to: {checkpoint}")
print("Loaded trained model")

In [None]:
# Now you can use the model to make predictions.

X_new = np.array([[3, 2, 1, 0.2], [  4.9, 2.2, 3.8, 1.1 ], [  5.3, 2.5, 4.6, 1.9 ]])
#Prediction of the species from the input vector
prediction = model.(X_new)
print("Prediction of Species: {}".format(prediction))