# Getting started with Determined, the open-source deep learning training platform - Lab 3

For this part of the lab you will learn how to create an experiment that trains a single instance of the model with multiple GPUs, a process known as **Distributed Training**. Again this experiment will feature a single trial with a set of constant hyperparameters.

Determined can coordinate multiple GPUs to train a deep learning model more quickly leveraging multiple GPUs on single machine or over multiple machines. Typically, ML engineers use Distributed Training to train models on larger datasets in order to improve the model performance and accuracy. This typically requires additional compute resources.

Determined automatically executes **[data parallelization](https://www.oreilly.com/content/distributed-tensorflow/)** training, where data set is divided into multiple pieces and distributed across the GPUs, **without requiring any model code changes**. Each GPU has the full model and trains the model on its portion of the data. Determined ensures the coordination of the training across multiple GPUs on a single machine or multiple machines to keep the whole training task in sync.    

> <font color="green"> **Note:** Distributed Training performs best with complex models; therefore, the simple Iris model used in this example may not demonstrate the full benefits of using Distributed Training.</font>

>_Note: To learn more about Distributed Training with Determined, check out the online documentation [here](https://docs.determined.ai/latest/training-distributed/index.html)._ 

### 1- Create an experiment to train a single instance of the model with multiple GPUs (distributed training)

Let's run an experiment with the same model definition (same code), but this time leveraging Determined's distributed training functionality using the _distributed.yaml_ experiment configuration file. 

#### Let's take a closer look at the experiment configuration file for distributed training:

In [None]:
!cat ~/source_control/Code/distributed.yaml

As you can see here, this configuration file is very similar to the _const.yaml_ file you used earlier. All you need to do to start a multi-GPU training workload (trial) is to specify the desired number of GPUs you want to use in the experiment configuration file and Determined takes care of the rest. For example:

                                      resources:
                                          slots_per_trial: 2

With this configuration, each trial within an experiment will use **2 GPUs** to train a single model, whether leveraging 2 GPUs on a single machine or 2 GPUs across multiple machines in the Kubernetes cluster.

#### Next, submit the experiment with the experiment configuration file _distributed.yaml_:

In [None]:
masterUrl=!kubectl describe service determined-master-service-stagingdetai -n determinedai | grep gateway/8080 | awk '{print $3}'
det_master = str(masterUrl)[2:-2] # we remove any potential brackets
determined_master = "http://" + det_master
#
# launch experiment to train a single model on muliple GPUs
!~/.local/bin/det -m {determined_master} experiment create ~/source_control/Code/distributed.yaml ~/source_control/Code

In the lab environment, the Kubernetes worker hosts have one GPU only. The training task (trial) needs 2 GPUs as per the experiment configuration file. Therefore, Determined Master brings up two PODs, for the same trial, on two different Kubernetes worker hosts.   

#### Using the command below, you will see that Determined Master has launched **two** PODs for the training task in the Kubernetes cluster with name in the form:

 _exp-\<experimentID\>-trial-\<TriaID\>-\<unique-name\>_

> **Note:** Notice the Trial ID is the same for the two PODs, which means your experiment features a single trial with a fixed set of hyperparameters. 

> <font color="blue"> **Note:** As you are sharing the same Kubernetes resources with other participants, and depending on the number of concurrent experiments running, your experiment's training task might be in **Pending** state. You might need to wait a few minutes until other experiments complete for your training task to become **Running**.</font>

In [None]:
!kubectl get pods -n determinedai

#### Run the code cell below to monitor the execution progress of the experiment.

In [None]:
!~/.local/bin/det -m {determined_master} experiment list | tail -1
# Get the experiment Id, remove spaces
myexpId=!~/.local/bin/det -m {determined_master} experiment list | tail -1 | cut -d'|' -f 1 |  tr -d ' '
# remove the trailer characters
myexpId=str(myexpId)[2:-2]
!~/.local/bin/det -m {determined_master} experiment describe {myexpId} --json | jq .[0].state

### 2- Monitor and visualize your experiment in Determined Web User Interface

To monitor the progress of the training task and access information on both training and validation performance, simply return to the Determined **WebUI**.

##### From the **Dashboard**, select the most recent experiment.

You should see the experiment as an **active** state. The graph is showing the training accuracy metric. You can see the graph changing in real time as the experiment runs.

> <font color="blue"> **Important Note:** If there are multiple concurrent participants to the workshop, your experiment might not be in active state yet. There are more experiments running than the Kubernetes cluster has GPUs. You might need to wait a few minutes until other experiments complete for your experiment to turn active. </font>

From the **Metrics** menu, under **Training Metrics**, select _categorical_accuracy_ (see picture below for an example). This metric indicates the model accuracy on training data while the _val_categorical_accuracy_ indicates the model accuracy on validation data.

After the experiment completes, you can see on the experiment detail page that training the model with the hyperparameter settings in `distributed.yaml` yields a validation accuracy between 93% and 97%. 

Scroll down to see a list of training validation workloads and their metrics for the metric types you previously selected. 
You might see one or two validation workloads with checkpoints. By default, Determined will checkpoint the most recent and the best model per training task (trial). If the most recent checkpoint is also the best checkpoint for a given trial, only one checkpoint will be saved for that trial as shown in the picture below.

<img src="WebUI-Exp-distribute-graph.png" height="520" width="900">

### 3 - TensorBoard visualization

You can also use [TensorBoard](https://www.tensorflow.org/tensorboard) for visualizing and inspecting the deep learning models. 

#### Run the code cell below to launch the TensorBoard server instance.

This may take a minute or so as Determined has to launch the Tensorboard server as a Kubernetes POD. 

In [None]:
print (f"Start a Tensordboard server instance for your Experiment {myexpId} with TensorBoard instance ID:")
# start the tensorBoard server instance for the experiment
!~/.local/bin/det -m {determined_master} tensorboard start -d {myexpId}

#### Run the code cell below to get the Tensorboard URL for your experiment. Then, click on the link to connect.

>**Note:** The associated TensorBoard server is launched as a container POD in the Kubernetes cluster. Determined proxies HTTP requests to and from the TensorBoard container through the Determined Master node.

In [None]:
mytensorboard=!~/.local/bin/det -m {determined_master} tensorboard list | grep RUNNING | cut -d'|' -f 1 |  tr -d ' '
mytensorboard=str(mytensorboard)[2:-2]
#print (f"{mytensorboard}")
print (f"Your tensorboard is running at http://notebooks.hpedev.io:{portUI}/proxy/{mytensorboard}/")
print (f"Click on the link to connect.")

<img src="TensorBoard-distribute-graph.png" height="413" width="900">

Determined created TensorBoard plots to show the training loss, validation loss, training accuracy and validation accuracy for the training task (trial).

#### When you have finished with Tensorboard, run the code cell below to `kill` the Tensorboard process

In [None]:
!~/.local/bin/det -m {determined_master} tensorboard kill {mytensorboard}

### 4 - List the best model created by the training process
By default, Determined will save the most recent and the best checkpoint per training task (trial) according to the validation metrics specified in the Searcher section of the configuration file for the experiment.

* _det experiment list-checkpoints [--best] [N best checkpoints to return] \<experiment_Id\>_

>**Note**: Upon completion of the training task, if the most recent checkpoint is also the best checkpoint for a given trial, only one checkpoint will be saved for that trial by Determined. Otherwise, two checkpoints will be saved. Other checkpoints will be automatically deleted to reclaim space.

#### Run the code cell below to display the best checkpoint for your experiment

In [None]:
#list the best Trial checkpoint(s) (training task):
!~/.local/bin/det -m {determined_master} experiment list-checkpoints --best 1 {myexpId}

### 5- Delete the checkpoints to reclaim some storage space in the storage file system

The default **checkpoint garbage collection policy** dictates Determined to save the most recent and the best checkpoint per training task (trial). The ***save_experiment_best***, ***save_trial_best*** and ***save_trial_latest*** parameters specify which checkpoints to save. The default policy is set as follows:

  * save_experiment_best:0 
  * save_trial_best:1
  * save_trial_latest:1
 
#### Run the code cell below to reclaim some storage disk space by changing the default checkpoint garbage collection policy as shown below:

In [None]:
# Delete the checkpoints data for the distributed training
!~/.local/bin/det -m {determined_master} experiment set gc-policy --yes --save-experiment-best 0 --save-trial-best 0 --save-trial-latest 0 {myexpId}

#### Now, let's explore an Hyperparameter Optimization experiment. 
Click on Lab 4 below to open a notebook to explore Hyperparameter Optimization (HPO). 
* [Lab 4](4-WKSHP-DET-AI-101-Getting-started-HPO.ipynb)