# Training A `TensorFlow.Keras` Classifier With And Without `Horovod`

This test uses MNIST dataset to train a model using TensorFlow.Keras with and without Horovod. Later it will verify that:

  * The accuracy was not damaged in Horovod.
  * The Horovod run was faster (only possible on big data). 

## General Configurations

In [None]:
# per ML-3824 need to install tensorflow and mlrun in the same command
!pip install plotly tensorflow mlrun

In [1]:
# Number of epochs to train (to increase the training time without increqasing the memory usage):
N_EPOCHS = 4

# Number of ranks (horovod workers) to deploy for the open mpi job:
N_RANKS = 4

## 1. Training Code

1. Get the MNIST data from `tensorflow.keras.datasets`.
2. Initialize a model.
3. Run training on the training set with validation on the testing set.

Accuracy score will be logged as a result as part of MLRun auto-logging.

In [2]:
# mlrun: start-code

In [3]:
from typing import Tuple
import time
import tensorflow as tf

import mlrun
import mlrun.frameworks.tf_keras as mlrun_tf_keras


def get_datasets(batch_size: int) -> Tuple[tf.data.Dataset, tf.data.Dataset]:
    # Download the data:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
    
    # Initialize tensorflow datasets:
    train_set = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
    test_set = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size)
    
    return train_set, test_set


def get_model() -> tf.keras.Model:
    # Build the model architecture:
    inputs = tf.keras.Input(shape=(28, 28))
    x = tf.keras.layers.experimental.preprocessing.Rescaling(1.0 / 255)(inputs)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(128, activation="relu")(x)
    x = tf.keras.layers.Dense(128, activation="relu")(x)
    outputs = tf.keras.layers.Dense(10, activation="softmax")(x)
    
    # Initialize a model:
    model = tf.keras.Model(inputs, outputs)
    
    return model


@mlrun.handler(outputs=["time"])
def train(context: mlrun.MLClientCtx, n_epochs: int):
    # Start the timer:
    run_time = time.time()
    
    # Get the data:
    batch_size = 32
    train_set, test_set = get_datasets(batch_size=batch_size)

    # Get the model:
    model = get_model()
    
    # Apply MLRun:
    mlrun_tf_keras.apply_mlrun(model=model, context=context)

    # Compile the model:
    model.compile(
        optimizer=tf.keras.optimizers.SGD(lr=0.1, momentum=0.9),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )

    # Train:
    model.fit(
        train_set,
        validation_data=test_set,
        epochs=n_epochs,
        steps_per_epoch=len(train_set) // batch_size,
    )
    run_time = time.time() - run_time
    
    return run_time

In [4]:
# mlrun: end-code

## 2. Create a Project

1. Create the MLRun project.
2. Create an MLRun function of the training code.

In [5]:
import numpy as np
import mlrun

In [6]:
# Create the project:
project = mlrun.get_or_create_project(name="horovod-tensorflow-test", context="./", user_project=True)

> 2022-12-26 12:16:58,139 [info] loaded project horovod-tensorflow-test from MLRun DB


In [7]:
# Create the job function:
job_function = project.set_function(name="train_job", kind="job", image="mlrun/ml-models", handler="train")
job_function.apply(mlrun.auto_mount())

# Create the open mpi function:
mpijob_function = project.set_function(name="train_mpijob", kind="mpijob", image="mlrun/ml-models", handler="train")
mpijob_function.apply(mlrun.auto_mount())
mpijob_function.spec.replicas = N_RANKS

## 3. Run As A Job

Run the training as a `job` and storing the results.

In [8]:
# Run as a job:
job_run = job_function.run(
    name="training_job",
    params={
        "n_epochs": N_EPOCHS,
    },
)

# Store results:
job_time = job_run.status.results['time']
job_accuracy = job_run.status.results['validation_accuracy']

> 2022-12-26 12:17:11,730 [info] starting run training_job uid=7a436b0e191a445d99c6f70c33781c25 DB=http://mlrun-api:8080
> 2022-12-26 12:17:11,930 [info] Job is running in the background, pod: training-job-jtp75
2022-12-26 12:17:18.493877: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-26 12:17:18.493925: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
2022-12-26 12:17:22.222287: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-26 12:17:22.222636: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot ope

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
horovod-tensorflow-test-guyl,...33781c25,0,Dec 26 12:17:18,completed,training_job,v3io_user=guylkind=jobowner=guylmlrun/client_version=1.2.1-rc4host=training-job-jtp75,,n_epochs=12,n_epochs=12lr=0.009999999776482582training_loss=0.493896484375training_accuracy=0.875validation_loss=0.4776747340973193validation_accuracy=0.875699954291883time=24.575107097625732,training_loss.htmltraining_accuracy.htmlvalidation_loss.htmlvalidation_accuracy.htmlloss_summary.htmlaccuracy_summary.htmllr_values.htmlmodel





> 2022-12-26 12:17:46,107 [info] run executed, status=completed


## 4. Run As a MPIJob

Run the training as a `mpijob` and storing the results.

In [9]:
# Run as a mpijob:
mpijob_run = mpijob_function.run(
    name="training_mpijob",
    params={
        "n_epochs": N_EPOCHS,
    },
)

# Store results:
mpijob_time = mpijob_run.status.results['time']
mpijob_accuracy = mpijob_run.status.results['validation_accuracy']

> 2022-12-26 12:17:46,153 [info] starting run training_mpijob uid=4f7bbe1a3ead410bb09946092d03531e DB=http://mlrun-api:8080
> 2022-12-26 12:17:56,521 [info] MpiJob training-mpijob-a3d4b5d9 launcher pod training-mpijob-a3d4b5d9-launcher state active
+ POD_NAME=training-mpijob-a3d4b5d9-worker-3
+ shift
+ /opt/kube/kubectl exec training-mpijob-a3d4b5d9-worker-3 -- /bin/sh -c        PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "572915712" -mca ess_base_vpid 4 -mca ess_base_num_procs "5" -mca orte_node_regex "training-mpijob-a[1:3]d4b5d9-launcher,training-mpijob-a[1:3]d4b5d9-worker-0,training-mpijob-a[1:3]d4b5d9-worker-1,training-mpijob-a[1:3]d4b5d9-worker-2,training-mpijob-a[1:3]d4b5d9-worker-3@0(5)" -mca orte_hnp_uri "572915712.0;tcp://10.200.83.234:33209" -mca plm "rsh" --tree-spa

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
horovod-tensorflow-test-guyl,...2d03531e,0,Dec 26 12:18:01,completed,training_mpijob,v3io_user=guylkind=mpijobowner=guylmlrun/client_version=1.2.1-rc4mlrun/job=training-mpijob-a3d4b5d9host=training-mpijob-a3d4b5d9-worker-0,,n_epochs=12,n_epochs=12lr=0.03999999910593033training_loss=0.4735088348388672training_accuracy=0.875validation_loss=0.5860166702026757validation_accuracy=0.8463000215280551time=26.737416744232178,training_loss.htmltraining_accuracy.htmlvalidation_loss.htmlvalidation_accuracy.htmlloss_summary.htmlaccuracy_summary.htmllr_values.htmlmodel





> 2022-12-26 12:18:33,882 [info] run executed, status=completed


## 5. Compare Runtimes

1. Print a summary message.
2. Verify that:
  * The mpijob run took less time (only in stronger machines). 
  * The accuracy value is equal between the runs.

In [None]:
# Delete the MLRun project:
mlrun.get_run_db().delete_project(name=project.name, deletion_strategy="cascading")

In [10]:
# Print the test's collected results:
print(
    f"Job:\n" 
    f"\t{'%.2f' % job_time} Seconds\n"
    f"\tAccuracy: {job_accuracy}"
)
print(
    f"Open MPI Job (Horovod):\n"
    f"\t{'%.2f' % mpijob_time} Seconds\n"
    f"\tAccuracy: {mpijob_accuracy}\n"
)

# Verification: (Only possible to test on a stronger machine (requires big data and longer training)
# assert mpijob_time < job_time
# assert np.isclose(job_accuracy, mpijob_accuracy, atol=0.1)

Job:
	24.58 Seconds
	Accuracy: 0.875699954291883
Open MPI Job (Horovod):
	26.74 Seconds
	Accuracy: 0.8463000215280551

