# Training A `TensorFlow.Keras` Classifier With And Without `Horovod`

This test uses MNIST dataset to train a model using TensorFlow.Keras with and without Horovod. Later it will verify that:

  * The accuracy was not damaged in Horovod.
  * The Horovod run was faster (only possible on big data). 

## General Configurations

In [None]:
# per ML-3824 need to install tensorflow and mlrun in the same command
!pip install plotly tensorflow==2.15.1 mlrun  # TODO: remove 2.15.1 here and in functions requirements after ML-6189 fix

In [None]:
# Number of epochs to train (to increase the training time without increasing the memory usage):
N_EPOCHS = 4

# Number of ranks (horovod workers) to deploy for the open mpi job:
N_RANKS = 4

## 1. Training Code

1. Get the MNIST data from `tensorflow.keras.datasets`.
2. Initialize a model.
3. Run training on the training set with validation on the testing set.

Accuracy score will be logged as a result as part of MLRun auto-logging.

In [None]:
# mlrun: start-code

In [None]:
from typing import Tuple
import time
import tensorflow as tf

import mlrun
import mlrun.frameworks.tf_keras as mlrun_tf_keras


def get_datasets(batch_size: int) -> Tuple[tf.data.Dataset, tf.data.Dataset]:
    # Download the data:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
    
    # Initialize tensorflow datasets:
    train_set = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
    test_set = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size)
    
    return train_set, test_set


def get_model() -> tf.keras.Model:
    # Build the model architecture:
    inputs = tf.keras.Input(shape=(28, 28))
    x = tf.keras.layers.experimental.preprocessing.Rescaling(1.0 / 255)(inputs)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(128, activation="relu")(x)
    x = tf.keras.layers.Dense(128, activation="relu")(x)
    outputs = tf.keras.layers.Dense(10, activation="softmax")(x)
    
    # Initialize a model:
    model = tf.keras.Model(inputs, outputs)
    
    return model


@mlrun.handler(outputs=["time"])
def train(context: mlrun.MLClientCtx, n_epochs: int):
    # Start the timer:
    run_time = time.time()
    
    # Get the data:
    batch_size = 32
    train_set, test_set = get_datasets(batch_size=batch_size)

    # Get the model:
    model = get_model()
    
    # Apply MLRun:
    mlrun_tf_keras.apply_mlrun(model=model, context=context)

    # Compile the model:
    model.compile(
        optimizer=tf.keras.optimizers.legacy.SGD(learning_rate=0.1, momentum=0.9),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )

    # Train:
    model.fit(
        train_set,
        validation_data=test_set,
        epochs=n_epochs,
        steps_per_epoch=len(train_set) // batch_size,
    )
    run_time = time.time() - run_time
    
    return run_time

In [None]:
# mlrun: end-code

## 2. Create a Project

1. Create the MLRun project.
2. Create an MLRun function of the training code.

In [None]:
import numpy as np
import mlrun

In [None]:
# Create the project:
project = mlrun.get_or_create_project(name="horovod-tensorflow-test", context="./", user_project=True)

In [None]:
# Create the job function:
job_function = project.set_function(name="train_job", kind="job", image="mlrun/mlrun", handler="train", requirements=["tensorflow==2.15.1"])
job_function.apply(mlrun.auto_mount())
job_function.deploy()

In [None]:
# Create the open mpi function:
mpijob_function = project.set_function(name="train_mpijob", kind="mpijob", image="mlrun/mlrun", handler="train", requirements=["horovod[tensorflow]"])
mpijob_function.apply(mlrun.auto_mount())
mpijob_function.spec.replicas = N_RANKS
mpijob_function.with_commands(["pip install tensorflow==2.15.1"])
mpijob_function.deploy(builder_env={"HOROVOD_WITH_TENSORFLOW": "1"})

## 3. Run As A Job

Run the training as a `job` and storing the results.

In [None]:
# Run as a job:
job_run = job_function.run(
    name="training_job",
    params={
        "n_epochs": N_EPOCHS,
    },
)

# Store results:
job_time = job_run.status.results['time']
job_accuracy = job_run.status.results['validation_accuracy']

## 4. Run As a MPIJob

Run the training as a `mpijob` and storing the results.

In [None]:
# Run as a mpijob:
mpijob_run = mpijob_function.run(
    name="training_mpijob",
    params={
        "n_epochs": N_EPOCHS,
    },
)

# Store results:
mpijob_time = mpijob_run.status.results['time']
mpijob_accuracy = mpijob_run.status.results['validation_accuracy']

## 5. Compare Runtimes

1. Print a summary message.
2. Verify that:
  * The mpijob run took less time (only in stronger machines). 
  * The accuracy value is equal between the runs.

In [None]:
# Delete the MLRun project:
mlrun.get_run_db().delete_project(name=project.name, deletion_strategy="cascading")

In [None]:
# Print the test's collected results:
print(
    f"Job:\n" 
    f"\t{'%.2f' % job_time} Seconds\n"
    f"\tAccuracy: {job_accuracy}"
)
print(
    f"Open MPI Job (Horovod):\n"
    f"\t{'%.2f' % mpijob_time} Seconds\n"
    f"\tAccuracy: {mpijob_accuracy}\n"
)

# Verification: (Only possible to test on a stronger machine (requires big data and longer training)
# assert mpijob_time < job_time
# assert np.isclose(job_accuracy, mpijob_accuracy, atol=0.1)