# Horovod

HorovodRunner is a general API to run distributed DL workloads on Databricks using Uber’s [Horovod](https://github.com/uber/horovod) framework. By integrating Horovod with Spark’s barrier mode, Databricks is able to provide higher stability for long-running deep learning training jobs on Spark. HorovodRunner takes a Python method that contains DL training code with Horovod hooks. This method gets pickled on the driver and sent to Spark workers. A Horovod MPI job is embedded as a Spark job using barrier execution mode. The first executor collects the IP addresses of all task executors using BarrierTaskContext and triggers a Horovod job using mpirun. Each Python MPI process loads the pickled program back, deserializes it, and runs it.

<br>

![](https://files.training.databricks.com/images/horovod-runner.png)

For additional resources, see:
* [Horovod Runner Docs](https://docs.microsoft.com/en-us/azure/databricks/applications/deep-learning/distributed-training/horovod-runner)
* [Horovod Runner webinar](https://vimeo.com/316872704/e79235f62c) 

Run the following cell to set up our environment.

In [0]:
%run "./Includes/Classroom-Setup"



## Build Model

In [0]:
import numpy as np
np.random.seed(0)
import tensorflow as tf
tf.set_random_seed(42) # For reproducibility
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def build_model():
  return Sequential([Dense(20, input_dim=8, activation='relu'),
                    Dense(20, activation='relu'),
                    Dense(1, activation='linear')]) # Keep the output layer as linear because this is a regression problem

[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
[0;32m<command-2004014418861572>[0m in [0;36m<cell line: 3>[0;34m()[0m
[1;32m      1[0m [0;32mimport[0m [0mnumpy[0m [0;32mas[0m [0mnp[0m[0;34m[0m[0;34m[0m[0m
[1;32m      2[0m [0mnp[0m[0;34m.[0m[0mrandom[0m[0;34m.[0m[0mseed[0m[0;34m([0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 3[0;31m [0;32mimport[0m [0mtensorflow[0m [0;32mas[0m [0mtf[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      4[0m [0mtf[0m[0;34m.[0m[0mset_random_seed[0m[0;34m([0m[0;36m42[0m[0;34m)[0m [0;31m# For reproducibility[0m[0;34m[0m[0;34m[0m[0m
[1;32m      5[0m [0;32mfrom[0m [0mtensorflow[0m [0;32mimport[0m [0mkeras[0m[0;34m[0m[0;34m[0m[0m

[0;32m/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py[0m in [0;36mimport_patch[


### 1. Importing Libraries

```python
import numpy as np
np.random.seed(0)
```

- **`import numpy as np`**: This line imports the NumPy library, which is a fundamental package for scientific computing in Python. It provides support for arrays, matrices, and a wide range of mathematical functions.

- **`np.random.seed(0)`**: Setting the random seed ensures that the random numbers generated by NumPy will be the same each time you run the code. This is crucial for reproducibility. When working with machine learning models, it’s important to be able to reproduce results, and using a fixed seed helps achieve that.

### 2. Importing TensorFlow and Setting the Seed

```python
import tensorflow as tf
tf.set_random_seed(42)  # For reproducibility
```

- **`import tensorflow as tf`**: This line imports TensorFlow, which is a comprehensive library for building and training machine learning models, especially deep learning models.

- **`tf.set_random_seed(42)`**: Similar to NumPy’s random seed, this sets the random seed for TensorFlow operations. Using a fixed value ensures that the initialization of weights and other random processes within TensorFlow will be the same across different runs, which is vital for consistent results.

### 3. Importing Keras Modules

```python
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
```

- **`from tensorflow import keras`**: This imports the Keras API from TensorFlow. Keras is a high-level neural networks API that simplifies the process of building deep learning models. It provides an easier interface for creating and training neural networks compared to using TensorFlow directly.

- **`from tensorflow.keras.models import Sequential`**: This imports the **Sequential** model class. A Sequential model is a linear stack of layers, where you can easily add layers one after another. This is useful for simple feedforward neural networks.

- **`from tensorflow.keras.layers import Dense`**: This imports the **Dense** layer class. A Dense layer is a fully connected layer in a neural network, where each neuron in the layer receives input from all neurons in the previous layer. It’s a core building block of neural networks.

### 4. Building the Model Function

```python
def build_model():
    return Sequential([
        Dense(20, input_dim=8, activation='relu'),
        Dense(20, activation='relu'),
        Dense(1, activation='linear')  # Linear output layer for regression
    ])
```

#### Function Definition: `build_model()`
- **`def build_model():`**: This line defines a function called `build_model()`. This function, when called, will create and return a neural network model.

#### Creating the Sequential Model
- **`return Sequential([...])`**: This returns a Sequential model constructed from the list of layers defined within the brackets.

#### Layers of the Model
1. **First Dense Layer**: 
   ```python
   Dense(20, input_dim=8, activation='relu')
   ```
   - **`Dense(20, ...)`**: This creates a Dense layer with 20 neurons.
   - **`input_dim=8`**: This specifies that the input to this layer will have 8 features (dimensions). It’s the shape of the input data.
   - **`activation='relu'`**: This sets the activation function for this layer to **ReLU** (Rectified Linear Unit). ReLU introduces non-linearity to the model, allowing it to learn complex patterns. It outputs the input directly if it is positive; otherwise, it outputs zero.

2. **Second Dense Layer**: 
   ```python
   Dense(20, activation='relu')
   ```
   - This creates another Dense layer with 20 neurons and uses the ReLU activation function. It doesn’t need an `input_dim` parameter because Keras automatically infers the input shape from the previous layer.

3. **Output Dense Layer**: 
   ```python
   Dense(1, activation='linear')
   ```
   - **`Dense(1, ...)`**: This creates the output layer with 1 neuron. This is appropriate for a regression task, where we want to predict a single continuous value (like a housing price).
   - **`activation='linear'`**: This means there’s no activation function applied; the output can be any real number. This is suitable for regression tasks where the output is continuous.

### Summary

This code block sets up the foundation for a deep learning model using TensorFlow and Keras. Here’s a quick summary of what we’ve done:

- We imported necessary libraries and set random seeds for reproducibility.
- We created a function (`build_model`) that defines a simple neural network architecture with:
  - **Two hidden layers** with 20 neurons each using the ReLU activation function.
  - **One output layer** with a single neuron using a linear activation function, suitable for regression tasks.

The model can be trained on datasets with 8 features to predict a continuous target variable, such as housing prices or other numerical outcomes. This structure forms the backbone of the machine learning task you’re trying to accomplish.

## Shard Data

From the [Horovod docs](https://github.com/horovod/horovod/blob/master/docs/concepts.rst):

Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, and broadcast. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

* Size would be the number of processes, in this case, 16.

* Rank would be the unique process ID from 0 to 15 (size - 1).

* Local rank would be the unique process ID within the server from 0 to 3.

We need to shard our data across our processes.  **NOTE:** We are using a Pandas DataFrame for demo purposes. In the next notebook we will use Parquet files with Petastorm for better scalability.

In [0]:
from sklearn.datasets.california_housing import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def get_dataset(rank=0, size=1):
  scaler = StandardScaler()
  cal_housing = fetch_california_housing(data_home="/dbfs/ml/" + str(rank) + "/")
  X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                       cal_housing.target,
                                                       test_size=0.2,
                                                       random_state=1)
  scaler.fit(X_train)
  X_train = scaler.transform(X_train[rank::size])
  y_train = y_train[rank::size]
  X_test = scaler.transform(X_test[rank::size])
  y_test = y_test[rank::size]
  return (X_train, y_train), (X_test, y_test)

[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
[0;32m<command-2004014418861574>[0m in [0;36m<cell line: 1>[0;34m()[0m
[0;32m----> 1[0;31m [0;32mfrom[0m [0msklearn[0m[0;34m.[0m[0mdatasets[0m[0;34m.[0m[0mcalifornia_housing[0m [0;32mimport[0m [0mfetch_california_housing[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      2[0m [0;32mfrom[0m [0msklearn[0m[0;34m.[0m[0mmodel_selection[0m [0;32mimport[0m [0mtrain_test_split[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0;32mfrom[0m [0msklearn[0m[0;34m.[0m[0mpreprocessing[0m [0;32mimport[0m [0mStandardScaler[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;34m[0m[0m
[1;32m      5[0m [0;32mdef[0m [0mget_dataset[0m[0;34m([0m[0mrank[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0msize[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m

[0;32

## Horovod

In [0]:
from tensorflow.keras import optimizers
import horovod.tensorflow.keras as hvd
from keras import backend as K

def run_training_horovod():
  # Horovod: initialize Horovod.
  hvd.init()
  # If using GPU: pin GPU to be used to process local rank (one GPU per process)
  # config = tf.ConfigProto()
  # config.gpu_options.allow_growth = True
  # config.gpu_options.visible_device_list = str(hvd.local_rank())
  # K.set_session(tf.Session(config=config))
  print(f"Rank is: {hvd.rank()}")
  print(f"Size is: {hvd.size()}")
  
  (X_train, y_train), (X_test, y_test) = get_dataset(hvd.rank(), hvd.size())
  
  model = build_model()
  
  from tensorflow.keras import optimizers
  # Horovod: adjust learning rate based on number of GPUs/CPUs.
  optimizer = optimizers.Adam(lr=0.001*hvd.size())
  
  # Horovod: add Horovod Distributed Optimizer.
  optimizer = hvd.DistributedOptimizer(optimizer)

  model.compile(optimizer=optimizer, loss="mse", metrics=["mse"])
  
  history = model.fit(X_train, y_train, validation_split=.2, epochs=10, batch_size=64, verbose=2)

[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
[0;32m<command-2004014418861576>[0m in [0;36m<cell line: 1>[0;34m()[0m
[0;32m----> 1[0;31m [0;32mfrom[0m [0mtensorflow[0m[0;34m.[0m[0mkeras[0m [0;32mimport[0m [0moptimizers[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      2[0m [0;32mimport[0m [0mhorovod[0m[0;34m.[0m[0mtensorflow[0m[0;34m.[0m[0mkeras[0m [0;32mas[0m [0mhvd[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0;32mfrom[0m [0mkeras[0m [0;32mimport[0m [0mbackend[0m [0;32mas[0m [0mK[0m[0;34m[0m[0;34m[0m[0m
[1;32m      4[0m [0;34m[0m[0m
[1;32m      5[0m [0;32mdef[0m [0mrun_training_horovod[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m

[0;32m/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py[0m in [0;36mimport_patch[0;34m(name, globals, locals, 

Test it out on just the driver.

In [0]:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=-1)
hr.run(run_training_horovod)

# Better Horovod

In [0]:
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import *

def run_training_horovod():
  # Horovod: initialize Horovod.
  hvd.init()
  # If using GPU: pin GPU to be used to process local rank (one GPU per process)
  # config = tf.ConfigProto()
  # config.gpu_options.allow_growth = True
  # config.gpu_options.visible_device_list = str(hvd.local_rank())
  # K.set_session(tf.Session(config=config))
  
  
  
  print(f"Rank is: {hvd.rank()}")
  print(f"Size is: {hvd.size()}")
  
  (X_train, y_train), (X_test, y_test) = get_dataset(hvd.rank(), hvd.size())
  
  model = build_model()
  
  from tensorflow.keras import optimizers
  # Horovod: adjust learning rate based on number of GPUs.
  optimizer = optimizers.Adam(lr=0.001*hvd.size())
  
  # Horovod: add Horovod Distributed Optimizer.
  optimizer = hvd.DistributedOptimizer(optimizer)

  model.compile(optimizer=optimizer, loss="mse", metrics=["mse"])

  # Use the optimized FUSE Mount
  checkpoint_dir = f"{ml_working_path}/horovod_checkpoint_weights.ckpt"
  
  callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),

    # Horovod: average metrics among workers at the end of every epoch.
    # Note: This callback must be in the list before the ReduceLROnPlateau,
    # TensorBoard or other metrics-based callbacks.
    hvd.callbacks.MetricAverageCallback(),

    # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
    # accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
    # the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
    hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
    
    # Reduce the learning rate if training plateaus.
    ReduceLROnPlateau(patience=10, verbose=1)
  ]
  
  # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
  if hvd.rank() == 0:
    callbacks.append(ModelCheckpoint(checkpoint_dir, save_weights_only=True))
  
  history = model.fit(X_train, y_train, validation_split=.2, epochs=10, batch_size=64, verbose=2, callbacks=callbacks)

Test it out on just the driver.

In [0]:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=-1)
hr.run(run_training_horovod)

## Run on all workers

In [0]:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=0)
hr.run(run_training_horovod)