From 3d5e4ed17d4af7f685bdd01c604d2fdaa9c98c45 Mon Sep 17 00:00:00 2001
From: Dinesh Ramasamy <89654805+iitmdinesh@users.noreply.github.com>
Date: Tue, 31 Aug 2021 13:22:05 -0700
Subject: [PATCH] Fix mistake in keras LR scheduler callback (#3142)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Fixes issue when start_epoch != 0

Signed-off-by: Dinesh Ramasamy <89654805+iitmdinesh@users.noreply.github.com>
Signed-off-by: weihanmines <weihan13@amd.com>

fix torch op handles lazy release which may cause oom in elastic scenario (#3110)

* fix torch op handles lazy release which may cause oom in elastic scenario

Signed-off-by: guoze.lin <guozelin@tencent.com>

* Update mpi_ops.py

Co-authored-by: guoze.lin <guozelin@tencent.com>
Co-authored-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Added support for extraction of storage options from url. (#3137)

* Added support for extraction of storage options from url.

Signed-off-by: Manjur Ansari <maansar@microsoft.com>

* mock fsspec.utils

Signed-off-by: Manjur Ansari <maansar@microsoft.com>

* Added missing comma

Co-authored-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Make RayExecutor use the current placement group if one exists (#3134)

Signed-off-by: weihanmines <weihan13@amd.com>

Fix the mapping btw pyspark and numpy (#3146)

Signed-off-by: Haoyang Chen <haoyang@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Add tests for Keras callbacks: MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback (#3102)

There were no tests for MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback from hvd as noted in #2659. This PR adds testing to verify the callback works.

Signed-off-by: Moses Lee <14leeyuchieh@gmail.com>
Co-authored-by: Moses Lee <molee@molee-ld4.linkedin.biz>
Signed-off-by: weihanmines <weihan13@amd.com>

Split gpu tests in head and non-head versions (#3155)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Allow caller to customize the Tensorboard callback (#3153)

* Keras Estimator: Allow user to pass in TensorBoard callback

Signed-off-by: Rich Porter <rich.porter@uber.com>

* Remove callback from other processes on the same machine

Signed-off-by: Rich Porter <rich.porter@uber.com>

* Allow other ranks to profile as well.  Doesn't seem to conflict

Signed-off-by: Rich Porter <rich.porter@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

test_torch.py: add explicit join() for testing duplicated name errors (#3159)

For torch nightly >=10.0, we need to add an explict join() call to avoid
hanging when testing duplicated name errors.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Disable TF2.6.0 XLA support on OSX (#3133)

Related to issue#3132

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix linking _pywrap_tensorflow_internal.so and re-enable XLA on macOS  (#3173)

Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: fix the usage of checkpoint callback (#3186)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix Cometlogger experiment key lost issue (#3184)

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* recreate_loger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_var

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Updated torch c++ to use new aten api (#3175)

Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Keras: remove bare Keras support (#3191)

Signed-off-by: weihanmines <weihan13@amd.com>

Make fork PRs publish test change stats (#3185)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Support for nccl on cuda 11.4 (#3182)

Signed-off-by: Evan Brossard <evanb@maka-ars.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix MPICH support (#3148)

* fix MPICH implementation
* enable tests for MPICH and Intel MPI

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: weihanmines <weihan13@amd.com>

Increase build timeout to 40m on Buildkite (#3192)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Change CMake syntax to be compatible with old versions of CMake (#3196)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: weihanmines <weihan13@amd.com>

Reinit every torch test (#3194)

Signed-off-by: weihanmines <weihan13@amd.com>

Add barrier call to torch module to support easy synchronization for process sets (#3139)

* Added barrier call to torch module

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Bump version to 0.23.0 (#3200)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Co-authored-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: weihanmines <weihan13@amd.com>

Increase Parallel PyTest timeout to 10m (#3198)

* Increase MPI and Gloo Parallel PyTest timeout to 10m

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: don't overwrite model with checkpoint by default (#3201)

Lightning estimator saves model by default if there is no specified
checkpoint callback. However, model is not overwritten with checkpoint
file in that case.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: fix checkpoint callback dirpath typo (#3204)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Rework events in CI workflows (#3202)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Allow for concurrent schedule and master build, document concurrency (#3206)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (#3210)

Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

add_history_in_lightning_estimator (#3214)

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Allow buildkite building merge commits on forks (#3215)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix json output in ci-results.yaml (#3217)

Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: fix history metrics for estimator serialization (#3216)

Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu')

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

patch python source files on macCI (#3220)

* patch python source files on macCI

* Trigger build and test CI

Signed-off-by: TJ <tix@uber.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Updated examples of torch and tf to include mixed precision training (#3222)

* Added mixed precision example for pytorch

* added mixed precision for keras

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Job buildkite-heads accesses ci-workflow outputs, add it to the needs (#3225)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Fixes race condition for ray scale up down tests (#3205)

Ensure that at least one host from the previous set of hosts have
been registered.
Without this, the discovery script will "discover" the new
set of hosts before the current set can register.
This would result in a race condition.
Consider a discovery schedule:
```
discovery_schedule = [
    (10, ['host-1:2']),
    (30, ['host-1:2', 'host-2:1', 'host-3:1']),
    (None, ['host-2:1']),
]
```
The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script
discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1'].
However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers.
When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments.
It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous
set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed
hosts.
This ensures that the previous set of hosts can register before the current set is discovered.

Signed-off-by: Abin Shahab <ashahab@linkedin.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Removed a case of the default mutable argument pitfall (#3227)

Signed-off-by: Naelson Douglas <naelson17@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Updates to TSC members (#3234)

Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Add in-place broadcast for TensorFlow (#3128)

* Update comment in FindTensorflow.cmake

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add in-place broadcast_() and broadcast_variables() for TF

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Include source files from TF in build to avoid missing symbol errors

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Limit build and test to TF 2.6+

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Remove source files copied from TensorFlow

The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so,
which was introduced to Horovod with PR #3053.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Fix possible type attribute values for HorovodBroadcastInplace

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add reference variables to test

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Update comments, doc strings, changelog

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: weihanmines <weihan13@amd.com>

[Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (#3144)

Co-authored-by: gethinhu <gethinhu@tencent.com>
Signed-off-by: weihanmines <weihan13@amd.com>

a better way to handle nccl error under elastic scenario (#3112)

Signed-off-by: guoze.lin <guozelin@tencent.com>
Signed-off-by: weihanmines <weihan13@amd.com>

check torch version for mixed precision example (#3238)

Signed-off-by: weihanmines <weihan13@amd.com>

Lightning: set limit_train_batches and limit_val_batches (#3237)

Tell Lightning trainer that how many batches a single epoch needs.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: reduce memory footprint of async dataloader (#3239)

Limit async data loader queue size.

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Change default fusion threshold from 64MB to 128MB in docs (#3241)

Signed-off-by: weihanmines <weihan13@amd.com>

fix the example of pytorch_lightning_mnist.py (#3245)

- remove unused arg parameters
- fix model test issue on GPU

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

CI: use latest pytorch_lightning with torchhead (#3243)

Signed-off-by: weihanmines <weihan13@amd.com>

test_gradient_aggregation with real gradient instead of a constant (#3176)

This fixes issue #2664 by performing gradient aggregation with a real gradient instead of a constant.
PR: #2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients().

Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced.
The current change ensure that instead of a constant a real gradient is computed from a loss-function.

Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation.
Signed-off-by: Abin Shahab <ashahab@linkedin.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Remove MetricAverageCallback warning on tf >= 2.5 (#3050)

Signed-off-by: Henrique Mendonça <henrique.mendonca@cscs.ch>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix Horovod pyarrow IndexError: list index out of range (#3255)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fixing up current CI test failures.  (#3259)

Signed-off-by: Josh Romero <joshr@nvidia.com>
Co-authored-by: Travis Addair <tgaddair@gmail.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Revert "Fix Horovod pyarrow IndexError: list index out of range (#3255)" (#3265)

This reverts commit 3efc229a8d12c250ea4a3493dc01aa8241a10899.

Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Debugging for lightning data loader and fix for simple profiler. (#3253)

add debugging flag for lightning data loader , make async data loader queue size configurable

Signed-off-by: weihanmines <weihan13@amd.com>

Call process_set._setup in init() to point to the correct native lib path (#3258)

* call setup for common process_set in remote trainers

moved _setup call to init()

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Add support for MXNet async dependency engine. (#3242)

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: weihanmines <weihan13@amd.com>
---
 .buildkite/gen-pipeline.sh                    |  52 +-
 .github/gen-workflow-ci.py                    | 230 +++---
 .github/get-changed-code-files.py             |   8 +-
 .github/workflows/ci-fork.yaml                | 121 ---
 .github/workflows/ci-results.yaml             | 237 ++++++
 .github/workflows/ci.yaml                     | 706 ++++++------------
 .gitmodules                                   |   2 +-
 CHANGELOG.md                                  |  68 +-
 Dockerfile.test.cpu                           |   5 +-
 Dockerfile.test.gpu                           |   2 +
 GOVERNANCE.md                                 |   5 +-
 Jenkinsfile.ppc64le                           |   8 +-
 cmake/Modules/FindTensorflow.cmake            |   9 +-
 docker-compose.test.yml                       |  45 +-
 docs/mocks.py                                 |   3 +
 docs/tensor-fusion.rst                        |   2 +-
 .../tensorflow2_keras_mnist_elastic.py        |  17 +
 examples/pytorch/pytorch_lightning_mnist.py   |  17 +-
 examples/pytorch/pytorch_mnist.py             |  48 +-
 examples/spark/keras/keras_spark_mnist.py     |   3 +-
 .../pytorch/pytorch_lightning_spark_mnist.py  |  11 +-
 horovod/__init__.py                           |   2 +-
 horovod/_keras/callbacks.py                   |   6 +-
 horovod/common/basics.py                      |   3 +-
 horovod/common/common.h                       |   4 +
 horovod/common/controller.cc                  |  28 +-
 horovod/common/message.cc                     |   6 +
 horovod/common/message.h                      |   4 +-
 horovod/common/operations.cc                  | 267 ++++---
 horovod/common/operations.h                   |   3 +
 horovod/common/ops/adasum_gpu_operations.cc   |  94 ++-
 horovod/common/ops/collective_operations.cc   |  16 +
 horovod/common/ops/collective_operations.h    |   9 +
 horovod/common/ops/cuda_operations.cc         | 147 ++--
 horovod/common/ops/gpu_context_impl.cc        |  64 +-
 horovod/common/ops/gpu_operations.cc          | 310 +++++---
 horovod/common/ops/gpu_operations.h           | 118 +--
 horovod/common/ops/hip_operations.cc          |  64 +-
 horovod/common/ops/nccl_operations.cc         | 422 +++++++----
 horovod/common/ops/nccl_operations.h          |  22 +-
 horovod/common/ops/operation_manager.cc       |   9 +
 horovod/common/ops/operation_manager.h        |   4 +
 horovod/common/response_cache.cc              |   2 +
 horovod/common/tensor_queue.cc                |   9 +-
 horovod/common/tensor_queue.h                 |   2 +
 horovod/common/wire/message.fbs               |   6 +-
 horovod/common/wire/message_generated.h       |  22 +-
 horovod/data/data_loader_base.py              |  42 +-
 horovod/mxnet/mpi_ops.cc                      | 126 +++-
 horovod/mxnet/mpi_ops.h                       |   3 +
 horovod/mxnet/mpi_ops.py                      |   6 +-
 horovod/ray/runner.py                         |  34 +-
 horovod/ray/strategy.py                       |  49 +-
 horovod/runner/mpi_run.py                     |  26 +-
 horovod/spark/common/store.py                 |   7 +-
 horovod/spark/common/util.py                  |   2 +-
 horovod/spark/keras/estimator.py              |  86 +--
 horovod/spark/keras/remote.py                 |  20 +-
 horovod/spark/keras/util.py                   | 180 -----
 horovod/spark/lightning/datamodule.py         |  41 +-
 horovod/spark/lightning/estimator.py          |  59 +-
 horovod/spark/lightning/remote.py             | 186 +++--
 horovod/spark/torch/remote.py                 |   4 +
 horovod/tensorflow/__init__.py                |   2 +-
 horovod/tensorflow/functions.py               |  41 +-
 horovod/tensorflow/mpi_ops.cc                 | 269 ++++++-
 horovod/tensorflow/mpi_ops.py                 |  43 +-
 horovod/torch/CMakeLists.txt                  |   4 +-
 horovod/torch/__init__.py                     |   1 +
 horovod/torch/adapter_v2.cc                   |   4 +-
 horovod/torch/cuda_util.cc                    |   7 +-
 horovod/torch/elastic/sampler.py              |  36 +-
 horovod/torch/elastic/state.py                |   5 -
 horovod/torch/mpi_ops.py                      |  31 +-
 horovod/torch/mpi_ops_v2.cc                   |  42 +-
 horovod/torch/ready_event.cc                  |  24 -
 setup.py                                      |   7 +-
 test/integration/test_spark_keras.py          | 130 +---
 test/integration/test_spark_lightning.py      |  33 +
 test/integration/test_static_run.py           |  10 +-
 test/parallel/base_test_mxnet.py              |   5 +-
 test/parallel/test_tensorflow.py              | 160 ++++
 test/parallel/test_tensorflow2_keras.py       | 147 ++--
 test/parallel/test_torch.py                   |  79 +-
 ...expected_buildkite_gpu_heads_pipeline.yaml | 147 ++++
 ...ted_buildkite_gpu_non_heads_pipeline.yaml} | 657 +++-------------
 test/single/test_buildkite.py                 |  51 +-
 test/single/test_ray.py                       |  33 +
 test/single/test_ray_elastic.py               |  62 +-
 test/single/test_torch_elastic.py             |  29 +-
 90 files changed, 3549 insertions(+), 2623 deletions(-)
 delete mode 100644 .github/workflows/ci-fork.yaml
 create mode 100644 .github/workflows/ci-results.yaml
 create mode 100644 test/single/data/expected_buildkite_gpu_heads_pipeline.yaml
 rename test/single/data/{expected_buildkite_gpu_pipeline.yaml => expected_buildkite_gpu_non_heads_pipeline.yaml} (61%)

diff --git a/.buildkite/gen-pipeline.sh b/.buildkite/gen-pipeline.sh
index 5e7b4b39b5..c203324ae3 100755
--- a/.buildkite/gen-pipeline.sh
+++ b/.buildkite/gen-pipeline.sh
@@ -7,29 +7,29 @@ set -eu
 repository=823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
 
 # our baseline test is
-baseline="test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2"
+baseline="test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2"
 # in run_gloo_integration we run 'Elastic Spark * Tests' for this baseline
 # so it has to have Gloo mpi kind
 
 # skip tests when there are no code changes
 dir="$(dirname "$0")"
 code_files=$(python "$dir/get_changed_code_files.py" || echo failure)
-tests=$(if [[ "${PIPELINE_MODE:-}" == *"FULL"* ]] && ( [[ "${BUILDKITE_BRANCH:-}" == "${BUILDKITE_PIPELINE_DEFAULT_BRANCH:-}" ]] || [[ -n "$code_files" ]] ); then
+tests=$(if [[ -n "${PIPELINE_MODE:-}" ]] && ( [[ "${BUILDKITE_BRANCH:-}" == "${BUILDKITE_PIPELINE_DEFAULT_BRANCH:-}" ]] || [[ -n "$code_files" ]] ); then
   # we vary the baseline along the Python dimension and PySpark together
   # run_gloo_integration expects these to have Gloo mpi kind to run 'Elastic Spark * Tests'
-  printf "test-cpu-gloo-py3_7-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark2_4_8 "
-  printf "test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_0_3 "
+  printf "test-cpu-gloo-py3_7-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark2_4_8 "
+  printf "test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_0_3 "
   # our baseline
   printf "$baseline "
 
   # then we vary the baseline along mpi kinds dimension
   # our baseline again
-# printf "test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
-  printf "test-cpu-mpich-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
-  printf "test-cpu-oneccl-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
-  printf "test-cpu-openmpi-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+# printf "test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+  printf "test-cpu-mpich-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+  printf "test-cpu-oneccl-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+  printf "test-cpu-openmpi-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
   # note: we test openmpi-gloo mpi kind in this variation in each of [cpu, gpu, mixed]
-  printf "test-cpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+  printf "test-cpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
 
   # then we vary the baseline along the framework dimensions all together
   # some frameworks are not available for our baseline Python version 3.8, so we use Python 3.7
@@ -39,13 +39,13 @@ tests=$(if [[ "${PIPELINE_MODE:-}" == *"FULL"* ]] && ( [[ "${BUILDKITE_BRANCH:-}
   # https://github.com/apache/incubator-mxnet/issues/16193
   # however, there is an mxnet-cu101-1.6.0.post0, so we test this with gpu instead of cpu
   #printf "test-cpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2 "
-  printf "test-cpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_7_0_p2-pyspark3_1_2 "
+  printf "test-cpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_7_0_p2-pyspark3_1_2 "
   # our baseline again
-# printf "test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+# printf "test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
   printf "test-cpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2 "
 
   # then we vary the frameworks for gpu
-  printf "test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2 "
+  printf "test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2 "
   # this is required as we cannot test mxnet-1.6.0.post0 with cpu
   printf "test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2 "
   # we additionally test the previous framework combination (CUDA 10.x) with mxnet 1.7.x
@@ -53,13 +53,15 @@ tests=$(if [[ "${PIPELINE_MODE:-}" == *"FULL"* ]] && ( [[ "${BUILDKITE_BRANCH:-}
   printf "test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2 "
   # we deviate from mxnet1_7_0_p2 here as other frameworks target CUDA 11.x and
   # mxnet 1.7.x only supports CUDA 10.x, with mxnet 1.8.x we have CUDA 11.x packages
-  printf "test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2 "
-  printf "test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+  printf "test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2 "
+  printf "test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
   printf "test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2 "
 
   # and one final test with mixed cpu+gpu
-  printf "test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
-fi | if [[ "${PIPELINE_MODE:-}" == "GPU FULL" ]]; then sed -E "s/[^ ]*-cpu-[^ ]*//g"; else cat; fi)
+  printf "test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2 "
+fi | if [[ "${PIPELINE_MODE:-}" == "GPU"* ]]; then sed -E "s/[^ ]*-cpu-[^ ]*//g"; else cat; fi \
+   | if [[ "${PIPELINE_MODE:-}" == "GPU HEADS" ]]; then sed -E "s/ /\n/g" | grep -e "-tfhead-keras_none-torchhead-mxnethead-" | paste -s -d " " -; else cat; fi \
+   | if [[ "${PIPELINE_MODE:-}" == "GPU NON HEADS" ]]; then sed -E "s/[^ ]*-tfhead-keras_none-torchhead-mxnethead-[^ ]*//g"; else cat; fi)
 read -r -a tests <<< "$tests"
 
 
@@ -76,7 +78,7 @@ build_test() {
   echo "      push-retries: 5"
   echo "  - ecr#v1.2.0:"
   echo "      login: true"
-  echo "  timeout_in_minutes: 30"
+  echo "  timeout_in_minutes: 40"
   echo "  retry:"
   echo "    automatic: true"
   echo "  agents:"
@@ -123,7 +125,7 @@ run_mpi_pytest() {
   run_test "${test}" "${queue}" \
     ":pytest: MPI Parallel PyTests (${test})" \
     "bash -c \"${oneccl_env} ${test_env} cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \\\$(cat /mpirun_command) /bin/bash /pytest.sh mpi)\"" \
-    5
+    10
   run_test "${test}" "${queue}" \
     ":pytest: MPI Single PyTests (${test})" \
     "bash -c \"${oneccl_env} ${test_env} cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh mpi)\"" \
@@ -229,7 +231,7 @@ run_gloo_pytest() {
   run_test "${test}" "${queue}" \
     ":pytest: Gloo Parallel PyTests (${test})" \
     "bash -c \"${test_env} cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)\"" \
-    5
+    10
   run_test "${test}" "${queue}" \
     ":pytest: Gloo Single PyTests (${test})" \
     "bash -c \"${test_env} cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)\"" \
@@ -298,7 +300,7 @@ run_gloo_integration() {
     run_test "${test}" "${queue}" \
       ":factory: Elastic Spark TensorFlow Tests (${test})" \
       "bash -c \"cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml ${elastic_spark_tensorflow}\"" \
-      20
+      40
   fi
 
   # Elastic Horovod on Spark tests are very expensive (high timeout)
@@ -307,7 +309,7 @@ run_gloo_integration() {
     run_test "${test}" "${queue}" \
       ":factory: Elastic Spark Torch Tests (${test})" \
       "bash -c \"cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py\"" \
-      20
+      40
   fi
 
 }
@@ -330,7 +332,7 @@ run_spark_integration() {
       run_test "${test}" "${queue}" \
         ":spark: Spark PyTests (${test})" \
         "bash -c \"cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)\"" \
-        20
+        40
     fi
 
     if [[ ${test} != *"tf2"* && ${test} != *"tfhead"* ]]; then
@@ -418,7 +420,7 @@ for test in ${tests[@]-}; do
       run_mpi ${test} "cpu" ${oneccl_cmd_ofi}
 
       # always run spark tests which use MPI and Gloo
-      run_spark_integration ${test} "cpu"
+      #run_spark_integration ${test} "cpu"
 
       # no runner application, world size = 1
       run_single_integration ${test} "cpu" ${oneccl_cmd_mpi}
@@ -430,7 +432,7 @@ for test in ${tests[@]-}; do
       fi
 
       # always run spark tests which use MPI and Gloo
-      run_spark_integration ${test} "cpu"
+      #run_spark_integration ${test} "cpu"
 
       # no runner application, world size = 1
       run_single_integration ${test} "cpu"
@@ -472,6 +474,6 @@ for test in ${tests[@]-}; do
       run_mpi_integration ${test} "2x-gpu-v510"
     fi
 
-    run_spark_integration ${test} "2x-gpu-v510"
+    #run_spark_integration ${test} "2x-gpu-v510"
   fi
 done
diff --git a/.github/gen-workflow-ci.py b/.github/gen-workflow-ci.py
index d13b4d9e83..3ac6fac39c 100644
--- a/.github/gen-workflow-ci.py
+++ b/.github/gen-workflow-ci.py
@@ -113,6 +113,7 @@ def workflow_header() -> str:
                 f'\n'
                 f'on:\n'
                 f'  schedule:\n'
+                f'    # run a build on master (this does not publish test results or cancel concurrent builds)\n'
                 f'    - cron: \'0 10 * * *\' # everyday at 10am\n'
                 f'  push:\n'
                 f'    # only consider push to master and tags\n'
@@ -120,14 +121,28 @@ def workflow_header() -> str:
                 f'    branches: [ master ]\n'
                 f'    tags: [ \'v*.*.*\' ]\n'
                 f'  pull_request:\n'
+                f'    # only consider pull requests into master\n'
                 f'    branches: [ master ]\n'
                 f'\n'
                 f'concurrency:\n'
-                f'  # github.ref means something like refs/heads/master or refs/tags/v0.22.1 or the branch.\n'
-                f'  # This helps to not cancel concurrent runs on master and a tag that share the same commit\n'
-                f'  # On master, head_ref is empty, so we use the SHA of the commit, this means\n'
-                f'  # individual commits to master will not be cancelled, but tagged\n'
-                f'  group: ci-${{{{ github.ref }}}}-${{{{ github.head_ref || github.sha }}}}\n'
+                f'  # This controls which concurrent builds to cancel:\n'
+                f'  # - we do not want any concurrent builds on a branch (pull_request)\n'
+                f'  # - we do not want concurrent builds on the same commit on master (push)\n'
+                f'  # - we do not want concurrent builds on the same commit on a tag (push)\n'
+                f'  # - we allow concurrent runs on the same commit on master and its tag (push)\n'
+                f'  # - we allow concurrent runs on the same commit on master (push) and a scheduled build (schedule)\n'
+                f'  #\n'
+                f'  # A pull_request event only runs on branch commit, a push event only on master and tag commit.\n'
+                f'  # A schedule event only runs on master HEAD commit.\n'
+                f'  #\n'
+                f'  # Expression github.ref means something like refs/heads/master or refs/tags/v0.22.1 or the branch.\n'
+                f'  # This helps to not cancel concurrent runs on master or a tag that share the same commit.\n'
+                f'  # Expression github.head_ref refers to the branch of the pull request.\n'
+                f'  # On master, github.head_ref is empty, so we use the SHA of the commit, this means individual\n'
+                f'  # commits to master will not be cancelled, while there can only be one concurrent build on a branch.\n'
+                f'  #\n'
+                f'  # We include the event name to we allow for concurrent scheduled and master builds.\n'
+                f'  group: ci-${{{{ github.event_name }}}}-${{{{ github.ref }}}}-${{{{ github.head_ref || github.sha }}}}\n'
                 f'  cancel-in-progress: true\n'
                 f'\n')
 
@@ -137,7 +152,20 @@ def jobs(*jobs: str) -> str:
                '    runs-on: ubuntu-latest\n' \
                '    steps:\n' \
                '    - name: Debug Action\n' \
-               '      uses: hmarr/debug-action@v1.0.0\n' + \
+               '      uses: hmarr/debug-action@v1.0.0\n' \
+               '    - name: Debug Concurrency\n' \
+               '      run: echo "ci-${{ github.event_name }}-${{ github.ref }}-${{ github.head_ref || github.sha }}"\n' \
+               '\n' \
+               '  event_file:\n' \
+               '    name: "Event File"\n' \
+               '    runs-on: ubuntu-latest\n' \
+               '    steps:\n' \
+               '    - name: Upload\n' \
+               '      uses: actions/upload-artifact@v2\n' \
+               '      with:\n' \
+               '        name: Event File\n' \
+               '        path: ${{ github.event_path }}\n' \
+               '\n' + \
                '\n'.join(jobs)
 
     def init_workflow_job() -> str:
@@ -145,9 +173,11 @@ def init_workflow_job() -> str:
                 f'    name: "Init Workflow"\n'
                 f'    runs-on: ubuntu-latest\n'
                 f'    outputs:\n'
-                f"      run_at_all: ${{{{ github.event_name != 'schedule' || github.repository == 'horovod/horovod' }}}}\n"
+                f"      run-at-all: ${{{{ github.event_name != 'schedule' || github.repository == 'horovod/horovod' }}}}\n"
                 f"      # if we don't get a clear 'false', we fall back to building and testing\n"
-                f"      run_builds_and_tests: ${{{{ steps.tests.outputs.needed != 'false' }}}}\n"
+                f"      run-builds-and-tests: ${{{{ steps.tests.outputs.needed != 'false' }}}}\n"
+                f'      buildkite-branch-label: "${{{{ steps.config-buildkite.outputs.branch-label }}}}"\n'
+                f'      buildkite-message: "${{{{ steps.config-buildkite.outputs.message }}}}"\n'
                 f'\n'
                 f'    steps:\n'
                 f'      - name: Checkout\n'
@@ -172,8 +202,8 @@ def init_workflow_job() -> str:
                 f'      - name: Check if tests are needed\n'
                 f'        id: tests\n'
                 f'        env:\n'
-                f'          GITHUB_BASE: ${{{{ github.event.pull_request.base.sha }}}}\n'
-                f'          GITHUB_HEAD: ${{{{ github.event.pull_request.head.sha }}}}\n'
+                f'          GITHUB_BASE_SHA: ${{{{ github.event.pull_request.base.sha }}}}\n'
+                f'          GITHUB_HEAD_SHA: ${{{{ github.event.pull_request.head.sha }}}}\n'
                 f'        run: |\n'
                 f'          if [[ "${{{{ github.event_name }}}}" == "pull_request" ]]\n'
                 f'          then\n'
@@ -190,7 +220,50 @@ def init_workflow_job() -> str:
                 f'          else\n'
                 f'            echo "This is not part of a pull request, we need to build and test"\n'
                 f'            echo "::set-output name=needed::true"\n'
-                f'          fi\n')
+                f'          fi\n'
+                f'\n'
+                f'      - name: Configure Buildkite Build\n'
+                f'        id: config-buildkite\n'
+                f'        env:\n'
+                f'          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}\n'
+                f'        run: |\n'
+                f'          branch="${{{{ github.event.pull_request.head.ref || github.ref }}}}"\n'
+                f'          branch="${{branch#"refs/heads/"}}"\n'
+                f'          branch="${{branch#"refs/tags/"}}"\n'
+                f'\n'
+                f'          branch_label="${{branch}}"\n'
+                f'          if [[ "${{{{ github.event_name }}}}" == "schedule" ]]\n'
+                f'          then\n'
+                f'            # we add this label to the branch used by Buildkite to avoid it cancelling one of concurrent schedule and push builds on master\n'
+                f'            branch_label="${{branch}} (schedule)"\n'
+                f'          fi\n'
+                f'          echo "::set-output name=branch-label::${{branch_label}}"\n'
+                f'\n'
+                f'          if [[ "${{{{ github.event_name }}}}" == "pull_request" ]]\n'
+                f'          then\n'
+                f'            head_sha="${{{{ github.event.pull_request.head.sha }}}}"\n'
+                f'            message="$(gh api https://api.github.com/repos/horovod/horovod/commits/${{head_sha}} -q .commit.message | head -n1)"\n'
+                f'            echo "::set-output name=message::${{message}}"\n'
+                f'          fi\n'
+                f'\n'
+                f'      - name: Provide PR meta\n'
+                f"        if: github.event_name == 'pull_request'\n"
+                f'        run: |\n'
+                f'          rm -f pr.json\n'
+                f'          echo -n "{{" >> pr.json\n'
+                f'          echo -n " \\\"merge_sha\\\": \\\"${{{{ github.sha }}}}\\\"," >> pr.json\n'
+                f'          echo -n " \\\"base_sha\\\": \\\"${{{{ github.event.pull_request.base.sha }}}}\\\"," >> pr.json\n'
+                f'          echo -n " \\\"head_sha\\\": \\\"${{{{ github.event.pull_request.head.sha }}}}\\\" " >> pr.json\n'
+                f'          echo -n "}}" >> pr.json\n'
+                f'          cat pr.json\n'
+                f'\n'
+                f'      - name: Upload PR meta\n'
+                f'        uses: actions/upload-artifact@v2\n'
+                f"        if: github.event_name == 'pull_request'\n"
+                f'        with:\n'
+                f'          name: PR Meta\n'
+                f'          path: pr.json\n'
+                f'\n')
 
     def build_and_test_images(id: str,
                               name: str,
@@ -207,8 +280,8 @@ def build_and_test_images(id: str,
                 f'    name: "{name} (${{{{ matrix.image }}}})"\n'
                 f'    needs: [{", ".join(needs)}]\n'
                 f'    if: >\n'
-                f"      needs.init-workflow.outputs.run_at_all == 'true' &&\n"
-                f"      needs.init-workflow.outputs.run_builds_and_tests == 'true'\n"
+                f"      needs.init-workflow.outputs.run-at-all == 'true' &&\n"
+                f"      needs.init-workflow.outputs.run-builds-and-tests == 'true'\n"
                 f'    runs-on: ubuntu-latest\n'
                 f'\n'
                 f'    strategy:\n'
@@ -358,8 +431,8 @@ def build_and_test_macos(id: str, name: str, needs: List[str], attempts: int = 3
                 f'    name: "{name} (${{{{ matrix.image }}}}-macos)"\n'
                 f'    needs: [{", ".join(needs)}]\n'
                 f'    if: >\n'
-                f"      needs.init-workflow.outputs.run_at_all == 'true' &&\n"
-                f"      needs.init-workflow.outputs.run_builds_and_tests == 'true'\n"
+                f"      needs.init-workflow.outputs.run-at-all == 'true' &&\n"
+                f"      needs.init-workflow.outputs.run-builds-and-tests == 'true'\n"
                 f'    runs-on: macos-latest\n'
                 f'\n'
                 f'    strategy:\n'
@@ -368,34 +441,34 @@ def build_and_test_macos(id: str, name: str, needs: List[str], attempts: int = 3
                 f'      matrix:\n'
                 f'        include:\n'
                 f''
-                f'          - image: test-cpu-openmpi-py3_7-tf1_15_5-keras2_2_4-torch1_2_0-mxnet1_5_0\n'
+                f'          - image: test-cpu-openmpi-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_0\n'
                 f'            HOROVOD_WITH_MPI: 1\n'
                 f'            HOROVOD_WITHOUT_GLOO: 1\n'
                 f'            TENSORFLOW: 1.15.0\n'
                 f'            KERAS: 2.2.4\n'
-                f'            PYTORCH: 1.2.0\n'
-                f'            PYTORCH_LIGHTNING: 0.7.6\n'
-                f'            TORCHVISION: 0.4.0\n'
+                f'            PYTORCH: 1.6.0\n'
+                f'            PYTORCH_LIGHTNING: 1.3.8\n'
+                f'            TORCHVISION: 0.7.0\n'
                 f'            MXNET: 1.5.0\n'
                 f'\n'
-                f'          - image: test-cpu-gloo-py3_8-tf2_2_0-keras2_3_1-torch1_5_0-mxnet1_5_0\n'
+                f'          - image: test-cpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_5_0\n'
                 f'            HOROVOD_WITHOUT_MPI: 1\n'
                 f'            HOROVOD_WITH_GLOO: 1\n'
-                f'            TENSORFLOW: 2.2.0\n'
-                f'            KERAS: 2.3.1\n'
-                f'            PYTORCH: 1.5.0\n'
-                f'            PYTORCH_LIGHTNING: 1.2.9\n'
-                f'            TORCHVISION: 0.6.0\n'
+                f'            TENSORFLOW: 2.5.1\n'
+                f'            KERAS: 2.5.0rc0\n'
+                f'            PYTORCH: 1.8.1\n'
+                f'            PYTORCH_LIGHTNING: 1.3.8\n'
+                f'            TORCHVISION: 0.9.1\n'
                 f'            MXNET: 1.5.0\n'
                 f'\n'
-                f'          - image: test-openmpi-cpu-gloo-py3_8-tf2_3_0-keras2_3_1-torch1_6_0-mxnet1_5_0\n'
+                f'          - image: test-openmpi-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_5_0\n'
                 f'            HOROVOD_WITH_MPI: 1\n'
                 f'            HOROVOD_WITH_GLOO: 1\n'
-                f'            TENSORFLOW: 2.3.0\n'
-                f'            KERAS: 2.3.1\n'
-                f'            PYTORCH: 1.6.0\n'
-                f'            PYTORCH_LIGHTNING: 1.2.9\n'
-                f'            TORCHVISION: 0.7.0\n'
+                f'            TENSORFLOW: 2.6.0\n'
+                f'            KERAS: 2.6.0\n'
+                f'            PYTORCH: 1.9.0\n'
+                f'            PYTORCH_LIGHTNING: 1.3.8\n'
+                f'            TORCHVISION: 0.10.0\n'
                 f'            MXNET: 1.5.0\n'
                 f'\n'
                 f'    steps:\n'
@@ -418,11 +491,14 @@ def build_and_test_macos(id: str, name: str, needs: List[str], attempts: int = 3
                 f'          TORCHVISION: ${{{{ matrix.TORCHVISION }}}}\n'
                 f'          MXNET: ${{{{ matrix.MXNET }}}}\n'
                 f'\n'
+                f'        # The python patch in the pyenv install step is to work around an incompatibility introduced in new xcode version in macOS Big Sur. The patch is provided by python team.\n'
+                f'        # The original discussion is here https://github.com/pyenv/pyenv/issues/1737\n'
                 f'        run: |\n'
-                f'          brew install -f openmpi cmake libuv pyenv coreutils\n'
+                f'          brew reinstall -f zlib bzip2\n'
+                f'          brew install -f openmpi cmake libuv pyenv coreutils curl\n'
                 f'          export PATH=$(pyenv root)/shims:$PATH\n'
                 f'          pyenv uninstall -f 3.7.7\n'
-                f'          pyenv install 3.7.7\n'
+                f'          CFLAGS="-I$(brew --prefix bzip2)/include -I$(brew --prefix zlib)/include" LDFLAGS="-L$(brew --prefix zlib)/lib -L$(brew --prefix bzip2)/lib" pyenv install --patch 3.7.7 < <(curl -sSL https://github.com/python/cpython/commit/8ea6353.patch)\n'
                 f'          pyenv global 3.7.7\n'
                 f'          python --version\n'
                 f'\n'
@@ -462,17 +538,17 @@ def build_and_test_macos(id: str, name: str, needs: List[str], attempts: int = 3
                 '\n'.join([f'            ${{{{ steps.test-{attempt}.outputs.artifacts-path }}}}'
                            for attempt in range(1, attempts+1)]))
 
-    def trigger_buildkite_job(id: str, needs: List[str]) -> str:
+    def trigger_buildkite_job(id: str, name: str, needs: List[str], mode: str) -> str:
         if 'init-workflow' not in needs:
             needs.insert(0, 'init-workflow')
         return (f'  {id}:\n'
-                f'    name: "Build and Test GPU (on Builtkite)"\n'
+                f'    name: "{name}"\n'
                 f'    needs: [{", ".join(needs)}]\n'
                 f'    runs-on: ubuntu-latest\n'
                 f'    if: >\n'
                 f'      github.repository == \'horovod/horovod\' &&\n'
-                f"      needs.init-workflow.outputs.run_at_all == 'true' &&\n"
-                f"      needs.init-workflow.outputs.run_builds_and_tests == 'true' &&\n"
+                f"      needs.init-workflow.outputs.run-at-all == 'true' &&\n"
+                f"      needs.init-workflow.outputs.run-builds-and-tests == 'true' &&\n"
                 f'      ( github.event_name != \'pull_request\' || github.event.pull_request.head.repo.full_name == github.repository )\n'
                 f'\n'
                 f'    steps:\n'
@@ -481,12 +557,12 @@ def trigger_buildkite_job(id: str, needs: List[str]) -> str:
                 f'        uses: EnricoMi/trigger-pipeline-action@master\n'
                 f'        env:\n'
                 f'          PIPELINE: "horovod/horovod"\n'
-                # on "push" event, github.event.pull_request.head.ref will be empty
-                # and trigger-pipeline-action falls back to github.ref
-                f'          BRANCH: "${{{{ github.event.pull_request.head.ref }}}}"\n'
-                f'          MESSAGE: "GPU Tests triggered by GitHub"\n'
+                f'          # COMMIT is taken from GITHUB_SHA\n'
+                f'          BRANCH: "${{{{ needs.init-workflow.outputs.buildkite-branch-label }}}}"\n'
+                f'          # empty MESSAGE will be filled by Buildkite from commit message\n'
+                f'          MESSAGE: "${{{{ needs.init-workflow.outputs.buildkite-message }}}}"\n'
                 f'          BUILDKITE_API_ACCESS_TOKEN: ${{{{ secrets.BUILDKITE_TOKEN }}}}\n'
-                f'          BUILD_ENV_VARS: "{{\\"PIPELINE_MODE\\": \\"GPU FULL\\"}}"\n'
+                f'          BUILD_ENV_VARS: "{{\\"PIPELINE_MODE\\": \\"{mode}\\"}}"\n'
                 f'\n'
                 f'      - name: Download Buildkite Artifacts\n'
                 f'        id: download\n'
@@ -497,14 +573,14 @@ def trigger_buildkite_job(id: str, needs: List[str]) -> str:
                 f'          buildkite_build_url: ${{{{ steps.build.outputs.url }}}}\n'
                 f'          ignore_build_states: blocked,canceled,skipped,not_run\n'
                 f'          ignore_job_states: timed_out\n'
-                f'          output_path: artifacts/Unit Test Results - GPUs on Buildkite\n'
+                f'          output_path: artifacts/Unit Test Results - {mode} on Builtkite\n'
                 f'\n'
                 f'      - name: Upload Test Results\n'
                 f'        uses: actions/upload-artifact@v2\n'
                 f'        if: always()\n'
                 f'        with:\n'
-                f'          name: Unit Test Results - GPUs on Builtkite\n'
-                f'          path: artifacts/Unit Test Results - GPUs on Buildkite/**/*.xml\n'
+                f'          name: Unit Test Results - {mode} on Builtkite\n'
+                f'          path: artifacts/Unit Test Results - {mode} on Builtkite/**/*.xml\n' +
                 f'\n'
                 f'      - name: Check Buildkite job state\n'
                 f'        if: >\n'
@@ -515,60 +591,6 @@ def trigger_buildkite_job(id: str, needs: List[str]) -> str:
                 f'          echo "::warning::Buildkite pipeline did not pass: ${{{{ steps.build.outputs.url }}}}"\n'
                 f'          exit 1\n')
 
-    def publish_unit_test_results(id: str, needs: List[str]) -> str:
-        return (f'  {id}:\n'
-                f'    name: "Publish Unit Tests Results"\n'
-                f'    needs: [{", ".join(needs)}]\n'
-                f'    runs-on: ubuntu-latest\n'
-                f'    # only run this job when the workflow is in success or failure state,\n'
-                f'    # not when it is in cancelled or skipped state\n'
-                f'    # only run this job on push events or when the event does not run in a fork repository\n'
-                f'    if: >\n'
-                f'      ( success() || failure() ) &&\n'
-                f"      needs.init-workflow.outputs.run_at_all == 'true' &&\n"
-                f'      ( github.event_name == \'push\' || ! github.event.head.repo.fork )\n'
-                f'\n'
-                f'    steps:\n'
-                f'      - name: Download GitHub Artifacts\n'
-                f'        uses: actions/download-artifact@v2\n'
-                f'        with:\n'
-                f'          path: artifacts\n'
-                f'\n'
-                f'      - name: Identify last run of each test\n'
-                f'        continue-on-error: true\n'
-                f'        run: |\n'
-                f'          declare -A last_runs\n'
-                f'          ls -d artifacts/Unit\\ Test\\ Results\\ */* | sort > runs.txt\n'
-                f'          while read run\n'
-                f'          do\n'
-                f'            test=${{run/%[_-]run[_-][0123456789]/}}\n'
-                f'            last_runs[$test]=$run\n'
-                f'          done < runs.txt\n'
-                f'\n'
-                f'          echo "LAST_RUNS<<EOF" >> $GITHUB_ENV\n'
-                f'          for test in "${{!last_runs[@]}}"\n'
-                f'          do\n'
-                f'            echo "${{last_runs[$test]}}" >&2\n'
-                f'            echo "${{last_runs[$test]}}/**/*.xml" >> $GITHUB_ENV\n'
-                f'          done\n'
-                f'          echo "EOF" >> $GITHUB_ENV\n'
-                f'        shell: bash\n'
-                f'\n'
-                f'      - name: Publish Unit Test Results\n'
-                f'        uses: EnricoMi/publish-unit-test-result-action@v1\n'
-                f'        if: always()\n'
-                f'        with:\n'
-                f'          check_name: Unit Test Results\n'
-                f'          files: "${{{{ env.LAST_RUNS }}}}"\n'
-                f'\n'
-                f'      - name: Publish Unit Test Results (with flaky tests)\n'
-                f'        uses: EnricoMi/publish-unit-test-result-action@v1\n'
-                f'        if: always()\n'
-                f'        with:\n'
-                f'          check_name: Unit Test Results (with flaky tests)\n'
-                f'          fail_on: errors\n'
-                f'          files: "artifacts/Unit Test Results */**/*.xml"\n')
-
     def publish_docker_images(needs: List[str], images: List[str]) -> str:
         if 'init-workflow' not in needs:
             needs.insert(0, 'init-workflow')
@@ -577,13 +599,13 @@ def publish_docker_images(needs: List[str], images: List[str]) -> str:
         return (f'  docker-config:\n'
                 f'    name: Configure docker build\n'
                 f'    needs: [{", ".join(needs)}]\n'
-                f"    # build-and-test-cpu, build-gpu and buildkite might have been skipped (! needs.init-workflow.outputs.run_builds_and_tests)\n"
+                f"    # build-and-test-cpu, build-gpu and buildkite might have been skipped (! needs.init-workflow.outputs.run-builds-and-tests)\n"
                 f'    # buildkite might have been skipped (workflow runs for a fork PR),\n'
                 f'    # we still want to build docker images (though we might not want to push them)\n'
                 f'    if: >\n'
                 f'      always() &&\n'
-                f"      needs.init-workflow.outputs.run_at_all == 'true' &&\n"
-                f"      needs.init-workflow.outputs.run_builds_and_tests == 'true' &&\n"
+                f"      needs.init-workflow.outputs.run-at-all == 'true' &&\n"
+                f"      needs.init-workflow.outputs.run-builds-and-tests == 'true' &&\n"
                 f"      needs.build-and-test.result == 'success' &&\n"
                 f"      ( needs.buildkite.result == 'success' || needs.buildkite.result == 'skipped' )\n"
                 f'    runs-on: ubuntu-latest\n'
@@ -779,8 +801,8 @@ def sync_files(needs: List[str]) -> str:
             build_and_test_images(id='build-and-test', name='Build and Test', needs=['init-workflow'], images=release_images, parallel_images='-cpu-', tests_per_image=tests_per_image, tests=tests),
             build_and_test_images(id='build-and-test-heads', name='Build and Test heads', needs=['build-and-test'], images=allhead_images, parallel_images='', tests_per_image=tests_per_image, tests=tests),
             build_and_test_macos(id='build-and-test-macos', name='Build and Test macOS', needs=['build-and-test']),
-            trigger_buildkite_job(id='buildkite', needs=['build-and-test']),
-            publish_unit_test_results(id='publish-test-results', needs=['build-and-test', 'build-and-test-heads', 'build-and-test-macos', 'buildkite']),
+            trigger_buildkite_job(id='buildkite', name='Build and Test GPU (on Builtkite)', needs=['build-and-test'], mode='GPU NON HEADS'),
+            trigger_buildkite_job(id='buildkite-heads', name='Build and Test GPU heads (on Builtkite)', needs=['buildkite'], mode='GPU HEADS'),
             publish_docker_images(needs=['build-and-test', 'buildkite'], images=['horovod', 'horovod-cpu', 'horovod-ray']),
             sync_files(needs=['init-workflow'])
         )
diff --git a/.github/get-changed-code-files.py b/.github/get-changed-code-files.py
index 69deb260a0..0e68121353 100644
--- a/.github/get-changed-code-files.py
+++ b/.github/get-changed-code-files.py
@@ -7,8 +7,8 @@
 import requests
 
 # this script outputs all code files that have changed between commit and master
-# environment variable GITHUB_HEAD provides the commit SHA
-# environment variable GITHUB_BASE provides the master SHA
+# environment variable GITHUB_HEAD_SHA provides the commit SHA
+# environment variable GITHUB_BASE_SHA provides the master SHA
 
 # files that match any of these regexps are considered non-code files
 # even though those files have changed, they will not be in the output of this script
@@ -49,8 +49,8 @@ def is_non_code_file(file):
 if __name__ == "__main__":
     logging.getLogger().level = logging.DEBUG
 
-    base = os.environ.get('GITHUB_BASE')
-    head = os.environ.get('GITHUB_HEAD')
+    base = os.environ.get('GITHUB_BASE_SHA')
+    head = os.environ.get('GITHUB_HEAD_SHA')
     if head is None or base is None:
         logging.warning('no base commit ({}) or head commit ({}) given'.format(base, head))
         sys.exit(1)
diff --git a/.github/workflows/ci-fork.yaml b/.github/workflows/ci-fork.yaml
deleted file mode 100644
index 3b7907c5fd..0000000000
--- a/.github/workflows/ci-fork.yaml
+++ /dev/null
@@ -1,121 +0,0 @@
-name: CI (Fork)
-
-on:
-  workflow_run:
-    workflows: ["CI"]
-    types:
-      - completed
-
-jobs:
-  debug:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Debug Action
-        uses: hmarr/debug-action@v1.0.0
-
-  ci-workflow:
-    name: "Check CI workflow outcome"
-    runs-on: ubuntu-latest
-    # only run if CI workflow ran on a fork
-    if: >
-      github.event.workflow_run.conclusion != 'skipped' &&
-      github.event.workflow_run.conclusion != 'cancelled' &&
-      github.event.workflow_run.head_repository.fork
-    outputs:
-      build-and-test: ${{ steps.workflow-conclusion.outputs.build-and-test }}
-
-    steps:
-    - name: Fetch workflow conclusion
-      id: workflow-conclusion
-      run: |
-        curl -s "${{ github.event.workflow_run.jobs_url }}" > workflow_run_jobs.json
-        conclusion=$(jq -r '.jobs[] | select(.name | startswith("Build and Test (")) | .conclusion' workflow_run_jobs.json | sort | uniq | paste -sd "," -)
-        echo "::set-output name=build-and-test::${conclusion}"
-      shell: bash
-
-  buildkite:
-    name: "Build and Test GPU (on Builtkite)"
-    needs: [ci-workflow]
-    runs-on: ubuntu-latest
-    # only run if CI workflow's build-and-test job succeeded and CI workflow ran on a fork
-    if: needs.ci-workflow.outputs.build-and-test == 'success'
-
-    steps:
-      - name: Trigger Buildkite Pipeline
-        id: buildkite
-        uses: EnricoMi/trigger-pipeline-action@master
-        env:
-          PIPELINE: "horovod/horovod"
-          COMMIT: "${{ github.event.workflow_run.head_sha }}"
-          BRANCH: "${{ github.event.workflow_run.head_repository.owner.login }}:${{ github.event.workflow_run.head_branch }}"
-          MESSAGE: "${{ github.event.workflow_run.message }}"
-          BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
-          BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU FULL\"}"
-
-      - name: Download Buildkite Artifacts
-        id: download
-        uses: docker://ghcr.io/enricomi/download-buildkite-artifact-action:v1
-        with:
-          github_token: ${{ github.token }}
-          buildkite_token: ${{ secrets.BUILDKITE_TOKEN }}
-          buildkite_build_url: ${{ steps.buildkite.outputs.url }}
-          ignore_build_states: blocked,canceled,skipped,not_run
-          ignore_job_states: timed_out
-          output_path: artifacts/Unit Test Results - GPUs on Buildkite
-
-      - name: Upload Test Results
-        uses: actions/upload-artifact@v2
-        if: always()
-        with:
-          name: Unit Test Results - GPUs on Builtkite
-          path: artifacts/Unit Test Results - GPUs on Buildkite/**/*.xml
-
-      - name: Check Buildkite job state
-        if: >
-          always() &&
-          steps.download.conclusion == 'success' &&
-          steps.download.outputs.build-state != 'passed'
-        run: |
-          echo "::warning::Buildkite pipeline did not pass: ${{ steps.buildkite.outputs.url }}"
-          exit 1
-
-  publish-test-results:
-    name: "Publish Unit Tests Results"
-    needs: [buildkite]
-    runs-on: ubuntu-latest
-    # only run if CI workflow ran on a fork
-    if: >
-      always() &&
-      github.event.workflow_run.conclusion != 'skipped' &&
-      github.event.workflow_run.conclusion != 'cancelled' &&
-      github.event.workflow_run.head_repository.fork
-
-    steps:
-      - name: Debug Action
-        uses: hmarr/debug-action@v2.0.0
-
-      - name: Download and Extract Artifacts
-        env:
-          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
-        run: |
-          mkdir -p artifacts && cd artifacts
-
-          artifacts_url=${{ github.event.workflow_run.artifacts_url }}
-
-          gh api "$artifacts_url" -q '.artifacts[] | [.name, .archive_download_url] | @tsv' | while read artifact
-          do
-            IFS=$'\t' read name url <<< "$artifact"
-            gh api $url > "$name.zip"
-            unzip -d "$name" "$name.zip"
-          done
-
-      - name: Download Buildkite Artifacts
-        uses: actions/download-artifact@v2
-        with:
-          path: artifacts
-
-      - name: Publish Unit Test Results
-        uses: EnricoMi/publish-unit-test-result-action@v1
-        with:
-          commit: ${{ github.event.workflow_run.head_sha }}
-          files: "artifacts/*/**/*.xml"
diff --git a/.github/workflows/ci-results.yaml b/.github/workflows/ci-results.yaml
new file mode 100644
index 0000000000..6608b4df1a
--- /dev/null
+++ b/.github/workflows/ci-results.yaml
@@ -0,0 +1,237 @@
+# publishes test results from the CI workflow (not when run on schedule)
+# this publishes test results of PRs from horovod repository and fork repositories
+# buildkite tests are only run here for fork repositories
+name: CI (Results)
+
+on:
+  workflow_run:
+    workflows: ["CI"]
+    types:
+      - completed
+
+jobs:
+  debug:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Debug Action
+        uses: hmarr/debug-action@v1.0.0
+
+  ci-workflow:
+    name: "Check CI workflow outcome"
+    runs-on: ubuntu-latest
+    # only run if CI workflow has not been skipped or cancelled
+    # only run if CI workflow did not run on schedule
+    if: >
+      github.event.workflow_run.conclusion != 'skipped' &&
+      github.event.workflow_run.conclusion != 'cancelled' &&
+      github.event.workflow_run.event != 'schedule'
+    outputs:
+      build-and-test: ${{ steps.workflow-conclusion.outputs.build-and-test }}
+      pr-json: ${{ steps.pr.outputs.json }}
+
+    steps:
+    - name: Fetch workflow conclusion
+      id: workflow-conclusion
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      run: |
+        conclusion=$(gh api "${{ github.event.workflow_run.jobs_url }}" -q '.jobs[] | select(.name | startswith("Build and Test (")) | .conclusion' | sort | uniq | paste -sd "," -)
+        echo "build-and-test conclusion: ${conclusion}"
+        echo "::set-output name=build-and-test::${conclusion}"
+      shell: bash
+
+    - name: Fetch PR meta
+      id: pr
+      if: github.event.workflow_run.event == 'pull_request' && github.event.workflow_run.head_repository.fork
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      run: |
+        artifacts_url=${{ github.event.workflow_run.artifacts_url }}
+        gh api "$artifacts_url" -q '.artifacts[] | select(.name == "PR Meta") .archive_download_url' | while read url
+        do
+          gh api "$url" > "pr.zip"
+          unzip -o "pr.zip"
+          echo "::set-output name=json::$(cat pr.json)"
+          cat pr.json
+          echo
+        done
+
+        if [[ ! -e "pr.json" ]]
+        then
+          echo "::error title=Artifact 'PR Meta' missing::Expected artifact 'PR Meta' does not exist for pull_request event."
+          exit 1
+        fi
+
+  buildkite:
+    name: "Build and Test GPU (on Builtkite)"
+    needs: [ci-workflow]
+    runs-on: ubuntu-latest
+    # only run if CI workflow's build-and-test job succeeded and CI workflow ran on a fork
+    if: >
+      needs.ci-workflow.outputs.build-and-test == 'success' &&
+      github.event.workflow_run.head_repository.fork
+
+    steps:
+      - name: Trigger Buildkite Pipeline
+        id: buildkite
+        uses: EnricoMi/trigger-pipeline-action@master
+        env:
+          PIPELINE: "horovod/horovod"
+          COMMIT: "${{ fromJSON( needs.ci-workflow.outputs.pr-json ).merge_sha }}"
+          BRANCH: "${{ github.event.workflow_run.head_repository.owner.login }}:${{ github.event.workflow_run.head_branch }}"
+          MESSAGE: "${{ github.event.workflow_run.head_commit.message }} (release versions)"
+          BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
+          BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU NON HEADS\"}"
+
+      - name: Download Buildkite Artifacts
+        id: download
+        uses: docker://ghcr.io/enricomi/download-buildkite-artifact-action:v1
+        with:
+          github_token: ${{ github.token }}
+          buildkite_token: ${{ secrets.BUILDKITE_TOKEN }}
+          buildkite_build_url: ${{ steps.buildkite.outputs.url }}
+          ignore_build_states: blocked,canceled,skipped,not_run
+          ignore_job_states: timed_out
+          output_path: artifacts/Unit Test Results - GPU NON HEADS on Builtkite
+
+      - name: Upload Test Results
+        uses: actions/upload-artifact@v2
+        if: always()
+        with:
+          name: Unit Test Results - GPU NON HEADS on Builtkite
+          path: artifacts/Unit Test Results - GPU NON HEADS on Builtkite/**/*.xml
+
+      - name: Check Buildkite job state
+        if: >
+          always() &&
+          steps.download.conclusion == 'success' &&
+          steps.download.outputs.build-state != 'passed'
+        run: |
+          echo "::warning::Buildkite pipeline did not pass: ${{ steps.buildkite.outputs.url }}"
+          exit 1
+
+  buildkite-heads:
+    name: "Build and Test GPU heads (on Builtkite)"
+    needs: [ci-workflow, buildkite]
+    runs-on: ubuntu-latest
+    # only run if CI workflow's build-and-test job succeeded and CI workflow ran on a fork
+    if: >
+      needs.ci-workflow.outputs.build-and-test == 'success' &&
+      github.event.workflow_run.head_repository.fork
+
+    steps:
+      - name: Trigger Buildkite Pipeline
+        id: buildkite
+        uses: EnricoMi/trigger-pipeline-action@master
+        env:
+          PIPELINE: "horovod/horovod"
+          COMMIT: "${{ fromJSON( needs.ci-workflow.outputs.pr-json ).merge_sha }}"
+          BRANCH: "${{ github.event.workflow_run.head_repository.owner.login }}:${{ github.event.workflow_run.head_branch }}"
+          MESSAGE: "${{ github.event.workflow_run.head_commit.message }} (head versions)"
+          BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
+          BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU HEADS\"}"
+
+      - name: Download Buildkite Artifacts
+        id: download
+        uses: docker://ghcr.io/enricomi/download-buildkite-artifact-action:v1
+        with:
+          github_token: ${{ github.token }}
+          buildkite_token: ${{ secrets.BUILDKITE_TOKEN }}
+          buildkite_build_url: ${{ steps.buildkite.outputs.url }}
+          ignore_build_states: blocked,canceled,skipped,not_run
+          ignore_job_states: timed_out
+          output_path: artifacts/Unit Test Results - GPU HEADS on Builtkite
+
+      - name: Upload Test Results
+        uses: actions/upload-artifact@v2
+        if: always()
+        with:
+          name: Unit Test Results - GPU HEADS on Builtkite
+          path: artifacts/Unit Test Results - GPU HEADS on Builtkite/**/*.xml
+
+      - name: Check Buildkite job state
+        if: >
+          always() &&
+          steps.download.conclusion == 'success' &&
+          steps.download.outputs.build-state != 'passed'
+        run: |
+          echo "::warning::Buildkite pipeline did not pass: ${{ steps.buildkite.outputs.url }}"
+          exit 1
+
+  publish-test-results:
+    name: "Publish Unit Tests Results"
+    needs: [ci-workflow, buildkite, buildkite-heads]
+    runs-on: ubuntu-latest
+    # only publish results when ci-workflow job has not been skipped, meaning:
+    # - CI workflow has not been skipped or cancelled
+    # - CI workflow did not run on schedule
+    # and CI workflow's build-and-test jobs have not all been skipped
+    if: >
+      always() &&
+      needs.ci-workflow.result != 'skipped' &&
+      needs.ci-workflow.outputs.build-and-test != 'skipped'
+
+    steps:
+      - name: Debug Action
+        uses: hmarr/debug-action@v2.0.0
+
+      - name: Download and Extract Artifacts
+        env:
+          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
+        run: |
+          mkdir -p artifacts && cd artifacts
+
+          artifacts_url=${{ github.event.workflow_run.artifacts_url }}
+
+          gh api "$artifacts_url" -q '.artifacts[] | [.name, .archive_download_url] | @tsv' | while read artifact
+          do
+            IFS=$'\t' read name url <<< "$artifact"
+            gh api $url > "$name.zip"
+            unzip -d "$name" "$name.zip"
+          done
+
+      - name: Download Buildkite Artifacts
+        uses: actions/download-artifact@v2
+        with:
+          path: artifacts
+
+      - name: Identify last run of each test
+        continue-on-error: true
+        run: |
+          declare -A last_runs
+          ls -d artifacts/Unit\ Test\ Results\ */* | sort > runs.txt
+          while read run
+          do
+            test=${run/%[_-]run[_-][0123456789]/}
+            last_runs[$test]=$run
+          done < runs.txt
+
+          echo "LAST_RUNS<<EOF" >> $GITHUB_ENV
+          for test in "${!last_runs[@]}"
+          do
+            echo "${last_runs[$test]}" >&2
+            echo "${last_runs[$test]}/**/*.xml" >> $GITHUB_ENV
+          done
+          echo "EOF" >> $GITHUB_ENV
+        shell: bash
+
+      - name: Publish Unit Test Results
+        uses: EnricoMi/publish-unit-test-result-action@v1
+        if: always()
+        with:
+          check_name: Unit Test Results
+          event_file: artifacts/Event File/event.json
+          event_name: ${{ github.event.workflow_run.event }}
+          commit: ${{ github.event.workflow_run.head_sha }}
+          files: "${{ env.LAST_RUNS }}"
+
+      - name: Publish Unit Test Results (with flaky tests)
+        uses: EnricoMi/publish-unit-test-result-action@v1
+        if: always()
+        with:
+          check_name: Unit Test Results (with flaky tests)
+          event_file: artifacts/Event File/event.json
+          event_name: ${{ github.event.workflow_run.event }}
+          commit: ${{ github.event.workflow_run.head_sha }}
+          files: "artifacts/Unit Test Results */**/*.xml"
+          fail_on: errors
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index 96d184ba60..7c77c2c7bd 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -4,6 +4,7 @@ name: CI
 
 on:
   schedule:
+    # run a build on master (this does not publish test results or cancel concurrent builds)
     - cron: '0 10 * * *' # everyday at 10am
   push:
     # only consider push to master and tags
@@ -11,14 +12,28 @@ on:
     branches: [ master ]
     tags: [ 'v*.*.*' ]
   pull_request:
+    # only consider pull requests into master
     branches: [ master ]
 
 concurrency:
-  # github.ref means something like refs/heads/master or refs/tags/v0.22.1 or the branch.
-  # This helps to not cancel concurrent runs on master and a tag that share the same commit
-  # On master, head_ref is empty, so we use the SHA of the commit, this means
-  # individual commits to master will not be cancelled, but tagged
-  group: ci-${{ github.ref }}-${{ github.head_ref || github.sha }}
+  # This controls which concurrent builds to cancel:
+  # - we do not want any concurrent builds on a branch (pull_request)
+  # - we do not want concurrent builds on the same commit on master (push)
+  # - we do not want concurrent builds on the same commit on a tag (push)
+  # - we allow concurrent runs on the same commit on master and its tag (push)
+  # - we allow concurrent runs on the same commit on master (push) and a scheduled build (schedule)
+  #
+  # A pull_request event only runs on branch commit, a push event only on master and tag commit.
+  # A schedule event only runs on master HEAD commit.
+  #
+  # Expression github.ref means something like refs/heads/master or refs/tags/v0.22.1 or the branch.
+  # This helps to not cancel concurrent runs on master or a tag that share the same commit.
+  # Expression github.head_ref refers to the branch of the pull request.
+  # On master, github.head_ref is empty, so we use the SHA of the commit, this means individual
+  # commits to master will not be cancelled, while there can only be one concurrent build on a branch.
+  #
+  # We include the event name to we allow for concurrent scheduled and master builds.
+  group: ci-${{ github.event_name }}-${{ github.ref }}-${{ github.head_ref || github.sha }}
   cancel-in-progress: true
 
 jobs:
@@ -27,13 +42,28 @@ jobs:
     steps:
     - name: Debug Action
       uses: hmarr/debug-action@v1.0.0
+    - name: Debug Concurrency
+      run: echo "ci-${{ github.event_name }}-${{ github.ref }}-${{ github.head_ref || github.sha }}"
+
+  event_file:
+    name: "Event File"
+    runs-on: ubuntu-latest
+    steps:
+    - name: Upload
+      uses: actions/upload-artifact@v2
+      with:
+        name: Event File
+        path: ${{ github.event_path }}
+
   init-workflow:
     name: "Init Workflow"
     runs-on: ubuntu-latest
     outputs:
-      run_at_all: ${{ github.event_name != 'schedule' || github.repository == 'horovod/horovod' }}
+      run-at-all: ${{ github.event_name != 'schedule' || github.repository == 'horovod/horovod' }}
       # if we don't get a clear 'false', we fall back to building and testing
-      run_builds_and_tests: ${{ steps.tests.outputs.needed != 'false' }}
+      run-builds-and-tests: ${{ steps.tests.outputs.needed != 'false' }}
+      buildkite-branch-label: "${{ steps.config-buildkite.outputs.branch-label }}"
+      buildkite-message: "${{ steps.config-buildkite.outputs.message }}"
 
     steps:
       - name: Checkout
@@ -58,8 +88,8 @@ jobs:
       - name: Check if tests are needed
         id: tests
         env:
-          GITHUB_BASE: ${{ github.event.pull_request.base.sha }}
-          GITHUB_HEAD: ${{ github.event.pull_request.head.sha }}
+          GITHUB_BASE_SHA: ${{ github.event.pull_request.base.sha }}
+          GITHUB_HEAD_SHA: ${{ github.event.pull_request.head.sha }}
         run: |
           if [[ "${{ github.event_name }}" == "pull_request" ]]
           then
@@ -78,12 +108,55 @@ jobs:
             echo "::set-output name=needed::true"
           fi
 
+      - name: Configure Buildkite Build
+        id: config-buildkite
+        env:
+          GITHUB_TOKEN: ${secrets.GITHUB_TOKEN}
+        run: |
+          branch="${{ github.event.pull_request.head.ref || github.ref }}"
+          branch="${branch#"refs/heads/"}"
+          branch="${branch#"refs/tags/"}"
+
+          branch_label="${branch}"
+          if [[ "${{ github.event_name }}" == "schedule" ]]
+          then
+            # we add this label to the branch used by Buildkite to avoid it cancelling one of concurrent schedule and push builds on master
+            branch_label="${branch} (schedule)"
+          fi
+          echo "::set-output name=branch-label::${branch_label}"
+
+          if [[ "${{ github.event_name }}" == "pull_request" ]]
+          then
+            head_sha="${{ github.event.pull_request.head.sha }}"
+            message="$(gh api https://api.github.com/repos/horovod/horovod/commits/${head_sha} -q .commit.message | head -n1)"
+            echo "::set-output name=message::${message}"
+          fi
+
+      - name: Provide PR meta
+        if: github.event_name == 'pull_request'
+        run: |
+          rm -f pr.json
+          echo -n "{" >> pr.json
+          echo -n " \"merge_sha\": \"${{ github.sha }}\"," >> pr.json
+          echo -n " \"base_sha\": \"${{ github.event.pull_request.base.sha }}\"," >> pr.json
+          echo -n " \"head_sha\": \"${{ github.event.pull_request.head.sha }}\" " >> pr.json
+          echo -n "}" >> pr.json
+          cat pr.json
+
+      - name: Upload PR meta
+        uses: actions/upload-artifact@v2
+        if: github.event_name == 'pull_request'
+        with:
+          name: PR Meta
+          path: pr.json
+
+
   build-and-test:
     name: "Build and Test (${{ matrix.image }})"
     needs: [init-workflow]
     if: >
-      needs.init-workflow.outputs.run_at_all == 'true' &&
-      needs.init-workflow.outputs.run_builds_and_tests == 'true'
+      needs.init-workflow.outputs.run-at-all == 'true' &&
+      needs.init-workflow.outputs.run-builds-and-tests == 'true'
     runs-on: ubuntu-latest
 
     strategy:
@@ -104,15 +177,9 @@ jobs:
             Single_Keras_MNIST: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Keras_MNIST: true
-            Spark_Keras_Rossmann_Estimator: true
-            Spark_Keras_Rossmann_Run: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-gloo-py3_7-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark2_4_8
+          - image: test-cpu-gloo-py3_7-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark2_4_8
             Elastic_Spark_TensorFlow_Tests_1: true
             Elastic_Spark_Torch_Tests: true
             Elastic_Tests_1: true
@@ -125,12 +192,9 @@ jobs:
             Gloo_TensorFlow_2_0_MNIST: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_7_0_p2-pyspark3_1_2
+          - image: test-cpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_7_0_p2-pyspark3_1_2
             Elastic_Tests_1: true
             Gloo_Cluster_PyTests: true
             Gloo_MXNet_MNIST: true
@@ -141,12 +205,9 @@ jobs:
             Gloo_TensorFlow_2_0_MNIST: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_0_3
+          - image: test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_0_3
             Elastic_Spark_TensorFlow_Tests_1: true
             Elastic_Spark_Torch_Tests: true
             Elastic_Tests_1: true
@@ -159,12 +220,9 @@ jobs:
             Gloo_TensorFlow_2_0_MNIST: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
             Elastic_Spark_TensorFlow_Tests_1: true
             Elastic_Spark_Torch_Tests: true
             Elastic_Tests_1: true
@@ -177,12 +235,9 @@ jobs:
             Gloo_TensorFlow_2_0_MNIST: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-mpich-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-cpu-mpich-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
             MPI_Cluster_PyTests: true
             MPI_MXNet_MNIST: true
             MPI_Parallel_PyTests: true
@@ -194,7 +249,7 @@ jobs:
             Single_PyTorch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-cpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
             Elastic_Tests_1: true
             Gloo_Cluster_PyTests: true
             Gloo_MXNet_MNIST: true
@@ -213,12 +268,9 @@ jobs:
             Run_PyTests_test_interactiverun: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-cpu-openmpi-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-cpu-openmpi-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
             MPI_Cluster_PyTests: true
             MPI_MXNet_MNIST: true
             MPI_Parallel_PyTests: true
@@ -229,12 +281,9 @@ jobs:
             Run_PyTests_test_interactiverun: true
             Single_MXNet_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
-          - image: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+          - image: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
             build_timeout: 40
 
           - image: test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2
@@ -243,13 +292,13 @@ jobs:
           - image: test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2
             build_timeout: 40
 
-          - image: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
             build_timeout: 40
 
-          - image: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
             build_timeout: 40
 
-          - image: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+          - image: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
             build_timeout: 40
 
     steps:
@@ -350,7 +399,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_1 && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 1 [attempt 2 of 3]"
@@ -359,7 +408,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_1 && steps.Elastic_Spark_TensorFlow_Tests_1_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 1 [attempt 3 of 3]"
@@ -368,7 +417,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_1 && steps.Elastic_Spark_TensorFlow_Tests_1_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 2 [attempt 1 of 3]"
@@ -377,7 +426,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_2 && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 2 [attempt 2 of 3]"
@@ -386,7 +435,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_2 && steps.Elastic_Spark_TensorFlow_Tests_2_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 2 [attempt 3 of 3]"
@@ -395,7 +444,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_2 && steps.Elastic_Spark_TensorFlow_Tests_2_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
         shell: bash
 
       - name: "Elastic Spark Torch Tests [attempt 1 of 3]"
@@ -404,7 +453,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_Torch_Tests && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
         shell: bash
 
       - name: "Elastic Spark Torch Tests [attempt 2 of 3]"
@@ -413,7 +462,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_Torch_Tests && steps.Elastic_Spark_Torch_Tests_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
         shell: bash
 
       - name: "Elastic Spark Torch Tests [attempt 3 of 3]"
@@ -422,7 +471,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_Torch_Tests && steps.Elastic_Spark_Torch_Tests_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
         shell: bash
 
       - name: "Elastic Tests 1 [attempt 1 of 3]"
@@ -593,7 +642,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Gloo_Parallel_PyTests && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
         shell: bash
 
       - name: "Gloo Parallel PyTests [attempt 2 of 3]"
@@ -602,7 +651,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Gloo_Parallel_PyTests && steps.Gloo_Parallel_PyTests_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
         shell: bash
 
       - name: "Gloo Parallel PyTests [attempt 3 of 3]"
@@ -611,7 +660,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Gloo_Parallel_PyTests && steps.Gloo_Parallel_PyTests_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
         shell: bash
 
       - name: "Gloo PyTorch MNIST [attempt 1 of 3]"
@@ -917,7 +966,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [attempt 2 of 3]"
@@ -926,7 +975,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests && steps.MPI_Parallel_PyTests_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [attempt 3 of 3]"
@@ -935,7 +984,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests && steps.MPI_Parallel_PyTests_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL MPI] [attempt 1 of 3]"
@@ -944,7 +993,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_MPI && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL MPI] [attempt 2 of 3]"
@@ -953,7 +1002,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_MPI && steps.MPI_Parallel_PyTests_ONECCL_MPI_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL MPI] [attempt 3 of 3]"
@@ -962,7 +1011,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_MPI && steps.MPI_Parallel_PyTests_ONECCL_MPI_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL OFI] [attempt 1 of 3]"
@@ -971,7 +1020,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_OFI && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL OFI] [attempt 2 of 3]"
@@ -980,7 +1029,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_OFI && steps.MPI_Parallel_PyTests_ONECCL_OFI_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL OFI] [attempt 3 of 3]"
@@ -989,7 +1038,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_OFI && steps.MPI_Parallel_PyTests_ONECCL_OFI_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI PyTorch MNIST [attempt 1 of 3]"
@@ -1559,168 +1608,6 @@ jobs:
           docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Single_PyTorch_MNIST_ONECCL_OFI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command && python /horovod/examples/pytorch/pytorch_mnist.py --epochs 3 --data-dir /data/pytorch_datasets"
         shell: bash
 
-      - name: "Spark Keras MNIST [attempt 1 of 3]"
-        id: Spark_Keras_MNIST_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_MNIST && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Keras MNIST [attempt 2 of 3]"
-        id: Spark_Keras_MNIST_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_MNIST && steps.Spark_Keras_MNIST_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Keras MNIST [attempt 3 of 3]"
-        id: Spark_Keras_MNIST_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_MNIST && steps.Spark_Keras_MNIST_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Estimator [attempt 1 of 3]"
-        id: Spark_Keras_Rossmann_Estimator_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Estimator && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Estimator [attempt 2 of 3]"
-        id: Spark_Keras_Rossmann_Estimator_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Estimator && steps.Spark_Keras_Rossmann_Estimator_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Estimator [attempt 3 of 3]"
-        id: Spark_Keras_Rossmann_Estimator_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Estimator && steps.Spark_Keras_Rossmann_Estimator_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Run [attempt 1 of 3]"
-        id: Spark_Keras_Rossmann_Run_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Run && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Run [attempt 2 of 3]"
-        id: Spark_Keras_Rossmann_Run_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Run && steps.Spark_Keras_Rossmann_Run_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Run [attempt 3 of 3]"
-        id: Spark_Keras_Rossmann_Run_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Run && steps.Spark_Keras_Rossmann_Run_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Lightning MNIST [attempt 1 of 3]"
-        id: Spark_Lightning_MNIST_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Lightning_MNIST && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Lightning MNIST [attempt 2 of 3]"
-        id: Spark_Lightning_MNIST_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Lightning_MNIST && steps.Spark_Lightning_MNIST_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Lightning MNIST [attempt 3 of 3]"
-        id: Spark_Lightning_MNIST_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Lightning_MNIST && steps.Spark_Lightning_MNIST_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark PyTests [attempt 1 of 3]"
-        id: Spark_PyTests_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_PyTests && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_PyTests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)"
-        shell: bash
-
-      - name: "Spark PyTests [attempt 2 of 3]"
-        id: Spark_PyTests_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_PyTests && steps.Spark_PyTests_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_PyTests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)"
-        shell: bash
-
-      - name: "Spark PyTests [attempt 3 of 3]"
-        id: Spark_PyTests_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_PyTests && steps.Spark_PyTests_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_PyTests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)"
-        shell: bash
-
-      - name: "Spark Torch MNIST [attempt 1 of 3]"
-        id: Spark_Torch_MNIST_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Torch_MNIST && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Torch MNIST [attempt 2 of 3]"
-        id: Spark_Torch_MNIST_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Torch_MNIST && steps.Spark_Torch_MNIST_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Torch MNIST [attempt 3 of 3]"
-        id: Spark_Torch_MNIST_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Torch_MNIST && steps.Spark_Torch_MNIST_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
       - name: Upload Test Results
         uses: actions/upload-artifact@v2
         if: always() && contains(matrix.image, '-cpu-')
@@ -1747,8 +1634,8 @@ jobs:
     name: "Build and Test heads (${{ matrix.image }})"
     needs: [init-workflow, build-and-test]
     if: >
-      needs.init-workflow.outputs.run_at_all == 'true' &&
-      needs.init-workflow.outputs.run_builds_and_tests == 'true'
+      needs.init-workflow.outputs.run-at-all == 'true' &&
+      needs.init-workflow.outputs.run-builds-and-tests == 'true'
     runs-on: ubuntu-latest
 
     strategy:
@@ -1767,9 +1654,6 @@ jobs:
             Gloo_TensorFlow_2_0_MNIST: true
             Single_MXNet2_MNIST: true
             Single_PyTorch_MNIST: true
-            Spark_Lightning_MNIST: true
-            Spark_PyTests: true
-            Spark_Torch_MNIST: true
             build_timeout: 30
 
           - image: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
@@ -1873,7 +1757,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_1 && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 1 [attempt 2 of 3]"
@@ -1882,7 +1766,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_1 && steps.Elastic_Spark_TensorFlow_Tests_1_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 1 [attempt 3 of 3]"
@@ -1891,7 +1775,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_1 && steps.Elastic_Spark_TensorFlow_Tests_1_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_1_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow2.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 2 [attempt 1 of 3]"
@@ -1900,7 +1784,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_2 && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 2 [attempt 2 of 3]"
@@ -1909,7 +1793,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_2 && steps.Elastic_Spark_TensorFlow_Tests_2_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
         shell: bash
 
       - name: "Elastic Spark TensorFlow Tests 2 [attempt 3 of 3]"
@@ -1918,7 +1802,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_TensorFlow_Tests_2 && steps.Elastic_Spark_TensorFlow_Tests_2_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_TensorFlow_Tests_2_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.tf.xml test_elastic_spark_tensorflow.py"
         shell: bash
 
       - name: "Elastic Spark Torch Tests [attempt 1 of 3]"
@@ -1927,7 +1811,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_Torch_Tests && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
         shell: bash
 
       - name: "Elastic Spark Torch Tests [attempt 2 of 3]"
@@ -1936,7 +1820,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_Torch_Tests && steps.Elastic_Spark_Torch_Tests_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
         shell: bash
 
       - name: "Elastic Spark Torch Tests [attempt 3 of 3]"
@@ -1945,7 +1829,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Elastic_Spark_Torch_Tests && steps.Elastic_Spark_Torch_Tests_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Elastic_Spark_Torch_Tests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 40m bash -c "cd /horovod/test/integration && /spark_env.sh HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.spark.torch.xml test_elastic_spark_torch.py"
         shell: bash
 
       - name: "Elastic Tests 1 [attempt 1 of 3]"
@@ -2116,7 +2000,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Gloo_Parallel_PyTests && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
         shell: bash
 
       - name: "Gloo Parallel PyTests [attempt 2 of 3]"
@@ -2125,7 +2009,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Gloo_Parallel_PyTests && steps.Gloo_Parallel_PyTests_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
         shell: bash
 
       - name: "Gloo Parallel PyTests [attempt 3 of 3]"
@@ -2134,7 +2018,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.Gloo_Parallel_PyTests && steps.Gloo_Parallel_PyTests_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Gloo_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
         shell: bash
 
       - name: "Gloo PyTorch MNIST [attempt 1 of 3]"
@@ -2440,7 +2324,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [attempt 2 of 3]"
@@ -2449,7 +2333,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests && steps.MPI_Parallel_PyTests_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [attempt 3 of 3]"
@@ -2458,7 +2342,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests && steps.MPI_Parallel_PyTests_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL MPI] [attempt 1 of 3]"
@@ -2467,7 +2351,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_MPI && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL MPI] [attempt 2 of 3]"
@@ -2476,7 +2360,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_MPI && steps.MPI_Parallel_PyTests_ONECCL_MPI_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL MPI] [attempt 3 of 3]"
@@ -2485,7 +2369,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_MPI && steps.MPI_Parallel_PyTests_ONECCL_MPI_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_MPI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_mpi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL OFI] [attempt 1 of 3]"
@@ -2494,7 +2378,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_OFI && true
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL OFI] [attempt 2 of 3]"
@@ -2503,7 +2387,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_OFI && steps.MPI_Parallel_PyTests_ONECCL_OFI_run_1.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI Parallel PyTests [ONECCL OFI] [attempt 3 of 3]"
@@ -2512,7 +2396,7 @@ jobs:
         if: always() && steps.build.outcome == 'success' && matrix.MPI_Parallel_PyTests_ONECCL_OFI && steps.MPI_Parallel_PyTests_ONECCL_OFI_run_2.outcome == 'failure'
         run: |
           mkdir -p artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 5m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
+          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/MPI_Parallel_PyTests_ONECCL_OFI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command &&  cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
         shell: bash
 
       - name: "MPI PyTorch MNIST [attempt 1 of 3]"
@@ -3082,168 +2966,6 @@ jobs:
           docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Single_PyTorch_MNIST_ONECCL_OFI_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "\$(cat /oneccl_env) && echo '/mpirun_command_ofi' > /mpirun_command && python /horovod/examples/pytorch/pytorch_mnist.py --epochs 3 --data-dir /data/pytorch_datasets"
         shell: bash
 
-      - name: "Spark Keras MNIST [attempt 1 of 3]"
-        id: Spark_Keras_MNIST_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_MNIST && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Keras MNIST [attempt 2 of 3]"
-        id: Spark_Keras_MNIST_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_MNIST && steps.Spark_Keras_MNIST_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Keras MNIST [attempt 3 of 3]"
-        id: Spark_Keras_MNIST_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_MNIST && steps.Spark_Keras_MNIST_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_MNIST_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Estimator [attempt 1 of 3]"
-        id: Spark_Keras_Rossmann_Estimator_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Estimator && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Estimator [attempt 2 of 3]"
-        id: Spark_Keras_Rossmann_Estimator_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Estimator && steps.Spark_Keras_Rossmann_Estimator_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Estimator [attempt 3 of 3]"
-        id: Spark_Keras_Rossmann_Estimator_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Estimator && steps.Spark_Keras_Rossmann_Estimator_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Estimator_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Run [attempt 1 of 3]"
-        id: Spark_Keras_Rossmann_Run_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Run && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Run [attempt 2 of 3]"
-        id: Spark_Keras_Rossmann_Run_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Run && steps.Spark_Keras_Rossmann_Run_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Keras Rossmann Run [attempt 3 of 3]"
-        id: Spark_Keras_Rossmann_Run_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Keras_Rossmann_Run && steps.Spark_Keras_Rossmann_Run_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Keras_Rossmann_Run_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-        shell: bash
-
-      - name: "Spark Lightning MNIST [attempt 1 of 3]"
-        id: Spark_Lightning_MNIST_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Lightning_MNIST && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Lightning MNIST [attempt 2 of 3]"
-        id: Spark_Lightning_MNIST_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Lightning_MNIST && steps.Spark_Lightning_MNIST_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Lightning MNIST [attempt 3 of 3]"
-        id: Spark_Lightning_MNIST_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Lightning_MNIST && steps.Spark_Lightning_MNIST_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Lightning_MNIST_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark PyTests [attempt 1 of 3]"
-        id: Spark_PyTests_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_PyTests && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_PyTests_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_PyTests_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)"
-        shell: bash
-
-      - name: "Spark PyTests [attempt 2 of 3]"
-        id: Spark_PyTests_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_PyTests && steps.Spark_PyTests_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_PyTests_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_PyTests_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)"
-        shell: bash
-
-      - name: "Spark PyTests [attempt 3 of 3]"
-        id: Spark_PyTests_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_PyTests && steps.Spark_PyTests_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_PyTests_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_PyTests_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 20m bash -c "cd /horovod/test/integration && (ls -1 test_spark*.py | xargs -n 1 /bin/bash /pytest_standalone.sh spark)"
-        shell: bash
-
-      - name: "Spark Torch MNIST [attempt 1 of 3]"
-        id: Spark_Torch_MNIST_run_1
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Torch_MNIST && true
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_1
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_1:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Torch MNIST [attempt 2 of 3]"
-        id: Spark_Torch_MNIST_run_2
-        continue-on-error: true
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Torch_MNIST && steps.Spark_Torch_MNIST_run_1.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_2
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_2:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
-      - name: "Spark Torch MNIST [attempt 3 of 3]"
-        id: Spark_Torch_MNIST_run_3
-        continue-on-error: false
-        if: always() && steps.build.outcome == 'success' && matrix.Spark_Torch_MNIST && steps.Spark_Torch_MNIST_run_2.outcome == 'failure'
-        run: |
-          mkdir -p artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_3
-          docker-compose -f docker-compose.test.yml run -e GITHUB_ACTIONS --rm --volume "$(pwd)/artifacts/${{ matrix.image }}/Spark_Torch_MNIST_run_3:/artifacts" ${{ matrix.image }} /usr/bin/timeout 10m bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-        shell: bash
-
       - name: Upload Test Results
         uses: actions/upload-artifact@v2
         if: always() && contains(matrix.image, '-cpu-')
@@ -3270,8 +2992,8 @@ jobs:
     name: "Build and Test macOS (${{ matrix.image }}-macos)"
     needs: [init-workflow, build-and-test]
     if: >
-      needs.init-workflow.outputs.run_at_all == 'true' &&
-      needs.init-workflow.outputs.run_builds_and_tests == 'true'
+      needs.init-workflow.outputs.run-at-all == 'true' &&
+      needs.init-workflow.outputs.run-builds-and-tests == 'true'
     runs-on: macos-latest
 
     strategy:
@@ -3279,34 +3001,34 @@ jobs:
       fail-fast: false
       matrix:
         include:
-          - image: test-cpu-openmpi-py3_7-tf1_15_5-keras2_2_4-torch1_2_0-mxnet1_5_0
+          - image: test-cpu-openmpi-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_0
             HOROVOD_WITH_MPI: 1
             HOROVOD_WITHOUT_GLOO: 1
             TENSORFLOW: 1.15.0
             KERAS: 2.2.4
-            PYTORCH: 1.2.0
-            PYTORCH_LIGHTNING: 0.7.6
-            TORCHVISION: 0.4.0
+            PYTORCH: 1.6.0
+            PYTORCH_LIGHTNING: 1.3.8
+            TORCHVISION: 0.7.0
             MXNET: 1.5.0
 
-          - image: test-cpu-gloo-py3_8-tf2_2_0-keras2_3_1-torch1_5_0-mxnet1_5_0
+          - image: test-cpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_5_0
             HOROVOD_WITHOUT_MPI: 1
             HOROVOD_WITH_GLOO: 1
-            TENSORFLOW: 2.2.0
-            KERAS: 2.3.1
-            PYTORCH: 1.5.0
-            PYTORCH_LIGHTNING: 1.2.9
-            TORCHVISION: 0.6.0
+            TENSORFLOW: 2.5.1
+            KERAS: 2.5.0rc0
+            PYTORCH: 1.8.1
+            PYTORCH_LIGHTNING: 1.3.8
+            TORCHVISION: 0.9.1
             MXNET: 1.5.0
 
-          - image: test-openmpi-cpu-gloo-py3_8-tf2_3_0-keras2_3_1-torch1_6_0-mxnet1_5_0
+          - image: test-openmpi-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_5_0
             HOROVOD_WITH_MPI: 1
             HOROVOD_WITH_GLOO: 1
-            TENSORFLOW: 2.3.0
-            KERAS: 2.3.1
-            PYTORCH: 1.6.0
-            PYTORCH_LIGHTNING: 1.2.9
-            TORCHVISION: 0.7.0
+            TENSORFLOW: 2.6.0
+            KERAS: 2.6.0
+            PYTORCH: 1.9.0
+            PYTORCH_LIGHTNING: 1.3.8
+            TORCHVISION: 0.10.0
             MXNET: 1.5.0
 
     steps:
@@ -3329,11 +3051,14 @@ jobs:
           TORCHVISION: ${{ matrix.TORCHVISION }}
           MXNET: ${{ matrix.MXNET }}
 
+        # The python patch in the pyenv install step is to work around an incompatibility introduced in new xcode version in macOS Big Sur. The patch is provided by python team.
+        # The original discussion is here https://github.com/pyenv/pyenv/issues/1737
         run: |
-          brew install -f openmpi cmake libuv pyenv coreutils
+          brew reinstall -f zlib bzip2
+          brew install -f openmpi cmake libuv pyenv coreutils curl
           export PATH=$(pyenv root)/shims:$PATH
           pyenv uninstall -f 3.7.7
-          pyenv install 3.7.7
+          CFLAGS="-I$(brew --prefix bzip2)/include -I$(brew --prefix zlib)/include" LDFLAGS="-L$(brew --prefix zlib)/lib -L$(brew --prefix bzip2)/lib" pyenv install --patch 3.7.7 < <(curl -sSL https://github.com/python/cpython/commit/8ea6353.patch)
           pyenv global 3.7.7
           python --version
 
@@ -3416,8 +3141,8 @@ jobs:
     runs-on: ubuntu-latest
     if: >
       github.repository == 'horovod/horovod' &&
-      needs.init-workflow.outputs.run_at_all == 'true' &&
-      needs.init-workflow.outputs.run_builds_and_tests == 'true' &&
+      needs.init-workflow.outputs.run-at-all == 'true' &&
+      needs.init-workflow.outputs.run-builds-and-tests == 'true' &&
       ( github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository )
 
     steps:
@@ -3426,10 +3151,12 @@ jobs:
         uses: EnricoMi/trigger-pipeline-action@master
         env:
           PIPELINE: "horovod/horovod"
-          BRANCH: "${{ github.event.pull_request.head.ref }}"
-          MESSAGE: "GPU Tests triggered by GitHub"
+          # COMMIT is taken from GITHUB_SHA
+          BRANCH: "${{ needs.init-workflow.outputs.buildkite-branch-label }}"
+          # empty MESSAGE will be filled by Buildkite from commit message
+          MESSAGE: "${{ needs.init-workflow.outputs.buildkite-message }}"
           BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
-          BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU FULL\"}"
+          BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU NON HEADS\"}"
 
       - name: Download Buildkite Artifacts
         id: download
@@ -3440,14 +3167,14 @@ jobs:
           buildkite_build_url: ${{ steps.build.outputs.url }}
           ignore_build_states: blocked,canceled,skipped,not_run
           ignore_job_states: timed_out
-          output_path: artifacts/Unit Test Results - GPUs on Buildkite
+          output_path: artifacts/Unit Test Results - GPU NON HEADS on Builtkite
 
       - name: Upload Test Results
         uses: actions/upload-artifact@v2
         if: always()
         with:
-          name: Unit Test Results - GPUs on Builtkite
-          path: artifacts/Unit Test Results - GPUs on Buildkite/**/*.xml
+          name: Unit Test Results - GPU NON HEADS on Builtkite
+          path: artifacts/Unit Test Results - GPU NON HEADS on Builtkite/**/*.xml
 
       - name: Check Buildkite job state
         if: >
@@ -3458,69 +3185,66 @@ jobs:
           echo "::warning::Buildkite pipeline did not pass: ${{ steps.build.outputs.url }}"
           exit 1
 
-  publish-test-results:
-    name: "Publish Unit Tests Results"
-    needs: [build-and-test, build-and-test-heads, build-and-test-macos, buildkite]
+  buildkite-heads:
+    name: "Build and Test GPU heads (on Builtkite)"
+    needs: [init-workflow, buildkite]
     runs-on: ubuntu-latest
-    # only run this job when the workflow is in success or failure state,
-    # not when it is in cancelled or skipped state
-    # only run this job on push events or when the event does not run in a fork repository
     if: >
-      ( success() || failure() ) &&
-      needs.init-workflow.outputs.run_at_all == 'true' &&
-      ( github.event_name == 'push' || ! github.event.head.repo.fork )
+      github.repository == 'horovod/horovod' &&
+      needs.init-workflow.outputs.run-at-all == 'true' &&
+      needs.init-workflow.outputs.run-builds-and-tests == 'true' &&
+      ( github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository )
 
     steps:
-      - name: Download GitHub Artifacts
-        uses: actions/download-artifact@v2
-        with:
-          path: artifacts
-
-      - name: Identify last run of each test
-        continue-on-error: true
-        run: |
-          declare -A last_runs
-          ls -d artifacts/Unit\ Test\ Results\ */* | sort > runs.txt
-          while read run
-          do
-            test=${run/%[_-]run[_-][0123456789]/}
-            last_runs[$test]=$run
-          done < runs.txt
-
-          echo "LAST_RUNS<<EOF" >> $GITHUB_ENV
-          for test in "${!last_runs[@]}"
-          do
-            echo "${last_runs[$test]}" >&2
-            echo "${last_runs[$test]}/**/*.xml" >> $GITHUB_ENV
-          done
-          echo "EOF" >> $GITHUB_ENV
-        shell: bash
+      - name: Trigger Buildkite Pipeline
+        id: build
+        uses: EnricoMi/trigger-pipeline-action@master
+        env:
+          PIPELINE: "horovod/horovod"
+          # COMMIT is taken from GITHUB_SHA
+          BRANCH: "${{ needs.init-workflow.outputs.buildkite-branch-label }}"
+          # empty MESSAGE will be filled by Buildkite from commit message
+          MESSAGE: "${{ needs.init-workflow.outputs.buildkite-message }}"
+          BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_TOKEN }}
+          BUILD_ENV_VARS: "{\"PIPELINE_MODE\": \"GPU HEADS\"}"
 
-      - name: Publish Unit Test Results
-        uses: EnricoMi/publish-unit-test-result-action@v1
-        if: always()
+      - name: Download Buildkite Artifacts
+        id: download
+        uses: docker://ghcr.io/enricomi/download-buildkite-artifact-action:v1
         with:
-          check_name: Unit Test Results
-          files: "${{ env.LAST_RUNS }}"
+          github_token: ${{ github.token }}
+          buildkite_token: ${{ secrets.BUILDKITE_TOKEN }}
+          buildkite_build_url: ${{ steps.build.outputs.url }}
+          ignore_build_states: blocked,canceled,skipped,not_run
+          ignore_job_states: timed_out
+          output_path: artifacts/Unit Test Results - GPU HEADS on Builtkite
 
-      - name: Publish Unit Test Results (with flaky tests)
-        uses: EnricoMi/publish-unit-test-result-action@v1
+      - name: Upload Test Results
+        uses: actions/upload-artifact@v2
         if: always()
         with:
-          check_name: Unit Test Results (with flaky tests)
-          fail_on: errors
-          files: "artifacts/Unit Test Results */**/*.xml"
+          name: Unit Test Results - GPU HEADS on Builtkite
+          path: artifacts/Unit Test Results - GPU HEADS on Builtkite/**/*.xml
+
+      - name: Check Buildkite job state
+        if: >
+          always() &&
+          steps.download.conclusion == 'success' &&
+          steps.download.outputs.build-state != 'passed'
+        run: |
+          echo "::warning::Buildkite pipeline did not pass: ${{ steps.build.outputs.url }}"
+          exit 1
 
   docker-config:
     name: Configure docker build
     needs: [init-workflow, build-and-test, buildkite]
-    # build-and-test-cpu, build-gpu and buildkite might have been skipped (! needs.init-workflow.outputs.run_builds_and_tests)
+    # build-and-test-cpu, build-gpu and buildkite might have been skipped (! needs.init-workflow.outputs.run-builds-and-tests)
     # buildkite might have been skipped (workflow runs for a fork PR),
     # we still want to build docker images (though we might not want to push them)
     if: >
       always() &&
-      needs.init-workflow.outputs.run_at_all == 'true' &&
-      needs.init-workflow.outputs.run_builds_and_tests == 'true' &&
+      needs.init-workflow.outputs.run-at-all == 'true' &&
+      needs.init-workflow.outputs.run-builds-and-tests == 'true' &&
       needs.build-and-test.result == 'success' &&
       ( needs.buildkite.result == 'success' || needs.buildkite.result == 'skipped' )
     runs-on: ubuntu-latest
diff --git a/.gitmodules b/.gitmodules
index c274d0cab6..de526f8fd4 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -3,7 +3,7 @@
 	url = https://github.com/yixuan/LBFGSpp.git
 [submodule "third_party/eigen"]
 	path = third_party/eigen
-	url = https://gitlab.com/libeigen/eigen.git
+	url = https://gitlab.com/cantonios/eigen.git
 [submodule "third_party/boost/assert"]
 	path = third_party/boost/assert
 	url = https://github.com/boostorg/assert.git
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 673a7818fd..a0f5a0a1aa 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,20 +8,82 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
-- Added process sets for TensorFlow: Concurrently running collective operations on subsets of Horovod processes. ([#2839](https://github.com/horovod/horovod/pull/2839))
+- TensorFlow: Added in-place broadcasting of variables. ([#3128](https://github.com/horovod/horovod/pull/3128))
 
-- Added terminate_on_nan flag to Spark Lightning estimator. [#3088](https://github.com/horovod/horovod/issues/3088)
+### Changed
+
+### Deprecated
+
+### Removed
+
+### Fixed
+
+- fix the example of pytorch_lightning_mnist.py ([#3245](https://github.com/horovod/horovod/pull/3245))
+
+- Call _setup in remote trainers to point to the correct shared lib path ([#3258](https://github.com/horovod/horovod/pull/3258))
+## [v0.23.0] - 2021-10-06
+
+### Added
+
+- Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. ([#2839](https://github.com/horovod/horovod/pull/2839), [#3042](https://github.com/horovod/horovod/pull/3042), [#3043](https://github.com/horovod/horovod/pull/3043), [#3054](https://github.com/horovod/horovod/pull/3054), [#3083](https://github.com/horovod/horovod/pull/3083), [#3090](https://github.com/horovod/horovod/pull/3090))
+
+- Added XLA support for Allreduce via `tf.function(jit_compile=True)`. ([#3053](https://github.com/horovod/horovod/pull/3053))
+
+- Added fused buffer scaling and unpack/pack kernels on GPU. ([#2973](https://github.com/horovod/horovod/pull/2973))
+
+- Added support for NCCL on CUDA 11.4. ([#3182](https://github.com/horovod/horovod/issues/3182))
+
+- Added fp16 compression for MXNet. ([#2987](https://github.com/horovod/horovod/issues/2987))
+
+- Added terminate_on_nan flag to Spark Lightning estimator. ([#3088](https://github.com/horovod/horovod/issues/3088))
+
+- Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. [#3139](https://github.com/horovod/horovod/pull/3139)
+
+- Added params for customizing Tensorboard callback. ([#3153](https://github.com/horovod/horovod/issues/3153))
+
+- Added `hvd.cross_rank()` for keras. ([#3008](https://github.com/horovod/horovod/issues/3008))
+
+- Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. [#3139](https://github.com/horovod/horovod/pull/3139)
 
 ### Changed
 
+- Implemented more asynchronous dependency handling on GPU. ([#2963](https://github.com/horovod/horovod/pull/2963))
+
+- Ray: RayExecutor will now use the current placement group instead of always creating a new one. ([#3134](https://github.com/horovod/horovod/pull/3134))
+
+- Lightning: turned off shuffling for validation dataset. ([#2974](https://github.com/horovod/horovod/pull/2974))
+
+- Ray: RayExecutor will use the current placement group if one exists. ([#3134](https://github.com/horovod/horovod/pull/3134))
+
+- Extended `hvd.join()` to return the last rank that joined. ([#3097](https://github.com/horovod/horovod/pull/3097)
 ### Deprecated
 
 ### Removed
 
+- Spark/Keras: remove bare Keras support. ([#3191](https://github.com/horovod/horovod/pull/3191))
+
 ### Fixed
 
 - Fix Horovod develop/editable install mode and incremental builds. ([#3074](https://github.com/horovod/horovod/pull/3074))
-- Estimator/Lightning: use lightning datamodule ([#3084](https://github.com/horovod/horovod/pull/3084))
+
+- Estimator/Lightning: use lightning datamodule. ([#3084](https://github.com/horovod/horovod/pull/3084))
+
+- Fix Horovod Spark StringType and numpy type mapping issue. ([#3146](https://github.com/horovod/horovod/pull/3146))
+
+- Fixed error in Keras LearningRateScheduler. ([#3135](https://github.com/horovod/horovod/pull/3135))
+
+- Fixed bug in Lightning Profiler on Ray. ([#3122](https://github.com/horovod/horovod/pull/3122))
+
+- Fixed torch op lazy release to prevent OOM in elastic training. ([#3110](https://github.com/horovod/horovod/pull/3110))
+
+- Lightning: Fixed usage of the checkpoint callback. ([#3186](https://github.com/horovod/horovod/pull/3186))
+
+- Fixed MPICH support to use Intel MPI's implementation. ([#3148](https://github.com/horovod/horovod/pull/3148))
+
+
+- Fixed race condition in PyTorch async dataloader. ([#3120](https://github.com/horovod/horovod/pull/3120))
+
+- Keras: Fixed learning rate scheduler. ([#3142](https://github.com/horovod/horovod/pull/3142), [#3135](https://github.com/horovod/horovod/pull/3135))
 
 ## [v0.22.1] - 2021-06-10
 
diff --git a/Dockerfile.test.cpu b/Dockerfile.test.cpu
index 99493ffeab..3b3e9d8b20 100644
--- a/Dockerfile.test.cpu
+++ b/Dockerfile.test.cpu
@@ -77,7 +77,8 @@ RUN if [[ ${SPARK_PACKAGE} != *"-preview"* ]]; then \
     fi
 
 # Install Ray.
-RUN pip install --no-cache-dir ray==1.3.0
+# Updating to 1.7.0 to pass ray tests
+RUN pip install --no-cache-dir ray==1.7.0
 
 # Install MPI.
 RUN if [[ ${MPI_KIND} == "OpenMPI" ]]; then \
@@ -142,6 +143,7 @@ RUN if [[ ${TENSORFLOW_PACKAGE} != "tf-nightly" ]]; then \
             if [[ ${TENSORFLOW_PACKAGE} == tensorflow*==1.* ]] || [[ ${TENSORFLOW_PACKAGE} == tensorflow*==2.[01234].* ]]; then \
                 h5py="h5py<3"; \
             fi; \
+            pip uninstall -y keras-nightly; \
             pip install --no-cache-dir ${KERAS_PACKAGE} ${h5py:-} "scipy!=1.4.0" "pandas<1.1.0"; \
         fi; \
         mkdir -p ~/.keras; \
@@ -189,6 +191,7 @@ COPY . /horovod
 RUN if [[ ${TENSORFLOW_PACKAGE} == "tf-nightly" ]]; then \
         pip install --no-cache-dir ${TENSORFLOW_PACKAGE}; \
         if [[ ${KERAS_PACKAGE} != "None" ]]; then \
+            pip uninstall -y keras-nightly; \
             pip install --no-cache-dir ${KERAS_PACKAGE} "scipy!=1.4.0" "pandas<1.1.0"; \
         fi; \
         mkdir -p ~/.keras; \
diff --git a/Dockerfile.test.gpu b/Dockerfile.test.gpu
index 8fc7772b5c..85d31c47f9 100644
--- a/Dockerfile.test.gpu
+++ b/Dockerfile.test.gpu
@@ -111,6 +111,7 @@ RUN if [[ ${TENSORFLOW_PACKAGE} != "tf-nightly-gpu" ]]; then \
             if [[ ${TENSORFLOW_PACKAGE} == tensorflow*==1.* ]] || [[ ${TENSORFLOW_PACKAGE} == tensorflow*==2.[01234].* ]]; then \
                 h5py="h5py<3"; \
             fi; \
+            pip uninstall -y keras-nightly; \
             pip install --no-cache-dir ${KERAS_PACKAGE} ${h5py:-} "scipy!=1.4.0" "pandas<1.1.0"; \
         fi; \
         mkdir -p ~/.keras; \
@@ -159,6 +160,7 @@ COPY . /horovod
 RUN if [[ ${TENSORFLOW_PACKAGE} == "tf-nightly-gpu" ]]; then \
         pip install --no-cache-dir ${TENSORFLOW_PACKAGE}; \
         if [[ ${KERAS_PACKAGE} != "None" ]]; then \
+            pip uninstall -y keras-nightly; \
             pip install --no-cache-dir ${KERAS_PACKAGE} "scipy!=1.4.0" "pandas<1.1.0"; \
         fi; \
         mkdir -p ~/.keras; \
diff --git a/GOVERNANCE.md b/GOVERNANCE.md
index 167c71f794..2d86399d7f 100644
--- a/GOVERNANCE.md
+++ b/GOVERNANCE.md
@@ -29,11 +29,11 @@ Current non-voting members of the Horovod TSC:
 * [Leonard Lausen](https://github.com/leezu) - Amazon
 * [Jonathan Dekhtiar](https://github.com/DEKHTIARJonathan) - NVIDIA
 * [Richard Liaw](https://github.com/richardliaw) - Anyscale
-* [Armand McQueen](https://github.com/armandmcqueen) - Determined AI
-* [Neil Conway](https://github.com/neilconway) - Determined AI
+* [Neil Conway](https://github.com/neilconway) - Determined AI, HPE
 * [Min Cai](https://github.com/mincai) - Uber
 * [Chongxiao Cao](https://github.com/chongxiaoc) - Uber
 * [Max Gerlach](https://github.com/maxhgerlach) - DeepL
+* [Ryan Beethe](https://github.com/rb-determined-ai) - Determined AI, HPE
 
 Emeritus members of the Horovod TSC:
 * [Lin Yuan](https://github.com/apeforest)
@@ -43,6 +43,7 @@ Emeritus members of the Horovod TSC:
 * [Aaron Harlap](https://github.com/aaron276h)
 * [Jaliya Ekanayake](https://github.com/jaliyae)
 * [Kaarthik Sivashanmugam](https://github.com/skaarthik)
+* [Armand McQueen](https://github.com/armandmcqueen)
 
 Non-voting members of the TSC ("maintainers") have commit access to the Horovod GitHub repository, and take part in the
 standing TSC meetings and mailing lists. Emeritus members are no longer active maintainers of the project, but are
diff --git a/Jenkinsfile.ppc64le b/Jenkinsfile.ppc64le
index 9756c9b1eb..b1bc5df851 100644
--- a/Jenkinsfile.ppc64le
+++ b/Jenkinsfile.ppc64le
@@ -1,12 +1,12 @@
 pipeline {
     options {
         buildDiscarder(logRotator(numToKeepStr: '30'))
-        timeout(time: 15, unit: 'MINUTES')
+        timeout(time: 30, unit: 'MINUTES')
     }
     agent {
         docker {
             alwaysPull true
-            // WMLCE 1.7.0 has CUDA 10.2, NCCL 2.5.6, TensorFlow 2.1.0, and PyTorch 1.3.1
+            // WMLCE 1.7.0 has CUDA 10.2, NCCL 2.5.6, TensorFlow 2.1.0, and PyTorch 1.8.0
             image 'tensorflowppc64le/tensorflow-ppc64le:osuosl-ubuntu-horovod-wlmce1.7.0-py3.7-ppc64le'
             args '--cap-add=SYS_PTRACE --shm-size=256g'
             label 'power8-gpu'
@@ -27,7 +27,7 @@ pipeline {
                       conda activate ${CONDA_ENV}
                       conda install -y cmake make
                       set -xe
-                      HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 \
+                      HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 \
                           HOROVOD_CUDA_HOME=$CONDA_PREFIX HOROVOD_GPU_OPERATIONS=NCCL MAKEFLAGS="-j1" \
                           pip install -v . --no-cache-dir --no-deps
                 '''
@@ -47,7 +47,7 @@ pipeline {
                           horovodrun -n 1 -H localhost:1 --mpi-args="-pami_noib" pytest -k 'multi_gpu' -v -s test/parallel/test_tensorflow.py
 
                           # PyTorch unit tests
-                          horovodrun -n 2 -H localhost:2 --mpi-args="-pami_noib" pytest -v -s test/parallel/test_torch.py
+                          # horovodrun -n 2 -H localhost:2 --mpi-args="-pami_noib" pytest -v -s test/parallel/test_torch.py
                     '''
                 }
             }
diff --git a/cmake/Modules/FindTensorflow.cmake b/cmake/Modules/FindTensorflow.cmake
index a766fd92b7..e8885903e1 100644
--- a/cmake/Modules/FindTensorflow.cmake
+++ b/cmake/Modules/FindTensorflow.cmake
@@ -20,10 +20,11 @@ if (LEN EQUAL "4")
     list(GET Tensorflow_OUTPUT 1 Tensorflow_INCLUDE_DIRS)
     list(GET Tensorflow_OUTPUT 2 Tensorflow_LIBRARIES)
     string(REPLACE " " ";" Tensorflow_LIBRARIES_LIST "${Tensorflow_LIBRARIES}")
-    list(GET Tensorflow_LIBRARIES_LIST 0 Tensorflow_LIB_PATH)
-    if (Tensorflow_VERSION VERSION_GREATER_EQUAL "2.6")
-        # XLA implementations are in _pywrap_tensorflow_internal.so
-        set(Tensorflow_LIBRARIES "${Tensorflow_LIBRARIES} ${Tensorflow_LIB_PATH}/python/ -l:_pywrap_tensorflow_internal.so")
+    list(GET Tensorflow_LIBRARIES_LIST 0 Tensorflow_LIB_PATH_ARGUMENT)
+    string(REGEX REPLACE "^-L" "" Tensorflow_LIB_PATH ${Tensorflow_LIB_PATH_ARGUMENT})
+    if (Tensorflow_VERSION VERSION_GREATER "2.6" OR Tensorflow_VERSION VERSION_EQUAL "2.6")
+        # XLA implementations and helpers for resource variables are in _pywrap_tensorflow_internal.so
+        set(Tensorflow_LIBRARIES "${Tensorflow_LIBRARIES} ${Tensorflow_LIB_PATH}/python/_pywrap_tensorflow_internal.so")
     endif()
     message("Tensorflow_LIBRARIES := ${Tensorflow_LIBRARIES}")
     list(GET Tensorflow_OUTPUT 3 Tensorflow_COMPILE_FLAGS)
diff --git a/docker-compose.test.yml b/docker-compose.test.yml
index 0fba464e70..587a93ae7a 100644
--- a/docker-compose.test.yml
+++ b/docker-compose.test.yml
@@ -10,7 +10,7 @@ services:
         MPI_KIND: None
         PYTHON_VERSION: 3.8
         TENSORFLOW_PACKAGE: tensorflow-cpu==2.6.0
-        KERAS_PACKAGE: None
+        KERAS_PACKAGE: keras==2.6.0
         PYTORCH_PACKAGE: torch==1.9.0+cpu
         PYTORCH_LIGHTNING_PACKAGE: pytorch-lightning==1.3.8
         TORCHVISION_PACKAGE: torchvision==0.10.0+cpu
@@ -22,27 +22,27 @@ services:
     shm_size: 8gb
 
   # our baseline first
-  test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-cpu-base
-  test-cpu-mpich-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-cpu-mpich-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-cpu-base
     build:
       args:
         MPI_KIND: MPICH
         HOROVOD_BUILD_FLAGS: HOROVOD_WITHOUT_GLOO=1
-  test-cpu-oneccl-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-cpu-oneccl-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-cpu-base
     build:
       args:
         MPI_KIND: ONECCL
         HOROVOD_BUILD_FLAGS: HOROVOD_WITHOUT_GLOO=1
-  test-cpu-openmpi-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-cpu-openmpi-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-cpu-base
     build:
       args:
         MPI_KIND: OpenMPI
         HOROVOD_BUILD_FLAGS: HOROVOD_WITHOUT_GLOO=1
-  test-cpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-cpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-cpu-base
     build:
       args:
@@ -65,11 +65,12 @@ services:
   # however, there is an mxnet-cu101-1.6.0.post0, so we test this with gpu instead of cpu
   # this cpu test variation is defined as gpu in gpu frameworks variations below
 #  test-cpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2:
-  test-cpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_7_0_p2-pyspark3_1_2:
+  test-cpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_7_0_p2-pyspark3_1_2:
     extends: test-cpu-base
     build:
       args:
         TENSORFLOW_PACKAGE: tensorflow==2.5.1
+        KERAS_PACKAGE: keras==2.4.3
         PYTORCH_PACKAGE: torch==1.8.1+cpu
         TORCHVISION_PACKAGE: torchvision==0.9.1
         MXNET_PACKAGE: mxnet==1.7.0.post2
@@ -82,16 +83,17 @@ services:
         KERAS_PACKAGE: None
         PYTORCH_PACKAGE: torch-nightly
         TORCHVISION_PACKAGE: torchvision
+        PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning
         MXNET_PACKAGE: mxnet-nightly
 
-  test-cpu-gloo-py3_7-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark2_4_8:
+  test-cpu-gloo-py3_7-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark2_4_8:
     extends: test-cpu-base
     build:
       args:
         PYTHON_VERSION: 3.7
         PYSPARK_PACKAGE: pyspark==2.4.8
         SPARK_PACKAGE: spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
-  test-cpu-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_0_3:
+  test-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_0_3:
     extends: test-cpu-base
     build:
       args:
@@ -120,8 +122,9 @@ services:
     privileged: true
     shm_size: 8gb
 
-  # torch==1.3.1+cu100 requires torchvision==0.4.2+cu100
-  test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2:
+  # okay to mix cuda 10.0 and 10.1 here as pytorch ships its own cuda libs
+  # torch==1.6.0+cu101 requires torchvision==0.7.0+cu101
+  test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2:
     extends: test-gpu-base
     build:
       args:
@@ -131,9 +134,9 @@ services:
         PYTHON_VERSION: 3.7
         TENSORFLOW_PACKAGE: tensorflow-gpu==1.15.5
         KERAS_PACKAGE: keras==2.2.4
-        PYTORCH_PACKAGE: torch==1.3.1+cu100
-        PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning==1.1.0
-        TORCHVISION_PACKAGE: torchvision==0.4.2+cu100
+        PYTORCH_PACKAGE: torch==1.6.0+cu101
+        PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning==1.3.8
+        TORCHVISION_PACKAGE: torchvision==0.7.0+cu101
         MXNET_PACKAGE: mxnet-cu100==1.5.1.post0
   # this is required as we cannot test mxnet-1.6.0.post0 with cpu
   test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2:
@@ -166,7 +169,7 @@ services:
         MXNET_PACKAGE: mxnet-cu101==1.7.0.post1
   # we deviate from mxnet1_7_0_p2 here as other frameworks target CUDA 11.x and
   # mxnet 1.7.x only supports CUDA 10.x, with mxnet 1.8.x we have CUDA 11.x packages
-  test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2:
+  test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-gpu-base
     build:
       args:
@@ -174,12 +177,12 @@ services:
         CUDNN_VERSION: 8.1.1.33-1+cuda11.2
         NCCL_VERSION_OVERRIDE: 2.8.4-1+cuda11.2
         TENSORFLOW_PACKAGE: tensorflow-gpu==2.5.1
-        KERAS_PACKAGE: None
+        KERAS_PACKAGE: keras==2.4.3
         PYTORCH_PACKAGE: torch==1.8.1+cu111
         PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning==1.3.8
         TORCHVISION_PACKAGE: torchvision==0.9.1+cu111
         MXNET_PACKAGE: mxnet-cu112==1.8.0.post0
-  test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-gpu-base
     build:
       args:
@@ -188,7 +191,7 @@ services:
         NCCL_VERSION_OVERRIDE: 2.8.4-1+cuda11.2
         MPI_KIND: OpenMPI
         TENSORFLOW_PACKAGE: tensorflow-gpu==2.6.0
-        KERAS_PACKAGE: None
+        KERAS_PACKAGE: keras==2.6.0
         PYTORCH_PACKAGE: torch==1.9.0+cu111
         PYTORCH_LIGHTNING_PACKAGE: pytorch-lightning==1.3.8
         TORCHVISION_PACKAGE: torchvision==0.10.0+cu111
@@ -203,11 +206,11 @@ services:
         TENSORFLOW_PACKAGE: tf-nightly-gpu
         KERAS_PACKAGE: None
         PYTORCH_PACKAGE: torch-nightly-cu111
-        PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning==1.3.8
+        PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning
         TORCHVISION_PACKAGE: torchvision
         MXNET_PACKAGE: mxnet-nightly-cu112
 
-  test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
+  test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:
     extends: test-gpu-base
     build:
       args:
@@ -216,7 +219,7 @@ services:
         NCCL_VERSION_OVERRIDE: 2.8.4-1+cuda11.2
         MPI_KIND: OpenMPI
         TENSORFLOW_PACKAGE: tensorflow-gpu==2.6.0
-        KERAS_PACKAGE: None
+        KERAS_PACKAGE: keras==2.6.0
         PYTORCH_PACKAGE: torch==1.9.0+cu111
         PYTORCH_LIGHTNING_PACKAGE: pytorch_lightning==1.3.8
         TORCHVISION_PACKAGE: torchvision==0.10.0+cu111
diff --git a/docs/mocks.py b/docs/mocks.py
index 793e0fd9cc..deafa7229d 100644
--- a/docs/mocks.py
+++ b/docs/mocks.py
@@ -39,6 +39,7 @@ def _dummy():
 
     'fsspec',
     'fsspec.core',
+    'fsspec.utils',
 
     'pyarrow',
     'pyarrow.parquet',
@@ -60,6 +61,8 @@ def _dummy():
     'ray',
     'ray.exceptions',
     'ray.services',
+    'ray.util',
+    'ray.util.placement_group',
 
     'tensorflow',
     'tensorflow.python',
diff --git a/docs/tensor-fusion.rst b/docs/tensor-fusion.rst
index 31b2d758cd..a1ef957a40 100644
--- a/docs/tensor-fusion.rst
+++ b/docs/tensor-fusion.rst
@@ -10,7 +10,7 @@ Tensor Fusion works by attempting to combine all the tensors that are ready to b
 one reduction operation. The algorithm of Tensor Fusion is as follows:
 
 1. Determine which tensors are ready to be reduced. Select first few tensors that fit in ``HOROVOD_FUSION_THRESHOLD`` bytes and have the same data type.
-2. Allocate fusion buffer of size ``HOROVOD_FUSION_THRESHOLD`` if it was not allocated before. Default fusion buffer size is 64 MB.
+2. Allocate fusion buffer of size ``HOROVOD_FUSION_THRESHOLD`` if it was not allocated before. Default fusion buffer size is 128 MB.
 3. Copy data of selected tensors into the fusion buffer.
 4. Execute the **allreduce** operation on the fusion buffer.
 5. Copy data from the fusion buffer into the output tensors.
diff --git a/examples/elastic/tensorflow2/tensorflow2_keras_mnist_elastic.py b/examples/elastic/tensorflow2/tensorflow2_keras_mnist_elastic.py
index 155a1b2b37..a0ce8d1929 100644
--- a/examples/elastic/tensorflow2/tensorflow2_keras_mnist_elastic.py
+++ b/examples/elastic/tensorflow2/tensorflow2_keras_mnist_elastic.py
@@ -13,8 +13,25 @@
 # limitations under the License.
 # ==============================================================================
 
+import argparse
 import tensorflow as tf
 import horovod.tensorflow.keras as hvd
+from distutils.version import LooseVersion
+
+parser = argparse.ArgumentParser(description='Tensorflow 2.0 Keras MNIST Example')
+
+parser.add_argument('--use-mixed-precision', action='store_true', default=False,
+                    help='use mixed precision for training')
+
+args = parser.parse_args()
+
+if args.use_mixed_precision:
+    if LooseVersion(tf.__version__) >= LooseVersion('2.4.0'):
+        from tensorflow.keras import mixed_precision
+        mixed_precision.set_global_policy('mixed_float16')
+    else:
+        policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
+        tf.keras.mixed_precision.experimental.set_policy(policy)
 
 # Horovod: initialize Horovod.
 hvd.init()
diff --git a/examples/pytorch/pytorch_lightning_mnist.py b/examples/pytorch/pytorch_lightning_mnist.py
index d60c4ae194..e6ffcccf1f 100644
--- a/examples/pytorch/pytorch_lightning_mnist.py
+++ b/examples/pytorch/pytorch_lightning_mnist.py
@@ -24,22 +24,10 @@
                     help='input batch size for testing (default: 1000)')
 parser.add_argument('--epochs', type=int, default=10, metavar='N',
                     help='number of epochs to train (default: 10)')
-parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
-                    help='learning rate (default: 0.01)')
-parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
-                    help='SGD momentum (default: 0.5)')
 parser.add_argument('--no-cuda', action='store_true', default=False,
                     help='disables CUDA training')
 parser.add_argument('--seed', type=int, default=42, metavar='S',
                     help='random seed (default: 42)')
-parser.add_argument('--log-interval', type=int, default=10, metavar='N',
-                    help='how many batches to wait before logging training status')
-parser.add_argument('--fp16-allreduce', action='store_true', default=False,
-                    help='use fp16 compression during allreduce')
-parser.add_argument('--use-adasum', action='store_true', default=False,
-                    help='use adasum algorithm to do reduction')
-parser.add_argument('--gradient-predivide-factor', type=float, default=1.0,
-                    help='apply gradient predivide factor in optimizer (default: 1.0)')
 parser.add_argument('--data-dir',
                     help='location of the training dataset in the local filesystem (will be downloaded if needed)')
 
@@ -205,7 +193,7 @@ def on_train_end(self, trainer, model):
         callbacks = [MyDummyCallback(), ModelCheckpoint(dirpath=ckpt_path)]
 
         trainer = Trainer(accelerator='horovod',
-                          gpus=(1 if torch.cuda.is_available() else 0),
+                          gpus=(1 if args.cuda else 0),
                           callbacks=callbacks,
                           max_epochs=epochs,
                           limit_train_batches=train_percent,
@@ -214,6 +202,7 @@ def on_train_end(self, trainer, model):
                           num_sanity_val_steps=0)
 
         trainer.fit(model)
-
+        if args.cuda:
+            model = model.cuda()
         test()
 
diff --git a/examples/pytorch/pytorch_mnist.py b/examples/pytorch/pytorch_mnist.py
index 907bdd39b8..0cece73bd1 100644
--- a/examples/pytorch/pytorch_mnist.py
+++ b/examples/pytorch/pytorch_mnist.py
@@ -1,5 +1,6 @@
 import argparse
 import os
+from distutils.version import LooseVersion
 from filelock import FileLock
 
 import torch.multiprocessing as mp
@@ -30,6 +31,8 @@
                     help='how many batches to wait before logging training status')
 parser.add_argument('--fp16-allreduce', action='store_true', default=False,
                     help='use fp16 compression during allreduce')
+parser.add_argument('--use-mixed-precision', action='store_true', default=False,
+                    help='use mixed precision for training')
 parser.add_argument('--use-adasum', action='store_true', default=False,
                     help='use adasum algorithm to do reduction')
 parser.add_argument('--gradient-predivide-factor', type=float, default=1.0,
@@ -56,6 +59,34 @@ def forward(self, x):
         x = self.fc2(x)
         return F.log_softmax(x)
 
+def train_mixed_precision(epoch, scaler):
+    model.train()
+    # Horovod: set epoch to sampler for shuffling.
+    train_sampler.set_epoch(epoch)
+    for batch_idx, (data, target) in enumerate(train_loader):
+        if args.cuda:
+            data, target = data.cuda(), target.cuda()
+        optimizer.zero_grad()
+        with torch.cuda.amp.autocast():
+            output = model(data)
+            loss = F.nll_loss(output, target)
+
+        scaler.scale(loss).backward()
+        # Make sure all async allreduces are done
+        optimizer.synchronize()
+        # In-place unscaling of all gradients before weights update
+        scaler.unscale_(optimizer)
+        with optimizer.skip_synchronize():
+            scaler.step(optimizer)
+        # Update scaler in case of overflow/underflow
+        scaler.update()
+
+        if batch_idx % args.log_interval == 0:
+            # Horovod: use train_sampler to determine the number of examples in
+            # this worker's partition.
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tLoss Scale: {}'.format(
+                epoch, batch_idx * len(data), len(train_sampler),
+                100. * batch_idx / len(train_loader), loss.item(), scaler.get_scale()))
 
 def train(epoch):
     model.train()
@@ -124,7 +155,14 @@ def test():
         # Horovod: pin GPU to local rank.
         torch.cuda.set_device(hvd.local_rank())
         torch.cuda.manual_seed(args.seed)
+    else:
+        if args.use_mixed_precision:
+            raise ValueError("Mixed precision is only supported with cuda enabled.")
 
+    if (args.use_mixed_precision and LooseVersion(torch.__version__)
+            < LooseVersion('1.6.0')):
+        raise ValueError("""Mixed precision is using torch.cuda.amp.autocast(),
+                            which requires torch >= 1.6.0""")
 
     # Horovod: limit # of CPU threads to be used per worker.
     torch.set_num_threads(1)
@@ -192,6 +230,14 @@ def test():
                                          op=hvd.Adasum if args.use_adasum else hvd.Average,
                                          gradient_predivide_factor=args.gradient_predivide_factor)
 
+    if args.use_mixed_precision:
+        # Initialize scaler in global scale
+        scaler = torch.cuda.amp.GradScaler()
+
     for epoch in range(1, args.epochs + 1):
-        train(epoch)
+        if args.use_mixed_precision:
+            train_mixed_precision(epoch, scaler)
+        else:
+            train(epoch)
+        # Keep test in full precision since computation is relatively light.
         test()
diff --git a/examples/spark/keras/keras_spark_mnist.py b/examples/spark/keras/keras_spark_mnist.py
index e03b26e803..b989c623e2 100644
--- a/examples/spark/keras/keras_spark_mnist.py
+++ b/examples/spark/keras/keras_spark_mnist.py
@@ -115,7 +115,8 @@
                                          batch_size=args.batch_size,
                                          epochs=args.epochs,
                                          inmemory_cache_all=True,
-                                         verbose=1)
+                                         verbose=1,
+                                         callbacks=[keras.callbacks.TensorBoard(profile_batch=5)])
 
     keras_model = keras_estimator.fit(train_df).setOutputCols(['label_prob'])
 
diff --git a/examples/spark/pytorch/pytorch_lightning_spark_mnist.py b/examples/spark/pytorch/pytorch_lightning_spark_mnist.py
index 99e05c981d..369a4bd0e6 100644
--- a/examples/spark/pytorch/pytorch_lightning_spark_mnist.py
+++ b/examples/spark/pytorch/pytorch_lightning_spark_mnist.py
@@ -43,6 +43,8 @@
                     help='temporary working directory to write intermediate files (prefix with hdfs:// to use HDFS)')
 parser.add_argument('--data-dir', default='/tmp',
                     help='location of the training dataset in the local filesystem (will be downloaded if needed)')
+parser.add_argument('--enable-profiler', action='store_true',
+                    help='Enable profiler')
 
 
 def train_model(args):
@@ -175,14 +177,15 @@ def on_train_end(self, trainer, model):
 
     # added EarlyStopping and ModelCheckpoint
     from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
-    callbacks.append(ModelCheckpoint(dirpath=args.work_dir))
+    callbacks.append(ModelCheckpoint(monitor='val_loss', mode="min",
+                                     save_top_k=1, verbose=True))
 
     from pytorch_lightning.callbacks.early_stopping import EarlyStopping
     callbacks.append(EarlyStopping(monitor='val_loss',
-                                   min_delta=0.00,
+                                   min_delta=0.001,
                                    patience=3,
                                    verbose=True,
-                                   mode='max'))
+                                   mode='min'))
 
     torch_estimator = hvd.TorchEstimator(backend=backend,
                                          store=store,
@@ -195,7 +198,7 @@ def on_train_end(self, trainer, model):
                                          validation=0.1,
                                          verbose=1,
                                          callbacks=callbacks,
-                                         profiler="simple")
+                                         profiler="simple" if args.enable_profiler else None)
 
     torch_model = torch_estimator.fit(train_df).setOutputCols(['label_prob'])
 
diff --git a/horovod/__init__.py b/horovod/__init__.py
index 995b732b0f..4c8369c39d 100644
--- a/horovod/__init__.py
+++ b/horovod/__init__.py
@@ -1,3 +1,3 @@
 from horovod.runner import run
 
-__version__ = '0.22.1'
+__version__ = '0.23.0'
diff --git a/horovod/_keras/callbacks.py b/horovod/_keras/callbacks.py
index 8efae8d2eb..839ea6ef0c 100644
--- a/horovod/_keras/callbacks.py
+++ b/horovod/_keras/callbacks.py
@@ -54,7 +54,7 @@ def __init__(self, backend, device='', *args):
         self.allreduce_ops = {}
         self.device = device
 
-        if LooseVersion(tf.__version__) >= LooseVersion("2.3"):
+        if LooseVersion("2.3") <= LooseVersion(tf.__version__) < LooseVersion("2.5"):
             warnings.warn(
                 "Some callbacks may not have access to the averaged metrics, "
                 "see https://github.com/horovod/horovod/issues/2440")
@@ -107,10 +107,10 @@ def __init__(self, backend, initial_lr, multiplier, start_epoch=0, end_epoch=Non
         self.current_epoch = None
 
         # set multiplier, which is a fn(epoch) and is the amount by which self.initial_lr is
-        # multiplied by on each batch / epoch begin (depending on whether you set staircase or not
+        # multiplied by on each batch / epoch begin (depending on whether you set staircase or not)
         if not callable(multiplier):
             # If multiplier is a constant, it corresponds to exponential decay
-            self.multiplier = lambda epoch: multiplier ** epoch
+            self.multiplier = lambda epoch: multiplier ** (epoch - start_epoch)
         else:
             self.multiplier = multiplier
 
diff --git a/horovod/common/basics.py b/horovod/common/basics.py
index 52255c62c4..3ff2d35424 100644
--- a/horovod/common/basics.py
+++ b/horovod/common/basics.py
@@ -150,7 +150,8 @@ def shutdown(self):
 
     def is_initialized(self):
         """Returns True if Horovod is initialized"""
-        return self.MPI_LIB_CTYPES.horovod_is_initialized()
+        is_initialized = self.MPI_LIB_CTYPES.horovod_is_initialized()
+        return bool(is_initialized)
 
     def start_timeline(self, file_path, mark_cycles=False):
         """Creates a timeline file at `file_path` and begins recording.
diff --git a/horovod/common/common.h b/horovod/common/common.h
index 8b5dd540b4..037e60dd72 100644
--- a/horovod/common/common.h
+++ b/horovod/common/common.h
@@ -153,6 +153,9 @@ namespace common {
 // Temporary tensor name for ranks that did Join().
 #define JOIN_TENSOR_NAME "join.noname"
 
+// Fixed tensor name for all barrier operations
+#define BARRIER_TENSOR_NAME "barrier.noname"
+
 // List of supported frameworks.
 enum Framework { TENSORFLOW, PYTORCH, MXNET, XLA };
 
@@ -185,6 +188,7 @@ struct Event {
   Event(std::shared_ptr<gpuEvent_t> event, gpuStream_t stream) :
     event(event), stream(stream) {};
   std::shared_ptr<gpuEvent_t> event;
+  uint64_t event_idx;
   gpuStream_t stream = nullptr;
 #endif
 };
diff --git a/horovod/common/controller.cc b/horovod/common/controller.cc
index 284dbdb022..62423f8502 100644
--- a/horovod/common/controller.cc
+++ b/horovod/common/controller.cc
@@ -97,6 +97,12 @@ ResponseList Controller::ComputeResponseList(bool this_process_requested_shutdow
       continue;
     }
 
+    // Never cache a barrier request, when all ranks return ready for this message, barrier will be released.
+    if(message.request_type() == Request::BARRIER) {
+      cache_coordinator.set_uncached_in_queue(true);
+      continue;
+    }
+
     // Keep track of cache hits
     if (response_cache_.capacity() > 0) {
       auto cache_ = response_cache_.cached(message);
@@ -266,6 +272,13 @@ ResponseList Controller::ComputeResponseList(bool this_process_requested_shutdow
         }
 
         bool reduce = IncrementTensorCount(message, process_set.joined_size);
+
+        // For barrier request, if not ready to reduce, we add it back to tensor queue
+        // to process in the next cycle.
+        if(!reduce && message.request_type() == Request::BARRIER) {
+          tensor_queue_.PushMessageToQueue(message);
+        }
+
         stall_inspector_.RecordUncachedTensorStart(
             message.tensor_name(), message.request_rank(), size_);
         if (reduce) {
@@ -283,7 +296,6 @@ ResponseList Controller::ComputeResponseList(bool this_process_requested_shutdow
         auto received_message_list = ready_list[i];
         for (auto& received_message : received_message_list.requests()) {
           auto& received_name = received_message.tensor_name();
-
           if (received_message.request_type() == Request::JOIN) {
             process_set.joined_size++;
             process_set.last_joined_rank = global_ranks_[i];
@@ -292,6 +304,7 @@ ResponseList Controller::ComputeResponseList(bool this_process_requested_shutdow
 
           bool reduce =
               IncrementTensorCount(received_message, process_set.joined_size);
+
           stall_inspector_.RecordUncachedTensorStart(
               received_message.tensor_name(), received_message.request_rank(),
               size_);
@@ -498,6 +511,7 @@ Response Controller::ConstructResponse(const std::string& name, int joined_size)
   // Check that all data types of tensors being processed
   // are identical.
   auto data_type = requests[0].tensor_type();
+
   for (unsigned int i = 1; i < requests.size(); ++i) {
     auto request_type = requests[i].tensor_type();
     if (data_type != request_type) {
@@ -756,6 +770,8 @@ Response Controller::ConstructResponse(const std::string& name, int joined_size)
     response.set_tensor_type(data_type);
     response.set_prescale_factor(prescale_factor);
     response.set_postscale_factor(postscale_factor);
+  } else if (message_type == Request::BARRIER) {
+    response.set_response_type(Response::BARRIER);
   }
   response.set_devices(devices);
 
@@ -971,13 +987,21 @@ bool Controller::IncrementTensorCount(const Request& msg, int joined_size) {
     timeline_.NegotiateStart(name, msg.request_type());
   } else {
     std::vector<Request>& messages = table_iter->second;
-    messages.push_back(msg);
+    if(msg.request_type() == Request::BARRIER) {
+      if(tensor_queue_.IsTensorPresentInTable(name)) {
+        messages.push_back(msg);
+      }
+    }
+    else {
+      messages.push_back(msg);
+    }
   }
 
   timeline_.NegotiateRankReady(name, msg.request_rank());
 
   std::vector<Request>& messages = table_iter->second;
   int count = (int)messages.size();
+
   bool ready_to_reduce = count == (size_ - joined_size);
   if (ready_to_reduce) {
     timeline_.NegotiateEnd(name);
diff --git a/horovod/common/message.cc b/horovod/common/message.cc
index 8a0b113092..3adbe870ca 100644
--- a/horovod/common/message.cc
+++ b/horovod/common/message.cc
@@ -111,6 +111,9 @@ const std::string& Request::RequestType_Name(RequestType value) {
     case RequestType::ALLTOALL:
       static const std::string alltoall("ALLTOALL");
       return alltoall;
+    case RequestType::BARRIER:
+      static const std::string barrier("BARRIER");
+      return barrier;
     default:
       static const std::string unknown("<unknown>");
       return unknown;
@@ -300,6 +303,9 @@ const std::string& Response::ResponseType_Name(ResponseType value) {
     case ResponseType::ALLTOALL:
       static const std::string alltoall("ALLTOALL");
       return alltoall;
+    case ResponseType::BARRIER:
+      static const std::string barrier("BARRIER");
+      return barrier;
     case ResponseType::ERROR:
       static const std::string error("ERROR");
       return error;
diff --git a/horovod/common/message.h b/horovod/common/message.h
index 8665ef8d5f..fe10bd83c1 100644
--- a/horovod/common/message.h
+++ b/horovod/common/message.h
@@ -50,7 +50,7 @@ std::size_t DataType_Size(DataType value);
 class Request {
 public:
   enum RequestType {
-    ALLREDUCE = 0, ALLGATHER = 1, BROADCAST = 2, JOIN = 3, ADASUM = 4, ALLTOALL = 5
+    ALLREDUCE = 0, ALLGATHER = 1, BROADCAST = 2, JOIN = 3, ADASUM = 4, ALLTOALL = 5, BARRIER = 6
   };
 
 
@@ -153,7 +153,7 @@ class RequestList {
 class Response {
 public:
   enum ResponseType {
-    ALLREDUCE = 0, ALLGATHER = 1, BROADCAST = 2, JOIN = 3, ADASUM = 4, ALLTOALL= 5, ERROR = 6
+    ALLREDUCE = 0, ALLGATHER = 1, BROADCAST = 2, JOIN = 3, ADASUM = 4, ALLTOALL= 5, BARRIER=6, ERROR = 7
   };
 
   static const std::string& ResponseType_Name(ResponseType value);
diff --git a/horovod/common/operations.cc b/horovod/common/operations.cc
index dd88425a6b..b0b07a490a 100644
--- a/horovod/common/operations.cc
+++ b/horovod/common/operations.cc
@@ -37,19 +37,19 @@
 #include "hashes.h"
 #include "logging.h"
 #include "message.h"
+#include "nvtx_op_range.h"
 #include "ops/operation_manager.h"
 #include "parameter_manager.h"
 #include "timeline.h"
 #include "utils/env_parser.h"
-#include "nvtx_op_range.h"
 
 #if HAVE_MPI
 #define OMPI_SKIP_MPICXX
 #include "mpi.h"
 #include "mpi/mpi_context.h"
 #include "mpi/mpi_controller.h"
-#include "ops/mpi_operations.h"
 #include "ops/adasum_mpi_operations.h"
+#include "ops/mpi_operations.h"
 #endif
 
 #if HAVE_GPU
@@ -158,11 +158,11 @@ OperationManager* CreateOperationManager(HorovodGlobalState& state) {
         new MPI_GPUAllreduce(&gpu_context, &state)));
 
 #elif HAVE_NCCL && HOROVOD_GPU_ALLREDUCE == 'N'
-    adasum_ops.push_back(std::shared_ptr<AllreduceOp>(new AdasumGpuAllreduceOp(&global_mpi_context, &nccl_context, &gpu_context, &state)));
+    adasum_ops.push_back(std::shared_ptr<AllreduceOp>(new AdasumGpuAllreduceOp(
+        &global_mpi_context, &nccl_context, &gpu_context, &state)));
 
-    allreduce_ops.push_back(
-        std::shared_ptr<AllreduceOp>(new NCCLHierarchicalAllreduce(
-            &nccl_context, &gpu_context, &state)));
+    allreduce_ops.push_back(std::shared_ptr<AllreduceOp>(
+        new NCCLHierarchicalAllreduce(&nccl_context, &gpu_context, &state)));
 
 #elif HAVE_DDL && HOROVOD_GPU_ALLREDUCE == 'D'
     allreduce_ops.push_back(std::shared_ptr<AllreduceOp>(
@@ -173,12 +173,12 @@ OperationManager* CreateOperationManager(HorovodGlobalState& state) {
     allgather_ops.push_back(std::shared_ptr<AllgatherOp>(
         new MPI_GPUAllgather(&gpu_context, &state)));
 #endif
-    allgather_ops.push_back(std::shared_ptr<AllgatherOp>(
-        new MPIHierarchicalAllgather(&state)));
+    allgather_ops.push_back(
+        std::shared_ptr<AllgatherOp>(new MPIHierarchicalAllgather(&state)));
 
 #if HOROVOD_GPU_ALLTOALL == 'M'
-    alltoall_ops.push_back(std::shared_ptr<AlltoallOp>(
-        new MPI_GPUAlltoall(&gpu_context, &state)));
+    alltoall_ops.push_back(
+        std::shared_ptr<AlltoallOp>(new MPI_GPUAlltoall(&gpu_context, &state)));
 #endif
   }
 #endif
@@ -189,8 +189,8 @@ OperationManager* CreateOperationManager(HorovodGlobalState& state) {
 #endif
 
 #if HAVE_NCCL && HOROVOD_GPU_BROADCAST == 'N'
-    broadcast_ops.push_back(
-        std::shared_ptr<BroadcastOp>(new NCCLBroadcast(&nccl_context, &gpu_context, &state)));
+  broadcast_ops.push_back(std::shared_ptr<BroadcastOp>(
+      new NCCLBroadcast(&nccl_context, &gpu_context, &state)));
 #endif
 
 #if HAVE_NCCL && HOROVOD_GPU_ALLGATHER == 'N'
@@ -224,15 +224,14 @@ OperationManager* CreateOperationManager(HorovodGlobalState& state) {
         std::make_shared<CCLAllgather>(&ccl_context, &state));
     broadcast_ops.push_back(
         std::make_shared<CCLBroadcast>(&ccl_context, &state));
-    alltoall_ops.push_back(
-        std::make_shared<CCLAlltoall>(&ccl_context, &state));
+    alltoall_ops.push_back(std::make_shared<CCLAlltoall>(&ccl_context, &state));
   }
 #endif
 
 #if HAVE_MPI
-  if (global_mpi_context.IsEnabled()){
-    adasum_ops.push_back(
-        std::shared_ptr<AllreduceOp>(new AdasumMPIAllreduceOp(&global_mpi_context, &state)));
+  if (global_mpi_context.IsEnabled()) {
+    adasum_ops.push_back(std::shared_ptr<AllreduceOp>(
+        new AdasumMPIAllreduceOp(&global_mpi_context, &state)));
     allreduce_ops.push_back(
         std::shared_ptr<AllreduceOp>(new MPIAllreduce(&state)));
     allgather_ops.push_back(
@@ -245,11 +244,12 @@ OperationManager* CreateOperationManager(HorovodGlobalState& state) {
 #endif
 
   std::shared_ptr<JoinOp> join_op(new JoinOp(&state));
+  std::shared_ptr<BarrierOp> barrier_op(new BarrierOp(&state));
   std::shared_ptr<ErrorOp> error_op(new ErrorOp(&state));
 
   return new OperationManager(&state.parameter_manager, allreduce_ops,
                               allgather_ops, broadcast_ops, alltoall_ops,
-                              join_op, adasum_ops, error_op);
+                              join_op, adasum_ops, barrier_op, error_op);
 }
 
 // Process a Response by doing a reduction, a gather, a broadcast, or
@@ -259,7 +259,9 @@ void PerformOperation(Response response, ProcessSet& process_set) {
   auto& timeline = horovod_global.timeline;
   process_set.tensor_queue.GetTensorEntriesFromResponse(response, entries,
                                                         process_set.joined);
-  if (response.response_type() != Response::JOIN) {
+
+  if (response.response_type() != Response::JOIN &&
+      response.response_type() != Response::BARRIER) {
     for (auto& e : entries) {
       timeline.Start(e.tensor_name, response.response_type(), e.tensor->size());
     }
@@ -276,7 +278,8 @@ void PerformOperation(Response response, ProcessSet& process_set) {
           [&]() { timeline.ActivityStartAll(entries, INIT_FUSION_BUFFER); },
           [&]() { timeline.ActivityEndAll(entries); });
       if (!status.ok()) {
-        LOG(DEBUG, horovod_global.global_controller->GetRank()) << "InitializeBuffer Failed";
+        LOG(DEBUG, horovod_global.global_controller->GetRank())
+            << "InitializeBuffer Failed";
         for (auto& e : entries) {
           timeline.End(e.tensor_name, nullptr);
           e.FinishWithCallback(status);
@@ -289,10 +292,12 @@ void PerformOperation(Response response, ProcessSet& process_set) {
   Status status;
   try {
     // process_set is passed here only for the case of Response::JOIN where
-    // entries is empty. The other operations can infer process_set from entries.
+    // entries is empty. The other operations can infer process_set from
+    // entries.
     status = op_manager->ExecuteOperation(entries, response, process_set);
   } catch (const std::exception& ex) {
-    LOG(DEBUG, horovod_global.global_controller->GetRank()) << "ExecuteOperation Failed";
+    LOG(DEBUG, horovod_global.global_controller->GetRank())
+        << "ExecuteOperation Failed";
     status = Status::UnknownError(ex.what());
   }
 
@@ -428,12 +433,12 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
   int local_rank = state.global_controller->GetLocalRank();
 
   // Set background thread affinity
-  parse_and_set_affinity(std::getenv(HOROVOD_THREAD_AFFINITY), local_size, local_rank);
+  parse_and_set_affinity(std::getenv(HOROVOD_THREAD_AFFINITY), local_size,
+                         local_rank);
 
 #if HAVE_GPU
   // Set number of GPU streams to use
-  auto horovod_num_nccl_streams =
-      std::getenv(HOROVOD_NUM_NCCL_STREAMS);
+  auto horovod_num_nccl_streams = std::getenv(HOROVOD_NUM_NCCL_STREAMS);
   if (horovod_num_nccl_streams != nullptr &&
       std::stol(horovod_num_nccl_streams, nullptr, 10) > 0) {
     state.num_nccl_streams = std::atoi(horovod_num_nccl_streams);
@@ -441,6 +446,7 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
 
 #if HAVE_NCCL
   nccl_context.nccl_comms.resize(state.num_nccl_streams);
+  SetBoolFromEnv(HOROVOD_ELASTIC, nccl_context.elastic, true);
 #endif
   gpu_context.streams.resize(state.num_nccl_streams);
 
@@ -465,8 +471,7 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
         state.timeline.Initialize(horovod_timeline,
                                   static_cast<unsigned int>(size));
       } else {
-        state.timeline.Initialize("",
-                                  static_cast<unsigned int>(size));
+        state.timeline.Initialize("", static_cast<unsigned int>(size));
       }
     }
     should_enable_timeline = true;
@@ -478,8 +483,7 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
   ParseStallInspectorFromEnv(
       state.process_set_table.Get(0).controller->GetStallInspector());
   bool mark_cycles = false;
-  SetBoolFromEnv(HOROVOD_TIMELINE_MARK_CYCLES, mark_cycles,
-                 true);
+  SetBoolFromEnv(HOROVOD_TIMELINE_MARK_CYCLES, mark_cycles, true);
   state.timeline_controller.SetMarkCyclesInTimelinePending(mark_cycles);
   state.mark_cycles_in_timeline = mark_cycles;
 
@@ -558,18 +562,19 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
   }
 
   // Set flag to control use of batched memcopy kernel on GPU
-  auto horovod_batch_d2d_memcopies =
-      std::getenv(HOROVOD_BATCH_D2D_MEMCOPIES);
+  auto horovod_batch_d2d_memcopies = std::getenv(HOROVOD_BATCH_D2D_MEMCOPIES);
   if (horovod_batch_d2d_memcopies != nullptr &&
       std::strtol(horovod_batch_d2d_memcopies, nullptr, 10) == 0) {
     state.batch_d2d_memcopies = false;
   }
 
   // Check if group fusion should be disabled
-  SetBoolFromEnv(HOROVOD_DISABLE_GROUP_FUSION, state.disable_group_fusion, true);
+  SetBoolFromEnv(HOROVOD_DISABLE_GROUP_FUSION, state.disable_group_fusion,
+                 true);
 
   // Check if async completion should be enabled
-  SetBoolFromEnv(HOROVOD_ENABLE_ASYNC_COMPLETION, state.enable_async_completion, true);
+  SetBoolFromEnv(HOROVOD_ENABLE_ASYNC_COMPLETION, state.enable_async_completion,
+                 true);
   if (enable_xla_ops) {
     // Enable async completion when XLA ops are enabled. Sine the XLA runtime is
     // single-threaded, async completion is essential to reduce host overhead.
@@ -589,9 +594,11 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
   }
 
   // Set chunk size for MPI based Adasum allreduce algorithms
-  auto horovod_adasum_mpi_chunk_size = std::getenv(HOROVOD_ADASUM_MPI_CHUNK_SIZE);
+  auto horovod_adasum_mpi_chunk_size =
+      std::getenv(HOROVOD_ADASUM_MPI_CHUNK_SIZE);
   if (horovod_adasum_mpi_chunk_size != nullptr) {
-    state.adasum_mpi_chunk_size = std::strtol(horovod_adasum_mpi_chunk_size, nullptr, 10);
+    state.adasum_mpi_chunk_size =
+        std::strtol(horovod_adasum_mpi_chunk_size, nullptr, 10);
   }
 
   op_manager.reset(CreateOperationManager(state));
@@ -641,7 +648,8 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
 
   // Iterate until shutdown.
   try {
-    while (RunLoopOnce(state));
+    while (RunLoopOnce(state))
+      ;
   } catch (const std::exception& ex) {
     LOG(ERROR, horovod_global.global_controller->GetRank())
         << "Horovod background loop uncaught exception: " << ex.what();
@@ -689,11 +697,10 @@ void BackgroundThreadLoop(HorovodGlobalState& state) {
 #endif
 
 #if HAVE_CCL
-  if (state.cpu_operation == LibType::CCL){
+  if (state.cpu_operation == LibType::CCL) {
     ccl_context.Finalize();
   }
 #endif
-
 }
 
 bool RunLoopOnce(HorovodGlobalState& state) {
@@ -753,7 +760,8 @@ bool RunLoopOnce(HorovodGlobalState& state) {
           state.timeline_controller.MarkCyclesInTimelinePending();
     }
 
-    // Get tensor name and size data for autotuning. // TODO: extend for all process sets?
+    // Get tensor name and size data for autotuning. // TODO: extend for all
+    // process sets?
     int64_t total_tensor_size = 0;
     std::vector<std::string> tensor_names;
     if (process_set_id == 0 && state.parameter_manager.IsAutoTuning()) {
@@ -838,8 +846,8 @@ bool InitializeHorovodOnce(
 #endif
     // Reset initialization flag
     horovod_global.initialization_done = false;
-    horovod_global.background_thread = std::thread(
-        BackgroundThreadLoop, std::ref(horovod_global));
+    horovod_global.background_thread =
+        std::thread(BackgroundThreadLoop, std::ref(horovod_global));
   }
 
   // Wait to ensure that the background thread has finished initializing MPI.
@@ -914,10 +922,11 @@ bool horovod_init_multi_comm(MPI_Comm* comm, int ncomms,
       MPI_Group diff_group;
       MPI_Group_difference(sub_group, global_group, &diff_group);
       if (diff_group != MPI_GROUP_EMPTY) {
-        LOG(ERROR) <<
-            "Group of processes in horovod_init_multi_comm argument number " +
-            std::to_string(i) +
-            " is not a subset of the assumed global communicator.";
+        LOG(ERROR)
+            << "Group of processes in horovod_init_multi_comm argument "
+               "number " +
+                   std::to_string(i) +
+                   " is not a subset of the assumed global communicator.";
         return false;
       }
     }
@@ -1016,8 +1025,8 @@ void horovod_shutdown() {
   }
 }
 
-bool horovod_is_initialized() {
-  return horovod_global.initialization_done;
+int horovod_is_initialized() {
+  return int(horovod_global.initialization_done.load());
 }
 
 int horovod_start_timeline(const char* file_name, bool mark_cycles) {
@@ -1033,7 +1042,8 @@ int horovod_start_timeline(const char* file_name, bool mark_cycles) {
         std::string(file_name), horovod_global.global_controller->GetSize());
     horovod_global.timeline.SetPendingTimelineFile(std::string(file_name));
   }
-  horovod_global.timeline_controller.SetMarkCyclesInTimelinePending(mark_cycles);
+  horovod_global.timeline_controller.SetMarkCyclesInTimelinePending(
+      mark_cycles);
   return 1;
 }
 
@@ -1041,13 +1051,14 @@ int horovod_stop_timeline() {
   if (!horovod_global.initialization_done) {
     return -1;
   }
-  if(!horovod_global.timeline_controller.TimelineEnabledPending()){
-    LOG(INFO) << " Timeline is already stopped. Please start timeline before stopping it.";
+  if (!horovod_global.timeline_controller.TimelineEnabledPending()) {
+    LOG(INFO) << " Timeline is already stopped. Please start timeline before "
+                 "stopping it.";
     return 1;
   }
   bool is_coordinator = horovod_global.global_controller->IsCoordinator();
   if (is_coordinator) {
-      horovod_global.timeline.SetPendingTimelineFile(std::string(""));
+    horovod_global.timeline.SetPendingTimelineFile(std::string(""));
   }
   return 1;
 }
@@ -1184,17 +1195,11 @@ bool horovod_rocm_built() {
 #endif
 }
 
-int horovod_reduce_op_average() {
-  return ReduceOp::AVERAGE;
-}
+int horovod_reduce_op_average() { return ReduceOp::AVERAGE; }
 
-int horovod_reduce_op_sum() {
-  return ReduceOp::SUM;
-}
+int horovod_reduce_op_sum() { return ReduceOp::SUM; }
 
-int horovod_reduce_op_adasum() {
-  return ReduceOp::ADASUM;
-}
+int horovod_reduce_op_adasum() { return ReduceOp::ADASUM; }
 
 const int HOROVOD_PROCESS_SET_ERROR_INIT = -1;
 const int HOROVOD_PROCESS_SET_ERROR_DYNAMIC = -2;
@@ -1320,7 +1325,6 @@ int horovod_process_set_included(int process_set_id) {
   return static_cast<int>(process_set.IsCurrentProcessIncluded());
 }
 
-
 int horovod_number_of_process_sets() {
   return static_cast<int>(horovod_global.process_set_table.Ids().size());
 }
@@ -1346,7 +1350,6 @@ int horovod_process_set_ranks(int id, int* ranks_prealloc) {
   }
   return 0;
 }
-
 }
 
 // Contexts and controller must be initialized and the background thread
@@ -1354,13 +1357,10 @@ int horovod_process_set_ranks(int id, int* ranks_prealloc) {
 Status EnqueueTensorAllreduce(std::shared_ptr<OpContext> context,
                               std::shared_ptr<Tensor> tensor,
                               std::shared_ptr<Tensor> output,
-                              ReadyEventList ready_event_list,
-                              std::string name, const int device,
-                              StatusCallback callback,
-                              ReduceOp reduce_op,
-                              double prescale_factor,
-                              double postscale_factor,
-                              int32_t process_set_id) {
+                              ReadyEventList ready_event_list, std::string name,
+                              const int device, StatusCallback callback,
+                              ReduceOp reduce_op, double prescale_factor,
+                              double postscale_factor, int32_t process_set_id) {
   // Wrap inputs in std::vector and pass onto multi tensor implementation
   std::vector<std::shared_ptr<OpContext>> contexts;
   std::vector<std::shared_ptr<Tensor>> tensors;
@@ -1376,27 +1376,24 @@ Status EnqueueTensorAllreduce(std::shared_ptr<OpContext> context,
   names.emplace_back(std::move(name));
   callbacks.emplace_back(std::move(callback));
 
-  return EnqueueTensorAllreduces(contexts, tensors, outputs, ready_event_lists,
-                                 names, device, callbacks, reduce_op,
-                                 prescale_factor, postscale_factor,
-                                 process_set_id);
+  return EnqueueTensorAllreduces(
+      contexts, tensors, outputs, ready_event_lists, names, device, callbacks,
+      reduce_op, prescale_factor, postscale_factor, process_set_id);
 }
 
-Status EnqueueTensorAllreduces(std::vector<std::shared_ptr<OpContext>>& contexts,
-                               std::vector<std::shared_ptr<Tensor>>& tensors,
-                               std::vector<std::shared_ptr<Tensor>>& outputs,
-                               std::vector<ReadyEventList>& ready_event_lists,
-                               std::vector<std::string>& names,
-                               const int device,
-                               std::vector<StatusCallback>& callbacks,
-                               ReduceOp reduce_op,
-                               double prescale_factor,
-                               double postscale_factor,
-                               int32_t process_set_id) {
+Status
+EnqueueTensorAllreduces(std::vector<std::shared_ptr<OpContext>>& contexts,
+                        std::vector<std::shared_ptr<Tensor>>& tensors,
+                        std::vector<std::shared_ptr<Tensor>>& outputs,
+                        std::vector<ReadyEventList>& ready_event_lists,
+                        std::vector<std::string>& names, const int device,
+                        std::vector<StatusCallback>& callbacks,
+                        ReduceOp reduce_op, double prescale_factor,
+                        double postscale_factor, int32_t process_set_id) {
   if (horovod_global.cpu_operation == LibType::CCL && process_set_id > 0 &&
-        device == CPU_DEVICE_ID) {
-      return Status::InvalidArgument(
-          "Process sets are not supported yet with oneCCL operations.");
+      device == CPU_DEVICE_ID) {
+    return Status::InvalidArgument(
+        "Process sets are not supported yet with oneCCL operations.");
   }
   if (!horovod_global.process_set_table.Contains(process_set_id)) {
     return Status::InvalidArgument("Allreduce: Process set provided does not "
@@ -1410,7 +1407,8 @@ Status EnqueueTensorAllreduces(std::vector<std::shared_ptr<OpContext>>& contexts
     // Averaging happens via postscale_factor
     postscale_factor /= process_set.controller->GetSize();
 #else
-    LOG(ERROR, horovod_global.global_controller->GetRank()) << "Enqueuing AVERAGE allreduce is not allowed.";
+    LOG(ERROR, horovod_global.global_controller->GetRank())
+        << "Enqueuing AVERAGE allreduce is not allowed.";
     return status.Aborted("AVERAGE not allowed.");
 #endif
   } else if (reduce_op == ReduceOp::ADASUM) {
@@ -1492,10 +1490,12 @@ Status EnqueueTensorAllreduces(std::vector<std::shared_ptr<OpContext>>& contexts
   for (const auto& n : names) {
     tensors_enqueued += n + "; ";
   }
-  LOG(TRACE, horovod_global.global_controller->GetRank()) << "Enqueued " << tensors_enqueued;
+  LOG(TRACE, horovod_global.global_controller->GetRank())
+      << "Enqueued " << tensors_enqueued;
 
-  // Only create groups larger than 1 tensor, unless disable_group_fusion is requested.
-  // In that case, even single tensor groups are created to enforce disabling fusion.
+  // Only create groups larger than 1 tensor, unless disable_group_fusion is
+  // requested. In that case, even single tensor groups are created to enforce
+  // disabling fusion.
   if (tensors.size() > 1 || horovod_global.disable_group_fusion) {
     auto group_id = process_set.group_table.RegisterGroup(std::move(names));
     for (auto& message : messages) {
@@ -1517,8 +1517,7 @@ Status EnqueueTensorAllgather(std::shared_ptr<OpContext> context,
                               std::shared_ptr<Tensor> tensor,
                               ReadyEventList ready_event_list,
                               const std::string& name, const int device,
-                              StatusCallback callback,
-                              int32_t process_set_id) {
+                              StatusCallback callback, int32_t process_set_id) {
   if (horovod_global.cpu_operation == LibType::CCL && process_set_id > 0 &&
       device == CPU_DEVICE_ID) {
     return Status::InvalidArgument(
@@ -1562,7 +1561,8 @@ Status EnqueueTensorAllgather(std::shared_ptr<OpContext> context,
   }
   Status status = process_set.tensor_queue.AddToTensorQueue(e, message);
   if (status.ok()) {
-    LOG(TRACE, horovod_global.global_controller->GetRank()) << "Enqueued " << name;
+    LOG(TRACE, horovod_global.global_controller->GetRank())
+        << "Enqueued " << name;
   }
   return status;
 }
@@ -1574,8 +1574,7 @@ Status EnqueueTensorBroadcast(std::shared_ptr<OpContext> context,
                               std::shared_ptr<Tensor> output, int root_rank,
                               ReadyEventList ready_event_list,
                               const std::string& name, const int device,
-                              StatusCallback callback,
-                              int32_t process_set_id) {
+                              StatusCallback callback, int32_t process_set_id) {
   if (horovod_global.cpu_operation == LibType::CCL && process_set_id > 0 &&
       device == CPU_DEVICE_ID) {
     return Status::InvalidArgument(
@@ -1592,9 +1591,9 @@ Status EnqueueTensorBroadcast(std::shared_ptr<OpContext> context,
     root_rank_in_process_set =
         process_set.controller->GetGlobalRankToControllerRank().at(root_rank);
   } catch (const std::out_of_range& e) {
-    return Status::InvalidArgument(
-        "broadcast received invalid root rank " + std::to_string(root_rank) +
-        " for provided process set");
+    return Status::InvalidArgument("broadcast received invalid root rank " +
+                                   std::to_string(root_rank) +
+                                   " for provided process set");
   }
 
   Request message;
@@ -1632,7 +1631,8 @@ Status EnqueueTensorBroadcast(std::shared_ptr<OpContext> context,
   }
   Status status = process_set.tensor_queue.AddToTensorQueue(e, message);
   if (status.ok()) {
-    LOG(TRACE, horovod_global.global_controller->GetRank()) << "Enqueued " << name;
+    LOG(TRACE, horovod_global.global_controller->GetRank())
+        << "Enqueued " << name;
   }
   return status;
 }
@@ -1644,8 +1644,7 @@ Status EnqueueTensorAlltoall(std::shared_ptr<OpContext> context,
                              std::shared_ptr<Tensor> splits,
                              ReadyEventList ready_event_list,
                              const std::string& name, const int device,
-                             StatusCallback callback,
-                             int32_t process_set_id) {
+                             StatusCallback callback, int32_t process_set_id) {
   if (horovod_global.cpu_operation == LibType::CCL && process_set_id > 0 &&
       device == CPU_DEVICE_ID) {
     return Status::InvalidArgument(
@@ -1662,7 +1661,8 @@ Status EnqueueTensorAlltoall(std::shared_ptr<OpContext> context,
     return Status::InvalidArgument("alltoall expects a 1D splits tensor");
   }
   if (splits->dtype() != HOROVOD_INT32) {
-    return Status::InvalidArgument("alltoall expects splits to contain 32-bit integer elements.");
+    return Status::InvalidArgument(
+        "alltoall expects splits to contain 32-bit integer elements.");
   }
 
   Request message;
@@ -1697,18 +1697,20 @@ Status EnqueueTensorAlltoall(std::shared_ptr<OpContext> context,
     auto splits_data = static_cast<const int32_t*>(splits->data());
     auto sum = std::accumulate(splits_data, splits_data + splits_first_dim, 0);
     if (sum > tensor_first_dim) {
-      return Status::InvalidArgument("Sum of splits entries is greater than the first dimension of tensor.");
+      return Status::InvalidArgument("Sum of splits entries is greater than "
+                                     "the first dimension of tensor.");
     }
-    e.splits.assign(splits_data,
-                    splits_data + splits->shape().num_elements());
+    e.splits.assign(splits_data, splits_data + splits->shape().num_elements());
   } else if (splits_first_dim == 0) {
     if (tensor_first_dim % world_size != 0) {
-      return Status::InvalidArgument("splits not provided, but first dimension of tensor is not an even "
-                                     "multiple of the number of workers.");
+      return Status::InvalidArgument(
+          "splits not provided, but first dimension of tensor is not an even "
+          "multiple of the number of workers.");
     }
     e.splits.resize(world_size, tensor_first_dim / world_size);
   } else {
-      return Status::InvalidArgument("Number of entries in splits does not equal number of workers.");
+    return Status::InvalidArgument(
+        "Number of entries in splits does not equal number of workers.");
   }
 
   if (horovod_global.shut_down) {
@@ -1716,7 +1718,8 @@ Status EnqueueTensorAlltoall(std::shared_ptr<OpContext> context,
   }
   Status status = process_set.tensor_queue.AddToTensorQueue(e, message);
   if (status.ok()) {
-    LOG(TRACE, horovod_global.global_controller->GetRank()) << "Enqueued " << name;
+    LOG(TRACE, horovod_global.global_controller->GetRank())
+        << "Enqueued " << name;
   }
   return status;
 }
@@ -1725,9 +1728,8 @@ Status EnqueueTensorAlltoall(std::shared_ptr<OpContext> context,
 // must be running before this function is called.
 Status EnqueueJoin(std::shared_ptr<OpContext> context,
                    std::shared_ptr<Tensor> output_last_joined_rank,
-                   ReadyEventList ready_event_list,
-                   const std::string& name, const int device,
-                   StatusCallback callback,
+                   ReadyEventList ready_event_list, const std::string& name,
+                   const int device, StatusCallback callback,
                    int32_t process_set_id) {
   auto& process_set = horovod_global.process_set_table.Get(process_set_id);
 
@@ -1750,8 +1752,45 @@ Status EnqueueJoin(std::shared_ptr<OpContext> context,
   }
   Status status = process_set.tensor_queue.AddToTensorQueue(e, message);
   if (status.ok()) {
-    LOG(TRACE, horovod_global.global_controller->GetRank()) << "Enqueued " << name;
+    LOG(TRACE, horovod_global.global_controller->GetRank())
+        << "Enqueued " << name;
+  }
+  return status;
+}
+
+// Contexts and controller must be initialized and the background thread
+// must be running before this function is called.
+Status EnqueueBarrier(StatusCallback callback, int32_t process_set_id) {
+  auto& process_set = horovod_global.process_set_table.Get(process_set_id);
+
+  if (!process_set.IsCurrentProcessIncluded()) {
+    return Status::InvalidArgument(
+        "Barrier: Rank " +
+        std::to_string(horovod_global.global_controller->GetRank()) +
+        " is not a member of the provided process set.");
+  }
+
+  Request message;
+  // Barrier doesn't need a tensor, we set an arbitrary name for tracing
+  // purposes.
+  message.set_tensor_name(BARRIER_TENSOR_NAME);
+  message.set_request_rank(process_set.controller->GetRank());
+  message.set_request_type(Request::BARRIER);
+
+  TensorTableEntry e;
+  e.tensor_name = BARRIER_TENSOR_NAME;
+  e.process_set_id = process_set_id;
+  e.callback = callback;
+
+  if (horovod_global.shut_down) {
+    return SHUT_DOWN_ERROR;
+  }
+  Status status = process_set.tensor_queue.AddToTensorQueue(e, message);
+  if (status.ok()) {
+    LOG(TRACE, horovod_global.global_controller->GetRank())
+        << "Enqueued barrier op";
   }
+
   return status;
 }
 
diff --git a/horovod/common/operations.h b/horovod/common/operations.h
index 5737c1f259..21c0dbeb06 100644
--- a/horovod/common/operations.h
+++ b/horovod/common/operations.h
@@ -233,6 +233,9 @@ Status EnqueueJoin(std::shared_ptr<OpContext> context,
                    StatusCallback callback,
                    int32_t process_set_id = 0);
 
+Status EnqueueBarrier(StatusCallback callback,
+                   int32_t process_set_id = 0);
+
 } // namespace common
 } // namespace horovod
 
diff --git a/horovod/common/ops/adasum_gpu_operations.cc b/horovod/common/ops/adasum_gpu_operations.cc
index e2e3197cab..ef94f7ab62 100644
--- a/horovod/common/ops/adasum_gpu_operations.cc
+++ b/horovod/common/ops/adasum_gpu_operations.cc
@@ -49,7 +49,8 @@ Status AdasumGpuAllreduceOp::Execute(std::vector<TensorTableEntry>& entries,
 
   WaitForData(entries);
 
-  // Lazily initialize reduction communicators for VHDD algorithm when Adasum reduction is actually called.
+  // Lazily initialize reduction communicators for VHDD algorithm when Adasum
+  // reduction is actually called.
   if (!reduction_comms_initialized) {
     InitializeVHDDReductionComms();
   }
@@ -66,7 +67,7 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
                                        const Response& response) {
   assert(!entries.empty());
   auto& first_entry = entries[0];
-  assert(first_entry.process_set_id == 0);  // TODO: generalize
+  assert(first_entry.process_set_id == 0); // TODO: generalize
   auto& process_set =
       global_state_->process_set_table.Get(first_entry.process_set_id);
 
@@ -88,8 +89,8 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
     MemcpyInFusionBuffer(entries, fused_input_data, buffer_data, buffer_len);
     if (global_state_->timeline.Initialized()) {
       gpu_context_->RecordEvent(gpu_op_context_.event_queue,
-                                 MEMCPY_IN_FUSION_BUFFER,
-                                 *gpu_op_context_.stream);
+                                MEMCPY_IN_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   } else {
     fused_input_data = first_entry.tensor->data();
@@ -97,11 +98,13 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
     buffer_len = (size_t)first_entry.output->size();
   }
 
-  int64_t num_elements = buffer_len / DataType_Size(first_entry.tensor->dtype());
+  int64_t num_elements =
+      buffer_len / DataType_Size(first_entry.tensor->dtype());
 
   if (response.prescale_factor() != 1.0) {
     // Execute prescaling op
-    ScaleBuffer(response.prescale_factor(), entries, fused_input_data, buffer_data, num_elements);
+    ScaleBuffer(response.prescale_factor(), entries, fused_input_data,
+                buffer_data, num_elements);
     fused_input_data = buffer_data; // for unfused, scale is done out of place
   }
 
@@ -134,9 +137,8 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
   // non-divisible part (if any), do NCCL Reduce (at rank local_size-1),
   // MPI Allreduce (across rank (local_size-1)'s), and NCCL Bcast
 
-  int64_t num_elements_per_rank = process_set.controller->IsHomogeneous()
-                                      ? num_elements / local_size
-                                      : 0;
+  int64_t num_elements_per_rank =
+      process_set.controller->IsHomogeneous() ? num_elements / local_size : 0;
 
   size_t buffer_len_per_rank = element_size * num_elements_per_rank;
 
@@ -172,25 +174,28 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
         (size_t)num_elements_per_rank, GetNCCLDataType(first_entry.tensor),
         ncclSum, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
 
-    nccl_context_->ErrorCheck("ncclReduceScatter", nccl_result, *nccl_op_context_.nccl_comm_);
+    nccl_context_->ErrorCheck("ncclReduceScatter", nccl_result,
+                              *nccl_op_context_.nccl_comm_);
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
-                                 NCCL_REDUCESCATTER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_REDUCESCATTER,
+                                *gpu_op_context_.stream);
     }
   }
 
   if (num_elements_remaining > 0) {
     // Reduce the remaining data at local_size-1 to append to
     // existing buffer
-    auto nccl_result = ncclReduce(
-        fused_input_data_remainder, buffer_data_remainder,
-        (size_t)num_elements_remaining, GetNCCLDataType(first_entry.tensor),
-        ncclSum, root_rank, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-
-    nccl_context_->ErrorCheck("ncclReduce", nccl_result, *nccl_op_context_.nccl_comm_);
+    auto nccl_result =
+        ncclReduce(fused_input_data_remainder, buffer_data_remainder,
+                   (size_t)num_elements_remaining,
+                   GetNCCLDataType(first_entry.tensor), ncclSum, root_rank,
+                   *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+
+    nccl_context_->ErrorCheck("ncclReduce", nccl_result,
+                              *nccl_op_context_.nccl_comm_);
     if (global_state_->timeline.Initialized()) {
       gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_REDUCE,
-                                 *gpu_op_context_.stream);
+                                *gpu_op_context_.stream);
     }
   }
 
@@ -199,8 +204,13 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
     // a buffer is not safe since the tensor can be arbitrarily large.
     host_buffer = GetHostBuffer((uint64_t)total_buffer_len);
     // Synchronize.
-    gpu_context_->WaitForEvents(gpu_op_context_.event_queue, entries,
-                                 timeline, nullptr, global_state_->elastic_enabled);
+    if (global_state_->elastic_enabled) {
+      gpu_context_->WaitForEventsElastic(gpu_op_context_.event_queue, entries,
+                                         timeline, nullptr);
+    } else {
+      gpu_context_->WaitForEvents(gpu_op_context_.event_queue, entries,
+                                  timeline, nullptr);
+    }
 
     // According to https://docs.nvidia.com/cuda/cuda-runtime-api/
     // api-sync-behavior.html#api-sync-behavior__memcpy-async,
@@ -263,24 +273,24 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
         entries, (void*)host_buffer, (void*)recv_buffer, tensor_counts,
         local_size, // start_level
         mpi_context_->GetMPICommunicator(process_set.controller->IsHomogeneous()
-                                    ? Communicator::GLOBAL
-                                    : Communicator::CROSS),
+                                             ? Communicator::GLOBAL
+                                             : Communicator::CROSS),
         0, reduction_comms_, first_entry.tensor->dtype(), global_state_);
     timeline.ActivityEndAll(entries);
 
     timeline.ActivityStartAll(entries, MEMCPY_OUT_HOST_BUFFER);
-    gpu_context_->MemcpyAsyncH2D(buffer_data_at_rank_offset,
-                                 host_buffer, total_buffer_len,
-                                 *gpu_op_context_.stream);
+    gpu_context_->MemcpyAsyncH2D(buffer_data_at_rank_offset, host_buffer,
+                                 total_buffer_len, *gpu_op_context_.stream);
     timeline.ActivityEndAll(entries);
   }
 
   if (num_elements_per_rank > 0) {
     nccl_context_->ErrorCheck(
-        "ncclAllGather", ncclAllGather(buffer_data_at_rank_offset, buffer_data,
-                                       (size_t)num_elements_per_rank,
-                                       GetNCCLDataType(first_entry.tensor),
-                                       *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
+        "ncclAllGather",
+        ncclAllGather(buffer_data_at_rank_offset, buffer_data,
+                      (size_t)num_elements_per_rank,
+                      GetNCCLDataType(first_entry.tensor),
+                      *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
         *nccl_op_context_.nccl_comm_);
     if (global_state_->timeline.Initialized()) {
       gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLGATHER,
@@ -288,12 +298,22 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
     }
   }
   if (num_elements_remaining > 0) {
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 2, 12)
+    nccl_context_->ErrorCheck(
+        "ncclBroadcast",
+        ncclBroadcast(buffer_data_remainder, buffer_data_remainder,
+                      (size_t)num_elements_remaining,
+                      GetNCCLDataType(first_entry.tensor), root_rank,
+                      *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
+        *nccl_op_context_.nccl_comm_);
+#else
     nccl_context_->ErrorCheck(
         "ncclBcast",
         ncclBcast(buffer_data_remainder, (size_t)num_elements_remaining,
-                  GetNCCLDataType(first_entry.tensor), root_rank, *nccl_op_context_.nccl_comm_,
-                  *gpu_op_context_.stream),
+                  GetNCCLDataType(first_entry.tensor), root_rank,
+                  *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
         *nccl_op_context_.nccl_comm_);
+#endif
     if (global_state_->timeline.Initialized()) {
       gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST,
                                 *gpu_op_context_.stream);
@@ -302,7 +322,8 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
 
   if (response.postscale_factor() != 1.0) {
     // Execute postscaling op
-    ScaleBuffer(response.postscale_factor(), entries, buffer_data, buffer_data, num_elements);
+    ScaleBuffer(response.postscale_factor(), entries, buffer_data, buffer_data,
+                num_elements);
   }
 
   // Copy memory out of the fusion buffer.
@@ -319,10 +340,9 @@ AdasumGpuAllreduceOp::NcclHierarchical(std::vector<TensorTableEntry>& entries,
   return gpu_op_context_.FinalizeGPUQueue(entries, false);
 }
 
-bool AdasumGpuAllreduceOp::Enabled(
-    const ParameterManager& param_manager,
-    const std::vector<TensorTableEntry>& entries,
-    const Response& response) const {
+bool AdasumGpuAllreduceOp::Enabled(const ParameterManager& param_manager,
+                                   const std::vector<TensorTableEntry>& entries,
+                                   const Response& response) const {
   return entries[0].device != CPU_DEVICE_ID;
 }
 } // namespace common
diff --git a/horovod/common/ops/collective_operations.cc b/horovod/common/ops/collective_operations.cc
index 747555eb4d..80b5f09f30 100644
--- a/horovod/common/ops/collective_operations.cc
+++ b/horovod/common/ops/collective_operations.cc
@@ -312,6 +312,22 @@ Status JoinOp::Execute(std::vector<TensorTableEntry>& entries,
   return Status::OK();
 }
 
+// Barrier
+BarrierOp::BarrierOp(HorovodGlobalState* global_state) : HorovodOp(global_state) {}
+
+Status BarrierOp::Execute(std::vector<TensorTableEntry>& entries,
+                       const Response& response) {
+  assert(entries.size() == 1);
+  int& process_set_id = entries[0].process_set_id;
+  auto& process_set = global_state_->process_set_table.Get(process_set_id);
+
+
+  process_set.controller->Barrier(Communicator::GLOBAL);
+  LOG(TRACE, global_state_->global_controller->GetRank()) << "Released from barrier.";
+
+  return Status::OK();
+}
+
 // Error
 ErrorOp::ErrorOp(HorovodGlobalState* global_state) : HorovodOp(global_state) {}
 
diff --git a/horovod/common/ops/collective_operations.h b/horovod/common/ops/collective_operations.h
index c863b9083a..250b848235 100644
--- a/horovod/common/ops/collective_operations.h
+++ b/horovod/common/ops/collective_operations.h
@@ -289,6 +289,15 @@ class JoinOp : public HorovodOp {
                          const Response& response, ProcessSet& process_set);
 };
 
+class BarrierOp : public HorovodOp {
+public:
+  explicit BarrierOp(HorovodGlobalState* global_state);
+
+  virtual ~BarrierOp() = default;
+
+  virtual Status Execute(std::vector<TensorTableEntry>& entries, const Response& response);
+};
+
 class ErrorOp : public HorovodOp {
 public:
   explicit ErrorOp(HorovodGlobalState* global_state);
diff --git a/horovod/common/ops/cuda_operations.cc b/horovod/common/ops/cuda_operations.cc
index 73398974a0..9e0eb53d38 100644
--- a/horovod/common/ops/cuda_operations.cc
+++ b/horovod/common/ops/cuda_operations.cc
@@ -14,10 +14,10 @@
 // limitations under the License.
 // =============================================================================
 
-#include "gpu_operations.h"
-#include "cuda/cuda_kernels.h"
-#include "../message.h"
 #include "../hashes.h"
+#include "../message.h"
+#include "cuda/cuda_kernels.h"
+#include "gpu_operations.h"
 
 #include <thread>
 
@@ -39,7 +39,8 @@ class GPUContext::impl {
       auto& queue = cuda_events[key];
       if (!prepopulated[key]) {
         // On first call for device and stream pair, prepopulate event queue.
-        // This is to minimize event reuse of callback events passed to framework.
+        // This is to minimize event reuse of callback events passed to
+        // framework.
         for (int i = 0; i < N_CUDA_EVENTS_PREPOPULATE; ++i) {
           cudaEvent_t ev;
           status = cudaEventCreateWithFlags(&ev, cudaEventDisableTiming);
@@ -49,6 +50,7 @@ class GPUContext::impl {
       }
       if (!queue.empty()) {
         *event = queue.front();
+        event->event_idx = ++cuda_event_idx[key];
         queue.pop();
         return cudaSuccess;
       }
@@ -58,6 +60,9 @@ class GPUContext::impl {
     status = cudaEventCreateWithFlags(&ev, cudaEventDisableTiming);
     event->event = std::make_shared<cudaEvent_t>(ev);
     event->stream = stream;
+    auto key2 = std::make_pair(device, stream);
+    event->event_idx = ++cuda_event_idx[key2];
+
 
     return status;
   }
@@ -81,28 +86,62 @@ class GPUContext::impl {
 
   void ErrorCheck(std::string op_name, cudaError_t cuda_result) {
     if (cuda_result != cudaSuccess) {
-      throw std::logic_error(std::string(op_name) + " failed: " + cudaGetErrorString(cuda_result));
+      throw std::logic_error(std::string(op_name) +
+                             " failed: " + cudaGetErrorString(cuda_result));
     }
   }
 
-  void RecordEvent(std::queue<std::pair<std::string, Event>>& event_queue, std::string name, cudaStream_t& stream) {
+  void RecordEvent(std::queue<std::pair<std::string, Event>>& event_queue,
+                   std::string name, cudaStream_t& stream) {
     Event event;
     ErrorCheck("GetGpuEvent", GetGpuEvent(&event, stream));
-    ErrorCheck("cudaEventRecord", cudaEventRecord(*(event.event), event.stream));
+    ErrorCheck("cudaEventRecord",
+               cudaEventRecord(*(event.event), event.stream));
     event_queue.emplace(name, event);
   }
 
   Event RecordEvent(cudaStream_t& stream) {
     Event event;
     ErrorCheck("GetGpuEvent", GetGpuEvent(&event, stream));
-    ErrorCheck("cudaEventRecord", cudaEventRecord(*(event.event), event.stream));
+    ErrorCheck("cudaEventRecord",
+               cudaEventRecord(*(event.event), event.stream));
     return event;
   }
 
   void WaitForEvents(std::queue<std::pair<std::string, Event>>& event_queue,
-      const std::vector<TensorTableEntry>& entries, Timeline& timeline,
-      const std::function<void()>& error_check_callback,
-      bool elastic) {
+                     const std::vector<TensorTableEntry>& entries,
+                     Timeline& timeline,
+                     const std::function<void()>& error_check_callback) {
+    while (!event_queue.empty()) {
+      std::string name;
+      Event event;
+      std::tie(name, event) = event_queue.front();
+      event_queue.pop();
+      if (name != "") {
+        timeline.ActivityStartAll(entries, name);
+      }
+
+      cudaError_t cuda_result = cudaEventSynchronize(*(event.event));
+      if (cuda_result != cudaSuccess) {
+        throw std::logic_error(std::string("cudaEventSynchronize failed: ") +
+                               cudaGetErrorString(cuda_result));
+      }
+      if (error_check_callback) {
+        error_check_callback();
+      }
+
+      if (name != "") {
+        timeline.ActivityEndAll(entries);
+      }
+      ErrorCheck("ReleaseGpuEvent", ReleaseGpuEvent(event));
+    }
+  }
+
+  void
+  WaitForEventsElastic(std::queue<std::pair<std::string, Event>>& event_queue,
+                       const std::vector<TensorTableEntry>& entries,
+                       Timeline& timeline,
+                       const std::function<void()>& error_check_callback) {
     while (!event_queue.empty()) {
       std::string name;
       Event event;
@@ -112,32 +151,24 @@ class GPUContext::impl {
         timeline.ActivityStartAll(entries, name);
       }
 
-      // Check for async (networking) errors while waiting for the event to complete
-      if (elastic) {
-        cudaError_t cuda_result;
-        while (true) {
-          cuda_result = cudaEventQuery(*(event.event));
-          if (cuda_result == cudaSuccess) {
-            break;
-          }
-
-          if (cuda_result != cudaErrorNotReady) {
-            throw std::logic_error(std::string("cudaEventQuery failed: ") + cudaGetErrorString(cuda_result));
-          }
-
-          if (error_check_callback) {
-            error_check_callback();
-          }
-          std::this_thread::yield();
+      // Check for async (networking) errors while waiting for the event to
+      // complete
+      cudaError_t cuda_result;
+      while (true) {
+        cuda_result = cudaEventQuery(*(event.event));
+        if (cuda_result == cudaSuccess) {
+          break;
         }
-      } else {
-        cudaError_t cuda_result = cudaEventSynchronize(*(event.event));
-        if (cuda_result != cudaSuccess) {
-          throw std::logic_error(std::string("cudaEventSynchronize failed: ") + cudaGetErrorString(cuda_result));
+
+        if (cuda_result != cudaErrorNotReady) {
+          throw std::logic_error(std::string("cudaEventQuery failed: ") +
+                                 cudaGetErrorString(cuda_result));
         }
+
         if (error_check_callback) {
           error_check_callback();
         }
+        std::this_thread::yield();
       }
 
       if (name != "") {
@@ -148,9 +179,10 @@ class GPUContext::impl {
   }
 
   void ClearEvents(std::queue<std::pair<std::string, Event>>& event_queue,
-      const std::vector<TensorTableEntry>& entries, Timeline& timeline,
-      const std::function<void()>& error_check_callback,
-      bool elastic) {
+                   const std::vector<TensorTableEntry>& entries,
+                   Timeline& timeline,
+                   const std::function<void()>& error_check_callback,
+                   bool elastic) {
     while (!event_queue.empty()) {
       std::string name;
       Event event;
@@ -167,12 +199,13 @@ class GPUContext::impl {
     }
   }
 
-  void StreamCreate(cudaStream_t *stream) {
+  void StreamCreate(cudaStream_t* stream) {
     int greatest_priority;
     ErrorCheck("cudaDeviceGetStreamPriorityRange",
-        cudaDeviceGetStreamPriorityRange(NULL, &greatest_priority));
+               cudaDeviceGetStreamPriorityRange(NULL, &greatest_priority));
     ErrorCheck("cudaStreamCreateWithPriority",
-        cudaStreamCreateWithPriority(stream, cudaStreamNonBlocking, greatest_priority));
+               cudaStreamCreateWithPriority(stream, cudaStreamNonBlocking,
+                                            greatest_priority));
   }
 
   void StreamSynchronize(cudaStream_t stream) {
@@ -189,30 +222,44 @@ class GPUContext::impl {
     ErrorCheck("cudaSetDevice", cudaSetDevice(device));
   }
 
-  void MemcpyAsyncD2D(void* dst, const void* src, size_t count, cudaStream_t stream) {
-    ErrorCheck("cudaMemcpyAsync", cudaMemcpyAsync(dst, src, count, cudaMemcpyDeviceToDevice, stream));
+  void MemcpyAsyncD2D(void* dst, const void* src, size_t count,
+                      cudaStream_t stream) {
+    ErrorCheck(
+        "cudaMemcpyAsync",
+        cudaMemcpyAsync(dst, src, count, cudaMemcpyDeviceToDevice, stream));
   }
 
-  void MemcpyAsyncH2D(void* dst, const void* src, size_t count, cudaStream_t stream) {
-    ErrorCheck("cudaMemcpyAsync", cudaMemcpyAsync(dst, src, count, cudaMemcpyHostToDevice, stream));
+  void MemcpyAsyncH2D(void* dst, const void* src, size_t count,
+                      cudaStream_t stream) {
+    ErrorCheck(
+        "cudaMemcpyAsync",
+        cudaMemcpyAsync(dst, src, count, cudaMemcpyHostToDevice, stream));
   }
 
-  void MemcpyAsyncD2H(void* dst, const void* src, size_t count, cudaStream_t stream) {
-    ErrorCheck("cudaMemcpyAsync", cudaMemcpyAsync(dst, src, count, cudaMemcpyDeviceToHost, stream));
+  void MemcpyAsyncD2H(void* dst, const void* src, size_t count,
+                      cudaStream_t stream) {
+    ErrorCheck(
+        "cudaMemcpyAsync",
+        cudaMemcpyAsync(dst, src, count, cudaMemcpyDeviceToHost, stream));
   }
 
-  void ScaleBufferImpl(const void* fused_input_data, void* buffer_data, int64_t num_elements,
-                       double scale_factor, DataType dtype, cudaStream_t stream) {
-    ScaleBufferCudaImpl(fused_input_data, buffer_data, num_elements, scale_factor, dtype, stream);
+  void ScaleBufferImpl(const void* fused_input_data, void* buffer_data,
+                       int64_t num_elements, double scale_factor,
+                       DataType dtype, cudaStream_t stream) {
+    ScaleBufferCudaImpl(fused_input_data, buffer_data, num_elements,
+                        scale_factor, dtype, stream);
 
     // TODO: https://github.com/horovod/horovod/issues/2230
-    //ErrorCheck("ScaleBufferCudaImpl", cudaGetLastError());
+    // ErrorCheck("ScaleBufferCudaImpl", cudaGetLastError());
   }
 
 private:
-  // We reuse CUDA events as it appears that their creation carries non-zero cost.
-  std::unordered_map<std::pair<int, cudaStream_t>, std::queue<Event>> cuda_events;
+  // We reuse CUDA events as it appears that their creation carries non-zero
+  // cost.
+  std::unordered_map<std::pair<int, cudaStream_t>, std::queue<Event>>
+      cuda_events;
   std::unordered_map<std::pair<int, cudaStream_t>, bool> prepopulated;
+  std::unordered_map<std::pair<int, cudaStream_t>, std::atomic<uint64_t>> cuda_event_idx;
   std::mutex cuda_events_mutex;
 
   static constexpr int N_CUDA_EVENTS_PREPOPULATE = 128;
diff --git a/horovod/common/ops/gpu_context_impl.cc b/horovod/common/ops/gpu_context_impl.cc
index d910a819df..751c553817 100644
--- a/horovod/common/ops/gpu_context_impl.cc
+++ b/horovod/common/ops/gpu_context_impl.cc
@@ -1,15 +1,15 @@
 GPUContext::GPUContext() : pimpl{new impl} {}
 GPUContext::~GPUContext() = default;
 
-void GPUContext::Finalize() {
-  finalizer_thread_pool.reset();
-}
+void GPUContext::Finalize() { finalizer_thread_pool.reset(); }
 
 void GPUContext::ErrorCheck(std::string op_name, gpuError_t gpu_result) {
   pimpl->ErrorCheck(op_name, gpu_result);
 }
 
-void GPUContext::RecordEvent(std::queue<std::pair<std::string, Event>>& event_queue, std::string name, gpuStream_t& stream) {
+void GPUContext::RecordEvent(
+    std::queue<std::pair<std::string, Event>>& event_queue, std::string name,
+    gpuStream_t& stream) {
   pimpl->RecordEvent(event_queue, name, stream);
 }
 
@@ -21,19 +21,30 @@ void GPUContext::ReleaseEvent(Event event) {
   pimpl->ErrorCheck("ReleaseGpuEvent", pimpl->ReleaseGpuEvent(event));
 }
 
-void GPUContext::WaitForEvents(std::queue<std::pair<std::string, Event>>& event_queue, const std::vector<TensorTableEntry>& entries,
-                               Timeline& timeline, const std::function<void()>& error_check_callback,
-                               bool elastic) {
-  pimpl->WaitForEvents(event_queue, entries, timeline, error_check_callback, elastic);
+void GPUContext::WaitForEvents(
+    std::queue<std::pair<std::string, Event>>& event_queue,
+    const std::vector<TensorTableEntry>& entries, Timeline& timeline,
+    const std::function<void()>& error_check_callback) {
+  pimpl->WaitForEvents(event_queue, entries, timeline, error_check_callback);
+}
+
+void GPUContext::WaitForEventsElastic(
+    std::queue<std::pair<std::string, Event>>& event_queue,
+    const std::vector<TensorTableEntry>& entries, Timeline& timeline,
+    const std::function<void()>& error_check_callback) {
+  pimpl->WaitForEventsElastic(event_queue, entries, timeline,
+                              error_check_callback);
 }
 
-void GPUContext::ClearEvents(std::queue<std::pair<std::string, Event>>& event_queue, const std::vector<TensorTableEntry>& entries,
-                             Timeline& timeline, const std::function<void()>& error_check_callback,
-                             bool elastic) {
-  pimpl->ClearEvents(event_queue, entries, timeline, error_check_callback, elastic);
+void GPUContext::ClearEvents(
+    std::queue<std::pair<std::string, Event>>& event_queue,
+    const std::vector<TensorTableEntry>& entries, Timeline& timeline,
+    const std::function<void()>& error_check_callback, bool elastic) {
+  pimpl->ClearEvents(event_queue, entries, timeline, error_check_callback,
+                     elastic);
 }
 
-void GPUContext::StreamCreate(gpuStream_t *stream) {
+void GPUContext::StreamCreate(gpuStream_t* stream) {
   pimpl->StreamCreate(stream);
 }
 
@@ -41,28 +52,29 @@ void GPUContext::StreamSynchronize(gpuStream_t stream) {
   pimpl->StreamSynchronize(stream);
 }
 
-int GPUContext::GetDevice() {
-  return pimpl->GetDevice();
-}
+int GPUContext::GetDevice() { return pimpl->GetDevice(); }
 
-void GPUContext::SetDevice(int device) {
-  pimpl->SetDevice(device);
-}
+void GPUContext::SetDevice(int device) { pimpl->SetDevice(device); }
 
-void GPUContext::MemcpyAsyncD2D(void* dst, const void* src, size_t count, gpuStream_t stream) {
+void GPUContext::MemcpyAsyncD2D(void* dst, const void* src, size_t count,
+                                gpuStream_t stream) {
   pimpl->MemcpyAsyncD2D(dst, src, count, stream);
 }
 
-void GPUContext::MemcpyAsyncH2D(void* dst, const void* src, size_t count, gpuStream_t stream) {
+void GPUContext::MemcpyAsyncH2D(void* dst, const void* src, size_t count,
+                                gpuStream_t stream) {
   pimpl->MemcpyAsyncH2D(dst, src, count, stream);
 }
 
-void GPUContext::MemcpyAsyncD2H(void* dst, const void* src, size_t count, gpuStream_t stream) {
+void GPUContext::MemcpyAsyncD2H(void* dst, const void* src, size_t count,
+                                gpuStream_t stream) {
   pimpl->MemcpyAsyncD2H(dst, src, count, stream);
 }
 
-void GPUContext::ScaleBufferImpl(const void* fused_input_data, void* buffer_data, int64_t num_elements,
-                                 double scale_factor, DataType dtype, gpuStream_t stream) {
-  pimpl->ScaleBufferImpl(fused_input_data, buffer_data, num_elements, scale_factor, dtype, stream);
+void GPUContext::ScaleBufferImpl(const void* fused_input_data,
+                                 void* buffer_data, int64_t num_elements,
+                                 double scale_factor, DataType dtype,
+                                 gpuStream_t stream) {
+  pimpl->ScaleBufferImpl(fused_input_data, buffer_data, num_elements,
+                         scale_factor, dtype, stream);
 }
-
diff --git a/horovod/common/ops/gpu_operations.cc b/horovod/common/ops/gpu_operations.cc
index 03866bbbcc..1fa7e8c42d 100644
--- a/horovod/common/ops/gpu_operations.cc
+++ b/horovod/common/ops/gpu_operations.cc
@@ -24,7 +24,8 @@
 namespace horovod {
 namespace common {
 
-GPUOpContext::GPUOpContext(GPUContext* context, HorovodGlobalState* global_state)
+GPUOpContext::GPUOpContext(GPUContext* context,
+                           HorovodGlobalState* global_state)
     : gpu_context_(context), global_state_(global_state) {}
 
 void GPUOpContext::InitGPU(const std::vector<TensorTableEntry>& entries) {
@@ -32,23 +33,29 @@ void GPUOpContext::InitGPU(const std::vector<TensorTableEntry>& entries) {
   gpu_context_->SetDevice(first_entry.device);
 
   // Ensure stream is in the map before executing reduction.
-  gpuStream_t& stream = gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device];
+  gpuStream_t& stream =
+      gpu_context_
+          ->streams[global_state_->current_nccl_stream][first_entry.device];
   if (stream == nullptr) {
     gpu_context_->StreamCreate(&stream);
   }
 }
 
-void GPUOpContext::InitGPUQueue(const std::vector<TensorTableEntry>& entries, const Response& response) {
+void GPUOpContext::InitGPUQueue(const std::vector<TensorTableEntry>& entries,
+                                const Response& response) {
   event_queue = std::queue<std::pair<std::string, Event>>();
-  stream = &gpu_context_->streams[global_state_->current_nccl_stream][entries[0].device];
+  stream =
+      &gpu_context_
+           ->streams[global_state_->current_nccl_stream][entries[0].device];
 
   if (global_state_->timeline.Initialized()) {
     gpu_context_->RecordEvent(event_queue, QUEUE, *stream);
   }
 }
 
-Status GPUOpContext::FinalizeGPUQueue(std::vector<TensorTableEntry>& entries, bool free_host_buffer /*= true*/,
-                                      const std::function<void()>& error_check_callback) {
+Status GPUOpContext::FinalizeGPUQueue(
+    std::vector<TensorTableEntry>& entries, bool free_host_buffer /*= true*/,
+    const std::function<void()>& error_check_callback) {
   // Use completion marker via event because it's faster than
   // blocking gpuStreamSynchronize() in this thread.
   if (!global_state_->enable_async_completion) {
@@ -60,67 +67,99 @@ Status GPUOpContext::FinalizeGPUQueue(std::vector<TensorTableEntry>& entries, bo
   auto& evt_queue = event_queue;
   auto& timeline = global_state_->timeline;
   auto& gpu_context = gpu_context_;
+  auto& global_state = global_state_;
 
-  // Claim a std::shared_ptr to the fusion buffer to prevent its memory from being reclaimed
-  // during finalization.
+  // Claim a std::shared_ptr to the fusion buffer to prevent its memory from
+  // being reclaimed during finalization.
   auto fusion_buffer = global_state_->fusion_buffer.GetBuffer(
-      first_entry.device, first_entry.context->framework(), global_state_->current_nccl_stream);
+      first_entry.device, first_entry.context->framework(),
+      global_state_->current_nccl_stream);
 
   bool elastic = global_state_->elastic_enabled;
   bool enable_async_completion = global_state_->enable_async_completion;
   auto current_stream = *stream;
-  gpu_context_->finalizer_thread_pool.execute([entries, first_entry, cpu_buffer, fusion_buffer, free_host_buffer,
-                                               evt_queue, &timeline, &gpu_context, error_check_callback,
-                                               elastic, enable_async_completion, current_stream]() mutable {
-    gpu_context->SetDevice(first_entry.device);
-
-    Event event;
-    if (!enable_async_completion || timeline.Initialized()) {
-      // If timeline is enabled, wait for events on CPU for accurate timings.
-      gpu_context->WaitForEvents(evt_queue, entries, timeline, error_check_callback, elastic);
-    } else {
-      gpu_context->ClearEvents(evt_queue, entries, timeline, error_check_callback, elastic);
-      event = gpu_context->RecordEvent(current_stream);
-    }
-
-    if (free_host_buffer && cpu_buffer != nullptr) {
-      free(cpu_buffer);
-    }
-
-    for (auto& e : entries) {
-      timeline.End(e.tensor_name, e.output);
-      auto status = Status::OK();
-      status.event = event;
-      e.FinishWithCallback(status);
-    }
-    if (enable_async_completion) {
-      gpu_context->ReleaseEvent(event);
-    }
-  });
+  gpu_context_->finalizer_thread_pool.execute(
+      [entries, first_entry, cpu_buffer, fusion_buffer, free_host_buffer,
+       evt_queue, &timeline, &gpu_context, error_check_callback, elastic,
+       enable_async_completion, current_stream, &global_state]() mutable {
+        gpu_context->SetDevice(first_entry.device);
+
+        Event event;
+        bool gpu_evt_failed = false;
+        std::string gpu_evt_err_msg;
+        if (!enable_async_completion || timeline.Initialized()) {
+          // If timeline is enabled, wait for events on CPU for accurate
+          // timings.
+          if (elastic) {
+            try {
+              gpu_context->WaitForEventsElastic(evt_queue, entries, timeline,
+                                                error_check_callback);
+            } catch (std::exception& e) {
+              // notify background loop to exit and reinit rather than just
+              // aborting the program
+              global_state->shut_down = true;
+              gpu_evt_failed = true;
+              gpu_evt_err_msg = e.what();
+            }
+          } else {
+            gpu_context->WaitForEvents(evt_queue, entries, timeline,
+                                       error_check_callback);
+          }
+        } else {
+          gpu_context->ClearEvents(evt_queue, entries, timeline,
+                                   error_check_callback, elastic);
+          event = gpu_context->RecordEvent(current_stream);
+        }
+
+        if (free_host_buffer && cpu_buffer != nullptr) {
+          free(cpu_buffer);
+        }
+
+        Status status;
+        if (gpu_evt_failed) {
+          status = Status::UnknownError(gpu_evt_err_msg);
+        } else {
+          status = Status::OK();
+          status.event = event;
+        }
+
+        for (auto& e : entries) {
+          timeline.End(e.tensor_name, e.output);
+          e.FinishWithCallback(status);
+        }
+        if (enable_async_completion) {
+          gpu_context->ReleaseEvent(event);
+        }
+      });
 
   // Update current stream
-  global_state_->current_nccl_stream = (global_state_->current_nccl_stream + 1) %
-                                  global_state_->num_nccl_streams;
+  global_state_->current_nccl_stream =
+      (global_state_->current_nccl_stream + 1) %
+      global_state_->num_nccl_streams;
 
   return Status::InProgress();
 }
 
-GPUAllreduce::GPUAllreduce(GPUContext* context, HorovodGlobalState* global_state)
-    : AllreduceOp(global_state), gpu_context_(context), gpu_op_context_(context, global_state) {}
+GPUAllreduce::GPUAllreduce(GPUContext* context,
+                           HorovodGlobalState* global_state)
+    : AllreduceOp(global_state), gpu_context_(context),
+      gpu_op_context_(context, global_state) {}
 
 bool GPUAllreduce::Enabled(const ParameterManager& param_manager,
-                            const std::vector<TensorTableEntry>& entries,
-                            const Response& response) const {
+                           const std::vector<TensorTableEntry>& entries,
+                           const Response& response) const {
   return entries[0].device != CPU_DEVICE_ID;
 }
 
 #if HAVE_CUDA
-void GPUAllreduce::MemcpyInFusionBuffer(const std::vector<TensorTableEntry>& entries, const void*& fused_input_data,
-                                        void*& buffer_data, size_t& buffer_len) {
+void GPUAllreduce::MemcpyInFusionBuffer(
+    const std::vector<TensorTableEntry>& entries, const void*& fused_input_data,
+    void*& buffer_data, size_t& buffer_len) {
   // Access the fusion buffer.
   auto& first_entry = entries[0];
   auto buffer = global_state_->fusion_buffer.GetBuffer(
-      first_entry.device, first_entry.context->framework(), global_state_->current_nccl_stream);
+      first_entry.device, first_entry.context->framework(),
+      global_state_->current_nccl_stream);
   buffer_data = const_cast<void*>(buffer->AccessData(first_entry.context));
 
   if (global_state_->batch_d2d_memcopies) {
@@ -135,18 +174,24 @@ void GPUAllreduce::MemcpyInFusionBuffer(const std::vector<TensorTableEntry>& ent
 
       // Set input/output pointers and sizes
       d2d_params.out[idx % BATCHED_D2D_CAPACITY] = buffer_data_at_offset;
-      d2d_params.in[idx % BATCHED_D2D_CAPACITY] = (void*) e.tensor->data();
+      d2d_params.in[idx % BATCHED_D2D_CAPACITY] = (void*)e.tensor->data();
       d2d_params.sizes[idx % BATCHED_D2D_CAPACITY] = e.tensor->size();
 
-      offset += BATCHED_D2D_PADDING * ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
+      offset +=
+          BATCHED_D2D_PADDING *
+          ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
       idx++;
       count++;
 
-      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int) entries.size()) {
+      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int)entries.size()) {
         // Perform batched d2d memcpy
-        BatchedD2DMemcpyCudaImpl(d2d_params, count, gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+        BatchedD2DMemcpyCudaImpl(
+            d2d_params, count,
+            gpu_context_->streams[global_state_->current_nccl_stream]
+                                 [first_entry.device]);
         // TODO: https://github.com/horovod/horovod/issues/2230
-        //gpu_context_->ErrorCheck("BatchedD2DMemcpyCudaImpl", cudaGetLastError());
+        // gpu_context_->ErrorCheck("BatchedD2DMemcpyCudaImpl",
+        // cudaGetLastError());
         count = 0;
       }
     }
@@ -155,12 +200,12 @@ void GPUAllreduce::MemcpyInFusionBuffer(const std::vector<TensorTableEntry>& ent
   } else {
     int64_t offset = 0;
     for (auto& e : entries) {
-      void* buffer_data_at_offset = (uint8_t*) buffer_data + offset;
+      void* buffer_data_at_offset = (uint8_t*)buffer_data + offset;
       MemcpyEntryInFusionBuffer(entries, e, buffer_data_at_offset);
       offset += e.tensor->size();
     }
 
-    buffer_len = (size_t) offset;
+    buffer_len = (size_t)offset;
   }
 
   // Set the input data to originate from the buffer.
@@ -169,12 +214,14 @@ void GPUAllreduce::MemcpyInFusionBuffer(const std::vector<TensorTableEntry>& ent
 #endif
 
 #if HAVE_CUDA
-void GPUAllreduce::ScaleMemcpyInFusionBuffer(const std::vector<TensorTableEntry>& entries, const void*& fused_input_data,
-                                             void*& buffer_data, size_t& buffer_len, double scale_factor) {
+void GPUAllreduce::ScaleMemcpyInFusionBuffer(
+    const std::vector<TensorTableEntry>& entries, const void*& fused_input_data,
+    void*& buffer_data, size_t& buffer_len, double scale_factor) {
   auto& first_entry = entries[0];
   // Access the fusion buffer.
   auto buffer = global_state_->fusion_buffer.GetBuffer(
-      first_entry.device, first_entry.context->framework(), global_state_->current_nccl_stream);
+      first_entry.device, first_entry.context->framework(),
+      global_state_->current_nccl_stream);
   buffer_data = const_cast<void*>(buffer->AccessData(first_entry.context));
 
   if (global_state_->batch_d2d_memcopies) {
@@ -188,19 +235,24 @@ void GPUAllreduce::ScaleMemcpyInFusionBuffer(const std::vector<TensorTableEntry>
 
       // Set input/output pointers and sizes
       d2d_params.out[idx % BATCHED_D2D_CAPACITY] = buffer_data_at_offset;
-      d2d_params.in[idx % BATCHED_D2D_CAPACITY] = (void*) e.tensor->data();
+      d2d_params.in[idx % BATCHED_D2D_CAPACITY] = (void*)e.tensor->data();
       d2d_params.sizes[idx % BATCHED_D2D_CAPACITY] = e.tensor->size();
 
-      offset += BATCHED_D2D_PADDING * ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
+      offset +=
+          BATCHED_D2D_PADDING *
+          ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
       idx++;
       count++;
 
-      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int) entries.size()) {
+      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int)entries.size()) {
         // Perform batched d2d memcpy
-        BatchedScaledD2DMemcpyCudaImpl(d2d_params, count, scale_factor, first_entry.tensor->dtype(),
-                                       gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+        BatchedScaledD2DMemcpyCudaImpl(
+            d2d_params, count, scale_factor, first_entry.tensor->dtype(),
+            gpu_context_->streams[global_state_->current_nccl_stream]
+                                 [first_entry.device]);
         // TODO: https://github.com/horovod/horovod/issues/2230
-        //gpu_context_->ErrorCheck("BatchedScaledD2DMemcpyCudaImpl", cudaGetLastError());
+        // gpu_context_->ErrorCheck("BatchedScaledD2DMemcpyCudaImpl",
+        // cudaGetLastError());
         count = 0;
       }
     }
@@ -209,15 +261,17 @@ void GPUAllreduce::ScaleMemcpyInFusionBuffer(const std::vector<TensorTableEntry>
   } else {
     int64_t offset = 0;
     for (auto& e : entries) {
-      void* buffer_data_at_offset = (uint8_t*) buffer_data + offset;
+      void* buffer_data_at_offset = (uint8_t*)buffer_data + offset;
       MemcpyEntryInFusionBuffer(entries, e, buffer_data_at_offset);
       offset += e.tensor->size();
     }
 
-    buffer_len = (size_t) offset;
-    int64_t num_elements = buffer_len / DataType_Size(first_entry.tensor->dtype());
+    buffer_len = (size_t)offset;
+    int64_t num_elements =
+        buffer_len / DataType_Size(first_entry.tensor->dtype());
     if (scale_factor != 1.0) {
-      ScaleBuffer(scale_factor, entries, buffer_data, buffer_data, num_elements);
+      ScaleBuffer(scale_factor, entries, buffer_data, buffer_data,
+                  num_elements);
     }
   }
 
@@ -226,16 +280,19 @@ void GPUAllreduce::ScaleMemcpyInFusionBuffer(const std::vector<TensorTableEntry>
 }
 #endif
 
-
-void GPUAllreduce::MemcpyEntryInFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                             const TensorTableEntry& e, void* buffer_data_at_offset) {
+void GPUAllreduce::MemcpyEntryInFusionBuffer(
+    const std::vector<TensorTableEntry>& entries, const TensorTableEntry& e,
+    void* buffer_data_at_offset) {
   auto& first_entry = entries[0];
-  gpu_context_->MemcpyAsyncD2D(buffer_data_at_offset, e.tensor->data(), (size_t) e.tensor->size(),
-                               gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+  gpu_context_->MemcpyAsyncD2D(
+      buffer_data_at_offset, e.tensor->data(), (size_t)e.tensor->size(),
+      gpu_context_
+          ->streams[global_state_->current_nccl_stream][first_entry.device]);
 }
 
 #if HAVE_CUDA
-void GPUAllreduce::MemcpyOutFusionBuffer(const void* buffer_data, std::vector<TensorTableEntry>& entries) {
+void GPUAllreduce::MemcpyOutFusionBuffer(
+    const void* buffer_data, std::vector<TensorTableEntry>& entries) {
   if (global_state_->batch_d2d_memcopies) {
     int64_t offset = 0;
     int idx = 0;
@@ -251,15 +308,21 @@ void GPUAllreduce::MemcpyOutFusionBuffer(const void* buffer_data, std::vector<Te
       d2d_params.in[idx % BATCHED_D2D_CAPACITY] = buffer_data_at_offset;
       d2d_params.sizes[idx % BATCHED_D2D_CAPACITY] = e.tensor->size();
 
-      offset += BATCHED_D2D_PADDING * ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
+      offset +=
+          BATCHED_D2D_PADDING *
+          ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
       idx++;
       count++;
 
-      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int) entries.size()) {
+      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int)entries.size()) {
         // Perform batched d2d memcpy
-        BatchedD2DMemcpyCudaImpl(d2d_params, count, gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+        BatchedD2DMemcpyCudaImpl(
+            d2d_params, count,
+            gpu_context_->streams[global_state_->current_nccl_stream]
+                                 [first_entry.device]);
         // TODO: https://github.com/horovod/horovod/issues/2230
-        //gpu_context_->ErrorCheck("BatchedD2DMemcpyCudaImpl", cudaGetLastError());
+        // gpu_context_->ErrorCheck("BatchedD2DMemcpyCudaImpl",
+        // cudaGetLastError());
         count = 0;
       }
     }
@@ -267,7 +330,7 @@ void GPUAllreduce::MemcpyOutFusionBuffer(const void* buffer_data, std::vector<Te
   } else {
     int64_t offset = 0;
     for (auto& e : entries) {
-      void* buffer_data_at_offset = (uint8_t*) buffer_data + offset;
+      void* buffer_data_at_offset = (uint8_t*)buffer_data + offset;
       MemcpyEntryOutFusionBuffer(entries, buffer_data_at_offset, e);
       offset += e.tensor->size();
     }
@@ -276,8 +339,9 @@ void GPUAllreduce::MemcpyOutFusionBuffer(const void* buffer_data, std::vector<Te
 #endif
 
 #if HAVE_CUDA
-void GPUAllreduce::ScaleMemcpyOutFusionBuffer(void* buffer_data, size_t buffer_len, double scale_factor,
-                                              std::vector<TensorTableEntry>& entries) {
+void GPUAllreduce::ScaleMemcpyOutFusionBuffer(
+    void* buffer_data, size_t buffer_len, double scale_factor,
+    std::vector<TensorTableEntry>& entries) {
   auto& first_entry = entries[0];
 
   if (global_state_->batch_d2d_memcopies) {
@@ -294,29 +358,36 @@ void GPUAllreduce::ScaleMemcpyOutFusionBuffer(void* buffer_data, size_t buffer_l
       d2d_params.in[idx % BATCHED_D2D_CAPACITY] = buffer_data_at_offset;
       d2d_params.sizes[idx % BATCHED_D2D_CAPACITY] = e.tensor->size();
 
-      offset += BATCHED_D2D_PADDING * ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
+      offset +=
+          BATCHED_D2D_PADDING *
+          ((e.tensor->size() + BATCHED_D2D_PADDING - 1) / BATCHED_D2D_PADDING);
       idx++;
       count++;
 
-      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int) entries.size()) {
+      if (idx % BATCHED_D2D_CAPACITY == 0 || idx == (int)entries.size()) {
         // Perform batched d2d memcpy
-        BatchedScaledD2DMemcpyCudaImpl(d2d_params, count, scale_factor, first_entry.tensor->dtype(),
-                                       gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+        BatchedScaledD2DMemcpyCudaImpl(
+            d2d_params, count, scale_factor, first_entry.tensor->dtype(),
+            gpu_context_->streams[global_state_->current_nccl_stream]
+                                 [first_entry.device]);
         // TODO: https://github.com/horovod/horovod/issues/2230
-        //gpu_context_->ErrorCheck("BatchedD2DMemcpyCudaImpl", cudaGetLastError());
+        // gpu_context_->ErrorCheck("BatchedD2DMemcpyCudaImpl",
+        // cudaGetLastError());
         count = 0;
       }
     }
 
   } else {
-    int64_t num_elements = buffer_len / DataType_Size(first_entry.tensor->dtype());
+    int64_t num_elements =
+        buffer_len / DataType_Size(first_entry.tensor->dtype());
     if (scale_factor != 1.0) {
-      ScaleBuffer(scale_factor, entries, buffer_data, buffer_data, num_elements);
+      ScaleBuffer(scale_factor, entries, buffer_data, buffer_data,
+                  num_elements);
     }
 
     int64_t offset = 0;
     for (auto& e : entries) {
-      void* buffer_data_at_offset = (uint8_t*) buffer_data + offset;
+      void* buffer_data_at_offset = (uint8_t*)buffer_data + offset;
       MemcpyEntryOutFusionBuffer(entries, buffer_data_at_offset, e);
       offset += e.tensor->size();
     }
@@ -324,22 +395,31 @@ void GPUAllreduce::ScaleMemcpyOutFusionBuffer(void* buffer_data, size_t buffer_l
 }
 #endif
 
-void GPUAllreduce::MemcpyEntryOutFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                               const void* buffer_data_at_offset, TensorTableEntry& e) {
+void GPUAllreduce::MemcpyEntryOutFusionBuffer(
+    const std::vector<TensorTableEntry>& entries,
+    const void* buffer_data_at_offset, TensorTableEntry& e) {
   auto& first_entry = entries[0];
-  gpu_context_->MemcpyAsyncD2D((void*) e.output->data(), buffer_data_at_offset, (size_t) e.tensor->size(),
-                               gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+  gpu_context_->MemcpyAsyncD2D(
+      (void*)e.output->data(), buffer_data_at_offset, (size_t)e.tensor->size(),
+      gpu_context_
+          ->streams[global_state_->current_nccl_stream][first_entry.device]);
 }
 
-void GPUAllreduce::ScaleBuffer(double scale_factor, const std::vector<TensorTableEntry>& entries,
-                               const void* fused_input_data, void* buffer_data, int64_t num_elements) {
-  gpu_context_->ScaleBufferImpl(fused_input_data, buffer_data, num_elements, scale_factor, entries[0].tensor->dtype(),
-                                gpu_context_->streams[global_state_->current_nccl_stream][entries[0].device]);
-
+void GPUAllreduce::ScaleBuffer(double scale_factor,
+                               const std::vector<TensorTableEntry>& entries,
+                               const void* fused_input_data, void* buffer_data,
+                               int64_t num_elements) {
+  gpu_context_->ScaleBufferImpl(
+      fused_input_data, buffer_data, num_elements, scale_factor,
+      entries[0].tensor->dtype(),
+      gpu_context_
+          ->streams[global_state_->current_nccl_stream][entries[0].device]);
 }
 
-GPUAllgather::GPUAllgather(GPUContext* context, HorovodGlobalState* global_state)
-    : AllgatherOp(global_state), gpu_context_(context), gpu_op_context_(context, global_state) {}
+GPUAllgather::GPUAllgather(GPUContext* context,
+                           HorovodGlobalState* global_state)
+    : AllgatherOp(global_state), gpu_context_(context),
+      gpu_op_context_(context, global_state) {}
 
 bool GPUAllgather::Enabled(const ParameterManager& param_manager,
                            const std::vector<TensorTableEntry>& entries,
@@ -347,24 +427,32 @@ bool GPUAllgather::Enabled(const ParameterManager& param_manager,
   return entries[0].device != CPU_DEVICE_ID;
 }
 
-void GPUAllgather::MemcpyEntryInFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                             const TensorTableEntry& e, void* buffer_data_at_offset) {
+void GPUAllgather::MemcpyEntryInFusionBuffer(
+    const std::vector<TensorTableEntry>& entries, const TensorTableEntry& e,
+    void* buffer_data_at_offset) {
   auto& first_entry = entries[0];
-  gpu_context_->MemcpyAsyncD2D(buffer_data_at_offset, e.tensor->data(), (size_t) e.tensor->size(),
-                               gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+  gpu_context_->MemcpyAsyncD2D(
+      buffer_data_at_offset, e.tensor->data(), (size_t)e.tensor->size(),
+      gpu_context_
+          ->streams[global_state_->current_nccl_stream][first_entry.device]);
 }
 
-void GPUAllgather::MemcpyEntryOutFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                              const void* buffer_data_at_offset, TensorTableEntry& e,
-                                              int64_t entry_offset, size_t entry_size) {
+void GPUAllgather::MemcpyEntryOutFusionBuffer(
+    const std::vector<TensorTableEntry>& entries,
+    const void* buffer_data_at_offset, TensorTableEntry& e,
+    int64_t entry_offset, size_t entry_size) {
   auto& first_entry = entries[0];
-  gpu_context_->MemcpyAsyncD2D((int8_t*)e.output->data() + entry_offset, buffer_data_at_offset, entry_size,
-                               gpu_context_->streams[global_state_->current_nccl_stream][first_entry.device]);
+  gpu_context_->MemcpyAsyncD2D(
+      (int8_t*)e.output->data() + entry_offset, buffer_data_at_offset,
+      entry_size,
+      gpu_context_
+          ->streams[global_state_->current_nccl_stream][first_entry.device]);
 }
 
 GPUBroadcast::GPUBroadcast(GPUContext* context,
                            HorovodGlobalState* global_state)
-    : BroadcastOp(global_state), gpu_context_(context), gpu_op_context_(context, global_state) {}
+    : BroadcastOp(global_state), gpu_context_(context),
+      gpu_op_context_(context, global_state) {}
 
 bool GPUBroadcast::Enabled(const ParameterManager& param_manager,
                            const std::vector<TensorTableEntry>& entries,
@@ -372,9 +460,9 @@ bool GPUBroadcast::Enabled(const ParameterManager& param_manager,
   return entries[0].device != CPU_DEVICE_ID;
 }
 
-GPUAlltoall::GPUAlltoall(GPUContext* context,
-		         HorovodGlobalState* global_state)
-    : AlltoallOp(global_state), gpu_context_(context), gpu_op_context_(context, global_state) {}
+GPUAlltoall::GPUAlltoall(GPUContext* context, HorovodGlobalState* global_state)
+    : AlltoallOp(global_state), gpu_context_(context),
+      gpu_op_context_(context, global_state) {}
 
 bool GPUAlltoall::Enabled(const ParameterManager& param_manager,
                           const std::vector<TensorTableEntry>& entries,
diff --git a/horovod/common/ops/gpu_operations.h b/horovod/common/ops/gpu_operations.h
index 09d4abc837..6ab69c2985 100644
--- a/horovod/common/ops/gpu_operations.h
+++ b/horovod/common/ops/gpu_operations.h
@@ -35,8 +35,8 @@ using gpuEvent_t = hipEvent_t;
 using gpuStream_t = hipStream_t;
 #endif
 
-#include "collective_operations.h"
 #include "../thread_pool.h"
+#include "collective_operations.h"
 
 namespace horovod {
 namespace common {
@@ -66,36 +66,47 @@ class GPUContext {
 
   void ErrorCheck(std::string op_name, gpuError_t gpu_result);
 
-  void RecordEvent(std::queue<std::pair<std::string, Event>>& event_queue, std::string name,
-                   gpuStream_t& stream);
+  void RecordEvent(std::queue<std::pair<std::string, Event>>& event_queue,
+                   std::string name, gpuStream_t& stream);
 
   Event RecordEvent(gpuStream_t& stream);
 
   void ReleaseEvent(Event event);
 
-  void WaitForEvents(std::queue<std::pair<std::string, Event>>& event_queue,
-                     const std::vector<TensorTableEntry>& entries, Timeline& timeline,
-                     const std::function<void()>& error_check_callback = nullptr,
-                     bool elastic = false);
+  void
+  WaitForEvents(std::queue<std::pair<std::string, Event>>& event_queue,
+                const std::vector<TensorTableEntry>& entries,
+                Timeline& timeline,
+                const std::function<void()>& error_check_callback = nullptr);
+
+  void WaitForEventsElastic(
+      std::queue<std::pair<std::string, Event>>& event_queue,
+      const std::vector<TensorTableEntry>& entries, Timeline& timeline,
+      const std::function<void()>& error_check_callback = nullptr);
 
   void ClearEvents(std::queue<std::pair<std::string, Event>>& event_queue,
-                   const std::vector<TensorTableEntry>& entries, Timeline& timeline,
+                   const std::vector<TensorTableEntry>& entries,
+                   Timeline& timeline,
                    const std::function<void()>& error_check_callback = nullptr,
                    bool elastic = false);
 
-  void StreamCreate(gpuStream_t *stream);
+  void StreamCreate(gpuStream_t* stream);
   void StreamSynchronize(gpuStream_t stream);
 
   int GetDevice();
 
   void SetDevice(int device);
 
-  void MemcpyAsyncD2D(void* dst, const void* src, size_t count, gpuStream_t stream);
-  void MemcpyAsyncH2D(void* dst, const void* src, size_t count, gpuStream_t stream);
-  void MemcpyAsyncD2H(void* dst, const void* src, size_t count, gpuStream_t stream);
+  void MemcpyAsyncD2D(void* dst, const void* src, size_t count,
+                      gpuStream_t stream);
+  void MemcpyAsyncH2D(void* dst, const void* src, size_t count,
+                      gpuStream_t stream);
+  void MemcpyAsyncD2H(void* dst, const void* src, size_t count,
+                      gpuStream_t stream);
 
-  void ScaleBufferImpl(const void* fused_input_data, void* buffer_data, int64_t num_elements,
-                       double scale_factor, DataType dtype, gpuStream_t stream);
+  void ScaleBufferImpl(const void* fused_input_data, void* buffer_data,
+                       int64_t num_elements, double scale_factor,
+                       DataType dtype, gpuStream_t stream);
 
   // Thread pool for finalizer threads
   ThreadPool finalizer_thread_pool;
@@ -107,21 +118,25 @@ class GPUContext {
 
 class GPUOpContext {
 public:
-  GPUOpContext(GPUContext* context,
-               HorovodGlobalState* global_state);
+  GPUOpContext(GPUContext* context, HorovodGlobalState* global_state);
 
   void InitGPU(const std::vector<TensorTableEntry>& entries);
 
-  void InitGPUQueue(const std::vector<TensorTableEntry>& entries, const Response& response);
+  void InitGPUQueue(const std::vector<TensorTableEntry>& entries,
+                    const Response& response);
 
-  Status FinalizeGPUQueue(std::vector<TensorTableEntry>& entries, bool free_host_buffer = true,
-                          const std::function<void()>& error_check_callback = nullptr);
+  Status
+  FinalizeGPUQueue(std::vector<TensorTableEntry>& entries,
+                   bool free_host_buffer = true,
+                   const std::function<void()>& error_check_callback = nullptr);
 
-  // GPU events are used as an alternative to host-device synchronization (which stalls the GPU pipeline)
-  // for the purpose of recording timing on the Horovod timeline.
+  // GPU events are used as an alternative to host-device synchronization (which
+  // stalls the GPU pipeline) for the purpose of recording timing on the Horovod
+  // timeline.
   //
-  // When an event we wish to record occurs (for example, NCCL_ALLREDUCE), the event is enqueued. After the entire
-  // operation completes, a background thread is spawned to synchronize on the events in the queue and record
+  // When an event we wish to record occurs (for example, NCCL_ALLREDUCE), the
+  // event is enqueued. After the entire operation completes, a background
+  // thread is spawned to synchronize on the events in the queue and record
   // timing, while allowing Horovod to continue processing additional tensors.
   //
   // For more information of CUDA Events, see:
@@ -138,8 +153,7 @@ class GPUOpContext {
 
 class GPUAllreduce : public AllreduceOp {
 public:
-  GPUAllreduce(GPUContext* context,
-               HorovodGlobalState* global_state);
+  GPUAllreduce(GPUContext* context, HorovodGlobalState* global_state);
 
   bool Enabled(const ParameterManager& param_manager,
                const std::vector<TensorTableEntry>& entries,
@@ -147,35 +161,42 @@ class GPUAllreduce : public AllreduceOp {
 
 protected:
 #if HAVE_CUDA
-  void MemcpyInFusionBuffer(const std::vector<TensorTableEntry>& entries, const void*& fused_input_data,
-                            void*& buffer_data, size_t& buffer_len) override;
-
-  void MemcpyOutFusionBuffer(const void* buffer_data, std::vector<TensorTableEntry>& entries) override;
-
-  void ScaleMemcpyInFusionBuffer(const std::vector<TensorTableEntry>& entries, const void*& fused_input_data,
-                                 void*& buffer_data, size_t& buffer_len, double scale_factor);
-  void ScaleMemcpyOutFusionBuffer(void* buffer_data, size_t buffer_len, double scale_factor,
+  void MemcpyInFusionBuffer(const std::vector<TensorTableEntry>& entries,
+                            const void*& fused_input_data, void*& buffer_data,
+                            size_t& buffer_len) override;
+
+  void MemcpyOutFusionBuffer(const void* buffer_data,
+                             std::vector<TensorTableEntry>& entries) override;
+
+  void ScaleMemcpyInFusionBuffer(const std::vector<TensorTableEntry>& entries,
+                                 const void*& fused_input_data,
+                                 void*& buffer_data, size_t& buffer_len,
+                                 double scale_factor);
+  void ScaleMemcpyOutFusionBuffer(void* buffer_data, size_t buffer_len,
+                                  double scale_factor,
                                   std::vector<TensorTableEntry>& entries);
 #endif
 
   void MemcpyEntryInFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                 const TensorTableEntry& e, void* buffer_data_at_offset) override;
+                                 const TensorTableEntry& e,
+                                 void* buffer_data_at_offset) override;
 
   void MemcpyEntryOutFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                  const void* buffer_data_at_offset, TensorTableEntry& e) override;
+                                  const void* buffer_data_at_offset,
+                                  TensorTableEntry& e) override;
 
-  void ScaleBuffer(double scale_factor, const std::vector<TensorTableEntry>& entries,
-                   const void* fused_input_data, void* buffer_data, int64_t num_elements);
+  void ScaleBuffer(double scale_factor,
+                   const std::vector<TensorTableEntry>& entries,
+                   const void* fused_input_data, void* buffer_data,
+                   int64_t num_elements);
 
   GPUContext* gpu_context_;
   GPUOpContext gpu_op_context_;
-
 };
 
 class GPUAllgather : public AllgatherOp {
 public:
-  GPUAllgather(GPUContext* context,
-               HorovodGlobalState* global_state);
+  GPUAllgather(GPUContext* context, HorovodGlobalState* global_state);
 
   bool Enabled(const ParameterManager& param_manager,
                const std::vector<TensorTableEntry>& entries,
@@ -183,11 +204,13 @@ class GPUAllgather : public AllgatherOp {
 
 protected:
   void MemcpyEntryInFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                 const TensorTableEntry& e, void* buffer_data_at_offset) override;
+                                 const TensorTableEntry& e,
+                                 void* buffer_data_at_offset) override;
 
   void MemcpyEntryOutFusionBuffer(const std::vector<TensorTableEntry>& entries,
-                                  const void* buffer_data_at_offset, TensorTableEntry& e,
-                                  int64_t entry_offset, size_t entry_size) override;
+                                  const void* buffer_data_at_offset,
+                                  TensorTableEntry& e, int64_t entry_offset,
+                                  size_t entry_size) override;
 
   GPUContext* gpu_context_;
   GPUOpContext gpu_op_context_;
@@ -195,8 +218,7 @@ class GPUAllgather : public AllgatherOp {
 
 class GPUBroadcast : public BroadcastOp {
 public:
-  GPUBroadcast(GPUContext* context,
-               HorovodGlobalState* global_state);
+  GPUBroadcast(GPUContext* context, HorovodGlobalState* global_state);
 
   bool Enabled(const ParameterManager& param_manager,
                const std::vector<TensorTableEntry>& entries,
@@ -209,11 +231,11 @@ class GPUBroadcast : public BroadcastOp {
 
 class GPUAlltoall : public AlltoallOp {
 public:
-  GPUAlltoall(GPUContext* context,
-              HorovodGlobalState* global_state);
+  GPUAlltoall(GPUContext* context, HorovodGlobalState* global_state);
   bool Enabled(const ParameterManager& param_manager,
                const std::vector<TensorTableEntry>& entries,
                const Response& response) const override;
+
 protected:
   GPUContext* gpu_context_;
   GPUOpContext gpu_op_context_;
@@ -222,4 +244,4 @@ class GPUAlltoall : public AlltoallOp {
 } // namespace common
 } // namespace horovod
 
-#endif //HOROVOD_GPU_OPERATIONS_H
+#endif // HOROVOD_GPU_OPERATIONS_H
diff --git a/horovod/common/ops/hip_operations.cc b/horovod/common/ops/hip_operations.cc
index 3ee9e63b52..0479c6151e 100644
--- a/horovod/common/ops/hip_operations.cc
+++ b/horovod/common/ops/hip_operations.cc
@@ -14,8 +14,8 @@
 // limitations under the License.
 // =============================================================================
 
-#include "gpu_operations.h"
 #include "../message.h"
+#include "gpu_operations.h"
 
 #include <thread>
 
@@ -64,21 +64,24 @@ class GPUContext::impl {
 
   void ErrorCheck(std::string op_name, hipError_t hip_result) {
     if (hip_result != hipSuccess) {
-      throw std::logic_error(std::string(op_name) + " failed: " + hipGetErrorString(hip_result));
+      throw std::logic_error(std::string(op_name) +
+                             " failed: " + hipGetErrorString(hip_result));
     }
   }
 
-  void RecordEvent(std::queue<std::pair<std::string, hipEvent_t>>& event_queue, std::string name, hipStream_t& stream) {
+  void RecordEvent(std::queue<std::pair<std::string, hipEvent_t>>& event_queue,
+                   std::string name, hipStream_t& stream) {
     hipEvent_t event;
     ErrorCheck("GetGpuEvent", GetGpuEvent(&event));
     ErrorCheck("hipEventRecord", hipEventRecord(event, stream));
     event_queue.emplace(name, event);
   }
 
-  void WaitForEvents(std::queue<std::pair<std::string, hipEvent_t>>& event_queue,
-      const std::vector<TensorTableEntry>& entries, Timeline& timeline,
-      const std::function<void()>& error_check_callback,
-      bool elastic) {
+  void
+  WaitForEvents(std::queue<std::pair<std::string, hipEvent_t>>& event_queue,
+                const std::vector<TensorTableEntry>& entries,
+                Timeline& timeline,
+                const std::function<void()>& error_check_callback) {
     while (!event_queue.empty()) {
       std::string name;
       hipEvent_t event;
@@ -88,7 +91,8 @@ class GPUContext::impl {
         timeline.ActivityStartAll(entries, name);
       }
 
-      // Check for async (networking) errors while waiting for the event to complete
+      // Check for async (networking) errors while waiting for the event to
+      // complete
       hipError_t hip_result;
       while (true) {
         hip_result = hipEventQuery(event);
@@ -97,7 +101,8 @@ class GPUContext::impl {
         }
 
         if (hip_result != hipErrorNotReady) {
-          throw std::logic_error(std::string("hipEventQuery failed: ") + hipGetErrorString(hip_result));
+          throw std::logic_error(std::string("hipEventQuery failed: ") +
+                                 hipGetErrorString(hip_result));
         }
 
         if (error_check_callback) {
@@ -113,12 +118,20 @@ class GPUContext::impl {
     }
   }
 
-  void StreamCreate(hipStream_t *stream) {
+  void WaitForEventsElastic(
+      std::queue<std::pair<std::string, hipEvent_t>>& event_queue,
+      const std::vector<TensorTableEntry>& entries, Timeline& timeline,
+      const std::function<void()>& error_check_callback) {
+    WaitForEvents(event_queue, entries, timeline, error_check_callback);
+  }
+
+  void StreamCreate(hipStream_t* stream) {
     int greatest_priority;
     ErrorCheck("hipDeviceGetStreamPriorityRange",
-        hipDeviceGetStreamPriorityRange(NULL, &greatest_priority));
+               hipDeviceGetStreamPriorityRange(NULL, &greatest_priority));
     ErrorCheck("hipStreamCreateWithPriority",
-        hipStreamCreateWithPriority(stream, hipStreamNonBlocking, greatest_priority));
+               hipStreamCreateWithPriority(stream, hipStreamNonBlocking,
+                                           greatest_priority));
   }
 
   void StreamSynchronize(hipStream_t stream) {
@@ -135,25 +148,34 @@ class GPUContext::impl {
     ErrorCheck("hipSetDevice", hipSetDevice(device));
   }
 
-  void MemcpyAsyncD2D(void* dst, const void* src, size_t count, hipStream_t stream) {
-    ErrorCheck("hipMemcpyAsync", hipMemcpyAsync(dst, src, count, hipMemcpyDeviceToDevice, stream));
+  void MemcpyAsyncD2D(void* dst, const void* src, size_t count,
+                      hipStream_t stream) {
+    ErrorCheck(
+        "hipMemcpyAsync",
+        hipMemcpyAsync(dst, src, count, hipMemcpyDeviceToDevice, stream));
   }
 
-  void MemcpyAsyncH2D(void* dst, const void* src, size_t count, hipStream_t stream) {
-    ErrorCheck("hipMemcpyAsync", hipMemcpyAsync(dst, src, count, hipMemcpyHostToDevice, stream));
+  void MemcpyAsyncH2D(void* dst, const void* src, size_t count,
+                      hipStream_t stream) {
+    ErrorCheck("hipMemcpyAsync",
+               hipMemcpyAsync(dst, src, count, hipMemcpyHostToDevice, stream));
   }
 
-  void MemcpyAsyncD2H(void* dst, const void* src, size_t count, hipStream_t stream) {
-    ErrorCheck("hipMemcpyAsync", hipMemcpyAsync(dst, src, count, hipMemcpyDeviceToHost, stream));
+  void MemcpyAsyncD2H(void* dst, const void* src, size_t count,
+                      hipStream_t stream) {
+    ErrorCheck("hipMemcpyAsync",
+               hipMemcpyAsync(dst, src, count, hipMemcpyDeviceToHost, stream));
   }
 
-  void ScaleBufferImpl(const void* fused_input_data, void* buffer_data, int64_t num_elements,
-                   double scale_factor, DataType dtype, hipStream_t stream) {
+  void ScaleBufferImpl(const void* fused_input_data, void* buffer_data,
+                       int64_t num_elements, double scale_factor,
+                       DataType dtype, hipStream_t stream) {
     throw std::logic_error("ScaleBuffer not implemented for AMD GPUs.");
   }
 
 private:
-  // We reuse HIP events as it appears that their creation carries non-zero cost.
+  // We reuse HIP events as it appears that their creation carries non-zero
+  // cost.
   std::unordered_map<int, std::queue<hipEvent_t>> hip_events;
   std::mutex hip_events_mutex;
 };
diff --git a/horovod/common/ops/nccl_operations.cc b/horovod/common/ops/nccl_operations.cc
index d241ec7ca2..29cd5666eb 100644
--- a/horovod/common/ops/nccl_operations.cc
+++ b/horovod/common/ops/nccl_operations.cc
@@ -26,37 +26,52 @@ namespace common {
 
 ncclDataType_t GetNCCLDataType(const std::shared_ptr<Tensor> tensor) {
   switch (tensor->dtype()) {
-    case HOROVOD_UINT8:
-      return ncclUint8;
-    case HOROVOD_INT8:
-      return ncclInt8;
-    case HOROVOD_INT32:
-      return ncclInt32;
-    case HOROVOD_INT64:
-      return ncclInt64;
-    case HOROVOD_FLOAT16:
-      return ncclFloat16;
-    case HOROVOD_FLOAT32:
-      return ncclFloat32;
-    case HOROVOD_FLOAT64:
-      return ncclFloat64;
-    default:
-      throw std::logic_error("Type " + DataType_Name(tensor->dtype()) +
-                             " is not supported in NCCL mode.");
+  case HOROVOD_UINT8:
+    return ncclUint8;
+  case HOROVOD_INT8:
+    return ncclInt8;
+  case HOROVOD_INT32:
+    return ncclInt32;
+  case HOROVOD_INT64:
+    return ncclInt64;
+  case HOROVOD_FLOAT16:
+    return ncclFloat16;
+  case HOROVOD_FLOAT32:
+    return ncclFloat32;
+  case HOROVOD_FLOAT64:
+    return ncclFloat64;
+  default:
+    throw std::logic_error("Type " + DataType_Name(tensor->dtype()) +
+                           " is not supported in NCCL mode.");
   }
 }
 
-void NCCLContext::ErrorCheck(std::string op_name, ncclResult_t nccl_result, ncclComm_t& nccl_comm) {
+void commDestroyOrAbort(ncclComm_t& nccl_comm, bool elastic) {
+  ncclResult_t nccl_async_err;
+  auto nccl_err = ncclCommGetAsyncError(nccl_comm, &nccl_async_err);
+  if (nccl_err != ncclSuccess) {
+    return;
+  }
+  if (nccl_async_err == ncclSuccess && !elastic) {
+    ncclCommDestroy(nccl_comm);
+  } else {
+    ncclCommAbort(nccl_comm);
+  }
+}
+
+void NCCLContext::ErrorCheck(std::string op_name, ncclResult_t nccl_result,
+                             ncclComm_t& nccl_comm) {
   if (nccl_result != ncclSuccess) {
     ncclCommAbort(nccl_comm);
-    throw std::logic_error(std::string(op_name) + " failed: " + ncclGetErrorString(nccl_result));
+    throw std::logic_error(std::string(op_name) +
+                           " failed: " + ncclGetErrorString(nccl_result));
   }
 }
 
-void NCCLContext::ShutDown(){
-  for(auto it = nccl_comms.begin(); it != nccl_comms.end(); ++it) {
+void NCCLContext::ShutDown() {
+  for (auto it = nccl_comms.begin(); it != nccl_comms.end(); ++it) {
     for (auto entry = it->begin(); entry != it->end(); ++entry) {
-      ncclCommDestroy(entry->second);
+      commDestroyOrAbort(entry->second, elastic);
     }
   }
   nccl_comms.clear();
@@ -86,14 +101,16 @@ void NCCLOpContext::InitNCCLComm(const std::vector<TensorTableEntry>& entries,
 
     ncclUniqueId nccl_id;
     if (nccl_rank == 0) {
-      nccl_context_->ErrorCheck("ncclGetUniqueId", ncclGetUniqueId(&nccl_id), nccl_comm);
+      nccl_context_->ErrorCheck("ncclGetUniqueId", ncclGetUniqueId(&nccl_id),
+                                nccl_comm);
     }
 
     process_set.controller->Bcast((void*)&nccl_id, sizeof(nccl_id), 0,
                                   nccl_id_bcast_comm);
 
     ncclComm_t new_nccl_comm;
-    auto nccl_result = ncclCommInitRank(&new_nccl_comm, nccl_size, nccl_id, nccl_rank);
+    auto nccl_result =
+        ncclCommInitRank(&new_nccl_comm, nccl_size, nccl_id, nccl_rank);
     nccl_context_->ErrorCheck("ncclCommInitRank", nccl_result, nccl_comm);
     nccl_comm = new_nccl_comm;
 
@@ -110,15 +127,16 @@ void NCCLOpContext::AsyncErrorCheck() {
   ncclResult_t nccl_async_err;
   auto nccl_err = ncclCommGetAsyncError(*nccl_comm_, &nccl_async_err);
   if (nccl_err != ncclSuccess) {
-    throw std::logic_error(std::string("ncclGetAsyncError failed: ") + ncclGetErrorString(nccl_err));
+    throw std::logic_error(std::string("ncclGetAsyncError failed: ") +
+                           ncclGetErrorString(nccl_err));
   }
 
   if (nccl_async_err != ncclSuccess) {
-    ncclCommAbort(*nccl_comm_);
-    throw std::logic_error(std::string("NCCL async error: ") + ncclGetErrorString(nccl_async_err));
+    // do not call ncclCommAbort(*nccl_comm_) from event polling thread to avoid
+    // race condition
+    throw std::logic_error(std::string("NCCL async error: ") +
+                           ncclGetErrorString(nccl_async_err));
   }
-
-
 }
 
 void NCCLOpContext::PopulateNCCLCommStrategy(int& nccl_rank, int& nccl_size,
@@ -131,8 +149,9 @@ void NCCLOpContext::PopulateNCCLCommStrategy(int& nccl_rank, int& nccl_size,
     nccl_rank = process_set.controller->GetLocalRank();
     nccl_size = process_set.controller->GetLocalSize();
   } else {
-    throw std::logic_error("Communicator type " + std::to_string(communicator_type_) +
-                            " is not supported in NCCL mode.");
+    throw std::logic_error("Communicator type " +
+                           std::to_string(communicator_type_) +
+                           " is not supported in NCCL mode.");
   }
   nccl_id_bcast_comm = communicator_type_;
 }
@@ -169,52 +188,66 @@ Status NCCLAllreduce::Execute(std::vector<TensorTableEntry>& entries,
 
   // Copy (and possibly scale) tensors into the fusion buffer.
   if (entries.size() > 1) {
-    ScaleMemcpyInFusionBuffer(entries, fused_input_data, buffer_data, buffer_len, response.prescale_factor());
+    ScaleMemcpyInFusionBuffer(entries, fused_input_data, buffer_data,
+                              buffer_len, response.prescale_factor());
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, MEMCPY_IN_FUSION_BUFFER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
+                                MEMCPY_IN_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   } else {
     fused_input_data = first_entry.tensor->data();
-    buffer_data = (void*) first_entry.output->data();
-    buffer_len = (size_t) first_entry.output->size();
-    int64_t num_elements = buffer_len / DataType_Size(first_entry.tensor->dtype());
+    buffer_data = (void*)first_entry.output->data();
+    buffer_len = (size_t)first_entry.output->size();
+    int64_t num_elements =
+        buffer_len / DataType_Size(first_entry.tensor->dtype());
     if (response.prescale_factor() != 1.0) {
       // Execute prescaling op
-      ScaleBuffer(response.prescale_factor(), entries, fused_input_data, buffer_data, num_elements);
+      ScaleBuffer(response.prescale_factor(), entries, fused_input_data,
+                  buffer_data, num_elements);
       fused_input_data = buffer_data; // for unfused, scale is done out of place
     }
   }
 
   // Do allreduce.
-  int64_t num_elements = buffer_len / DataType_Size(first_entry.tensor->dtype());
-  auto nccl_result = ncclAllReduce(fused_input_data, buffer_data,
-                                   (size_t) num_elements,
-                                   GetNCCLDataType(first_entry.tensor), ncclSum,
-                                   *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-  nccl_context_->ErrorCheck("ncclAllReduce", nccl_result, *nccl_op_context_.nccl_comm_);
+  int64_t num_elements =
+      buffer_len / DataType_Size(first_entry.tensor->dtype());
+  auto nccl_result =
+      ncclAllReduce(fused_input_data, buffer_data, (size_t)num_elements,
+                    GetNCCLDataType(first_entry.tensor), ncclSum,
+                    *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+  nccl_context_->ErrorCheck("ncclAllReduce", nccl_result,
+                            *nccl_op_context_.nccl_comm_);
   if (global_state_->timeline.Initialized()) {
-    gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLREDUCE, *gpu_op_context_.stream);
+    gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLREDUCE,
+                              *gpu_op_context_.stream);
   }
 
   // Copy (and possible scale) tensors out of the fusion buffer.
   if (entries.size() > 1) {
-    ScaleMemcpyOutFusionBuffer(buffer_data, buffer_len, response.postscale_factor(), entries);
+    ScaleMemcpyOutFusionBuffer(buffer_data, buffer_len,
+                               response.postscale_factor(), entries);
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, MEMCPY_OUT_FUSION_BUFFER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
+                                MEMCPY_OUT_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   } else {
     if (response.postscale_factor() != 1.0) {
       // Execute postscaling op
-      ScaleBuffer(response.postscale_factor(), entries, buffer_data, buffer_data, num_elements);
+      ScaleBuffer(response.postscale_factor(), entries, buffer_data,
+                  buffer_data, num_elements);
     }
   }
 
-  return gpu_op_context_.FinalizeGPUQueue(entries, true, nccl_op_context_.error_check_callback_);
+  return gpu_op_context_.FinalizeGPUQueue(
+      entries, true, nccl_op_context_.error_check_callback_);
 }
 
 #if HAVE_MPI
-void NCCLHierarchicalAllreduce::WaitForData(std::vector<TensorTableEntry>& entries) {
+void NCCLHierarchicalAllreduce::WaitForData(
+    std::vector<TensorTableEntry>& entries) {
   if (global_state_->timeline.Initialized()) {
     // If timeline is initialized, need to use normal CPU syncing path
     HorovodOp::WaitForData(entries);
@@ -261,19 +294,23 @@ NCCLHierarchicalAllreduce::Execute(std::vector<TensorTableEntry>& entries,
     MemcpyInFusionBuffer(entries, fused_input_data, buffer_data, buffer_len);
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, MEMCPY_IN_FUSION_BUFFER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
+                                MEMCPY_IN_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   } else {
     fused_input_data = first_entry.tensor->data();
-    buffer_data = (void*) first_entry.output->data();
-    buffer_len = (size_t) first_entry.output->size();
+    buffer_data = (void*)first_entry.output->data();
+    buffer_len = (size_t)first_entry.output->size();
   }
 
-  int64_t num_elements = buffer_len / DataType_Size(first_entry.tensor->dtype());
+  int64_t num_elements =
+      buffer_len / DataType_Size(first_entry.tensor->dtype());
 
   if (response.prescale_factor() != 1.0) {
     // Execute prescaling op
-    ScaleBuffer(response.prescale_factor(), entries, fused_input_data, buffer_data, num_elements);
+    ScaleBuffer(response.prescale_factor(), entries, fused_input_data,
+                buffer_data, num_elements);
     fused_input_data = buffer_data; // for unfused, scale is done out of place
   }
 
@@ -306,9 +343,8 @@ NCCLHierarchicalAllreduce::Execute(std::vector<TensorTableEntry>& entries,
   // non-divisible part (if any), do NCCL Reduce (at rank local_size-1),
   // MPI Allreduce (across rank (local_size-1)'s), and NCCL Bcast
 
-  int64_t num_elements_per_rank = process_set.controller->IsHomogeneous()
-                                      ? num_elements / local_size
-                                      : 0;
+  int64_t num_elements_per_rank =
+      process_set.controller->IsHomogeneous() ? num_elements / local_size : 0;
 
   size_t buffer_len_per_rank = element_size * num_elements_per_rank;
 
@@ -339,28 +375,31 @@ NCCLHierarchicalAllreduce::Execute(std::vector<TensorTableEntry>& entries,
 
   auto& timeline = global_state_->timeline;
   if (num_elements_per_rank > 0) {
-    auto nccl_result = ncclReduceScatter(fused_input_data,
-                                         buffer_data_at_rank_offset,
-                                         (size_t) num_elements_per_rank,
-                                         GetNCCLDataType(first_entry.tensor),
-                                         ncclSum, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-    nccl_context_->ErrorCheck("ncclReduceScatter", nccl_result, *nccl_op_context_.nccl_comm_);
+    auto nccl_result = ncclReduceScatter(
+        fused_input_data, buffer_data_at_rank_offset,
+        (size_t)num_elements_per_rank, GetNCCLDataType(first_entry.tensor),
+        ncclSum, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+    nccl_context_->ErrorCheck("ncclReduceScatter", nccl_result,
+                              *nccl_op_context_.nccl_comm_);
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_REDUCESCATTER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_REDUCESCATTER,
+                                *gpu_op_context_.stream);
     }
   }
 
   if (num_elements_remaining > 0) {
     // Reduce the remaining data at local_size-1 to append to
     // existing buffer
-    auto nccl_result = ncclReduce(fused_input_data_remainder,
-                                  buffer_data_remainder,
-                                  (size_t) num_elements_remaining,
-                                  GetNCCLDataType(first_entry.tensor), ncclSum,
-                                  root_rank, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-    nccl_context_->ErrorCheck("ncclReduce", nccl_result, *nccl_op_context_.nccl_comm_);
+    auto nccl_result =
+        ncclReduce(fused_input_data_remainder, buffer_data_remainder,
+                   (size_t)num_elements_remaining,
+                   GetNCCLDataType(first_entry.tensor), ncclSum, root_rank,
+                   *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+    nccl_context_->ErrorCheck("ncclReduce", nccl_result,
+                              *nccl_op_context_.nccl_comm_);
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_REDUCE, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_REDUCE,
+                                *gpu_op_context_.stream);
     }
   }
 
@@ -370,61 +409,85 @@ NCCLHierarchicalAllreduce::Execute(std::vector<TensorTableEntry>& entries,
     gpu_op_context_.host_buffer = malloc(total_buffer_len);
 
     // Synchronize.
-    gpu_context_->WaitForEvents(gpu_op_context_.event_queue, entries, timeline, nccl_op_context_.error_check_callback_,
-                                global_state_->elastic_enabled);
+    if (global_state_->elastic_enabled) {
+      gpu_context_->WaitForEventsElastic(
+          gpu_op_context_.event_queue, entries, timeline,
+          nccl_op_context_.error_check_callback_);
+    } else {
+      gpu_context_->WaitForEvents(gpu_op_context_.event_queue, entries,
+                                  timeline,
+                                  nccl_op_context_.error_check_callback_);
+    }
 
     // According to https://docs.nvidia.com/cuda/cuda-runtime-api/
     // api-sync-behavior.html#api-sync-behavior__memcpy-async,
     // cudaMemcpyAsync is synchronous with respect to the host, so we
     // memcpy (effectively) synchronously to generate an accurate timeline
     timeline.ActivityStartAll(entries, MEMCPY_IN_HOST_BUFFER);
-    gpu_context_->MemcpyAsyncD2H(gpu_op_context_.host_buffer, buffer_data_at_rank_offset,
-                                 total_buffer_len, *gpu_op_context_.stream);
+    gpu_context_->MemcpyAsyncD2H(gpu_op_context_.host_buffer,
+                                 buffer_data_at_rank_offset, total_buffer_len,
+                                 *gpu_op_context_.stream);
     timeline.ActivityEndAll(entries);
 
     timeline.ActivityStartAll(entries, MPI_ALLREDUCE);
     int op = MPI_Allreduce(MPI_IN_PLACE, gpu_op_context_.host_buffer,
-                           (int) total_num_elements,
+                           (int)total_num_elements,
                            mpi_context.GetMPIDataType(first_entry.tensor),
                            mpi_context.GetMPISumOp(first_entry.tensor->dtype()),
                            mpi_context.GetMPICommunicator(Communicator::CROSS));
     if (op != MPI_SUCCESS) {
-      throw std::runtime_error("MPI_Allreduce failed, see MPI output for details.");
+      throw std::runtime_error(
+          "MPI_Allreduce failed, see MPI output for details.");
     }
     timeline.ActivityEndAll(entries);
 
     timeline.ActivityStartAll(entries, MEMCPY_OUT_HOST_BUFFER);
-    gpu_context_->MemcpyAsyncH2D(buffer_data_at_rank_offset, gpu_op_context_.host_buffer,
-                                 total_buffer_len, *gpu_op_context_.stream);
+    gpu_context_->MemcpyAsyncH2D(buffer_data_at_rank_offset,
+                                 gpu_op_context_.host_buffer, total_buffer_len,
+                                 *gpu_op_context_.stream);
     timeline.ActivityEndAll(entries);
   }
 
   if (num_elements_per_rank > 0) {
-    nccl_context_->ErrorCheck("ncclAllGather",
-                              ncclAllGather(buffer_data_at_rank_offset, buffer_data,
-                                            (size_t) num_elements_per_rank,
-                                            GetNCCLDataType(first_entry.tensor),
-                                            *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
-                              *nccl_op_context_.nccl_comm_);
+    nccl_context_->ErrorCheck(
+        "ncclAllGather",
+        ncclAllGather(buffer_data_at_rank_offset, buffer_data,
+                      (size_t)num_elements_per_rank,
+                      GetNCCLDataType(first_entry.tensor),
+                      *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
+        *nccl_op_context_.nccl_comm_);
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLGATHER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLGATHER,
+                                *gpu_op_context_.stream);
     }
   }
   if (num_elements_remaining > 0) {
-    nccl_context_->ErrorCheck("ncclBcast",
-                              ncclBcast(buffer_data_remainder,
-                                        (size_t) num_elements_remaining,
-                                        GetNCCLDataType(first_entry.tensor), root_rank,
-                                        *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
-                              *nccl_op_context_.nccl_comm_);
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 2, 12)
+    nccl_context_->ErrorCheck(
+        "ncclBroadcast",
+        ncclBroadcast(buffer_data_remainder, buffer_data_remainder,
+                      (size_t)num_elements_remaining,
+                      GetNCCLDataType(first_entry.tensor), root_rank,
+                      *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
+        *nccl_op_context_.nccl_comm_);
+#else
+    nccl_context_->ErrorCheck(
+        "ncclBcast",
+        ncclBcast(buffer_data_remainder, (size_t)num_elements_remaining,
+                  GetNCCLDataType(first_entry.tensor), root_rank,
+                  *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
+        *nccl_op_context_.nccl_comm_);
+#endif
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST,
+                                *gpu_op_context_.stream);
     }
   }
 
   if (response.postscale_factor() != 1.0) {
     // Execute postscaling op
-    ScaleBuffer(response.postscale_factor(), entries, buffer_data, buffer_data, num_elements);
+    ScaleBuffer(response.postscale_factor(), entries, buffer_data, buffer_data,
+                num_elements);
   }
 
   // Copy memory out of the fusion buffer.
@@ -432,16 +495,20 @@ NCCLHierarchicalAllreduce::Execute(std::vector<TensorTableEntry>& entries,
     MemcpyOutFusionBuffer(buffer_data, entries);
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, MEMCPY_OUT_FUSION_BUFFER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
+                                MEMCPY_OUT_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   }
 
-  return gpu_op_context_.FinalizeGPUQueue(entries, true, nccl_op_context_.error_check_callback_);
+  return gpu_op_context_.FinalizeGPUQueue(
+      entries, true, nccl_op_context_.error_check_callback_);
 }
 
-bool NCCLHierarchicalAllreduce::Enabled(const ParameterManager& param_manager,
-                                        const std::vector<TensorTableEntry>& entries,
-                                        const Response& response) const {
+bool NCCLHierarchicalAllreduce::Enabled(
+    const ParameterManager& param_manager,
+    const std::vector<TensorTableEntry>& entries,
+    const Response& response) const {
   if (!NCCLAllreduce::Enabled(param_manager, entries, response)) {
     return false;
   }
@@ -480,25 +547,39 @@ Status NCCLBroadcast::Execute(std::vector<TensorTableEntry>& entries,
   // On root rank, ncclbcast sends data, on other ranks it receives data.
   void* data_ptr;
   if (process_set.controller->GetRank() == e.root_rank) {
-    data_ptr = (void*) e.tensor->data();
+    data_ptr = (void*)e.tensor->data();
   } else {
-    data_ptr = (void*) e.output->data();
-  }
-
-  // We only use 'ncclChar' for this operation because the type format does not matter for a
-  // broadcast, only the size of the data.
+    data_ptr = (void*)e.output->data();
+  }
+
+  // We only use 'ncclChar' for this operation because the type format does not
+  // matter for a broadcast, only the size of the data.
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 2, 12)
+  nccl_context_->ErrorCheck("ncclBroadcast",
+                            ncclBroadcast(data_ptr, data_ptr,
+                                          e.tensor->shape().num_elements() *
+                                              DataType_Size(e.tensor->dtype()),
+                                          ncclChar, e.root_rank,
+                                          *nccl_op_context_.nccl_comm_,
+                                          *gpu_op_context_.stream),
+                            *nccl_op_context_.nccl_comm_);
+#else
   nccl_context_->ErrorCheck("ncclBcast",
                             ncclBcast(data_ptr,
                                       e.tensor->shape().num_elements() *
-                                      DataType_Size(e.tensor->dtype()),
+                                          DataType_Size(e.tensor->dtype()),
                                       ncclChar, e.root_rank,
-                                      *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream),
+                                      *nccl_op_context_.nccl_comm_,
+                                      *gpu_op_context_.stream),
                             *nccl_op_context_.nccl_comm_);
+#endif
   if (global_state_->timeline.Initialized()) {
-    gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST, *gpu_op_context_.stream);
+    gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST,
+                              *gpu_op_context_.stream);
   }
 
-  return gpu_op_context_.FinalizeGPUQueue(entries, true, nccl_op_context_.error_check_callback_);
+  return gpu_op_context_.FinalizeGPUQueue(
+      entries, true, nccl_op_context_.error_check_callback_);
 }
 
 void NCCLAllgather::WaitForData(std::vector<TensorTableEntry>& entries) {
@@ -518,7 +599,7 @@ void NCCLAllgather::WaitForData(std::vector<TensorTableEntry>& entries) {
 }
 
 Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
-                                const Response& response) {
+                              const Response& response) {
   assert(!entries.empty());
   auto& first_entry = entries[0];
   auto& process_set =
@@ -531,11 +612,11 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
   WaitForData(entries);
 
   // Sizes of subcomponents of each entry from all ranks
-  auto** entry_component_sizes = new int64_t* [entries.size()];
+  auto** entry_component_sizes = new int64_t*[entries.size()];
 
   // Offset of each subcomponent of every entry in the final buffer after
   // allgatherv
-  auto** entry_component_offsets = new int64_t* [entries.size()];
+  auto** entry_component_offsets = new int64_t*[entries.size()];
 
   int global_size = process_set.controller->GetSize();
   int global_rank = process_set.controller->GetRank();
@@ -548,12 +629,13 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
   }
 
   global_state_->timeline.ActivityStartAll(entries, ALLOCATE_OUTPUT);
-  Status status = AllocateOutput(entries, response, entry_component_sizes, recvcounts);
+  Status status =
+      AllocateOutput(entries, response, entry_component_sizes, recvcounts);
   if (!status.ok()) {
     for (size_t ec = 0; ec < entries.size(); ++ec) {
       delete[] entry_component_sizes[ec];
       delete[] entry_component_offsets[ec];
-    }   
+    }
     delete[] entry_component_sizes;
     delete[] entry_component_offsets;
     delete[] recvcounts;
@@ -563,7 +645,8 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
   global_state_->timeline.ActivityEndAll(entries);
 
   SetDisplacements(recvcounts, displcmnts, global_size);
-  SetEntryComponentOffsets(entries, entry_component_sizes, recvcounts, entry_component_offsets);
+  SetEntryComponentOffsets(entries, entry_component_sizes, recvcounts,
+                           entry_component_offsets);
 
   size_t element_size = DataType_Size(first_entry.tensor->dtype());
 
@@ -573,21 +656,25 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
   // Copy memory into the fusion buffer.
   if (entries.size() > 1) {
     MemcpyInFusionBuffer(entries, displcmnts, element_size, buffer_data);
-    fused_input_data = (uint8_t*)buffer_data + displcmnts[global_rank] * element_size;
+    fused_input_data =
+        (uint8_t*)buffer_data + displcmnts[global_rank] * element_size;
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, MEMCPY_IN_FUSION_BUFFER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
+                                MEMCPY_IN_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   } else {
     fused_input_data = first_entry.tensor->data();
-    buffer_data = (void*) first_entry.output->data();
+    buffer_data = (void*)first_entry.output->data();
   }
 
   bool same_shape = true;
   const auto& tensor_sizes = response.tensor_sizes();
   for (size_t ec = 0; ec < entries.size(); ++ec) {
     for (int rc = 1; rc < global_size; ++rc) {
-      if (tensor_sizes[ec * global_size + rc] != tensor_sizes[ec * global_size]) {
+      if (tensor_sizes[ec * global_size + rc] !=
+          tensor_sizes[ec * global_size]) {
         same_shape = false;
         break;
       }
@@ -599,30 +686,35 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
 
   // Do allgather.
   if (same_shape) {
-    auto nccl_result = ncclAllGather(fused_input_data, buffer_data,
-                                     recvcounts[0] * element_size,
-                                     ncclChar,
-                                     *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+    auto nccl_result = ncclAllGather(
+        fused_input_data, buffer_data, recvcounts[0] * element_size, ncclChar,
+        *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
 
-    nccl_context_->ErrorCheck("ncclAllGather", nccl_result, *nccl_op_context_.nccl_comm_);
+    nccl_context_->ErrorCheck("ncclAllGather", nccl_result,
+                              *nccl_op_context_.nccl_comm_);
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLGATHER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLGATHER,
+                                *gpu_op_context_.stream);
     }
   } else {
-    nccl_context_->ErrorCheck("ncclGroupStart", ncclGroupStart(), *nccl_op_context_.nccl_comm_);
+    nccl_context_->ErrorCheck("ncclGroupStart", ncclGroupStart(),
+                              *nccl_op_context_.nccl_comm_);
     for (int rc = 0; rc < global_size; ++rc) {
-      void* new_buffer_data = (uint8_t*)buffer_data + displcmnts[rc] * element_size;
-      auto nccl_result = ncclBroadcast(fused_input_data, new_buffer_data,
-                                       recvcounts[rc] * element_size,
-                                       ncclChar, rc,
-                                       *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-      nccl_context_->ErrorCheck("ncclBroadcast", nccl_result, *nccl_op_context_.nccl_comm_);
-    }
-    nccl_context_->ErrorCheck("ncclGroupEnd", ncclGroupEnd(), *nccl_op_context_.nccl_comm_);
+      void* new_buffer_data =
+          (uint8_t*)buffer_data + displcmnts[rc] * element_size;
+      auto nccl_result = ncclBroadcast(
+          fused_input_data, new_buffer_data, recvcounts[rc] * element_size,
+          ncclChar, rc, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+      nccl_context_->ErrorCheck("ncclBroadcast", nccl_result,
+                                *nccl_op_context_.nccl_comm_);
+    }
+    nccl_context_->ErrorCheck("ncclGroupEnd", ncclGroupEnd(),
+                              *nccl_op_context_.nccl_comm_);
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_BCAST,
+                                *gpu_op_context_.stream);
     }
   }
 
@@ -632,7 +724,9 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
                           buffer_data, element_size, entries);
 
     if (global_state_->timeline.Initialized()) {
-      gpu_context_->RecordEvent(gpu_op_context_.event_queue, MEMCPY_OUT_FUSION_BUFFER, *gpu_op_context_.stream);
+      gpu_context_->RecordEvent(gpu_op_context_.event_queue,
+                                MEMCPY_OUT_FUSION_BUFFER,
+                                *gpu_op_context_.stream);
     }
   }
 
@@ -646,12 +740,13 @@ Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
   delete[] entry_component_sizes;
   delete[] entry_component_offsets;
 
-  return gpu_op_context_.FinalizeGPUQueue(entries, true, nccl_op_context_.error_check_callback_);
+  return gpu_op_context_.FinalizeGPUQueue(
+      entries, true, nccl_op_context_.error_check_callback_);
 }
 
 bool NCCLAllgather::Enabled(const ParameterManager& param_manager,
-                              const std::vector<TensorTableEntry>& entries,
-                              const Response& response) const {
+                            const std::vector<TensorTableEntry>& entries,
+                            const Response& response) const {
   return entries[0].device != CPU_DEVICE_ID;
 }
 
@@ -686,43 +781,58 @@ Status NCCLAlltoall::Execute(std::vector<TensorTableEntry>& entries,
 
   std::vector<int32_t> sdispls, rdispls;
   std::vector<int32_t> sendcounts, recvcounts;
-  Status status = PrepareOutputAndParams(e, sdispls, rdispls, sendcounts, recvcounts);
+  Status status =
+      PrepareOutputAndParams(e, sdispls, rdispls, sendcounts, recvcounts);
   if (!status.ok()) {
     return status;
   }
 
   auto world_size = process_set.controller->GetSize();
 
-  nccl_context_->ErrorCheck("ncclGroupStart", ncclGroupStart(), *nccl_op_context_.nccl_comm_);
+  nccl_context_->ErrorCheck("ncclGroupStart", ncclGroupStart(),
+                            *nccl_op_context_.nccl_comm_);
 
   for (int i = 0; i < world_size; ++i) {
     if (recvcounts[i] > 0) {
-      auto nccl_result = ncclRecv((uint8_t*) e.output->data() + rdispls[i] * DataType_Size(e.tensor->dtype()),
-                                  recvcounts[i] * DataType_Size(e.tensor->dtype()), ncclChar, i,
-                                  *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-      nccl_context_->ErrorCheck("ncclRecv", nccl_result, *nccl_op_context_.nccl_comm_);
+      auto nccl_result =
+          ncclRecv((uint8_t*)e.output->data() +
+                       rdispls[i] * DataType_Size(e.tensor->dtype()),
+                   recvcounts[i] * DataType_Size(e.tensor->dtype()), ncclChar,
+                   i, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+      nccl_context_->ErrorCheck("ncclRecv", nccl_result,
+                                *nccl_op_context_.nccl_comm_);
     }
 
     if (sendcounts[i] > 0) {
-      auto nccl_result = ncclSend((uint8_t*) e.tensor->data() + sdispls[i] * DataType_Size(e.tensor->dtype()),
-                             sendcounts[i] * DataType_Size(e.tensor->dtype()), ncclChar, i,
-                             *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
-      nccl_context_->ErrorCheck("ncclSend", nccl_result, *nccl_op_context_.nccl_comm_);
+      auto nccl_result =
+          ncclSend((uint8_t*)e.tensor->data() +
+                       sdispls[i] * DataType_Size(e.tensor->dtype()),
+                   sendcounts[i] * DataType_Size(e.tensor->dtype()), ncclChar,
+                   i, *nccl_op_context_.nccl_comm_, *gpu_op_context_.stream);
+      nccl_context_->ErrorCheck("ncclSend", nccl_result,
+                                *nccl_op_context_.nccl_comm_);
     }
   }
-  nccl_context_->ErrorCheck("ncclGroupEnd", ncclGroupEnd(), *nccl_op_context_.nccl_comm_);
+  nccl_context_->ErrorCheck("ncclGroupEnd", ncclGroupEnd(),
+                            *nccl_op_context_.nccl_comm_);
 
   if (global_state_->timeline.Initialized()) {
-    gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLTOALL, *gpu_op_context_.stream);
+    gpu_context_->RecordEvent(gpu_op_context_.event_queue, NCCL_ALLTOALL,
+                              *gpu_op_context_.stream);
   }
 
   return gpu_op_context_.FinalizeGPUQueue(entries);
 #else
-  throw std::runtime_error("NCCLAlltoall requires NCCL version >= 2.7.0. If your NCCL installation cannot be updated "
-                           "and you installed with HOROVOD_GPU_OPERATIONS=NCCL, reinstall with only supported "
-                           "operations individually specified (i.e. HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL "
-                           "HOROVOD_GPU_ALLGATHER=NCCL). Otherwise, exclude HOROVOD_GPU_ALLTOALL=NCCL from your "
-                           "installation command.");
+  throw std::runtime_error(
+      "NCCLAlltoall requires NCCL version >= 2.7.0. If your NCCL installation "
+      "cannot be updated "
+      "and you installed with HOROVOD_GPU_OPERATIONS=NCCL, reinstall with only "
+      "supported "
+      "operations individually specified (i.e. HOROVOD_GPU_ALLREDUCE=NCCL "
+      "HOROVOD_GPU_BROADCAST=NCCL "
+      "HOROVOD_GPU_ALLGATHER=NCCL). Otherwise, exclude "
+      "HOROVOD_GPU_ALLTOALL=NCCL from your "
+      "installation command.");
 #endif
 }
 
diff --git a/horovod/common/ops/nccl_operations.h b/horovod/common/ops/nccl_operations.h
index 960b216bf0..1727cbba2f 100644
--- a/horovod/common/ops/nccl_operations.h
+++ b/horovod/common/ops/nccl_operations.h
@@ -27,8 +27,8 @@
 #include <rccl.h>
 #endif
 
-#include "gpu_operations.h"
 #include "../hashes.h"
+#include "gpu_operations.h"
 
 #include <functional>
 
@@ -43,9 +43,12 @@ struct NCCLContext {
       std::unordered_map<std::tuple<int32_t, std::vector<int32_t>>, ncclComm_t>>
       nccl_comms;
 
-  void ErrorCheck(std::string op_name, ncclResult_t nccl_result, ncclComm_t& nccl_comm);
+  void ErrorCheck(std::string op_name, ncclResult_t nccl_result,
+                  ncclComm_t& nccl_comm);
 
   void ShutDown();
+
+  bool elastic;
 };
 
 class NCCLOpContext {
@@ -54,8 +57,7 @@ class NCCLOpContext {
                 Communicator communicator_type)
       : nccl_comm_(nullptr),
         error_check_callback_(std::bind(&NCCLOpContext::AsyncErrorCheck, this)),
-        nccl_context_(nccl_context),
-        global_state_(global_state),
+        nccl_context_(nccl_context), global_state_(global_state),
         communicator_type_(communicator_type){};
 
   void InitNCCLComm(const std::vector<TensorTableEntry>& entries,
@@ -81,8 +83,7 @@ class NCCLAllreduce : public GPUAllreduce {
   NCCLAllreduce(NCCLContext* nccl_context, GPUContext* gpu_context,
                 HorovodGlobalState* global_state,
                 Communicator communicator_type = Communicator::GLOBAL)
-      : GPUAllreduce(gpu_context, global_state),
-        nccl_context_(nccl_context),
+      : GPUAllreduce(gpu_context, global_state), nccl_context_(nccl_context),
         nccl_op_context_(nccl_context, global_state, communicator_type),
         global_state_(global_state){};
 
@@ -101,8 +102,7 @@ class NCCLBroadcast : public GPUBroadcast {
 public:
   NCCLBroadcast(NCCLContext* nccl_context, GPUContext* gpu_context,
                 HorovodGlobalState* global_state)
-      : GPUBroadcast(gpu_context, global_state),
-        nccl_context_(nccl_context),
+      : GPUBroadcast(gpu_context, global_state), nccl_context_(nccl_context),
         nccl_op_context_(nccl_context, global_state, Communicator::GLOBAL),
         global_state_(global_state){};
 
@@ -121,8 +121,7 @@ class NCCLAlltoall : public GPUAlltoall {
 public:
   NCCLAlltoall(NCCLContext* nccl_context, GPUContext* gpu_context,
                HorovodGlobalState* global_state)
-      : GPUAlltoall(gpu_context, global_state),
-        nccl_context_(nccl_context),
+      : GPUAlltoall(gpu_context, global_state), nccl_context_(nccl_context),
         nccl_op_context_(nccl_context, global_state, Communicator::GLOBAL),
         global_state_(global_state){};
 
@@ -143,7 +142,7 @@ class NCCLHierarchicalAllreduce : public NCCLAllreduce {
   NCCLHierarchicalAllreduce(NCCLContext* nccl_context, GPUContext* gpu_context,
                             HorovodGlobalState* global_state)
       : NCCLAllreduce(nccl_context, gpu_context, global_state,
-                      Communicator::LOCAL) {};
+                      Communicator::LOCAL){};
 
   Status Execute(std::vector<TensorTableEntry>& entries,
                  const Response& response) override;
@@ -183,7 +182,6 @@ class NCCLAllgather : public GPUAllgather {
   HorovodGlobalState* global_state_;
 };
 
-
 } // namespace common
 } // namespace horovod
 
diff --git a/horovod/common/ops/operation_manager.cc b/horovod/common/ops/operation_manager.cc
index b9bfdcb0bc..3cea39370a 100644
--- a/horovod/common/ops/operation_manager.cc
+++ b/horovod/common/ops/operation_manager.cc
@@ -27,6 +27,7 @@ OperationManager::OperationManager(ParameterManager* param_manager,
                                    std::vector<std::shared_ptr<AlltoallOp>> alltoall_ops,
                                    std::shared_ptr<JoinOp> join_op,
                                    std::vector<std::shared_ptr<AllreduceOp>> adasum_ops,
+                                   std::shared_ptr<BarrierOp> barrier_op,
                                    std::shared_ptr<ErrorOp> error_op)
     : param_manager_(param_manager),
       allreduce_ops_(std::move(allreduce_ops)),
@@ -35,6 +36,7 @@ OperationManager::OperationManager(ParameterManager* param_manager,
       alltoall_ops_(std::move(alltoall_ops)),
       join_op_(std::move(join_op)),
       adasum_ops_(std::move(adasum_ops)),
+      barrier_op_(std::move(barrier_op)),
       error_op_(std::move(error_op)) {}
 
 Status OperationManager::ExecuteAllreduce(std::vector<TensorTableEntry>& entries,
@@ -93,6 +95,11 @@ Status OperationManager::ExecuteAdasum(std::vector<TensorTableEntry>& entries,
   throw std::logic_error("No Adasum operation enabled");
 }
 
+Status OperationManager::ExecuteBarrier(std::vector<TensorTableEntry>& entries,
+                                     const Response& response) const {
+  return barrier_op_->Execute(entries, response);
+}
+
 Status OperationManager::ExecuteError(std::vector<TensorTableEntry>& entries,
                                       const Response& response) const {
   return error_op_->Execute(entries, response);
@@ -113,6 +120,8 @@ Status OperationManager::ExecuteOperation(std::vector<TensorTableEntry>& entries
     return ExecuteJoin(entries, response, process_set);
   } else if (response.response_type() == Response::ADASUM) {
     return ExecuteAdasum(entries, response);
+  } else if (response.response_type() == Response::BARRIER) {
+    return ExecuteBarrier(entries, response);
   } else if (response.response_type() == Response::ERROR) {
     return ExecuteError(entries, response);
   } else {
diff --git a/horovod/common/ops/operation_manager.h b/horovod/common/ops/operation_manager.h
index 9d8433bbeb..555f6abc26 100644
--- a/horovod/common/ops/operation_manager.h
+++ b/horovod/common/ops/operation_manager.h
@@ -33,6 +33,7 @@ class OperationManager {
                    std::vector<std::shared_ptr<AlltoallOp>> alltoall_ops,
                    std::shared_ptr<JoinOp> join_op,
                    std::vector<std::shared_ptr<AllreduceOp>> adasum_ops,
+                   std::shared_ptr<BarrierOp> barrier_op,
                    std::shared_ptr<ErrorOp> error_op);
 
   virtual ~OperationManager() = default;
@@ -52,6 +53,8 @@ class OperationManager {
 
   Status ExecuteAdasum(std::vector<TensorTableEntry>& entries, const Response& response) const;
 
+  Status ExecuteBarrier(std::vector<TensorTableEntry>& entries, const Response& response) const;
+
   Status ExecuteOperation(std::vector<TensorTableEntry>& entries,
                           const Response& response,
                           ProcessSet& process_set) const;
@@ -65,6 +68,7 @@ class OperationManager {
   std::vector<std::shared_ptr<AlltoallOp>> alltoall_ops_;
   std::shared_ptr<JoinOp> join_op_;
   std::vector<std::shared_ptr<AllreduceOp>> adasum_ops_;
+  std::shared_ptr<BarrierOp> barrier_op_;
   std::shared_ptr<ErrorOp> error_op_;
 };
 
diff --git a/horovod/common/response_cache.cc b/horovod/common/response_cache.cc
index 0649c65fab..b3029cf966 100644
--- a/horovod/common/response_cache.cc
+++ b/horovod/common/response_cache.cc
@@ -39,6 +39,8 @@ static Response::ResponseType RequestTypeToResponseType(Request::RequestType val
       return Response::ResponseType::ADASUM;
     case Request::RequestType::ALLTOALL:
       return Response::ResponseType::ALLTOALL;
+    case Request::RequestType::BARRIER:
+      return Response::ResponseType::BARRIER;
     default:
       throw std::logic_error("No corresponding ResponseType for provided RequestType.");
   }
diff --git a/horovod/common/tensor_queue.cc b/horovod/common/tensor_queue.cc
index 8428e90579..9d933f5b6c 100644
--- a/horovod/common/tensor_queue.cc
+++ b/horovod/common/tensor_queue.cc
@@ -109,17 +109,18 @@ void TensorQueue::GetTensorEntriesFromResponse(
              response.response_type() == Response::BROADCAST ||
              response.response_type() == Response::ALLTOALL ||
              response.response_type() == Response::ADASUM ||
+             response.response_type() == Response::BARRIER ||
              response.response_type() == Response::ERROR);
 
       if (!joined) {
         // We should never fail at finding this key in the tensor table.
         auto iter = tensor_table_.find(name);
         assert(iter != tensor_table_.end());
-
         entries.push_back(std::move(iter->second));
 
         // Clear the tensor table of this tensor.
         tensor_table_.erase(iter);
+
       } else if (response.response_type() != Response::ERROR) {
 
         // Find Join tensor to use its context.
@@ -152,6 +153,12 @@ TensorQueue::GetTensorEntry(const std::string& tensor_name) const{
   return iter;
 }
 
+bool TensorQueue::IsTensorPresentInTable(const std::string& tensor_name) const {
+  // Lock on the tensor table.
+  std::lock_guard<std::mutex> guard(mutex_);
+  return tensor_table_.find(tensor_name) != tensor_table_.end();
+}
+
 // Pop out all the messages from the queue
 void TensorQueue::PopMessagesFromQueue(
     std::deque<Request>& message_queue_buffer) {
diff --git a/horovod/common/tensor_queue.h b/horovod/common/tensor_queue.h
index d66de52569..1ad7c5eca7 100644
--- a/horovod/common/tensor_queue.h
+++ b/horovod/common/tensor_queue.h
@@ -43,6 +43,8 @@ class TensorQueue {
 
   const TensorTableEntry& GetTensorEntry(const std::string& tensor_name) const;
 
+  bool IsTensorPresentInTable (const std::string& tensor_name) const;
+
   void PopMessagesFromQueue(std::deque<Request>& message_queue_buffer);
 
   void PushMessageToQueue(Request& message);
diff --git a/horovod/common/wire/message.fbs b/horovod/common/wire/message.fbs
index 45c9a044a9..7eac4da6f2 100644
--- a/horovod/common/wire/message.fbs
+++ b/horovod/common/wire/message.fbs
@@ -38,7 +38,8 @@ enum RequestType:byte {
     ALLREDUCE = 0,
     ALLGATHER = 1,
     BROADCAST = 2,
-    JOIN = 3
+    JOIN = 3,
+    BARRIER = 4
 }
 table Request {
     // The request rank is necessary to create a consistent ordering of results,
@@ -81,7 +82,8 @@ enum ResponseType:byte {
     BROADCAST = 2,
     JOIN = 3,
     ADASUM = 4,
-    ERROR = 5
+    BARRIER = 5,
+    ERROR = 6
 }
 table Response {
     response_type:ResponseType;
diff --git a/horovod/common/wire/message_generated.h b/horovod/common/wire/message_generated.h
index ddc23ff28b..2909d6bb18 100644
--- a/horovod/common/wire/message_generated.h
+++ b/horovod/common/wire/message_generated.h
@@ -97,33 +97,36 @@ enum RequestType : int8_t {
   RequestType_ALLGATHER = 1,
   RequestType_BROADCAST = 2,
   RequestType_JOIN = 3,
+  RequestType_BARRIER = 4,
   RequestType_MIN = RequestType_ALLREDUCE,
-  RequestType_MAX = RequestType_JOIN
+  RequestType_MAX = RequestType_BARRIER
 };
 
-inline const RequestType (&EnumValuesRequestType())[4] {
+inline const RequestType (&EnumValuesRequestType())[5] {
   static const RequestType values[] = {
     RequestType_ALLREDUCE,
     RequestType_ALLGATHER,
     RequestType_BROADCAST,
-    RequestType_JOIN
+    RequestType_JOIN,
+    RequestType_BARRIER
   };
   return values;
 }
 
 inline const char * const *EnumNamesRequestType() {
-  static const char * const names[5] = {
+  static const char * const names[6] = {
     "ALLREDUCE",
     "ALLGATHER",
     "BROADCAST",
     "JOIN",
+    "BARRIER",
     nullptr
   };
   return names;
 }
 
 inline const char *EnumNameRequestType(RequestType e) {
-  if (flatbuffers::IsOutRange(e, RequestType_ALLREDUCE, RequestType_JOIN)) return "";
+  if (flatbuffers::IsOutRange(e, RequestType_ALLREDUCE, RequestType_BARRIER)) return "";
   const size_t index = static_cast<size_t>(e);
   return EnumNamesRequestType()[index];
 }
@@ -134,30 +137,33 @@ enum ResponseType : int8_t {
   ResponseType_BROADCAST = 2,
   ResponseType_JOIN = 3,
   ResponseType_ADASUM = 4,
-  ResponseType_ERROR = 5,
+  ResponseType_BARRIER = 5,
+  ResponseType_ERROR = 6,
   ResponseType_MIN = ResponseType_ALLREDUCE,
   ResponseType_MAX = ResponseType_ERROR
 };
 
-inline const ResponseType (&EnumValuesResponseType())[6] {
+inline const ResponseType (&EnumValuesResponseType())[7] {
   static const ResponseType values[] = {
     ResponseType_ALLREDUCE,
     ResponseType_ALLGATHER,
     ResponseType_BROADCAST,
     ResponseType_JOIN,
     ResponseType_ADASUM,
+    ResponseType_BARRIER,
     ResponseType_ERROR
   };
   return values;
 }
 
 inline const char * const *EnumNamesResponseType() {
-  static const char * const names[7] = {
+  static const char * const names[8] = {
     "ALLREDUCE",
     "ALLGATHER",
     "BROADCAST",
     "JOIN",
     "ADASUM",
+    "BARRIER",
     "ERROR",
     nullptr
   };
diff --git a/horovod/data/data_loader_base.py b/horovod/data/data_loader_base.py
index 6fd0f5f255..b5f3f3f480 100644
--- a/horovod/data/data_loader_base.py
+++ b/horovod/data/data_loader_base.py
@@ -55,11 +55,12 @@ class AsyncDataLoaderMixin(object):
         class PytorchAsyncDataLoader(AsyncDataLoaderMixin, PytorchDataLoader):
     """
 
-    def __init__(self, async_loader_queue_size=64, *args, **kwargs):
+    def __init__(self, async_loader_queue_size=64, debug_data_loader=False, *args, **kwargs):
         """
         initialize the async data loader. Need to add this in the __init__() of the implementation
         """
         self.async_loader_queue_size = async_loader_queue_size
+        self.debug_data_loader = debug_data_loader
         super().__init__(*args, **kwargs)
 
         print(f"Apply the AsyncDataLoaderMixin on top of the data loader, async_loader_queue_size={async_loader_queue_size}. ")
@@ -75,28 +76,53 @@ def close_async_loader(self):
         """
         Close the async data loader.
         """
-        print("Closing the AsyncDataLoaderMixin.")
+        print(f"close_async_loader[{self.async_loader_queue_size}], Closing the AsyncDataLoaderMixin.")
         if self.async_loader_queue_size > 0 and self.started:
             self.finished_event.set()
+            c = 0
             while True:
                 try:
                     # Drain buffer
                     self.queue.get_nowait()
+                    if self.debug_data_loader:
+                        print(f"close_async_loader[{self.async_loader_queue_size}], discarded batch #{c} from Queue.")
+
+                    c += 1
+
+                    # Force break out if hanging. We assume hanging if already pop items more than twice of the size of the queue.
+                    if c > max(2 * self.async_loader_queue_size, 200):
+                        print(f"close_async_loader: ERROR!!! Force break out after {c} times get_nowait.")
+                        break
                 except Empty:
                     break
+            if self.debug_data_loader:
+                print(f"close_async_loader[{self.async_loader_queue_size}], joining...")
+
             self.thread.join()
 
+        print(f"close_async_loader[{self.async_loader_queue_size}] is closed.")
+
     def _async_worker(self):
         """
         Start worker thread to load data asynchronously.
         User need to implement self._iterate() to read the data.
         """
         try:
+            c = 0
             while not self.finished_event.is_set():
                 for batch in self._iterate():
                     if self.finished_event.is_set():
                         break
                     self.queue.put(batch)
+
+                    if self.debug_data_loader:
+                        print(f"_async_worker[{self.async_loader_queue_size}], push batch #{c}.")
+                        c += 1
+
+                if self.debug_data_loader:
+                    print(f"_async_worker[{self.async_loader_queue_size}], finish reading at #{c}, reset debugging counter, append None to queue.")
+                    c = 0
+
                 self.queue.put(None)
         except Exception as ex:
             self.queue.put(ex)
@@ -104,6 +130,8 @@ def _async_worker(self):
         finally:
             self.queue.put(None)
 
+        print(f"_async_worker[{self.async_loader_queue_size}], stoped")
+
     def __iter__(self):
         """
         Override the __iter__() to iterate data asynchronously to produce batchs.
@@ -115,12 +143,22 @@ def __iter__(self):
             if not self.started:
                 self.started = True
                 self.thread.start()
+
+            c = 0
             while True:
                 batch = self.queue.get()
+
+                if self.debug_data_loader:
+                    print(f"__iter__[{self.async_loader_queue_size}], get batch #{c}.")
+                    c += 1
+
                 if batch is None:
+                    if self.debug_data_loader:
+                        print(f"__iter__[{self.async_loader_queue_size}], get None from queue at #{c}.")
                     break
                 if isinstance(batch, Exception):
                     raise batch
+
                 yield self._process_batch(batch)
         else:
             for batch in self._iterate():
diff --git a/horovod/mxnet/mpi_ops.cc b/horovod/mxnet/mpi_ops.cc
index 3a778bac03..69a525c582 100644
--- a/horovod/mxnet/mpi_ops.cc
+++ b/horovod/mxnet/mpi_ops.cc
@@ -20,6 +20,10 @@
 #include "cuda_util.h"
 #include "mpi_ops.h"
 
+#if MXNET_MAJOR >= 2 || MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+#define MXNET_ASYNC_GPU_ENGINE_SUPPORTED 1
+#endif
+
 namespace horovod {
 namespace mxnet {
 
@@ -72,7 +76,66 @@ bool IsTensorOnCPU(NDArray* tensor) {
   return tensor->ctx().dev_mask() == cpu::kDevMask;
 }
 
+#if HAVE_CUDA
+class MXReadyEvent : public common::ReadyEvent {
+public:
+  MXReadyEvent(gpuEvent_t event) : event_(event) {};
+  bool Ready() const override {
+      HVD_GPU_CHECK(gpuEventSynchronize(event_));
+      return true;
+  };
+  gpuEvent_t event() const override {
+    return event_;
+  }
+
+private:
+  gpuEvent_t event_;
+};
+#endif
+
+ReadyEventList FormReadyEventList(NDArray* input, NDArray* output) {
+  ReadyEventList ready_event_list;
+
+#if HAVE_CUDA && MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+  // Get events from input tensor writers
+  {
+    auto& sync_obj = input->var()->sync_object;
+    std::lock_guard<std::mutex> l(sync_obj.mutex);
+    if (!sync_obj.writer_event.empty()) {
+      auto ev = sync_obj.writer_event[0].event.lock();
+      if (ev) {
+        ready_event_list.AddReadyEvent(std::make_shared<MXReadyEvent>(*ev));
+      }
+    }
+  }
+
+  // Get events from output tensor reader and writers
+  {
+    auto& sync_obj = output->var()->sync_object;
+    std::lock_guard<std::mutex> l(sync_obj.mutex);
+    for (auto& cuda_event : sync_obj.reader_events) {
+      auto ev = cuda_event.event.lock();
+      if (ev) {
+        ready_event_list.AddReadyEvent(std::make_shared<MXReadyEvent>(*ev));
+      }
+    }
+    if (!sync_obj.writer_event.empty()) {
+      auto ev = sync_obj.writer_event[0].event.lock();
+      if (ev) {
+        ready_event_list.AddReadyEvent(std::make_shared<MXReadyEvent>(*ev));
+      }
+    }
+  }
+#endif
+  return ready_event_list;
+}
+
+#if MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+void DoHorovodOperation(void*, void* on_start_ptr, void* on_complete_ptr, void* param) {
+  auto on_start = *static_cast<CallbackOnStart*>(on_start_ptr);
+#else
 void DoHorovodOperation(void*, void* on_complete_ptr, void* param) {
+#endif
   ThrowIfError(common::CheckInitialized());
 
   auto on_complete = *static_cast<CallbackOnComplete*>(on_complete_ptr);
@@ -91,14 +154,17 @@ void DoHorovodOperation(void*, void* on_complete_ptr, void* param) {
   std::vector<ReadyEventList> ready_event_lists;
   hvd_tensors.reserve(num_tensors);
   hvd_contexts.reserve(num_tensors);
-  ready_event_lists.resize(num_tensors);
+  ready_event_lists.reserve(num_tensors);
   callbacks.reserve(num_tensors);
 
   auto callback_mutex = std::make_shared<std::mutex>();
   for (int i = 0; i < num_tensors; ++i) {
     auto input_tensor = ops_param->input_tensors[i].get();
+    auto output_tensor = ops_param->output_tensors[i].get();
     auto output = ops_param->outputs[i];
 
+    ready_event_lists.emplace_back(FormReadyEventList(input_tensor, output_tensor));
+
     hvd_tensors.emplace_back(std::make_shared<MXTensor>(input_tensor));
     if (TensorUtil::GetDevice(input_tensor) != device) {
       throw std::logic_error("Tensors in list must be on same device.");
@@ -109,10 +175,56 @@ void DoHorovodOperation(void*, void* on_complete_ptr, void* param) {
     }
     hvd_contexts.push_back(ctx);
     callbacks.emplace_back([on_complete, ops_param, callback_mutex, i](const Status& status) {
+      auto input_tensor = ops_param->input_tensors[i].get();
+      auto output_tensor = ops_param->output_tensors[i].get();
 #if HAVE_CUDA
       auto hvd_event = status.event;
       if (hvd_event.event) {
+#if MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+        auto async_engine_enabled = dmlc::GetEnv("MXNET_ASYNC_GPU_ENGINE", false);
+        if (async_engine_enabled) {
+          {
+            auto &sync_obj = input_tensor->var()->sync_object;
+            std::lock_guard<std::mutex> l(sync_obj.mutex);
+            // If some reader event is already recorded on the same stream,
+            // we want to replace ourselves by it
+            int i;
+            for (i = 0; i < sync_obj.reader_events.size(); ++i) {
+              auto stream = sync_obj.reader_events[i].stream;
+              if (stream == hvd_event.stream) {
+                sync_obj.reader_events[i].event = hvd_event.event;
+                sync_obj.reader_events[i].pool_index = hvd_event.event_idx;
+                break;
+              }
+            }
+            if (i == sync_obj.reader_events.size()) {
+              sync_obj.reader_events.push_back({hvd_event.event, hvd_event.stream, hvd_event.event_idx});
+            }
+          }
+
+          {
+            auto &sync_obj = output_tensor->var()->sync_object;
+            std::lock_guard<std::mutex> l(sync_obj.mutex);
+            sync_obj.reader_events.clear();
+            sync_obj.writer_event.clear();
+            sync_obj.writer_event.push_back({hvd_event.event, hvd_event.stream, hvd_event.event_idx});
+          }
+
+          if (ops_param->received_splits_tensor) {
+            {
+              auto &sync_obj = ops_param->received_splits_tensor.get()->var()->sync_object;
+              std::lock_guard<std::mutex> l(sync_obj.mutex);
+              sync_obj.reader_events.clear();
+              sync_obj.writer_event.clear();
+              sync_obj.writer_event.push_back({hvd_event.event, hvd_event.stream, hvd_event.event_idx});
+            }
+          }
+        } else {
+          HVD_GPU_CHECK(gpuEventSynchronize(*(hvd_event.event)));
+        }
+#else
         HVD_GPU_CHECK(gpuEventSynchronize(*(hvd_event.event)));
+#endif
       }
 #endif
 
@@ -163,6 +275,9 @@ void DoHorovodOperation(void*, void* on_complete_ptr, void* param) {
       break;
     case OperationType::ALLTOALL:
     {
+#if MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+      on_start(); // Need to call on_start to sync on possible D2H copy of splits tensor.
+#endif
       auto hvd_splits = std::make_shared<MXTensor>(ops_param->splits_tensor.get());
       enqueue_result = EnqueueTensorAlltoall(
           hvd_contexts[0], hvd_tensors[0], hvd_splits, ready_event_lists[0],
@@ -281,7 +396,12 @@ inline void PushHorovodOperation(OperationType op_type, NDArray* const * inputs,
   }
 }
 #if HAVE_CUDA
+#if MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+void DoHorovodOperationCudaOnCPU(void*, void* on_start_ptr, void* on_complete_ptr, void* param) {
+  auto on_start = *static_cast<CallbackOnComplete*>(on_start_ptr);
+#else
 void DoHorovodOperationCudaOnCPU(void*, void* on_complete_ptr, void* param) {
+#endif
   ThrowIfError(common::CheckInitialized());
 
   auto on_complete = *static_cast<CallbackOnComplete*>(on_complete_ptr);
@@ -299,7 +419,7 @@ void DoHorovodOperationCudaOnCPU(void*, void* on_complete_ptr, void* param) {
   std::vector<ReadyEventList> ready_event_lists;
   hvd_cpu_buffers.reserve(num_tensors);
   hvd_contexts.reserve(num_tensors);
-  ready_event_lists.resize(num_tensors);
+  ready_event_lists.reserve(num_tensors);
   callbacks.reserve(num_tensors);
 
   auto callback_mutex = std::make_shared<std::mutex>();
@@ -307,6 +427,8 @@ void DoHorovodOperationCudaOnCPU(void*, void* on_complete_ptr, void* param) {
     auto input = ops_param->cpu_input_tensors[i].get();
     auto output = ops_param->cpu_output_tensors[i].get();
 
+    ready_event_lists.emplace_back(FormReadyEventList(input, output));
+
     hvd_cpu_buffers.emplace_back(std::make_shared<MXTensor>(input));
     if (TensorUtil::GetDevice(input) != device) {
       throw std::logic_error("Tensors in list must be on same device.");
diff --git a/horovod/mxnet/mpi_ops.h b/horovod/mxnet/mpi_ops.h
index 482f3592c7..97854c37ff 100644
--- a/horovod/mxnet/mpi_ops.h
+++ b/horovod/mxnet/mpi_ops.h
@@ -33,6 +33,9 @@ using namespace horovod::common;
 
 typedef ::mxnet::NDArray NDArray;
 typedef ::mxnet::Engine::CallbackOnComplete CallbackOnComplete;
+#if MXNET_MAJOR >= 2 || MXNET_ASYNC_GPU_ENGINE_SUPPORTED
+typedef ::mxnet::Engine::CallbackOnStart CallbackOnStart;
+#endif
 typedef Request::RequestType OperationType;
 typedef std::shared_ptr<MXTensor> MXTensorSharedPtr;
 typedef std::shared_ptr<NDArray> NDArraySharedPtr;
diff --git a/horovod/mxnet/mpi_ops.py b/horovod/mxnet/mpi_ops.py
index c4b6d9cd2b..6994a26556 100644
--- a/horovod/mxnet/mpi_ops.py
+++ b/horovod/mxnet/mpi_ops.py
@@ -39,7 +39,6 @@
     check_installed_version('mxnet', mx.__version__)
 
 # import basic methods
-init = _basics.init
 shutdown = _basics.shutdown
 is_initialized = _basics.is_initialized
 start_timeline = _basics.start_timeline
@@ -61,6 +60,11 @@
 cuda_built = _basics.cuda_built
 rocm_built = _basics.rocm_built
 
+def init(*args, **kwargs):
+    _basics.init(*args, **kwargs)
+    # Call set up again to make sure the basics is in sync
+    _setup_process_sets(_basics)
+
 dll_path = os.path.join(os.path.dirname(__file__),
                         'mpi_lib' + get_ext_suffix())
 MPI_MXNET_LIB_CTYPES = ctypes.CDLL(dll_path, ctypes.RTLD_GLOBAL)
diff --git a/horovod/ray/runner.py b/horovod/ray/runner.py
index 882f4856ae..595e0361de 100644
--- a/horovod/ray/runner.py
+++ b/horovod/ray/runner.py
@@ -1,4 +1,5 @@
 import ray
+from ray.util.placement_group import get_current_placement_group
 
 import warnings
 from collections import defaultdict
@@ -10,7 +11,7 @@
 from horovod.runner.common.util import secret, timeout, hosts
 from horovod.runner.http.http_server import RendezvousServer
 from horovod.ray.utils import detect_nics, nics_to_env_var, map_blocking
-from horovod.ray.strategy import ColocatedStrategy, PackStrategy
+from horovod.ray.strategy import ColocatedStrategy, PGStrategy
 logger = logging.getLogger(__name__)
 
 
@@ -146,6 +147,8 @@ class RayExecutor:
             ``num_workers``. Number of workers to be placed on each machine.
             Used to enforce equal number of workers on each machine. Only
             used in conjunction with `num_hosts`.
+        use_current_placement_group (bool): Whether to use the current
+            placement group instead of creating a new one. Defaults to True.
 
     """
 
@@ -187,6 +190,7 @@ def __init__(
             cpus_per_worker: int = 1,
             use_gpu: bool = False,
             gpus_per_worker: Optional[int] = None,
+            use_current_placement_group: bool = True,
             # Deprecated Args.
             num_slots: Optional[int] = None,
             cpus_per_slot: Optional[int] = None,
@@ -214,12 +218,12 @@ def __init__(
             cpus_per_worker = cpus_per_slot
             gpus_per_worker = gpus_per_slot
 
-        if num_workers is None and num_hosts is None:
-            raise ValueError("Either `num_workers` or `num_hosts` must be "
+        if not (num_workers or num_hosts):
+            raise ValueError("One of `num_workers` or `num_hosts` must be "
                              "set.")
 
         if num_workers and num_hosts:
-            raise ValueError("Both `num_workers` and `num_hosts` cannot be "
+            raise ValueError("Only one of `num_workers` and `num_hosts` must be "
                              "set.")
 
         if gpus_per_worker and not use_gpu:
@@ -238,6 +242,7 @@ def __init__(
             cpus_per_worker=cpus_per_worker,
             use_gpu=use_gpu,
             gpus_per_worker=gpus_per_worker,
+            use_current_placement_group=use_current_placement_group
         )
         self._is_remote = False
         if ray.util.client.ray.is_connected():
@@ -365,7 +370,8 @@ def __init__(self,
                  num_workers_per_host: int = 1,
                  cpus_per_worker: int = 1,
                  use_gpu: bool = False,
-                 gpus_per_worker: Optional[int] = None):
+                 gpus_per_worker: Optional[int] = None,
+                 use_current_placement_group: bool = True):
 
         self.settings = settings
         self.num_workers = num_workers
@@ -374,6 +380,7 @@ def __init__(self,
         self.cpus_per_worker = cpus_per_worker
         self.use_gpu = use_gpu
         self.gpus_per_worker = gpus_per_worker or 1
+        self.use_current_placement_group = use_current_placement_group
 
         self.workers = []
         self.strategy = None
@@ -388,13 +395,22 @@ def _start_exec(worker):
 
     def _create_strategy(self):
         assert self.num_workers is None or self.num_hosts is None
-        if self.num_workers:
-            return PackStrategy(
+        use_pg = self.use_current_placement_group and get_current_placement_group()
+        if self.num_workers or use_pg:
+            if use_pg:
+                logger.info(
+                    "Found an existing placement group, inheriting. "
+                    "You can disable this behavior by setting "
+                    "`use_current_placement_group=False`."
+                )
+            num_workers = self.num_workers or self.num_workers_per_host * self.num_hosts
+            return PGStrategy(
                 settings=self.settings,
-                num_workers=self.num_workers,
+                num_workers=num_workers,
                 use_gpu=self.use_gpu,
                 cpus_per_worker=self.cpus_per_worker,
-                gpus_per_worker=self.gpus_per_worker)
+                gpus_per_worker=self.gpus_per_worker,
+                force_create_placement_group=not self.use_current_placement_group)
         else:
             return ColocatedStrategy(
                 settings=self.settings,
diff --git a/horovod/ray/strategy.py b/horovod/ray/strategy.py
index 984c547f81..dc98516be2 100644
--- a/horovod/ray/strategy.py
+++ b/horovod/ray/strategy.py
@@ -3,6 +3,7 @@
 from typing import Dict
 
 import ray
+from ray.util.placement_group import get_current_placement_group
 from horovod.ray.utils import map_blocking
 from horovod.ray.worker import BaseHorovodWorker
 
@@ -75,8 +76,8 @@ def __init__(self, *, settings, num_hosts: int, num_workers_per_host: int,
         self.gpus_per_worker = gpus_per_worker or 1
 
     @property
-    def num_workers(self):
-        return self.num_hosts * self.num_workers_per_host
+    def num_workers(self) -> int:
+        return int(self.num_hosts * self.num_workers_per_host)
 
     def _resources_per_host(self):
         num_cpus = self.cpus_per_worker * self.num_workers_per_host
@@ -135,20 +136,29 @@ def create_workers(self):
         return self.workers, self.get_node_workers(self.workers)
 
 
-class PackStrategy(BaseStrategy):
-    """Packs workers together but does not guarantee balanced hosts."""
+class PGStrategy(BaseStrategy):
+    """Packs workers together but does not guarantee balanced hosts.
+
+    Will use the current placement group if one is available.
+    """
 
     def __init__(self, *, settings, num_workers, use_gpu, cpus_per_worker,
-                 gpus_per_worker):
+                 gpus_per_worker, placement_group=None, force_create_placement_group=False):
         self.settings = settings
         self._num_workers = num_workers
         self.cpus_per_worker = cpus_per_worker
         self.gpus_per_worker = gpus_per_worker or 1
         self.use_gpu = use_gpu
+        if force_create_placement_group:
+            self.placement_group = None
+        else:
+            self.placement_group = placement_group or get_current_placement_group()
+        self._placement_group_bundles = self.placement_group.bundle_specs if self.placement_group else None
+        self._created_placement_group = False
 
     @property
-    def num_workers(self):
-        return self._num_workers
+    def num_workers(self) -> int:
+        return int(self._num_workers)
 
     def resources_per_worker(self):
         num_cpus = self.cpus_per_worker
@@ -156,25 +166,27 @@ def resources_per_worker(self):
         return dict(CPU=num_cpus, GPU=num_gpus)
 
     def create_workers(self):
-        self.placement_group, bundles = create_placement_group(
-            resources_per_bundle=self.resources_per_worker(),
-            num_bundles=self.num_workers,
-            pg_strategy="PACK",
-            pg_timeout=self.settings.placement_group_timeout_s)
+        if not self.placement_group:
+            self.placement_group, _ = create_placement_group(
+                resources_per_bundle=self.resources_per_worker(),
+                num_bundles=self.num_workers,
+                pg_strategy="PACK",
+                pg_timeout=self.settings.placement_group_timeout_s)
+            self._created_placement_group = True
 
         # Placement group has started. Now create the workers.
         self.workers = []
         remote_cls = ray.remote(BaseHorovodWorker)
 
-        for bundle_index in range(len(bundles)):
+        for worker_index in range(self.num_workers):
             remote_cls_with_options = remote_cls.options(
                 num_cpus=self.cpus_per_worker,
                 num_gpus=self.gpus_per_worker * int(self.use_gpu),
                 placement_group_capture_child_tasks=False,
                 placement_group=self.placement_group,
-                placement_group_bundle_index=bundle_index)
+                placement_group_bundle_index=worker_index if self._created_placement_group else -1)
             worker = remote_cls_with_options.remote(
-                world_rank=bundle_index, world_size=self.num_workers)
+                world_rank=worker_index, world_size=self.num_workers)
 
             self.workers.append(worker)
 
@@ -202,3 +214,10 @@ def create_workers(self):
                         }))
             ray.get(futures)
         return self.workers, self.get_node_workers(self.workers)
+
+    def shutdown(self):
+        if self._created_placement_group and self.placement_group:
+            ray.util.remove_placement_group(self.placement_group)
+            self.placement_group = None
+
+        self.workers = []
diff --git a/horovod/runner/mpi_run.py b/horovod/runner/mpi_run.py
index 7b5be2787d..5bcce0bb8c 100644
--- a/horovod/runner/mpi_run.py
+++ b/horovod/runner/mpi_run.py
@@ -153,7 +153,7 @@ def mpi_run(settings, nics, env, command, stdout=None, stderr=None):
     if mpi_impl_flags is None:
         raise Exception(_MPI_NOT_FOUND_ERROR_MSG)
 
-    impi = _IMPI_IMPL == mpi
+    impi_or_mpich = mpi in (_IMPI_IMPL, _MPICH_IMPL)
 
     ssh_args = []
     if settings.ssh_port:
@@ -164,27 +164,27 @@ def mpi_run(settings, nics, env, command, stdout=None, stderr=None):
     mpi_ssh_args = ''
     if ssh_args:
         joined_ssh_args = ' '.join(ssh_args)
-        mpi_ssh_args = f'-bootstrap=ssh -bootstrap-exec-args \"{joined_ssh_args}\"' if impi else f'-mca plm_rsh_args \"{joined_ssh_args}\"'
+        mpi_ssh_args = f'-bootstrap=ssh -bootstrap-exec-args \"{joined_ssh_args}\"' if impi_or_mpich else f'-mca plm_rsh_args \"{joined_ssh_args}\"'
 
     tcp_intf_arg = '-mca btl_tcp_if_include {nics}'.format(
-        nics=','.join(nics)) if nics and not impi else ''
+        nics=','.join(nics)) if nics and not impi_or_mpich else ''
     nccl_socket_intf_arg = '-{opt} NCCL_SOCKET_IFNAME={nics}'.format(
-        opt='genv' if impi else 'x',
+        opt='genv' if impi_or_mpich else 'x',
         nics=','.join(nics)) if nics else ''
 
     # On large cluster runs (e.g. Summit), we need extra settings to work around OpenMPI issues
     host_names, host_to_slots = hosts.parse_hosts_and_slots(settings.hosts)
-    if not impi and host_names and len(host_names) >= _LARGE_CLUSTER_THRESHOLD:
+    if not impi_or_mpich and host_names and len(host_names) >= _LARGE_CLUSTER_THRESHOLD:
         mpi_impl_flags.append('-mca plm_rsh_no_tree_spawn true')
         mpi_impl_flags.append('-mca plm_rsh_num_concurrent {}'.format(len(host_names)))
 
     # if user does not specify any hosts, mpirun by default uses local host.
     # There is no need to specify localhost.
-    hosts_arg = '-{opt} {hosts}'.format(opt='hosts' if impi else 'H',
-                hosts=','.join(host_names) if host_names and impi else settings.hosts)
+    hosts_arg = '-{opt} {hosts}'.format(opt='hosts' if impi_or_mpich else 'H',
+                hosts=','.join(host_names) if host_names and impi_or_mpich else settings.hosts)
 
     ppn_arg = ' '
-    if host_to_slots and impi:
+    if host_to_slots and impi_or_mpich:
         ppn = host_to_slots[host_names[0]]
         for h_name in host_names[1:]:
             if ppn != host_to_slots[h_name]:
@@ -192,19 +192,19 @@ def mpi_run(settings, nics, env, command, stdout=None, stderr=None):
                                  Use -machinefile <machine_file> for this purpose.''')
         ppn_arg = ' -ppn {} '.format(ppn)
 
-    if settings.prefix_output_with_timestamp and not impi:
+    if settings.prefix_output_with_timestamp and not impi_or_mpich:
         mpi_impl_flags.append('--timestamp-output')
 
-    binding_args = settings.binding_args if settings.binding_args and not impi else ' '.join(impl_binding_args)
+    binding_args = settings.binding_args if settings.binding_args and not impi_or_mpich else ' '.join(impl_binding_args)
 
-    basic_args = '-l' if impi else '--allow-run-as-root --tag-output'
+    basic_args = '-l' if impi_or_mpich else '--allow-run-as-root --tag-output'
 
     output = []
     if settings.output_filename:
-        output.append('-outfile-pattern' if impi else '--output-filename')
+        output.append('-outfile-pattern' if impi_or_mpich else '--output-filename')
         output.append(settings.output_filename)
 
-    env_list = '' if impi else ' '.join(
+    env_list = '' if impi_or_mpich else ' '.join(
                     '-x %s' % key for key in sorted(env.keys()) if env_util.is_exportable(key))
 
     # Pass all the env variables to the mpirun command.
diff --git a/horovod/spark/common/store.py b/horovod/spark/common/store.py
index f493f48905..e39ac92992 100644
--- a/horovod/spark/common/store.py
+++ b/horovod/spark/common/store.py
@@ -29,6 +29,7 @@
 
 import fsspec
 from fsspec.core import split_protocol
+from fsspec.utils import update_storage_options
 
 from horovod.spark.common.util import is_databricks, host_hash
 
@@ -338,8 +339,12 @@ def fs(self):
 
     #@staticmethod
     def _get_fs_and_protocol(self):
+        storage_options = self.storage_options or {}
         protocol, path = split_protocol(self.prefix_path)
-        fs = fsspec.filesystem(protocol, **self.storage_options)
+        cls = fsspec.get_filesystem_class(protocol)
+        options = cls._get_kwargs_from_urls(self.prefix_path)
+        update_storage_options(options, storage_options)
+        fs = cls(**options)
         return fs, protocol
 
     @classmethod
diff --git a/horovod/spark/common/util.py b/horovod/spark/common/util.py
index 504fd8e4b1..ae2a41a261 100644
--- a/horovod/spark/common/util.py
+++ b/horovod/spark/common/util.py
@@ -132,7 +132,7 @@ def data_type_to_numpy(dtype):
     elif dtype == IntegerType:
         return np.int32
     elif dtype == StringType:
-        return np.uint8
+        return np.str
     elif dtype == FloatType:
         return np.float32
     elif dtype == BinaryType:
diff --git a/horovod/spark/keras/estimator.py b/horovod/spark/keras/estimator.py
index be9306aafc..b1ec936738 100644
--- a/horovod/spark/keras/estimator.py
+++ b/horovod/spark/keras/estimator.py
@@ -12,14 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-
-import horovod.spark.common._namedtuple_fix
-
 import numbers
 import time
 
-from distutils.version import LooseVersion
-
 import numpy as np
 import tensorflow as tf
 
@@ -35,10 +30,7 @@
 from horovod.spark.common.params import EstimatorParams
 from horovod.spark.common.serialization import HorovodParamsWriter, HorovodParamsReader
 from horovod.spark.keras import remote
-from horovod.spark.keras.util import \
-    BARE_KERAS, TF_KERAS, \
-    BareKerasUtil, TFKerasUtil, \
-    is_instance_of_bare_keras_model, is_instance_of_bare_keras_optimizer
+from horovod.spark.keras.util import TFKerasUtil
 
 
 class KerasEstimatorParamsWriter(HorovodParamsWriter):
@@ -75,16 +67,6 @@ def load_model_fn(x):
 
         # In order to deserialize the model, we need to deserialize the custom_objects param
         # first.
-        keras_utils = None
-        if KerasEstimator._keras_pkg_type.name in dict:
-            keras_pkg_type = _param_deserializer_fn(KerasEstimator._keras_pkg_type.name,
-                                                    dict[KerasEstimator._keras_pkg_type.name],
-                                                    None, None)
-            if keras_pkg_type == BARE_KERAS:
-                keras_utils = BareKerasUtil
-            elif keras_pkg_type == TF_KERAS:
-                keras_utils = TFKerasUtil
-
         custom_objects = {}
         if KerasEstimator.custom_objects.name in dict:
             custom_objects = _param_deserializer_fn(KerasEstimator.custom_objects.name,
@@ -92,7 +74,7 @@ def load_model_fn(x):
                                                     None, None)
 
         for key, val in dict.items():
-            dict[key] = _param_deserializer_fn(key, val, keras_utils, custom_objects)
+            dict[key] = _param_deserializer_fn(key, val, TFKerasUtil, custom_objects)
         return dict
 
 
@@ -166,7 +148,6 @@ class KerasEstimator(HorovodEstimator, KerasEstimatorParamsReadable,
     """
 
     custom_objects = Param(Params._dummy(), 'custom_objects', 'custom objects')
-    _keras_pkg_type = Param(Params._dummy(), '_keras_pkg_type', 'keras package type')
     checkpoint_callback = Param(Params._dummy(), 'checkpoint_callback',
                                 'model checkpointing callback')
     inmemory_cache_all = Param(Params._dummy(), 'inmemory_cache_all',
@@ -214,7 +195,6 @@ def __init__(self,
 
         self._setDefault(optimizer=None,
                          custom_objects={},
-                         _keras_pkg_type=None,
                          checkpoint_callback=None,
                          inmemory_cache_all=False,
                          backend_env={'LIBHDFS_OPTS': '-Xms2048m -Xmx2048m'})
@@ -223,47 +203,22 @@ def __init__(self,
         self.setParams(**kwargs)
 
     def _get_keras_utils(self):
-        # This function determines the keras package type of the Estimator based on the passed
-        # optimizer and model and updates _keras_pkg_type parameter.
+        # This function checks the keras package type is tensorflow.keras
 
-        model_type = None
         model = self.getModel()
         if model:
-            if isinstance(model, tf.keras.Model):
-                model_type = TF_KERAS
-            elif is_instance_of_bare_keras_model(model):
-                model_type = BARE_KERAS
-            else:
+            if not isinstance(model, tf.keras.Model):
                 raise ValueError(
-                    "model has to be an instance of tensorflow.keras.Model or keras.Model")
+                    "model has to be an instance of tensorflow.keras.Model")
 
-        optimizer_type = None
         optimizer = self.getOptimizer()
         if optimizer:
             if isinstance(optimizer, str):
-                optimizer_type = None
-            elif isinstance(optimizer, tf.keras.optimizers.Optimizer):
-                optimizer_type = TF_KERAS
-            elif is_instance_of_bare_keras_optimizer(optimizer):
-                optimizer_type = BARE_KERAS
-            else:
-                raise ValueError("invalid optimizer type")
-
-        types = set([model_type, optimizer_type])
-        types.discard(None)
+                pass
+            elif not isinstance(optimizer, tf.keras.optimizers.Optimizer):
+                raise ValueError("optimizer has to be an instance of tensorflow.keras.optimizers.Optimizer")
 
-        if len(types) > 1:
-            raise ValueError('mixed keras and tf.keras values for optimizers and model')
-        elif len(types) == 1:
-            pkg_type = types.pop()
-            super(KerasEstimator, self)._set(_keras_pkg_type=pkg_type)
-
-            if pkg_type == TF_KERAS:
-                return TFKerasUtil
-            elif pkg_type == BARE_KERAS:
-                return BareKerasUtil
-            else:
-                raise ValueError("invalid keras type")
+        return TFKerasUtil
 
     def setCustomObjects(self, value):
         return self._set(custom_objects=value)
@@ -428,11 +383,6 @@ class KerasModel(HorovodModel, KerasEstimatorParamsReadable,
     """
 
     custom_objects = Param(Params._dummy(), 'custom_objects', 'custom objects')
-
-    # Setting _keras_pkg_type parameter helps us determine the type of keras package during
-    # deserializing the transformer
-    _keras_pkg_type = Param(Params._dummy(), '_keras_pkg_type', 'keras package type')
-
     _floatx = Param(Params._dummy(), '_floatx', 'keras default float type')
 
     @keyword_only
@@ -466,22 +416,11 @@ def _get_keras_utils(self, model=None):
         # infer keras package from model
         model = self.getModel()
         if model:
-            if isinstance(model, tf.keras.Model):
-                pkg_type = TF_KERAS
-            elif is_instance_of_bare_keras_model(model):
-                pkg_type = BARE_KERAS
-            else:
+            if not isinstance(model, tf.keras.Model):
                 raise ValueError(
-                    "model has to be an instance of tensorflow.keras.Model or keras.Model")
-
-            super(KerasModel, self)._set(_keras_pkg_type=pkg_type)
+                    "model has to be an instance of tensorflow.keras.Model")
 
-            if pkg_type == TF_KERAS:
-                return TFKerasUtil
-            elif pkg_type == BARE_KERAS:
-                return BareKerasUtil
-            else:
-                raise ValueError("invalid keras type")
+            return TFKerasUtil
 
         raise ValueError("model is not set")
 
@@ -578,4 +517,3 @@ def to_numpy(item):
 
         # Use the schema from previous section to construct the final DF with prediction
         return spark0.createDataFrame(pred_rdd, schema=final_output_schema)
-
diff --git a/horovod/spark/keras/remote.py b/horovod/spark/keras/remote.py
index 1b05a2d9cd..879e80e12b 100644
--- a/horovod/spark/keras/remote.py
+++ b/horovod/spark/keras/remote.py
@@ -109,6 +109,7 @@ def train(serialized_model, train_rows, val_rows, avg_row_size):
 
         hvd = get_horovod()
         hvd.init()
+
         pin_gpu(hvd, tf, k)
 
         if not user_shuffle_buffer_size:
@@ -129,6 +130,9 @@ def train(serialized_model, train_rows, val_rows, avg_row_size):
         # Verbose mode 1 will print a progress bar
         verbose = user_verbose if hvd.rank() == 0 else 0
 
+        if verbose:
+            print(f"Shared lib path is pointing to: {_horovod.common.process_sets._basics.MPI_LIB_CTYPES}")
+
         transform_spec = None
         if transformation:
             transform_spec = TransformSpec(transformation)
@@ -151,6 +155,7 @@ def train(serialized_model, train_rows, val_rows, avg_row_size):
                 # TensorBoard, or other metrics-based callbacks.
                 hvd.callbacks.MetricAverageCallback(),
             ]
+
             callbacks += user_callbacks
 
             # Horovod: save checkpoints only on the first worker to prevent other workers from
@@ -176,7 +181,18 @@ def train(serialized_model, train_rows, val_rows, avg_row_size):
                 callbacks.append(_checkpoint_callback)
 
                 if remote_store.saving_runs:
-                    callbacks.append(k.callbacks.TensorBoard(logs_dir))
+                    tb_callback = None
+                    for i, c in enumerate(callbacks):
+                        if isinstance(c, k.callbacks.TensorBoard):
+                            tb_callback = c
+                            print(f"Found TensorBoard callback, updating log_dir to {logs_dir}")
+                            tb_callback.log_dir = logs_dir
+                            break
+                    if tb_callback:
+                        # Rather than a possibly arbitrary order, we always place the TensorBoard
+                        # callback right before the SyncCallback
+                        callbacks.pop(i)
+                    callbacks.append(tb_callback or k.callbacks.TensorBoard(logs_dir))
                     callbacks.append(SyncCallback(run_output_dir, remote_store.sync, k))
 
             if train_steps_per_epoch is None:
@@ -267,7 +283,7 @@ def train(serialized_model, train_rows, val_rows, avg_row_size):
                         with k.utils.custom_object_scope(custom_objects):
                             model = k.models.load_model(ckpt_file)
                         serialized_model = keras_utils.serialize_model(model)
-                    else:    
+                    else:
                         with open(ckpt_file, 'rb') as f:
                             serialized_model = codec.dumps_base64(f.read())
 
diff --git a/horovod/spark/keras/util.py b/horovod/spark/keras/util.py
index 1fd443df7d..238a0598e3 100644
--- a/horovod/spark/keras/util.py
+++ b/horovod/spark/keras/util.py
@@ -27,7 +27,6 @@
 from horovod.spark.keras import optimizer, remote
 
 
-BARE_KERAS = 'keras'
 TF_KERAS = 'tf_keras'
 
 
@@ -217,185 +216,6 @@ def prep(row):
         return prep
 
 
-class BareKerasUtil(object):
-    type = BARE_KERAS
-
-    @staticmethod
-    def fit_fn(epochs):
-        def fn(model, train_data, val_data, steps_per_epoch, validation_steps, callbacks, verbose):
-            return model.fit_generator(
-                train_data,
-                validation_data=val_data,
-                steps_per_epoch=steps_per_epoch,
-                validation_steps=validation_steps,
-                callbacks=callbacks,
-                verbose=verbose,
-                epochs=epochs)
-
-        return fn
-
-    @staticmethod
-    def make_dataset_fn(feature_columns, label_columns, sample_weight_col, metadata,
-                        input_shapes, label_shapes, output_names):
-        batch_generator = BareKerasUtil._batch_generator_fn(
-            feature_columns, label_columns, sample_weight_col,
-            input_shapes, label_shapes, metadata)
-
-        def fn(reader, batch_size, shuffle_buffer_size, is_batch_reader, shuffle=False, cache=inmemory_cache_all):
-            # is_batch_reader and cache are not used for BareKerasUtil.
-            return batch_generator(reader, batch_size, shuffle_buffer_size, shuffle)
-
-        return fn
-
-    @staticmethod
-    def get_horovod():
-        return BareKerasUtil.horovod_fn()()
-
-    @staticmethod
-    def horovod_fn():
-        def fn():
-            import horovod.keras as hvd
-            return hvd
-        return fn
-
-    @staticmethod
-    def keras():
-        return BareKerasUtil.keras_fn()()
-
-    @staticmethod
-    def keras_fn():
-        def fn():
-            import keras
-            return keras
-        return fn
-
-    @staticmethod
-    def serialize_optimizer(*args, **kwargs):
-        return optimizer.serialize_bare_keras_optimizer(*args, **kwargs)
-
-    @staticmethod
-    def deserialize_optimizer(*args, **kwargs):
-        return optimizer.deserialize_bare_keras_optimizer(*args, **kwargs)
-
-    @staticmethod
-    def serialize_model(*args, **kwargs):
-        def serialize_keras_model(x):
-            return _serialize_keras_model(x, BareKerasUtil.keras().models.save_model)
-
-        return serialize_keras_model(*args, **kwargs)
-
-    @staticmethod
-    def deserialize_model(*args, **kwargs):
-        return _deserialize_keras_model(*args, **kwargs)
-
-    @staticmethod
-    def serialize_param_value(*args, **kwargs):
-        def _serialize_param(x, y):
-            return _serialize_param_value(x, y,
-                                          serialize_model_fn=BareKerasUtil.serialize_model,
-                                          serialize_opt_fn=BareKerasUtil.serialize_optimizer)
-
-        return _serialize_param(*args, **kwargs)
-
-    @staticmethod
-    def _batch_generator_fn(feature_columns, label_columns, sample_weight_col,
-                            input_shapes, label_shapes, metadata):
-        prepare_data_bare_keras = BareKerasUtil._prepare_data_fn(metadata)
-
-        cols = feature_columns + label_columns
-        if sample_weight_col:
-            cols.append(sample_weight_col)
-
-        def batch_generator(reader, batch_size, shuffle_buffer_size, shuffle=False):
-            while True:
-                num_rows_read_sofar = 0
-                data = None
-                # Create iterator from reader to start a new iteration from beginning.
-                reader_iter = iter(reader)
-                while num_rows_read_sofar < shuffle_buffer_size:
-                    # Each call to next reads one row group at a time.
-                    row_group_data = next(reader_iter, None)
-                    if row_group_data is None:
-                        # No data left in reader, stop filling.
-                        break
-                    if not data:
-                        data = {col: getattr(row_group_data, col) for col in cols}
-                    else:
-                        for col in cols:
-                            data[col] = np.concatenate((data[col],
-                                                        getattr(row_group_data, col)))
-                    num_rows_read_sofar += row_group_data[0].shape[0]
-
-                # Create a permutation of len of data and use it to shuffle each numpy array
-                perm = np.random.permutation(num_rows_read_sofar) \
-                    if shuffle else list(range(num_rows_read_sofar))
-
-                inputs = [prepare_data_bare_keras(data[col][perm], col, shape) for col, shape
-                          in zip(feature_columns, input_shapes)]
-                labels = [prepare_data_bare_keras(data[col][perm], col, shape) for col, shape
-                          in zip(label_columns, label_shapes)]
-
-                num_outputs = len(label_columns)
-                sample_weights = None
-                if sample_weight_col:
-                    sample_weights = data[sample_weight_col][perm]
-
-                batch_count = int(len(inputs[0]) / batch_size)
-                for i in range(0, batch_count):
-                    if sample_weight_col:
-                        # We use the same sample weight for all the outputs of the sample
-                        sample_weight = sample_weights[i * batch_size:(i + 1) * batch_size]
-                        sample_weight_for_batch = [sample_weight for i in range(num_outputs)]
-
-                        yield (
-                            [input[i * batch_size:(i + 1) * batch_size] for input in inputs],
-                            [label[i * batch_size:(i + 1) * batch_size] for label in labels],
-                            sample_weight_for_batch)
-                    else:
-                        yield (
-                            [input[i * batch_size:(i + 1) * batch_size] for input in inputs],
-                            [label[i * batch_size:(i + 1) * batch_size] for label in labels])
-
-        return batch_generator
-
-    @staticmethod
-    def _prepare_data_fn(metadata):
-        convert_custom_sparse_to_dense = BareKerasUtil._convert_custom_sparse_to_dense_fn()
-        CUSTOM_SPARSE = constants.CUSTOM_SPARSE
-
-        def prepare_data(rows, col, shape):
-            intermediate_format = metadata[col]['intermediate_format']
-            if intermediate_format != CUSTOM_SPARSE:
-                return rows.reshape(shape)
-
-            dense_rows = []
-            shape_1d = metadata[col]['shape']
-            for row in rows:
-                dense_row = convert_custom_sparse_to_dense(row, shape_1d)
-                dense_rows.append(dense_row)
-            return np.array(dense_rows).reshape(shape)
-        return prepare_data
-
-    @staticmethod
-    def _convert_custom_sparse_to_dense_fn():
-        def convert_custom_sparse_to_dense(row, shape):
-            size = int(row[0])
-            dense_row = np.zeros(shape)
-            dense_row[row[1:size + 1].astype(int)] = row[size + 1:2 * size + 1]
-            return dense_row
-        return convert_custom_sparse_to_dense
-
-
-def is_instance_of_bare_keras_optimizer(opt):
-    import keras
-    return isinstance(opt, keras.optimizers.Optimizer)
-
-
-def is_instance_of_bare_keras_model(model):
-    import keras
-    return isinstance(model, keras.models.Model)
-
-
 def _serialize_keras_model(model, save_model_fn):
     """Serialize model into byte array encoded into base 64."""
     bio = io.BytesIO()
diff --git a/horovod/spark/lightning/datamodule.py b/horovod/spark/lightning/datamodule.py
index 534361469a..213e779267 100644
--- a/horovod/spark/lightning/datamodule.py
+++ b/horovod/spark/lightning/datamodule.py
@@ -12,10 +12,12 @@ class PetastormDataModule(pl.LightningDataModule):
     """Default DataModule for Lightning Estimator"""
     def __init__(self, train_dir: str, val_dir: str, num_train_epochs: int=1, has_val: bool=True,
                  train_batch_size: int=32, val_batch_size: int=32, shuffle_size: int=1000,
-                 num_reader_epochs=None, reader_pool_type: str="process", reader_worker_count: int=2,
-                 transform_spec=None, inmemory_cache_all=False,
+                 num_reader_epochs=None, reader_pool_type: str="process",
+                 reader_worker_count: int=2, transform_spec=None, inmemory_cache_all=False,
                  cur_shard: int=0, shard_count: int=1, schema_fields=None, storage_options=None,
-                 steps_per_epoch_train: int=1, steps_per_epoch_val: int=1, verbose=True, **kwargs):
+                 steps_per_epoch_train: int=1, steps_per_epoch_val: int=1, verbose=True,
+                 debug_data_loader: bool=False, train_async_data_loader_queue_size: int=None,
+                 val_async_data_loader_queue_size: int=None, **kwargs):
         super().__init__()
         self.train_dir = train_dir
         self.val_dir = val_dir
@@ -36,6 +38,13 @@ def __init__(self, train_dir: str, val_dir: str, num_train_epochs: int=1, has_va
         self.steps_per_epoch_train = steps_per_epoch_train
         self.steps_per_epoch_val = steps_per_epoch_val
         self.verbose = verbose
+        self.debug_data_loader = debug_data_loader
+        self.train_async_data_loader_queue_size = train_async_data_loader_queue_size
+        self.val_async_data_loader_queue_size = val_async_data_loader_queue_size
+
+        if debug_data_loader:
+            print("Creating data_module")
+
 
     def setup(self, stage=None):
         # Assign train/val datasets for use in dataloaders
@@ -78,6 +87,8 @@ def teardown(self, stage=None):
                 if self.has_val:
                     self.val_reader.stop()
                     self.val_reader.join()
+            if self.verbose:
+                print("Tear down: async dataloaders closed.")
 
     def train_dataloader(self):
         if self.verbose:
@@ -95,6 +106,18 @@ def train_dataloader(self):
             dataloader_class = PytorchInfiniteAsyncDataLoader
             kwargs['shuffling_queue_capacity'] = self.shuffle_size
 
+            if self.debug_data_loader:
+                kwargs['debug_data_loader'] = self.debug_data_loader
+
+            if self.train_async_data_loader_queue_size is not None:
+                if isinstance(self.train_async_data_loader_queue_size, int):
+                    kwargs['async_loader_queue_size'] = self.train_async_data_loader_queue_size
+                elif isinstance(self.train_async_data_loader_queue_size, float):
+                    # use async data loader queue size as ratio of total steps.
+                    kwargs['async_loader_queue_size'] = int(kwargs['limit_step_per_epoch'] * self.train_async_data_loader_queue_size)
+                else:
+                    raise RuntimeError(f"Unsupported type for train_async_data_loader_queue_size={self.train_async_data_loader_queue_size}")
+
         self.train_dl = dataloader_class(**kwargs)
         return self.train_dl
 
@@ -116,5 +139,17 @@ def val_dataloader(self):
             dataloader_class = PytorchInfiniteAsyncDataLoader
             kwargs['shuffling_queue_capacity'] = 0
 
+            if self.debug_data_loader:
+                kwargs['debug_data_loader'] = self.debug_data_loader
+
+            if self.val_async_data_loader_queue_size is not None:
+                if isinstance(self.val_async_data_loader_queue_size, int):
+                    kwargs['async_loader_queue_size'] = self.val_async_data_loader_queue_size
+                elif isinstance(self.val_async_data_loader_queue_size, float):
+                    # use async data loader queue size as ratio of total steps.
+                    kwargs['async_loader_queue_size'] = int(kwargs['limit_step_per_epoch'] * self.val_async_data_loader_queue_size)
+                else:
+                    raise RuntimeError(f"Unsupported type for val_async_data_loader_queue_size={self.val_async_data_loader_queue_size}")
+
         self.val_dl = dataloader_class(**kwargs)
         return self.val_dl
diff --git a/horovod/spark/lightning/estimator.py b/horovod/spark/lightning/estimator.py
index 0b6cedc065..5111ca4cd1 100644
--- a/horovod/spark/lightning/estimator.py
+++ b/horovod/spark/lightning/estimator.py
@@ -160,6 +160,7 @@ class TorchEstimator(HorovodEstimator, TorchEstimatorParamsWritable,
         train_steps_per_epoch: (Optional) Number of steps to train each epoch. Useful for testing
                                 that model trains successfully. Defaults to training the entire
                                 dataset each epoch.
+        trainer_args:   (Optional) Dict of args to pass to trainer, it will be used to add/override the args which estimator gives to trainer.
         transformation_fn:  (Optional) function that takes a row as its parameter and returns a
                             modified row that is then fed into the train or validation step.
                             This transformation is applied after batching. See Petastorm
@@ -175,6 +176,9 @@ class TorchEstimator(HorovodEstimator, TorchEstimatorParamsWritable,
         validation_steps_per_epoch: (Optional) Number of validation steps to perform each epoch.
         verbose:    (Optional)Verbosity level, 0 for silent. (default: 1).
         profiler:    (Optional)Lightning profiler to enable. (disabled by default).
+        debug_data_loader: (Optional)Debugging flag for data loader.
+        train_async_data_loader_queue_size: (Optional) Size of train async data loader queue.
+        val_async_data_loader_queue_size: (Optional) Size of val async data loader queue.
     """
 
     input_shapes = Param(Params._dummy(), 'input_shapes', 'input layer shapes')
@@ -206,6 +210,15 @@ class TorchEstimator(HorovodEstimator, TorchEstimatorParamsWritable,
 
     profiler = Param(Params._dummy(), 'profiler', 'lightning profiler to use')
 
+    trainer_args = Param(Params._dummy(), 'trainer_args',
+                        'Dict of args to pass to trainer, it will be used to add/override the args which estimator gives to trainer. ')
+
+    debug_data_loader = Param(Params._dummy(), 'debug_data_loader', 'Flag to enable debug for data laoder.')
+
+    train_async_data_loader_queue_size = Param(Params._dummy(), 'train_async_data_loader_queue_size', 'Size of train async data loader queue.')
+
+    val_async_data_loader_queue_size = Param(Params._dummy(), 'val_async_data_loader_queue_size', 'Size of val async data loader queue.')
+
     @keyword_only
     def __init__(self,
                  num_proc=None,
@@ -236,6 +249,7 @@ def __init__(self,
                  validation_steps_per_epoch=None,
                  transformation_fn=None,
                  train_reader_num_workers=None,
+                 trainer_args=None,
                  val_reader_num_workers=None,
                  reader_pool_type=None,
                  label_shapes=None,
@@ -246,7 +260,10 @@ def __init__(self,
                  data_module=None,
                  loader_num_epochs=None,
                  terminate_on_nan=False,
-                 profiler=None):
+                 profiler=None,
+                 debug_data_loader=False,
+                 train_async_data_loader_queue_size=None,
+                 val_async_data_loader_queue_size=None):
 
         super(TorchEstimator, self).__init__()
         self._setDefault(loss_constructors=None,
@@ -260,7 +277,11 @@ def __init__(self,
                          data_module=None,
                          loader_num_epochs=None,
                          terminate_on_nan=False,
-                         profiler=None)
+                         profiler=None,
+                         trainer_args=None,
+                         debug_data_loader=False,
+                         train_async_data_loader_queue_size=None,
+                         val_async_data_loader_queue_size=None)
 
         kwargs = self._input_kwargs
 
@@ -333,12 +354,36 @@ def setTerminateOnNan(self, value):
     def getTerminateOnNan(self):
         return self.getOrDefault(self.terminate_on_nan)
 
+    def setTrainerArgs(self, value):
+        return self._set(trainer_args=value)
+
+    def getTrainerArgs(self):
+        return self.getOrDefault(self.trainer_args)
+
     def getProfiler(self):
         return self.getOrDefault(self.profiler)
 
     def _get_optimizer(self):
         return self.getOrDefault(self.optimizer)
 
+    def setDebugDataLoader(self, value):
+        return self._set(debug_data_loader=value)
+
+    def getDebugDataLoader(self):
+        return self.getOrDefault(self.debug_data_loader)
+
+    def setTrainAsyncDataLoaderQueueSize(self, value):
+        return self._set(train_async_data_loader_queue_size=value)
+
+    def getTrainAsyncDataLoaderQueueSize(self):
+        return self.getOrDefault(self.train_async_data_loader_queue_size)
+
+    def setValAsyncDataLoaderQueueSize(self, value):
+        return self._set(val_async_data_loader_queue_size=value)
+
+    def getValAsyncDataLoaderQueueSize(self):
+        return self.getOrDefault(self.val_async_data_loader_queue_size)
+
     # Overwrites Model's getOptimizer method
     def getOptimizer(self):
         model = self.getModel()
@@ -401,6 +446,7 @@ def _fit_on_prepared_data(self, backend, train_rows, val_rows, metadata, avg_row
                                         validation=self.getValidation())
 
         serialized_model = serialize_fn()(model)
+        # FIXME: checkpoint bytes should be loaded into serialized_model, same as Keras Estimator.
         ckpt_bytes = self._read_checkpoint(run_id) if self._has_checkpoint(run_id) else None
         trainer = remote.RemoteTrainer(self,
                                        metadata=metadata,
@@ -430,14 +476,13 @@ def _create_model(self, run_results, run_id, metadata):
         best_checkpoint = torch.load(serialized_checkpoint, map_location=torch.device('cpu'))
 
         model = copy.deepcopy(self.getModel())
-        # optimizer = copy.deepcopy(self.getOptimizer())
-
         model.load_state_dict(best_checkpoint['model'])
-
         model.eval()
 
-        # optimizer.load_state_dict(best_checkpoint['optimizer'])
-        history = None
+        history = best_checkpoint['logged_metrics']
+
+        # Optimizer is part of the model no need to return it to transformer.
+        # TODO: (Pengz) Update the latest state of the optimizer in the model for retraining.
         optimizer = None
 
         return self.get_model_class()(**self._get_model_kwargs(
diff --git a/horovod/spark/lightning/remote.py b/horovod/spark/lightning/remote.py
index 348b32f34e..dce397907e 100644
--- a/horovod/spark/lightning/remote.py
+++ b/horovod/spark/lightning/remote.py
@@ -56,11 +56,26 @@ def RemoteTrainer(estimator, metadata, ckpt_bytes, run_id, dataset_idx, train_ro
     train_steps_per_epoch = estimator.getTrainStepsPerEpoch()
     val_steps_per_epoch = estimator.getValidationStepsPerEpoch()
     num_gpus = estimator.getNumGPUs()
-    logger = estimator.getLogger()
-    log_every_n_steps = estimator.getLogEveryNSteps()
     data_module = estimator.getDataModule() if estimator.getDataModule() else PetastormDataModule
     loader_num_epochs = estimator.getLoaderNumEpochs()
     verbose = (estimator.getVerbose() > 0)
+    trainer_args = estimator.getTrainerArgs()
+    debug_data_loader = estimator.getDebugDataLoader()
+    train_async_data_loader_queue_size = estimator.getTrainAsyncDataLoaderQueueSize()
+    val_async_data_loader_queue_size = estimator.getValAsyncDataLoaderQueueSize()
+
+    # get logger
+    logger = estimator.getLogger()
+    log_every_n_steps = estimator.getLogEveryNSteps()
+    print(f"logger is configured: {logger}")
+
+    # Comet logger's expriment key is not serialize correctly. Need to remember the key, and
+    # resume the logger experiment from GPU instance.
+    if isinstance(logger, CometLogger):
+        logger_experiment_key = logger._experiment_key
+        print(f"logger vars: {vars(logger)}")
+    else:
+        logger_experiment_key = None
 
     # Data reader parameters
     train_reader_worker_count = estimator.getTrainReaderNumWorker()
@@ -85,43 +100,66 @@ def RemoteTrainer(estimator, metadata, ckpt_bytes, run_id, dataset_idx, train_ro
 
     def train(serialized_model):
         import horovod.torch as hvd
+
         # Horovod: initialize library.
         hvd.init()
 
-        with tempfile.TemporaryDirectory() as last_ckpt_dir, remote_store.get_local_output_dir() as run_output_dir:
-            last_ckpt_file = os.path.join(last_ckpt_dir, 'last.ckpt')
-            if ckpt_bytes:
-                with open(last_ckpt_file, 'wb') as f:
-                    f.write(ckpt_bytes)
+        if verbose:
+            import horovod as _horovod
+            print(f"Shared lib path is pointing to: {_horovod.common.process_sets._basics.MPI_LIB_CTYPES}")
+
+        _checkpoint_callback = None
+        require_checkpoint = False
 
-            # TODO: Pass the logger from estimator constructor
+        with remote_store.get_local_output_dir() as run_output_dir:
             logs_path = os.path.join(run_output_dir, remote_store.logs_subdir)
             os.makedirs(logs_path, exist_ok=True)
             print(f"Made directory {logs_path} for horovod rank {hvd.rank()}")
+            ckpt_dir = run_output_dir
+            ckpt_filename = remote_store.checkpoint_filename
 
-            # Use default logger if no logger is supplied
-            train_logger = logger
-            print(f"Train_logger is {train_logger}")
-
-            if train_logger is None:
+            if logger is None:
+                # Use default logger if no logger is supplied
                 train_logger = TensorBoardLogger(logs_path)
-
-            # TODO: find out a way to use ckpt_path created from remote store, but all other parameters ingest from estimator config
-            # ckpt_path = os.path.join(run_output_dir, remote_store.checkpoint_filename)
-            # os.makedirs(ckpt_path, exist_ok=True)
-            # model_checkpoint_callback = ModelCheckpoint(dirpath=ckpt_path)
-            # callbacks.append(model_checkpoint_callback)
-
-            is_model_checkpoint_callback_exist = False
+                print(f"Setup logger: Using TensorBoardLogger: {train_logger}")
+
+            elif isinstance(logger, CometLogger) and logger._experiment_key is None:
+                # Resume logger experiment key if passed correctly from CPU.
+                train_logger = CometLogger(
+                    save_dir=logs_path,
+                    api_key=logger.api_key,
+                    experiment_key=logger_experiment_key,
+                )
+
+                print(f"Setup logger: Resume comet logger: {vars(train_logger)}")
+            else:
+                # use logger passed in.
+                train_logger = logger
+                train_logger.save_dir = logs_path
+                print(f"Setup logger: Using logger passed from estimator: {train_logger}")
+
+            # Lightning requires to add checkpoint callbacks for all ranks.
+            # Otherwise we are seeing hanging in training.
             for cb in callbacks:
                 if isinstance(cb, ModelCheckpoint):
-                    is_model_checkpoint_callback_exist = True
+                    cb.dirpath = ckpt_dir
+                    cb.filename = ckpt_filename
+                    _checkpoint_callback = cb
+                    require_checkpoint = True
                     break
+            if not _checkpoint_callback:
+                # By default 'monitor'=None which saves a checkpoint only for the last epoch.
+                _checkpoint_callback = ModelCheckpoint(dirpath=ckpt_dir,
+                                                       filename=ckpt_filename,
+                                                       verbose=True)
+                callbacks.append(_checkpoint_callback)
 
             if remote_store.saving_runs and hvd.rank() == 0:
+                # Horovod: sync checkpoint and logging files only on rank 0 to
+                # prevent other ranks from corrupting them.
                 class _SyncCallback(Callback):
                     def on_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
-                        remote_store.sync(logs_path)
+                        remote_store.sync(run_output_dir)
 
                 callbacks.append(_SyncCallback())
 
@@ -133,7 +171,11 @@ def on_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -
             _val_steps_per_epoch = val_steps_per_epoch if val_steps_per_epoch else \
                 int(math.floor(float(val_rows) / val_batch_size / hvd.size()))
 
-            print(f"Training data of rank[{hvd.local_rank()}]: train_rows:{train_rows}, batch_size:{batch_size}, _train_steps_per_epoch:{_train_steps_per_epoch}.")
+            if verbose:
+                print(f"Training data of rank[{hvd.local_rank()}]: Epochs: {epochs}\n"
+                      f"Train rows: {train_rows}, Train batch size: {batch_size}, Train_steps_per_epoch: {_train_steps_per_epoch}\n"
+                      f"Val rows: {val_rows}, Val batch size: {val_batch_size}, Val_steps_per_epoch: {_val_steps_per_epoch}\n"
+                      f"Checkpoint file: {remote_store.checkpoint_path}, Logs dir: {remote_store.logs_path}\n")
 
             cuda_available = torch.cuda.is_available()
             # We need to check all ranks have same device type for traning.
@@ -152,58 +194,92 @@ def on_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -
             if _num_gpus is None:
                 _num_gpus = 1 if cuda_available else 0
 
+            # Set bar refresh to 1 / epoch, detailed loss and metrics is avaialbe in logger,
+            # no need to print in screen here. User can still override this in trainer_args
+            progress_bar_refresh_rate = _train_steps_per_epoch
+
             kwargs = {'accelerator': 'horovod',
                       'gpus': _num_gpus,
                       'callbacks': callbacks,
                       'max_epochs': epochs,
                       'logger': train_logger,
                       'log_every_n_steps': log_every_n_steps,
-                      'resume_from_checkpoint': (last_ckpt_file if ckpt_bytes else None),
-                      'checkpoint_callback': is_model_checkpoint_callback_exist,
                       'num_sanity_val_steps': 0,
                       'reload_dataloaders_every_epoch': False,
-                      'progress_bar_refresh_rate': _train_steps_per_epoch // 10,
+                      'progress_bar_refresh_rate': progress_bar_refresh_rate,
                       'terminate_on_nan': terminate_on_nan,
                       'profiler': profiler
                       }
-            print("Creating trainer with: \n ", kwargs)
+            if trainer_args:
+                kwargs.update(trainer_args)
+
+            if verbose and hvd.rank() == 0:
+                print("Creating trainer with: \n ", kwargs)
+
             trainer = Trainer(**kwargs)
 
-            if trainer.profiler:
-                print(f"Set profiler's logs_path to {logs_path}")
+            if profiler != 'simple' and trainer.profiler:
+                print(f"Set profiler's logs_path for {hvd.rank()} to {logs_path}")
                 trainer.profiler.dirpath = logs_path
+                # filename where the profiler results will be saved instead of
+                # printing to stdout. The .txt extension will be used automatically.
+                trainer.profiler.filename = "profile"
+
+            if verbose and hvd.rank() == 0:
+                print(f"pytorch_lightning version={pl.__version__}")
+
+            data_module_kwargs = {
+                'train_dir': remote_store.train_data_path,
+                'val_dir': remote_store.val_data_path,
+                'num_train_epochs': epochs,
+                'has_val': should_validate is not None,
+                'train_batch_size': batch_size,
+                'val_batch_size': val_batch_size,
+                'shuffle_size': calculate_shuffle_buffer_size(),
+                'num_reader_epochs': loader_num_epochs,
+                'reader_pool_type': reader_pool_type,
+                'reader_worker_count': train_reader_worker_count,
+                'transform_spec': transformation,
+                'inmemory_cache_all': inmemory_cache_all,
+                'cur_shard': hvd.rank(),
+                'shard_count': hvd.size(),
+                'schema_fields': schema_fields,
+                'storage_options': storage_options,
+                'steps_per_epoch_train': _train_steps_per_epoch,
+                'steps_per_epoch_val': _val_steps_per_epoch,
+                'verbose': verbose,
+                'debug_data_loader': debug_data_loader,
+                'train_async_data_loader_queue_size': train_async_data_loader_queue_size,
+                'val_async_data_loader_queue_size': val_async_data_loader_queue_size,
+            }
+            if debug_data_loader and hvd.rank() == 0:
+                print(f"Creating data module with args:\n {data_module_kwargs}")
+
+            dataset = data_module(**data_module_kwargs)
 
-            print(f"pytorch_lightning version={pl.__version__}")
-
-            dataset = data_module(train_dir=remote_store.train_data_path,
-                                  val_dir=remote_store.val_data_path,
-                                  num_train_epochs=epochs,
-                                  has_val=should_validate is not None,
-                                  train_batch_size=batch_size, val_batch_size=val_batch_size,
-                                  shuffle_size=calculate_shuffle_buffer_size(),
-                                  num_reader_epochs=loader_num_epochs,
-                                  reader_pool_type=reader_pool_type, reader_worker_count=train_reader_worker_count,
-                                  transform_spec=transformation, inmemory_cache_all=inmemory_cache_all,
-                                  cur_shard=hvd.rank(), shard_count=hvd.size(),
-                                  schema_fields=schema_fields, storage_options=storage_options,
-                                  steps_per_epoch_train=_train_steps_per_epoch,
-                                  steps_per_epoch_val=_val_steps_per_epoch,
-                                  verbose=verbose)
             trainer.fit(model, dataset)
 
-            serialized_checkpoint = io.BytesIO()
-            module = model if not is_legacy else model._model
+            if hvd.rank() == 0:
+                if remote_store.saving_runs and trainer.profiler:
+                    # One more file sync to push profiler result.
+                    remote_store.sync(logs_path)
 
-            # TODO: find a way to pass trainer.logged_metrics out.
-            output = {'model': module.state_dict()}
+                # rank 0 overwrites model with best checkpoint and returns.
+                if require_checkpoint:
+                    if verbose:
+                        print("load from checkpoint best model path:",
+                              _checkpoint_callback.best_model_path)
+                    best_model = model.load_from_checkpoint(_checkpoint_callback.best_model_path)
+                else:
+                    best_model = model
+                serialized_checkpoint = io.BytesIO()
+                module = best_model if not is_legacy else best_model._model
 
-            torch.save(output, serialized_checkpoint)
+                output = {'model': module.state_dict(), 'logged_metrics': trainer.logged_metrics}
 
-            if remote_store.saving_runs and hvd.rank() == 0:
-                remote_store.sync(logs_path)
+                torch.save(output, serialized_checkpoint)
 
-            serialized_checkpoint.seek(0)
-            return serialized_checkpoint
+                return serialized_checkpoint
     return train
 
 
diff --git a/horovod/spark/torch/remote.py b/horovod/spark/torch/remote.py
index fe38237949..6b2f325caa 100644
--- a/horovod/spark/torch/remote.py
+++ b/horovod/spark/torch/remote.py
@@ -117,6 +117,10 @@ def train(serialized_model, optimizer_cls, model_opt_state_serialized,
         # Horovod: initialize library.
         hvd.init()
 
+        if user_verbose:
+            import horovod as _horovod
+            print(f"Shared lib path is pointing to: {_horovod.common.process_sets._basics.MPI_LIB_CTYPES}")
+
         if not user_shuffle_buffer_size:
             shuffle_buffer_size = \
                 calculate_shuffle_buffer_size(hvd, avg_row_size, train_rows / hvd.size())
diff --git a/horovod/tensorflow/__init__.py b/horovod/tensorflow/__init__.py
index 002fa5068d..ddf5585ae3 100644
--- a/horovod/tensorflow/__init__.py
+++ b/horovod/tensorflow/__init__.py
@@ -26,7 +26,7 @@
 from horovod.tensorflow import elastic
 from horovod.tensorflow.compression import Compression
 from horovod.tensorflow.functions import allgather_object, broadcast_object, broadcast_object_fn, broadcast_variables
-from horovod.tensorflow.mpi_ops import allgather, broadcast, _allreduce, _grouped_allreduce, alltoall
+from horovod.tensorflow.mpi_ops import allgather, broadcast, broadcast_, _allreduce, _grouped_allreduce, alltoall
 from horovod.tensorflow.mpi_ops import init, shutdown
 from horovod.tensorflow.mpi_ops import is_initialized, start_timeline, stop_timeline
 from horovod.tensorflow.mpi_ops import size, local_size, cross_size, rank, local_rank, cross_rank, is_homogeneous
diff --git a/horovod/tensorflow/functions.py b/horovod/tensorflow/functions.py
index ebaa2f8cd7..034d3c0a99 100644
--- a/horovod/tensorflow/functions.py
+++ b/horovod/tensorflow/functions.py
@@ -21,7 +21,7 @@
 
 from tensorflow.python.framework import ops
 
-from horovod.tensorflow.mpi_ops import allgather, broadcast
+from horovod.tensorflow.mpi_ops import allgather, broadcast, broadcast_
 from horovod.tensorflow.mpi_ops import rank, size
 from horovod.tensorflow.util import _cache, _executing_eagerly, _make_subgraph
 from horovod.common.process_sets import ProcessSet, global_process_set
@@ -45,20 +45,53 @@ def broadcast_group(variables, root_rank, process_set: ProcessSet):
         return broadcast_group
 
 
-def broadcast_variables(variables, root_rank, process_set=global_process_set):
+@_cache
+def _make_inplace_broadcast_group_fn():
+    if _executing_eagerly():
+        # These are just a few calls of broadcast_, no need to aggregate them in a tf.function
+        def broadcast_group(variable_lists, root_rank, process_set: ProcessSet):
+            for variables in variable_lists:
+                broadcast_(variables, root_rank, process_set=process_set)
+
+        return broadcast_group
+    else:
+        # Graph mode requires an Op
+        def broadcast_group(variable_lists, root_rank, process_set: ProcessSet):
+            return tf.group(*[broadcast_(variables, root_rank, process_set=process_set)
+                              for variables in variable_lists])
+
+        return broadcast_group
+
+
+def broadcast_variables(variables, root_rank, process_set=global_process_set, inplace=False):
     """
     Broadcasts variables from root rank to all other processes
     in a process set (defaults to all Horovod processes).
 
+    Optionally, the broadcast may be performed in-place, which avoids
+    temporary memory allocations and fragmentation. This is only
+    supported with TensorFlow 2.6 or later. Reference variables
+    (legacy support in TF 2) must all be of the same data type. There
+    is no such restriction for resource variables (default in TF 2).
+
     Arguments:
         variables: variables for broadcast
         root_rank: rank of the process from which global variables will be broadcasted
                    to all other processes.
         process_set: Process set object to limit this operation to a subset of
                      Horovod processes. Default is the global process set.
+        inplace: whether to perform in-place broadcasts
     """
-    broadcast_group = _make_broadcast_group_fn()
-    return broadcast_group(variables, root_rank, process_set)
+    if inplace:
+        vars_by_device = {}
+        for var in variables:
+            vars_by_device.setdefault(var.device, []).append(var)
+
+        inplace_broadcast_group = _make_inplace_broadcast_group_fn()
+        return inplace_broadcast_group(vars_by_device.values(), root_rank, process_set)
+    else:
+        broadcast_group = _make_broadcast_group_fn()
+        return broadcast_group(variables, root_rank, process_set)
 
 
 def broadcast_object(obj, root_rank=0, session=None, name=None, process_set=global_process_set):
diff --git a/horovod/tensorflow/mpi_ops.cc b/horovod/tensorflow/mpi_ops.cc
index 6faed38a5a..9cb55d5266 100644
--- a/horovod/tensorflow/mpi_ops.cc
+++ b/horovod/tensorflow/mpi_ops.cc
@@ -18,16 +18,27 @@
 
 #include <memory>
 #include <queue>
+#include <regex>
 #include <thread>
 #include <unordered_map>
 
+#define EIGEN_USE_THREADS
+#if HAVE_CUDA || HAVE_ROCM
+#define EIGEN_USE_GPU
+#endif  // HAVE_CUDA || HAVE_ROCM
+
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/shape_inference.h"
+#include "tensorflow/core/framework/common_shape_fns.h"
 
-#include "../common/common.h"
+#if TENSORFLOW_VERSION >= 2006000000
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/framework/resource_var.h"
+#include "tensorflow/core/kernels/training_op_helpers.h"
+#endif // TENSORFLOW_VERSION >= 2006000000
 
-#define EIGEN_USE_THREADS
+#include "../common/common.h"
 
 #if HAVE_GPU
 
@@ -831,6 +842,260 @@ Output
                `tensor` on root rank.
 )doc");
 
+#if TENSORFLOW_VERSION >= 2006000000
+namespace {
+std::string NormalizeNameForTensorFlow(const std::string& name) {
+  static const std::regex normalize_re(R"regex([^a-zA-Z0-9_])regex");
+  return std::regex_replace(name, normalize_re, "_");
+}
+
+Status GetInputDataTypeFromVariable(OpKernelContext* ctx, int input,
+                                    DataType& out) {
+  if (ctx->input_dtype(input) == DT_RESOURCE) {
+    core::RefCountPtr<Var> var;
+    TF_RETURN_IF_ERROR(LookupResource(ctx, HandleFromInput(ctx, input), &var));
+    out = var->tensor()->dtype();
+  } else {
+    out = BaseType(ctx->input_dtype(input));
+  }
+  return Status::OK();
+}
+
+}
+
+template <typename Device>
+class HorovodBroadcastInplaceOp : public OpKernel {
+public:
+  explicit HorovodBroadcastInplaceOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("root_rank", &root_rank_));
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("process_set_id", &process_set_id_));
+    OP_REQUIRES_OK(context, context->GetAttr("num_variables", &num_variables_));
+    OP_REQUIRES_OK(context, context->GetAttr("variable_names", &variable_names_));
+    OP_REQUIRES(context, (int) variable_names_.size() == num_variables_,
+                errors::InvalidArgument(
+                    "len(variable_names) needs to be equal to num_variables"));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    OP_REQUIRES_OK(context, ConvertStatus(common::CheckInitialized()));
+
+    auto any_failures_and_tensors_done =
+        std::make_shared<std::pair<std::atomic<bool>, std::atomic<int>>>();
+    any_failures_and_tensors_done->first.store(false);
+    any_failures_and_tensors_done->second.store(0);
+
+    std::vector<VariableInputLockHolder> variable_locks;
+    variable_locks.reserve(num_variables_);
+
+    for (int tensor_index = 0; tensor_index < num_variables_; ++tensor_index) {
+      DataType dtype;
+      OP_REQUIRES_OK(
+          context, GetInputDataTypeFromVariable(context, tensor_index, dtype));
+
+      // Functions in tensorflow/core/kernels/training_op_helpers.h that deal
+      // with resource variables need a template type parameter. This requires
+      // us to branch out to different specializations of a templated helper
+      // function.
+      switch (dtype) {
+#define PROCESS_CASE(DT, T)                                                    \
+  case DT:                                                                     \
+    OP_REQUIRES_OK(context, Process<T>(context, tensor_index, variable_locks,  \
+                                       any_failures_and_tensors_done));        \
+    break;
+        PROCESS_CASE(DT_UINT8, uint8)
+        PROCESS_CASE(DT_INT8, int8)
+        PROCESS_CASE(DT_INT32, int32)
+        PROCESS_CASE(DT_INT64, int64)
+        PROCESS_CASE(DT_HALF, Eigen::half)
+        PROCESS_CASE(DT_FLOAT, float)
+        PROCESS_CASE(DT_DOUBLE, double)
+        PROCESS_CASE(DT_BOOL, bool)
+        // no support for int16 and uint16 because there are no DenseUpdate
+        // kernels for them
+      default:
+        context->CtxFailure(__FILE__, __LINE__,errors::InvalidArgument(
+            "Horovod inplace broadcast does not support data type ",
+            DataTypeString(dtype)));
+        return;
+      }
+#undef PROCESS_CASE
+    }
+
+    while (!any_failures_and_tensors_done->first.load() &&
+           any_failures_and_tensors_done->second.load() < num_variables_) {
+      std::this_thread::yield();
+    }
+  }
+
+private:
+  int root_rank_ = 0;
+  int process_set_id_ = 0;
+  int num_variables_ = 0;
+  std::vector<std::string> variable_names_;
+
+  template <typename T>
+  Status
+  Process(OpKernelContext* context, int tensor_index,
+          std::vector<VariableInputLockHolder>& variable_locks,
+          const std::shared_ptr<std::pair<std::atomic<bool>, std::atomic<int>>>&
+              any_failures_and_tensors_done) {
+    const bool do_lock = true;
+    const bool sparse = false;
+    // Here we need to replicate the functionality provided by
+    // MaybeLockVariableInputMutexesInOrder(). That function currently does
+    // not work as intended for input_ids not starting at 0. See:
+    // https://github.com/tensorflow/tensorflow/issues/51686
+    {
+      Var* var;
+      mutex* mu = GetTrainingVariableMutex<Device, T>(context, tensor_index,
+                                                      sparse, &var);
+      std::vector<Var*> vars;
+      if (var) {
+        vars.reserve(1);
+        vars.push_back(var);
+      }
+      std::vector<mutex*> mutexes{mu};
+      auto locks = absl::make_unique<std::vector<mutex_lock>>();
+      locks->reserve(1);
+      locks->emplace_back(*mu);
+      auto shared_locks = absl::make_unique<std::vector<tf_shared_lock>>();
+      variable_locks.emplace_back(std::move(vars), std::move(locks),
+                                  std::move(shared_locks));
+    }
+
+    Tensor tensor;
+    TF_RETURN_IF_ERROR(GetInputTensorFromVariable<Device, T>(
+        context, tensor_index, do_lock, sparse, &tensor));
+    Tensor* output = &tensor;
+    MaybeForwardRefInputToRefOutput(context, tensor_index, tensor_index);
+
+    std::string var_name = variable_names_[tensor_index];
+    if (context->input_dtype(tensor_index) == DT_RESOURCE && var_name.empty()) {
+      const ResourceHandle& handle = HandleFromInput(context, tensor_index);
+      // We use handle.name() as a fallback only when we do not have a proper
+      // name because typically it seems to be something like _AnonymousVar18.
+      // The Python name attribute of the variable does not appear to be passed
+      // through automatically.
+      var_name = handle.name();
+    }
+
+    auto device = GetDeviceID(context);
+    // ReadyEvent makes sure input tensor is ready, and output is allocated.
+    common::ReadyEventList ready_event_list;
+#if HAVE_GPU
+    ready_event_list.AddReadyEvent(
+        std::shared_ptr<common::ReadyEvent>(RecordReadyEvent(context)));
+#endif
+    auto hvd_context = std::make_shared<TFOpContext>(context);
+    auto hvd_tensor = std::make_shared<TFTensor>(tensor);
+    auto hvd_output = std::make_shared<TFTensor>(*output);
+    const std::string node_name =
+        name() + "_" + NormalizeNameForTensorFlow(var_name);
+    auto enqueue_result = EnqueueTensorBroadcast(
+        hvd_context, hvd_tensor, hvd_output, root_rank_, ready_event_list,
+        node_name, device,
+        [context, any_failures_and_tensors_done](const common::Status& status) {
+#if HAVE_GPU
+          auto hvd_event = status.event;
+          if (hvd_event.event) {
+            auto device_context = context->op_device_context();
+            if (device_context != nullptr) {
+              auto stream = stream_executor::gpu::AsGpuStreamValue(
+                  device_context->stream());
+              HVD_GPU_CHECK(gpuStreamWaitEvent(stream, *(hvd_event.event), 0));
+            }
+          }
+#endif
+          if (!status.ok()) {
+            auto prev_failures = any_failures_and_tensors_done->first.load();
+            if (!prev_failures) {
+              // Only keeping failure status of the first broadcast that fails
+              context->SetStatus(ConvertStatus(status));
+              any_failures_and_tensors_done->first.store(false);
+            }
+          }
+          any_failures_and_tensors_done->second.fetch_add(1);
+        },
+        process_set_id_);
+    return ConvertStatus(enqueue_result);
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("HorovodBroadcastInplace").Device(DEVICE_CPU),
+                        HorovodBroadcastInplaceOp<CPUDevice>);
+#if HOROVOD_GPU_BROADCAST
+REGISTER_KERNEL_BUILDER(Name("HorovodBroadcastInplace").Device(DEVICE_GPU),
+                        HorovodBroadcastInplaceOp<GPUDevice>);
+#endif
+
+REGISTER_OP("HorovodBroadcastInplace")
+    .Attr(
+        "T: {uint8, int8, int32, int64, float16, float32, float64, bool}")
+    .Attr("root_rank: int")
+    .Attr("process_set_id: int = 0")
+    .Attr("num_variables: int")
+    .Attr("variable_names: list(string)")
+    .Input("tensor_refs: Ref(num_variables * T)")
+    .Output("output_refs: Ref(num_variables * T)")
+    .SetShapeFn(shape_inference::UnchangedShape)
+    .Doc(R"doc(
+Perform an in-place Broadcast on (TF1-style) reference variables. All other
+processes that do a broadcast on variables with the same names must have the
+same dimensions for those variables. All variables must be located on the same
+device and they must be of the same data type.
+
+This requires TensorFlow 2.6+.
+
+Arguments
+    root_rank:      Rank that will send data, other ranks will receive data.
+    variable_names: Names associated to the variables (obtained via Python
+                    framework)
+
+Input
+    tensor_refs:    Variables to broadcast. They will be updated in-place
+                    to the values from the root rank.
+Output
+    output_refs:    The updated variables.
+)doc");
+
+REGISTER_KERNEL_BUILDER(
+    Name("HorovodBroadcastInplaceResource").Device(DEVICE_CPU),
+    HorovodBroadcastInplaceOp<CPUDevice>);
+#if HOROVOD_GPU_BROADCAST
+REGISTER_KERNEL_BUILDER(Name("HorovodBroadcastInplaceResource")
+                            .Device(DEVICE_GPU)
+                            .HostMemory("resources"),
+                        HorovodBroadcastInplaceOp<GPUDevice>);
+#endif
+
+REGISTER_OP("HorovodBroadcastInplaceResource")
+    .Attr("root_rank: int")
+    .Attr("process_set_id: int = 0")
+    .Attr("num_variables: int")
+    .Attr("variable_names: list(string)")
+    .Input("resources: num_variables * resource")
+    .SetShapeFn(shape_inference::NoOutputs)
+    .Doc(R"doc(
+Perform an in-place Broadcast on (TF2-style) resource variables. All other
+processes that do a broadcast on variables with the same names must have the
+same dimensions for those variables. All variables must be located on the same
+device.
+
+This requires TensorFlow 2.6+.
+
+Arguments
+    root_rank:      Rank that will send data, other ranks will receive data.
+    variable_names: Names associated to the variables (obtained via Python
+                    framework)
+
+Input
+    resources:    Variables to broadcast. They will be updated in-place
+                  to the values from the root rank.
+)doc");
+#endif // TENSORFLOW_VERSION >= 2006000000
+
 class HorovodJoinOp : public AsyncOpKernel {
 public:
   explicit HorovodJoinOp(OpKernelConstruction* context)
diff --git a/horovod/tensorflow/mpi_ops.py b/horovod/tensorflow/mpi_ops.py
index be4c387d4e..8865e5843d 100644
--- a/horovod/tensorflow/mpi_ops.py
+++ b/horovod/tensorflow/mpi_ops.py
@@ -57,7 +57,6 @@ def _load_library(name):
 _basics = _HorovodBasics(__file__, 'mpi_lib')
 
 # import basic methods
-init = _basics.init
 shutdown = _basics.shutdown
 is_initialized = _basics.is_initialized
 start_timeline = _basics.start_timeline
@@ -84,6 +83,11 @@ def _load_library(name):
 Sum = _basics.Sum
 Adasum = _basics.Adasum
 
+def init(*args, **kwargs):
+    _basics.init(*args, **kwargs)
+    # Call set up again to make sure the basics is in sync
+    _setup_process_sets(_basics)
+
 is_homogeneous = _basics.is_homogeneous
 
 handle_average_backwards_compatibility = get_average_backwards_compatibility_fun(_basics)
@@ -294,6 +298,43 @@ def _broadcast_grad(op, grad):
     return grad_reduced
 
 
+def broadcast_(variables, root_rank, name=None, process_set=global_process_set):
+    """An op which broadcasts the input variables from the root rank to the same
+    input variables on all other Horovod processes. The operation is performed
+    in-place.
+
+    The broadcast operation is keyed by the name of the op combined with the names
+    of the variables. The variable type and shape must be the same on all Horovod
+    processes for any given name. The broadcast will not start until all processes
+    are ready to send and receive all variables. In each process all variables need
+    to be located on the same device (CPU or GPU).
+
+    Note: This is only supported with TensorFlow 2.6 or later.
+
+    Returns:
+      The tensor values of the updated `variables` as broadcasted from root rank.
+    """
+    from distutils.version import LooseVersion
+    if LooseVersion(tf.__version__) < LooseVersion('2.6.0'):
+        raise NotImplementedError("In-place broadcasts are only supported with TensorFlow 2.6 or later")
+
+    from tensorflow.python.ops import resource_variable_ops
+    if all(resource_variable_ops.is_resource_variable(var) for var in variables):
+        with tf.control_dependencies(
+            [MPI_LIB.horovod_broadcast_inplace_resource([var.handle for var in variables],
+                                                        variable_names=[var.name for var in variables],
+                                                        name=name,
+                                                        root_rank=root_rank,
+                                                        process_set_id=process_set.process_set_id)]):
+            return [var.read_value() for var in variables]
+    elif all(not resource_variable_ops.is_resource_variable(var) for var in variables):
+        return MPI_LIB.horovod_broadcast_inplace(variables, variable_names=[var.name for var in variables],
+                                                 name=name, root_rank=root_rank,
+                                                 process_set_id=process_set.process_set_id)
+    else:
+        raise ValueError("All variables passed to broadcast_() should be of the same kind, resource or reference")
+
+
 def alltoall(tensor, splits=None, name=None, ignore_name_scope=False, process_set=global_process_set):
     """An op that scatters slices of the input tensor to all other Horovod processes
     and returns a tensor of gathered slices from all other Horovod processes.
diff --git a/horovod/torch/CMakeLists.txt b/horovod/torch/CMakeLists.txt
index dc327fa734..7f25d6810a 100644
--- a/horovod/torch/CMakeLists.txt
+++ b/horovod/torch/CMakeLists.txt
@@ -54,9 +54,9 @@ if(HAVE_CUDA)
     endif()
 endif()
 parse_version(${Pytorch_VERSION} VERSION_DEC)
-add_definitions(-DTORCH_VERSION=${VERSION_DEC} -DTORCH_API_INCLUDE_EXTENSION_H=1)
+add_definitions(-DPYTORCH_VERSION=${VERSION_DEC} -DTORCH_API_INCLUDE_EXTENSION_H=1)
 set(Pytorch_CXX11 ${Pytorch_CXX11} PARENT_SCOPE)
-if(NOT Pytorch_VERSION VERSION_LESS "1.3.0")
+if(NOT Pytorch_VERSION VERSION_LESS "1.5.0")
     set(CMAKE_CXX_STANDARD 14)
 endif()
 
diff --git a/horovod/torch/__init__.py b/horovod/torch/__init__.py
index 689e4e4333..df51620dc3 100644
--- a/horovod/torch/__init__.py
+++ b/horovod/torch/__init__.py
@@ -42,6 +42,7 @@
     from horovod.torch.mpi_ops import broadcast, broadcast_async, broadcast_, broadcast_async_
     from horovod.torch.mpi_ops import alltoall, alltoall_async
     from horovod.torch.mpi_ops import join
+    from horovod.torch.mpi_ops import barrier
     from horovod.torch.mpi_ops import poll, synchronize
     from horovod.torch.mpi_ops import init, shutdown
     from horovod.torch.mpi_ops import is_initialized, start_timeline, stop_timeline
diff --git a/horovod/torch/adapter_v2.cc b/horovod/torch/adapter_v2.cc
index 4591bf3daa..86bd7b39c1 100644
--- a/horovod/torch/adapter_v2.cc
+++ b/horovod/torch/adapter_v2.cc
@@ -46,9 +46,9 @@ TorchPersistentBuffer::TorchPersistentBuffer(int device, int64_t size)
     : device_(device) {
   with_device device_context(device_);
   if (device_ == CPU_DEVICE_ID) {
-    tensor_ = ::torch::empty(size, ::torch::device(::torch::kCPU).dtype(::torch::kByte));
+    tensor_ = ::torch::empty({size}, ::torch::device(::torch::kCPU).dtype(::torch::kByte));
   } else {
-    tensor_ = ::torch::empty(size, ::torch::device(::torch::kCUDA).dtype(::torch::kByte));
+    tensor_ = ::torch::empty({size}, ::torch::device(::torch::kCUDA).dtype(::torch::kByte));
   }
 }
 
diff --git a/horovod/torch/cuda_util.cc b/horovod/torch/cuda_util.cc
index ac4721ea81..9a1e6493eb 100644
--- a/horovod/torch/cuda_util.cc
+++ b/horovod/torch/cuda_util.cc
@@ -15,6 +15,7 @@
 
 #if HAVE_GPU
 #include "cuda_runtime.h"
+#include <ATen/ATen.h>
 #include <THC/THC.h>
 #else
 #include <stdexcept>
@@ -31,8 +32,8 @@ with_device::with_device(int device) {
     restore_device_ = CPU_DEVICE_ID;
   } else {
 #if HAVE_GPU
-    THCudaCheck(cudaGetDevice(&restore_device_));
-    THCudaCheck(cudaSetDevice(device));
+    C10_CUDA_CHECK(cudaGetDevice(&restore_device_));
+    C10_CUDA_CHECK(cudaSetDevice(device));
 #else
     throw std::logic_error("Internal error. Requested device context manager "
                            "with GPU device but not compiled with CUDA.");
@@ -43,7 +44,7 @@ with_device::with_device(int device) {
 with_device::~with_device() {
 #if HAVE_GPU
   if (restore_device_ != CPU_DEVICE_ID) {
-    THCudaCheck(cudaSetDevice(restore_device_));
+    C10_CUDA_CHECK(cudaSetDevice(restore_device_));
   }
 #endif
 }
diff --git a/horovod/torch/elastic/sampler.py b/horovod/torch/elastic/sampler.py
index 8a10624066..3468d28200 100644
--- a/horovod/torch/elastic/sampler.py
+++ b/horovod/torch/elastic/sampler.py
@@ -32,7 +32,7 @@ class ElasticSampler(torch.utils.data.Sampler):
     In order to use this object successfully it is recommended that the user:
 
     1. Include this object in the `TorchState`.
-    2. Call `record_batch` or `record_indices` after processing a set of samples.
+    2. Call `record_batch` after processing a set of samples.
     3. Call `set_epoch` at the end of each epoch to clear the processed indices.
 
     Args:
@@ -54,6 +54,7 @@ def __init__(self, dataset, shuffle=True, seed=0):
         self.remaining_indices = []
         self.num_samples = 0
         self.total_size = 0
+        self.processed_num = 0
 
         self.reset()
 
@@ -71,33 +72,22 @@ def set_epoch(self, epoch):
             epoch: Epoch number.
         """
         self.epoch = epoch
-        self.processed_indices = set()
+        self.processed_num = 0
         self.reset()
 
     def record_batch(self, batch_idx, batch_size):
-        """Record indices at batch `batch_idx` with length `batch_size` as processed."""
-        indices = set(self.get_indices(batch_idx, batch_size))
-        self.record_indices(indices)
-
-    def record_indices(self, indices):
-        """Record set `indices` as processed."""
-        self.processed_indices.update(indices)
-
-    def get_indices(self, batch_idx, batch_size):
-        """Return list of indices at batch `batch_idx` with length `batch_size`."""
-        start_idx = batch_idx * batch_size
-        end_idx = min(start_idx + batch_size, len(self.indices))
-        return self.indices[start_idx:end_idx]
+        """Record the number of processed samples."""
+        self.processed_num += (batch_size * self.num_replicas)
 
     def load_state_dict(self, state_dict):
         self.epoch = state_dict['epoch']
-        self.processed_indices = state_dict['processed_indices']
+        self.processed_num = state_dict["processed_num"]
         self.reset()
 
     def state_dict(self):
         return dict(
             epoch=self.epoch,
-            processed_indices=self.processed_indices
+            processed_num=self.processed_num
         )
 
     def reset(self):
@@ -105,18 +95,18 @@ def reset(self):
         self.rank = rank()
 
         # Exclude any samples we have already processed this epoch
-        self.remaining_indices = [idx for idx in range(len(self.dataset))
-                                  if idx not in self.processed_indices]
+        all_indices = [idx for idx in range(len(self.dataset))]
+        if self.shuffle:
+            # Shuffle indices across workers deterministically in place
+            seed = self.seed + self.epoch
+            random.Random(seed).shuffle(all_indices)
+        self.remaining_indices = all_indices[self.processed_num:]
 
         self.num_samples = int(math.ceil(len(self.remaining_indices) * 1.0 / self.num_replicas))
         self.total_size = self.num_samples * self.num_replicas
 
     def __iter__(self):
         self.indices = self.remaining_indices[:]
-        if self.shuffle:
-            # Shuffle indices across workers deterministically in place
-            seed = self.seed + self.epoch
-            random.Random(seed).shuffle(self.indices)
 
         # add extra samples to make it evenly divisible
         self.indices += self.indices[:(self.total_size - len(self.indices))]
diff --git a/horovod/torch/elastic/state.py b/horovod/torch/elastic/state.py
index 9806bb7cd5..946d987b2a 100644
--- a/horovod/torch/elastic/state.py
+++ b/horovod/torch/elastic/state.py
@@ -128,12 +128,7 @@ def restore(self):
         self.value.load_state_dict(self._saved_sampler_state)
 
     def sync(self):
-        # Get the set of processed indices from all workers
-        world_processed_indices = _union(allgather_object(self.value.processed_indices))
-
-        # Replace local processed indices with global indices
         state_dict = self.value.state_dict()
-        state_dict['processed_indices'] = world_processed_indices
 
         # Broadcast and load the state to make sure we're all in sync
         self.value.load_state_dict(broadcast_object(state_dict))
diff --git a/horovod/torch/mpi_ops.py b/horovod/torch/mpi_ops.py
index bd85c54c43..1fbcc3bc3d 100644
--- a/horovod/torch/mpi_ops.py
+++ b/horovod/torch/mpi_ops.py
@@ -40,8 +40,8 @@
 _NULL = ""
 
 _basics = _HorovodBasics(__file__, 'mpi_lib_v2')
+
 # import basic methods
-init = _basics.init
 is_initialized = _basics.is_initialized
 start_timeline = _basics.start_timeline
 stop_timeline = _basics.stop_timeline
@@ -61,10 +61,18 @@
 ccl_built = _basics.ccl_built
 cuda_built = _basics.cuda_built
 rocm_built = _basics.rocm_built
+
 def shutdown(*args, **kwargs):
     mpi_lib.horovod_torch_reset()
     return _basics.shutdown(*args, **kwargs)
 
+def init(*args, **kwargs):
+    global _handle_map
+    _handle_map = {}
+    _basics.init(*args, **kwargs)
+    # Call set up again to make sure the basics is in sync
+    _setup_process_sets(_basics)
+
 # import reduction op values
 Average = _basics.Average
 Sum = _basics.Sum
@@ -939,6 +947,7 @@ def synchronize(handle):
         output = _handle_map.pop(handle)[-1]
         return output
     except RuntimeError as e:
+        _handle_map.pop(handle, None)
         raise HorovodInternalError(e)
 
 
@@ -963,3 +972,23 @@ def join(device=-1) -> int:
     _handle_map[handle] = (None, output)
 
     return synchronize(handle).item()
+
+def barrier(process_set=global_process_set):
+    """
+    A function that acts as a simple sychronization point for ranks specified
+    in the given process group(default to global group). Ranks that reach
+    this function call will stall until all other ranks have reached.
+
+    Arguments:
+        process_set: Process set object to limit this operation to a subset of
+                     Horovod processes. Default is the global process set.
+    """
+
+    try:
+        handle = mpi_lib.horovod_torch_barrier(process_set.process_set_id)
+    except RuntimeError as e:
+        raise HorovodInternalError(e)
+
+    _handle_map[handle] = (None, None)
+
+    synchronize(handle)
diff --git a/horovod/torch/mpi_ops_v2.cc b/horovod/torch/mpi_ops_v2.cc
index 6ef7a47e06..8a11fe552d 100644
--- a/horovod/torch/mpi_ops_v2.cc
+++ b/horovod/torch/mpi_ops_v2.cc
@@ -16,12 +16,8 @@
 // =============================================================================
 
 #if HAVE_GPU
-#if TORCH_VERSION >= 1005000000
 #include <c10/cuda/CUDAStream.h>
 #include <c10/cuda/CUDAException.h>
-#else
-#include <THC/THC.h>
-#endif
 #endif
 
 #include <chrono>
@@ -36,12 +32,6 @@
 #include "handle_manager.h"
 #include "ready_event.h"
 
-#if TORCH_VERSION < 1005000000
-#if HAVE_GPU
-extern THCState* state;
-#endif
-#endif
-
 namespace horovod {
 namespace torch {
 
@@ -67,22 +57,16 @@ int GetDeviceID(const ::torch::Tensor& tensor) {
 } // namespace
 
 void DivideInPlace(::torch::Tensor& tensor, int divisor) {
-#if TORCH_VERSION >= 1005000000
   if (isIntegralType(tensor.scalar_type())) {
     tensor.floor_divide_(divisor);
     return;
   }
-#endif
   tensor.div_(divisor);
 }
 
 #if HAVE_GPU
 gpuStream_t GetGPUStream(int device) {
-  #if TORCH_VERSION >= 1005000000
   return c10::cuda::getCurrentCUDAStream(device);
-  #else
-  return THCState_getCurrentStreamOnDevice(state, device);
-  #endif
 }
 #endif
 
@@ -361,7 +345,7 @@ int DoAllgatherCudaOnCPU(::torch::Tensor tensor, ::torch::Tensor output,
   ready_event_list.AddReadyEvent(RecordReadyEvent(device));
 #endif
 
-  auto cpu_output = ::torch::empty_like(cpu_tensor);
+  auto cpu_output = ::torch::empty_like(cpu_tensor, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
   auto hvd_cpu_output = std::make_shared<TorchTensor>(cpu_output);
   auto hvd_context =
       std::make_shared<TorchOpContext>(CPU_DEVICE_ID, cpu_output);
@@ -478,7 +462,7 @@ int DoAlltoall(::torch::Tensor tensor, ::torch::Tensor splits,
   // Deal with possibility of output_received_splits being on GPU
   auto received_splits_device = GetDeviceID(output_received_splits);
   auto cpu_received_splits = (received_splits_device != CPU_DEVICE_ID)
-                                 ? ::torch::empty_like(cpu_splits)
+                                 ? ::torch::empty_like(cpu_splits, LEGACY_CONTIGUOUS_MEMORY_FORMAT)
                                  : output_received_splits;
   auto hvd_context = std::make_shared<TorchOpContext>(device, output);
   hvd_context->AddOutput(CPU_DEVICE_ID, cpu_received_splits);
@@ -531,13 +515,13 @@ int DoAlltoallCudaOnCPU(::torch::Tensor tensor, ::torch::Tensor splits,
   ready_event_list.AddReadyEvent(RecordReadyEvent(device));
 #endif
 
-  auto cpu_output = ::torch::empty_like(cpu_tensor);
+  auto cpu_output = ::torch::empty_like(cpu_tensor, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
   auto hvd_cpu_output = std::make_shared<TorchTensor>(cpu_output);
 
   // Deal with possibility of output_received_splits being on GPU
   auto received_splits_device = GetDeviceID(output_received_splits);
   auto cpu_received_splits = (received_splits_device != CPU_DEVICE_ID)
-                                 ? ::torch::empty_like(cpu_splits)
+                                 ? ::torch::empty_like(cpu_splits, LEGACY_CONTIGUOUS_MEMORY_FORMAT)
                                  : output_received_splits;
   auto hvd_context =
       std::make_shared<TorchOpContext>(CPU_DEVICE_ID, cpu_output);
@@ -603,6 +587,20 @@ int DoJoin(::torch::Tensor output_last_joined_rank, int device) {
   return handle;
 }
 
+int DoBarrier(int process_set_id = 0) {
+  ThrowIfError(common::CheckInitialized());
+
+  auto handle = handle_manager.AllocateHandle();
+
+  auto enqueue_result = EnqueueBarrier(
+      [handle](const Status& status) mutable {
+        handle_manager.MarkDone(handle, status);
+      }, process_set_id);
+  ThrowIfError(enqueue_result);
+
+  return handle;
+}
+
 int PollHandle(int handle) { return handle_manager.PollHandle(handle) ? 1 : 0; }
 
 void WaitAndClear(int handle) {
@@ -618,7 +616,6 @@ void Reset() {
   handle_manager.Reset();
 }
 
-
 PYBIND11_MODULE(mpi_lib_v2, m) {
   // allreduce
   m.def("horovod_torch_allreduce_async_torch_IntTensor", &DoAllreduce);
@@ -784,6 +781,9 @@ PYBIND11_MODULE(mpi_lib_v2, m) {
   // join
   m.def("horovod_torch_join", &DoJoin);
 
+  // barrier
+  m.def("horovod_torch_barrier", &DoBarrier);
+
   // basics
   m.def("horovod_torch_poll", &PollHandle);
   m.def("horovod_torch_wait_and_clear", &WaitAndClear);
diff --git a/horovod/torch/ready_event.cc b/horovod/torch/ready_event.cc
index c8ad9033e6..03f09bac68 100644
--- a/horovod/torch/ready_event.cc
+++ b/horovod/torch/ready_event.cc
@@ -14,12 +14,8 @@
 // =============================================================================
 
 #if HAVE_GPU
-#if TORCH_VERSION >= 1005000000
 #include <c10/cuda/CUDAStream.h>
 #include <c10/cuda/CUDAException.h>
-#else
-#include <THC/THC.h>
-#endif
 #include <cassert>
 #include <mutex>
 #include <queue>
@@ -31,12 +27,6 @@
 #include "ready_event.h"
 #include "cuda_util.h"
 
-#if TORCH_VERSION < 1005000000
-#if HAVE_GPU
-extern THCState* state;
-#endif
-#endif
-
 namespace horovod {
 namespace torch {
 
@@ -59,22 +49,12 @@ TorchReadyEvent::TorchReadyEvent(int device) : device_(device) {
       cuda_event_ = queue.front();
       queue.pop();
     } else {
-      #if TORCH_VERSION >= 1005000000
       C10_CUDA_CHECK(cudaEventCreateWithFlags(
           &cuda_event_, cudaEventBlockingSync | cudaEventDisableTiming));
-      #else
-      THCudaCheck(cudaEventCreateWithFlags(
-          &cuda_event_, cudaEventBlockingSync | cudaEventDisableTiming));
-      #endif
     }
   }
-  #if TORCH_VERSION >= 1005000000
   auto stream = c10::cuda::getCurrentCUDAStream(device_);
   C10_CUDA_CHECK(cudaEventRecord(cuda_event_, stream));
-  #else
-  auto stream = THCState_getCurrentStreamOnDevice(state, device_);
-  THCudaCheck(cudaEventRecord(cuda_event_, stream));
-  #endif
 }
 
 TorchReadyEvent::~TorchReadyEvent() {
@@ -86,11 +66,7 @@ TorchReadyEvent::~TorchReadyEvent() {
 }
 
 bool TorchReadyEvent::Ready() const {
-  #if TORCH_VERSION >= 1005000000
   C10_CUDA_CHECK(cudaEventSynchronize(cuda_event_));
-  #else
-  THCudaCheck(cudaEventSynchronize(cuda_event_));
-  #endif
   return true;
 }
 
diff --git a/setup.py b/setup.py
index 4d4ff499c4..94357e9c10 100644
--- a/setup.py
+++ b/setup.py
@@ -29,7 +29,9 @@
 _FRAMEWORK_METADATA_FILE = 'horovod/metadata.json'
 
 class CMakeExtension(Extension):
-    def __init__(self, name, cmake_lists_dir='.', sources=[], **kwa):
+    def __init__(self, name, cmake_lists_dir='.', sources=None, **kwa):
+        if sources is None:
+            sources = []
         Extension.__init__(self, name, sources=sources, **kwa)
         self.cmake_lists_dir = os.path.abspath(cmake_lists_dir)
 
@@ -124,7 +126,8 @@ def build_extensions(self):
                         'pyspark>=3.0.0;python_version>="3.8"']
 # Pin h5py: https://github.com/h5py/h5py/issues/1732
 spark_require_list = ['h5py<3', 'numpy', 'petastorm>=0.11.0', 'pyarrow>=0.15.0', 'fsspec']
-ray_require_list = ['ray']
+# https://github.com/ray-project/ray/pull/17465
+ray_require_list = ['ray', 'aioredis<2']
 pytorch_spark_require_list = pytorch_require_list + \
                              spark_require_list + \
                              pyspark_require_list
diff --git a/test/integration/test_spark_keras.py b/test/integration/test_spark_keras.py
index 747a63154f..2e107ea9fe 100644
--- a/test/integration/test_spark_keras.py
+++ b/test/integration/test_spark_keras.py
@@ -34,7 +34,7 @@
 from horovod.spark.common import constants, util
 from horovod.spark.keras import remote
 from horovod.spark.keras.estimator import EstimatorParams
-from horovod.spark.keras.util import _custom_sparse_to_dense_fn, _serialize_param_value, BareKerasUtil, TFKerasUtil
+from horovod.spark.keras.util import _custom_sparse_to_dense_fn, _serialize_param_value, TFKerasUtil
 
 sys.path.append(os.path.join(os.path.dirname(__file__), os.pardir, 'utils'))
 
@@ -424,134 +424,6 @@ def test_custom_sparse_to_dense_fn(self):
         assert sparse_vector_values[6] == 60
         assert len(sparse_vector_values) == dense_shape
 
-    def test_convert_custom_sparse_to_dense_bare_keras_fn(self):
-        convert_custom_sparse_to_dense_bare_keras = BareKerasUtil._convert_custom_sparse_to_dense_fn()
-        custom_sparse_row = np.array([2, 1, 2, 0.1, 0.2])
-        sparse_row = convert_custom_sparse_to_dense_bare_keras(custom_sparse_row, 4)
-        assert np.array_equal(sparse_row, np.array([0., 0.1, 0.2, 0.]))
-
-    def test_prepare_data_bare_keras_fn(self):
-        metadata = \
-            {
-                'col1': {
-                    'dtype': float,
-                    'intermediate_format': 'nochange',
-                    'max_size': 1,
-                    'shape': 1
-                },
-                'col2': {
-                    'dtype': 'float',
-                    'intermediate_format': 'nochange',
-                    'max_size': 1,
-                    'shape': 1
-                },
-                'col3': {
-                    'dtype': SparseVector,
-                    'intermediate_format': 'custom_sparse_format',
-                    'max_size': 7,
-                    'shape': 10
-                }
-            }
-        prepare_data_bare_keras = BareKerasUtil._prepare_data_fn(metadata)
-
-        col1 = np.array([1., 2., 3.])
-        col1_prepared = prepare_data_bare_keras(col1, 'col1', [-1, 3])
-        assert col1_prepared.shape == (1, 3)
-        assert np.array_equal(col1_prepared, np.array([[1., 2., 3.]]))
-
-        col3 = [np.array([3., 0., 2., 5., 0., 0.2, 0.5, 0, 0]),
-                np.array([4., 0., 2., 5., 6., 0.2, 0.5, 0.6, 0])]
-
-        col3_prepared = prepare_data_bare_keras(col3, 'col3', [-1, 10])
-
-        assert col3_prepared.shape == (2, 10)
-        assert np.array_equal(col3_prepared, np.array(
-            [[0., 0., 0.2, 0., 0., 0.5, 0., 0., 0., 0.], [0.2, 0., 0.5, 0., 0., 0.6, 0., 0., 0., 0.]]))
-
-    def test_batch_generator_fn(self):
-        shuffle_buffer_size = 10
-        rows_in_row_group = 100
-        batch_size = 32
-
-        def _create_numpy_array(n_rows, shape):
-            return np.array([[i for i in range(j, j + shape)] for j in range(n_rows)])
-
-        """A dummy reader class only run 1 epoch (2 rows of data) for each iteration"""
-        class DummyReader():
-            def __init__(self):
-                self._in_iter = False
-
-            def __iter__(self):
-                if self._in_iter:
-                    raise RuntimeError('Do not support resetting a dummy reader while in the middle of iteration.')
-
-                self._in_iter = True
-                Row = collections.namedtuple('row', ['col1', 'col2', 'sample_weight', 'label'])
-
-                col11 = _create_numpy_array(rows_in_row_group, 1)
-                col21 = _create_numpy_array(rows_in_row_group, 10)
-                label1 = _create_numpy_array(rows_in_row_group, 8)
-                sw1 = np.array([i / 100. for i in range(rows_in_row_group)])
-
-                row1 = Row(col1=col11, col2=col21, label=label1, sample_weight=sw1)
-
-                col12 = _create_numpy_array(rows_in_row_group, 1)
-                col22 = _create_numpy_array(rows_in_row_group, 10)
-                label2 = _create_numpy_array(rows_in_row_group, 8)
-                sw2 = np.array([i / 100. for i in range(rows_in_row_group)])
-                row2 = Row(col1=col12, col2=col22, label=label2, sample_weight=sw2)
-                try:
-                    yield row1
-                    yield row2
-                finally:
-                    self._in_iter = False
-
-        metadata = \
-            {
-                'col1': {
-                    'dtype': float,
-                    'intermediate_format': constants.NOCHANGE,
-                    'max_size': 1,
-                    'shape': 1
-                },
-                'col2': {
-                    'dtype': DenseVector,
-                    'intermediate_format': constants.ARRAY,
-                    'max_size': 10,
-                    'shape': 10
-                },
-                'label': {
-                    'dtype': float,
-                    'intermediate_format': constants.NOCHANGE,
-                    'max_size': 1,
-                    'shape': 1
-                },
-            }
-
-        reader = DummyReader()
-
-        feature_columns = ['col1', 'col2']
-        label_columns = ['label']
-        sample_weight_col = 'sample_weight'
-
-        input_shapes = [[-1, 1], [-1, 2, 5]]
-        output_shapes = [[-1, 2, 4]]
-
-        batch_generator = BareKerasUtil._batch_generator_fn(
-            feature_columns, label_columns, sample_weight_col,
-            input_shapes, output_shapes, metadata)
-
-        for shuffle in [True, False]:
-            batch_gen = batch_generator(reader, batch_size, shuffle_buffer_size, shuffle=shuffle)
-
-            for _ in range(10):
-                batch = next(batch_gen)
-                assert batch[0][0][0].shape == (1,)
-                assert batch[0][1][0].shape == (2, 5)
-                assert batch[1][0][0].shape == (2, 4)
-                # sample weight has to be a singel np array with shape (batch_size,)
-                assert batch[2][0].shape == (batch_size,)
-
     def test_reshape(self):
         metadata = \
             {
diff --git a/test/integration/test_spark_lightning.py b/test/integration/test_spark_lightning.py
index 435c7dcf49..6c784b58ae 100644
--- a/test/integration/test_spark_lightning.py
+++ b/test/integration/test_spark_lightning.py
@@ -913,6 +913,39 @@ def val_dataloader(self):
                 assert len(pred) == 1
                 assert pred.dtype == torch.float32
 
+    """
+    Test override trainer args.
+    """
+    def test_model_override_trainer_args(self):
+        from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
+
+        with spark_session('test_fit_model') as spark:
+            df = create_noisy_xor_data(spark)
+            model = create_xor_model()
+
+            with tempdir() as dir:
+
+                with local_store() as store:
+                    torch_estimator = hvd_spark.TorchEstimator(
+                        num_proc=2,
+                        store=store,
+                        model=model,
+                        input_shapes=[[-1, 2]],
+                        feature_cols=['features'],
+                        label_cols=['y'],
+                        validation=0.2,
+                        batch_size=4,
+                        epochs=2,
+                        verbose=2,
+                        trainer_args={'stochastic_weight_avg': True})
+
+                    torch_model = torch_estimator.fit(df)
+
+                    # TODO: Find a way to pass log metrics from remote, and assert base on the logger.
+                    trained_model = torch_model.getModel()
+                    pred = trained_model(torch.ones([1, 2], dtype=torch.int32))
+                    assert len(pred) == 1
+                    assert pred.dtype == torch.float32
 
 def check_fail(dir, rank, epoch, batch):
     if dir:
diff --git a/test/integration/test_static_run.py b/test/integration/test_static_run.py
index 34e90714de..88322f8667 100644
--- a/test/integration/test_static_run.py
+++ b/test/integration/test_static_run.py
@@ -29,7 +29,7 @@
 from horovod.runner.common.util import safe_shell_exec
 from horovod.runner import _HorovodArgs
 from horovod.runner.launch import _check_all_hosts_ssh_successful, _run
-from horovod.runner.mpi_run import mpi_available, is_mpich, is_intel_mpi
+from horovod.runner.mpi_run import mpi_available
 
 sys.path.append(os.path.join(os.path.dirname(__file__), os.pardir, 'utils'))
 
@@ -141,10 +141,6 @@ def test_run_success(self, controller, mode, run):
         if controller == 'mpi':
             if not (mpi_built() and mpi_available()):
                 self.skipTest("MPI is not available")
-            if is_mpich():
-                self.skipTest("MPICH is not testable")
-            if is_intel_mpi():
-                self.skipTest("Intel(R) MPI is not testable because it is based on MPICH")
 
         self.do_test_run_with_controller_success(controller, mode, run)
 
@@ -156,10 +152,6 @@ def test_run_failure(self, controller, mode, run):
         if controller == 'mpi':
             if not (mpi_built() and mpi_available()):
                 self.skipTest("MPI is not available")
-            if is_mpich():
-                self.skipTest("MPICH is not testable")
-            if is_intel_mpi():
-                self.skipTest("Intel(R) MPI is not testable because it is based on MPICH")
 
         self.do_test_run_with_controller_failure(controller, mode, run)
 
diff --git a/test/parallel/base_test_mxnet.py b/test/parallel/base_test_mxnet.py
index 8a48c32d0b..d2d3f13530 100644
--- a/test/parallel/base_test_mxnet.py
+++ b/test/parallel/base_test_mxnet.py
@@ -33,7 +33,10 @@
     from mxnet.test_utils import almost_equal, same
     import horovod.mxnet as hvd
 
-    has_gpu = mx.context.num_gpus() > 0
+    try:
+        has_gpu = mx.context.num_gpus() > 0
+    except AttributeError:
+        has_gpu = mx.device.num_gpus() > 0
 
     ccl_supported_types = set(['int32', 'int64', 'float32', 'float64'])
 
diff --git a/test/parallel/test_tensorflow.py b/test/parallel/test_tensorflow.py
index 05067d820a..35963f0d0e 100644
--- a/test/parallel/test_tensorflow.py
+++ b/test/parallel/test_tensorflow.py
@@ -29,6 +29,9 @@
 import tensorflow as tf
 from horovod.tensorflow.util import _executing_eagerly
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables as tf_ops_variables
+
 import warnings
 
 import horovod.tensorflow as hvd
@@ -2222,6 +2225,163 @@ def test_horovod_broadcast_gpu(self):
                     tf.cast(root_tensor, tf.int32), tf.cast(broadcasted_tensor, tf.int32)))),
                 "hvd.broadcast produces incorrect broadcasted tensor")
 
+    def test_horovod_broadcast_inplace_cpu(self):
+        """Test that the inplace broadcast correctly broadcasts 1D, 2D, 3D variables on CPU."""
+        if LooseVersion(tf.__version__) < LooseVersion('2.6.0'):
+            self.skipTest("Custom Ops using resource variables only work with TF 2.6+")
+
+        hvd.init()
+        rank = hvd.rank()
+        size = hvd.size()
+
+        # This test does not apply if there is only one worker.
+        if size == 1:
+            self.skipTest("Only one worker available")
+
+        dtypes = [tf.uint8, tf.int8,
+                  tf.int32, tf.int64, tf.float16, tf.float32,
+                  tf.float64, tf.bool]
+        dims = [1, 2, 3]
+        root_ranks = list(range(size))
+        for use_resource in [False, True]:
+            if not use_resource and _executing_eagerly():
+                continue
+            for dtype, dim, root_rank in itertools.product(dtypes, dims, root_ranks):
+                with tf.device("/cpu:0"):
+                    if dtype == tf.bool:
+                        initial_value = tf.cast((tf.ones([17] * dim) * rank) % 2, dtype)
+                    else:
+                        initial_value = tf.cast(tf.ones([17] * dim) * rank, dtype)
+                    if not hvd._executing_eagerly():
+                        if use_resource:
+                            var = resource_variable_ops.ResourceVariable(initial_value)
+                        else:
+                            var = tf_ops_variables.RefVariable(initial_value)
+                        init = tf.compat.v1.global_variables_initializer()
+                        self.evaluate(init)
+                    else:
+                        assert use_resource
+                        var = self.tfe.Variable(initial_value)
+                    root_tensor = tf.ones([17] * dim) * root_rank
+                    if dtype == tf.bool:
+                        root_tensor = root_tensor % 2
+                    broadcasted_tensor, = hvd.broadcast_([var], root_rank)
+                    self.assertEqual(var.dtype.base_dtype, dtype)
+                    self.assertEqual(broadcasted_tensor.dtype.base_dtype, dtype)
+                    np.testing.assert_array_equal(self.evaluate(broadcasted_tensor), self.evaluate(var),
+                                                  err_msg="broadcasted_var and var may not differ, actually they should have the same underlying buffer")
+                    self.assertTrue(
+                        self.evaluate(tf.reduce_all(tf.equal(
+                            tf.cast(root_tensor, tf.int32), tf.cast(broadcasted_tensor, tf.int32)))),
+                        "Inplace hvd.broadcast_ produces incorrect broadcasted variable value")
+
+    def test_horovod_broadcast_inplace_gpu(self):
+        """Test that the inplace broadcast correctly broadcasts 1D, 2D, 3D variables on GPU."""
+        if LooseVersion(tf.__version__) < LooseVersion('2.6.0'):
+            self.skipTest("Custom Ops using resource variables only work with TF 2.6+")
+
+        if not tf.test.is_gpu_available(cuda_only=True):
+            self.skipTest("No GPUs available")
+
+        if int(os.environ.get('HOROVOD_MIXED_INSTALL', 0)):
+            # Skip if compiled with CUDA but without HOROVOD_GPU_OPERATIONS.
+            self.skipTest("Not compiled with HOROVOD_GPU_OPERATIONS")
+
+        hvd.init()
+        rank = hvd.rank()
+        local_rank = hvd.local_rank()
+        size = hvd.size()
+
+        # This test does not apply if there is only one worker.
+        if size == 1:
+            self.skipTest("Only one worker available")
+
+        # dtypes that are supported both for variable assignments and by Horovod
+        dtypes = [tf.int64, tf.float16, tf.float32, tf.float64]
+        dims = [1, 2, 3]
+        root_ranks = list(range(size))
+        for use_resource in [False, True]:
+            if not use_resource and _executing_eagerly():
+                continue
+            for counter, (dtype, dim, root_rank) in enumerate(itertools.product(dtypes, dims, root_ranks)):
+                with tf.device("/gpu:%d" % local_rank):
+                    if dtype == tf.bool:
+                        initial_value = tf.cast((tf.ones([17] * dim) * rank) % 2, dtype)
+                    else:
+                        initial_value = tf.cast(tf.ones([17] * dim) * rank, dtype)
+                    root_tensor = tf.ones([17] * dim) * root_rank
+                    if dtype == tf.bool:
+                        root_tensor = root_tensor % 2
+                    if not hvd._executing_eagerly():
+                        if use_resource:
+                            var = resource_variable_ops.ResourceVariable(initial_value)
+                        else:
+                            var = tf_ops_variables.RefVariable(initial_value)
+                        init = tf.compat.v1.global_variables_initializer()
+                        self.evaluate(init)
+                    else:
+                        assert use_resource
+                        var = self.tfe.Variable(initial_value)
+                    broadcasted_tensor, = hvd.broadcast_([var], root_rank)
+                    self.assertEqual(var.dtype.base_dtype, dtype)
+                    self.assertEqual(broadcasted_tensor.dtype.base_dtype, dtype)
+                    np.testing.assert_array_equal(self.evaluate(broadcasted_tensor), self.evaluate(var),
+                                                  err_msg="broadcasted_var and var may not differ, actually they should have the same underlying buffer")
+                    self.assertTrue(
+                        self.evaluate(tf.reduce_all(tf.equal(
+                            tf.cast(root_tensor, tf.int32), tf.cast(broadcasted_tensor, tf.int32)))),
+                        "Inplace hvd.broadcast_ produces incorrect broadcasted variable value")
+
+    def test_horovod_broadcast_inplace_multiple_cpu(self):
+        """Test that the inplace broadcast correctly broadcasts multiple variables on CPU."""
+        if LooseVersion(tf.__version__) < LooseVersion('2.6.0'):
+            self.skipTest("Custom Ops using resource variables only work with TF 2.6+")
+
+        hvd.init()
+        rank = hvd.rank()
+        size = hvd.size()
+
+        # This test does not apply if there is only one worker.
+        if size == 1:
+            self.skipTest("Only one worker available")
+
+        dtypes = [tf.float32]
+        dims = [1, 2, 3]
+        root_ranks = list(range(size))
+        for use_resource in [False, True]:
+            if not use_resource and _executing_eagerly():
+                continue
+            for dtype, root_rank in itertools.product(dtypes, root_ranks):
+                with tf.device("/cpu:0"):
+                    variables = []
+                    root_tensors = []
+                    for dim in dims:
+                        initial_value = tf.cast(tf.ones([17] * dim) * rank, dtype)
+                        if not hvd._executing_eagerly():
+                            if use_resource:
+                                var = resource_variable_ops.ResourceVariable(initial_value, name=f"dim_{dim}_var")
+                            else:
+                                var = tf_ops_variables.RefVariable(initial_value, name=f"dim_{dim}_var")
+                            init = tf.compat.v1.global_variables_initializer()
+                            self.evaluate(init)
+                        else:
+                            assert use_resource
+                            var = self.tfe.Variable(initial_value, name=f"dim_{dim}_var")
+                        root_tensor = tf.ones([17] * dim) * root_rank
+                        variables.append(var)
+                        root_tensors.append(root_tensor)
+
+                    broadcasted_tensors = hvd.broadcast_(variables, root_rank)
+                    for broadcasted_tensor, var, root_tensor in zip(broadcasted_tensors, variables, root_tensors):
+                        self.assertEqual(var.dtype.base_dtype, dtype)
+                        self.assertEqual(broadcasted_tensor.dtype.base_dtype, dtype)
+                        np.testing.assert_array_equal(self.evaluate(broadcasted_tensor), self.evaluate(var),
+                                                      err_msg="broadcasted_var and var may not differ, actually they should have the same underlying buffer")
+                        self.assertTrue(
+                            self.evaluate(tf.reduce_all(tf.equal(
+                                tf.cast(root_tensor, tf.int32), tf.cast(broadcasted_tensor, tf.int32)))),
+                            "Inplace hvd.broadcast_ produces incorrect broadcasted variable value")
+
     def test_horovod_broadcast_cpu_process_sets(self):
         """Test that the broadcast correctly broadcasts 1D, 2D, 3D tensors on CPU
          if restricted to non-global process sets"""
diff --git a/test/parallel/test_tensorflow2_keras.py b/test/parallel/test_tensorflow2_keras.py
index c2eef8cb03..df61e2e521 100644
--- a/test/parallel/test_tensorflow2_keras.py
+++ b/test/parallel/test_tensorflow2_keras.py
@@ -31,13 +31,15 @@
 import horovod.tensorflow.keras as hvd
 
 
-_PRE_TF_2_4_0 = LooseVersion(tf.__version__) < LooseVersion("2.4.0")
 _PRE_TF_2_2_0 = LooseVersion(tf.__version__) < LooseVersion("2.2.0")
 
-# Set environment variable to enable adding/removing process sets after initializing Horovod.
+# Set environment variable to enable adding/removing process sets after
+# initializing Horovod.
 os.environ["HOROVOD_DYNAMIC_PROCESS_SETS"] = "1"
 
-@pytest.mark.skipif(LooseVersion(tf.__version__) < LooseVersion('2.0.0'), reason='TensorFlow v2 tests')
+
+@pytest.mark.skipif(LooseVersion(tf.__version__) <
+                    LooseVersion('2.0.0'), reason='TensorFlow v2 tests')
 class Tf2KerasTests(tf.test.TestCase):
     """
     Tests for ops in horovod.tensorflow.keras.
@@ -52,16 +54,15 @@ def __init__(self, *args, **kwargs):
         for gpu in gpus:
             tf.config.experimental.set_memory_growth(gpu, True)
         if gpus:
-            tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
+            tf.config.experimental.set_visible_devices(
+                gpus[hvd.local_rank()], 'GPU')
 
     def test_train_model_lr_schedule(self):
-        lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
-            0.001 * hvd.size(),
-            decay_steps=100000,
-            decay_rate=0.96,
-            staircase=True)
-        opt = tf.keras.optimizers.Adam(lr_schedule)
+        initial_lr = 0.1 * hvd.size()
+        opt = tf.keras.optimizers.Adam()
         opt = hvd.DistributedOptimizer(opt)
+        def linear_multiplier(epoch):
+            return epoch
 
         model = keras.models.Sequential()
         model.add(keras.layers.Dense(2, input_shape=(3,)))
@@ -71,17 +72,64 @@ def test_train_model_lr_schedule(self):
                       optimizer=opt,
                       metrics=[keras.metrics.categorical_accuracy],
                       experimental_run_tf_function=False)
-
-        x = np.random.random((1, 3))
-        y = np.random.random((1, 3, 2))
-
-        # No assertions, we just need to verify that it doesn't hang or error
-        callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
-        model.fit(x,
-                  y,
-                  steps_per_epoch=10,
-                  callbacks=callbacks,
-                  epochs=1)
+        x = np.random.random((10, 3))
+        y = np.random.random((10, 3, 2))
+
+        class StoreLearningRateCallback(tf.keras.callbacks.Callback):
+            def on_epoch_end(self, epoch, logs=None):
+                # test learning rate warmup
+                lr = self.model.optimizer.lr.numpy()
+                if epoch >= 0 and epoch < 5:
+                    assert lr <= initial_lr or np.isclose(lr, initial_lr)
+
+                # # test learning rate schedule callback
+                if epoch > 5 and epoch < 10:
+                    assert lr <= initial_lr * \
+                        1e-1 or np.isclose(lr, initial_lr * 1e-1)
+                if epoch > 10 and epoch < 15:
+                    assert lr < initial_lr * \
+                        1e-2 or np.isclose(lr, initial_lr * 1e-2)
+                if epoch >= 15 and epoch < 20:
+                    assert np.isclose(
+                        lr, initial_lr * linear_multiplier(epoch))
+
+        # No assertions needed for BroadcastGlobalVariableCallbacks
+        # We just need to verify that it doesn't hang or error
+        callbacks = [
+            hvd.callbacks.BroadcastGlobalVariablesCallback(0),
+            hvd.callbacks.MetricAverageCallback(),
+            hvd.callbacks.LearningRateWarmupCallback(
+                initial_lr=initial_lr,
+                warmup_epochs=5),
+            hvd.callbacks.LearningRateScheduleCallback(
+                initial_lr=initial_lr,
+                multiplier=1e-1,
+                start_epoch=5,
+                end_epoch=10),
+            hvd.callbacks.LearningRateScheduleCallback(
+                initial_lr=initial_lr,
+                multiplier=1e-2,
+                start_epoch=10,
+                end_epoch=15),
+            hvd.callbacks.LearningRateScheduleCallback(
+                initial_lr=initial_lr,
+                multiplier=linear_multiplier,
+                start_epoch=15,
+                end_epoch=20),
+            StoreLearningRateCallback()]
+        train_history = model.fit(x,
+                                  y,
+                                  steps_per_epoch=5,
+                                  callbacks=callbacks,
+                                  epochs=20)
+
+        # test that the metrics average is being respected
+        loss_metrics = train_history.history["loss"]
+        loss_metrics_tensor = tf.convert_to_tensor(
+            loss_metrics, dtype=tf.float32)
+        expected_loss_metrics_tensor = hvd.broadcast(
+            loss_metrics_tensor, root_rank=0)
+        self.assertAllClose(expected_loss_metrics_tensor, loss_metrics_tensor)
 
     def test_sparse_as_dense(self):
         opt = keras.optimizers.RMSprop(lr=0.0001)
@@ -116,7 +164,7 @@ def test_elastic_state(self):
         ])
         model1.build((2, 2))
         model1.set_weights(
-            [np.array([[v,  v], [v, v]], dtype=np.float32),
+            [np.array([[v, v], [v, v]], dtype=np.float32),
              np.array([v, v], dtype=np.float32)])
 
         model2 = tf.keras.Sequential([
@@ -124,12 +172,18 @@ def test_elastic_state(self):
         ])
         model2.build((2, 2))
         model2.set_weights(
-            [np.array([[1.0,  2.0], [3.0, 4.0]], dtype=np.float32),
+            [np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32),
              np.array([0.0, 0.0], dtype=np.float32)])
 
         optimizer = tf.optimizers.Adam(0.001 * hvd.size())
 
-        state = hvd.elastic.KerasState(model1, optimizer, batch=20 + hvd.rank(), epoch=10 + hvd.rank())
+        state = hvd.elastic.KerasState(
+            model1,
+            optimizer,
+            batch=20 +
+            hvd.rank(),
+            epoch=10 +
+            hvd.rank())
         state.sync()
 
         model1_weights = model1.get_weights()
@@ -166,13 +220,12 @@ def test_elastic_state(self):
         assert state.batch == 21
         assert state.epoch == 11
 
-    @pytest.mark.skipif(LooseVersion(tf.__version__) >= LooseVersion('2.4.0'),
-                        reason='TensorFlow 2.4.0+ does not support this path')
     def test_gradient_aggregation(self):
         class TestingOptimizer(optimizer_v2.OptimizerV2):
             """
             Custom optimizer we use for testing gradient aggregation.
             """
+
             def get_config(self):
                 config = super(TestingOptimizer, self).get_config()
                 return config
@@ -196,8 +249,6 @@ def compute_expected_value(batch_id):
             sum_per_aggregation = 0.0
             for _ in range(backward_passes_per_step):
                 grads_for_batch = 0.0
-                for rank in range(hvd.size()):
-                    grads_for_batch += rank
 
                 # Apply `average_aggregated_gradients`.
                 grads_for_batch /= float(backward_passes_per_step)
@@ -205,35 +256,36 @@ def compute_expected_value(batch_id):
                 # Averages across workers.
                 sum_per_aggregation += grads_for_batch / float(hvd.size())
 
-            aggregations_completed = math.floor((batch_id + 1) / backward_passes_per_step)
+            aggregations_completed = math.floor(
+                (batch_id + 1) / backward_passes_per_step)
             return aggregations_completed * sum_per_aggregation
 
         @tf.function
-        def apply_gradients_in_tf_function(gradient_updates, model_variables, **kwargs):
+        def apply_gradients_in_tf_function(grads_and_vars, **kwargs):
             # Apply gradient updates in tf.function to reproduce how it is
             # done inside `model.fit()`.
-            hvd_optimizer.apply_gradients(zip(gradient_updates, model_variables), **kwargs)
+            hvd_optimizer.apply_gradients(grads_and_vars, **kwargs)
 
-        gradients = [tf.constant([float(hvd.rank())])]
-        variables = [tf.Variable([0.0])]
+        var = tf.Variable([0.0])
+        variables = [var]
+        def loss():
+            return (var - var)
         for idx in range(10):
             if _PRE_TF_2_2_0:
-                updated_gradients = hvd_optimizer._allreduce(gradients, variables)
-                apply_gradients_in_tf_function(updated_gradients, variables)
-            elif _PRE_TF_2_4_0:
+                grads_and_vars = hvd_optimizer._compute_gradients(
+                    loss, var_list=variables)
+                apply_gradients_in_tf_function(grads_and_vars, variables)
+            else:
                 # In 2.2 and 2.3 the horovod optimizer sets `_HAS_AGGREGATE_GRAD = True`.
                 # This configures tf.keras to call `_aggregate_gradients()` outside of
                 # `apply_gradients()` and to set `experimental_aggregate_gradients` to
                 # False when calling `apply_gradients()` to prevent it from calling
                 # `_aggregate_gradients()` again.
-                updated_gradients = hvd_optimizer._aggregate_gradients(
-                    zip(gradients, variables))
+
+                grads_and_vars = hvd_optimizer._compute_gradients(
+                    loss, var_list=variables)
                 apply_gradients_in_tf_function(
-                    updated_gradients, variables,
-                    experimental_aggregate_gradients=False
-                )
-            else:
-                raise RuntimeError("This test should be skipped ...")
+                    grads_and_vars, experimental_aggregate_gradients=False)
 
             updated_variable_value = variables[0][0].numpy()
             assert updated_variable_value == compute_expected_value(idx)
@@ -251,13 +303,17 @@ def test_process_set_optimizer(self):
         class TestOptimizer(keras.optimizers.Optimizer):
             def __init__(self, name, **kwargs):
                 super(TestOptimizer, self).__init__(name, **kwargs)
+
             def get_gradients(self, loss, params):
                 assert len(params) == 1
                 return [tf.constant([float(hvd.rank())])]
+
             def _create_slots(self, var_list):
                 pass
+
             def _resource_apply_dense(self, grad, var, apply_state):
                 return var.assign_add(grad)
+
             def get_config(self):
                 config = super(TestOptimizer, self).get_config()
                 return config
@@ -271,7 +327,10 @@ def get_config(self):
         computed_value = variable.numpy()
 
         if subset.included():
-            self.assertAlmostEqual(computed_value, sum(range(0, size, 2)) / subset.size())
+            self.assertAlmostEqual(
+                computed_value, sum(
+                    range(
+                        0, size, 2)) / subset.size())
         else:
             self.assertAlmostEqual(computed_value, float(hvd.rank()))
 
diff --git a/test/parallel/test_torch.py b/test/parallel/test_torch.py
index add8873275..1e5f42932d 100644
--- a/test/parallel/test_torch.py
+++ b/test/parallel/test_torch.py
@@ -28,6 +28,7 @@
 import json
 
 from collections.abc import Iterable
+from datetime import datetime
 
 import numpy as np
 import pytest
@@ -42,6 +43,8 @@
 from common import mpi_env_rank_and_size, skip_or_fail_gpu_test, temppath
 
 _1_5_api = LooseVersion(torch.__version__) >= LooseVersion('1.5.0')
+_1_10_api = LooseVersion(torch.__version__) >= LooseVersion('1.10.0')
+_is_mac = platform.system() == 'Darwin'
 
 ccl_supported_types = set([torch.ByteTensor, torch.CharTensor, torch.ShortTensor,
                            torch.IntTensor, torch.LongTensor, torch.FloatTensor,
@@ -62,6 +65,14 @@ def __init__(self, *args, **kwargs):
         super(TorchTests, self).__init__(*args, **kwargs)
         warnings.simplefilter('module')
 
+    def setup(self):
+        hvd.init()
+
+    def tearDown(self):
+        gloo_rank = int(os.getenv('HOROVOD_RANK', -1))
+        if hvd.is_initialized() and not _is_mac and gloo_rank != -1:
+            hvd.shutdown()
+
     def convert_cpu_fp16_to_fp32(self, *values):
         # PyTorch doesn't support any CPU ops on FP16 tensors.
         # In case we need to do ops, we will convert tensor to FP32 here.
@@ -87,7 +98,6 @@ def test_gpu_required(self):
         if not torch.cuda.is_available():
             skip_or_fail_gpu_test(self, "No GPUs available")
 
-    @pytest.mark.skipif(platform.system() == 'Darwin', reason='Reinit not supported on macOS')
     def test_horovod_reinit(self):
         """Test that Horovod can init -> shutdown -> init successfully."""
         mpi_rank, _ = mpi_env_rank_and_size()
@@ -3170,7 +3180,74 @@ def create_model(opt_class, opt_params, process_set):
         hvd.remove_process_set(odd_set)
         hvd.remove_process_set(even_set)
 
+    def test_process_set_barrier_op(self):
+        """Test that process set barrier stalls all ranks in that process set"""
+        hvd.init()
+
+        # No need to test if only one rank is available
+        if hvd.size() == 1:
+            self.skipTest("Number of ranks is 1. Skipping test.")
+
+        even_ranks = [rk for rk in range(0, hvd.size()) if rk % 2 == 0]
+        odd_ranks = [rk for rk in range(0, hvd.size()) if rk % 2 == 1]
+        even_set = hvd.add_process_set(even_ranks)
+        odd_set = hvd.add_process_set(odd_ranks)
+
+        # Make sure all ranks are initialized
+        i = 0
+        while hvd.allreduce_(torch.IntTensor([int(hvd.is_initialized())]), None, 'is_initialized{}'.format(i), hvd.Sum) != hvd.size():
+            i+=1
+            continue
+
+        even_barrier_time = 0
+        odd_barrier_time = 0
+        even_barrier_time_start = datetime.now()
+        odd_barrier_time_start = datetime.now()
 
+        if hvd.rank() in even_ranks:
+            # rank 0 sleeps for 5 seconds
+            if hvd.rank() == 0:
+                time.sleep(5)
+            hvd.barrier(even_set)
+            # barrier time should be at least 5 seconds for all even ranks
+            even_barrier_time_end = datetime.now()
+            even_barrier_time = (even_barrier_time_end - even_barrier_time_start).total_seconds()
+            self.assertTrue(even_barrier_time >= 5)
+        # No stall time for odd ranks
+        elif hvd.rank() in odd_ranks:
+            hvd.barrier(odd_set)
+            odd_barrier_time_end = datetime.now()
+            odd_barrier_time = (odd_barrier_time_end - odd_barrier_time_start).total_seconds()
+            self.assertTrue(odd_barrier_time <= 1)
+
+        hvd.barrier()
+
+    def test_global_barrier_op(self):
+        """Test that global barrier stalls all ranks"""
+        hvd.init()
+
+        # No need to test if only one rank is available
+        if hvd.size() == 1:
+            self.skipTest("Number of ranks is 1. Skipping test.")
+
+        # Make sure all ranks are initialized
+        i = 0
+        while hvd.allreduce_(torch.IntTensor([int(hvd.is_initialized())]), None, 'is_initialized{}'.format(i), hvd.Sum) != hvd.size():
+            i+=1
+            continue
+
+        # Sleep rank 0 for 5 seconds, all the other ranks will arrive barrier right away.
+        barrier_time = 0
+        barrier_time_start = datetime.now()
+        if hvd.rank() == 0:
+            time.sleep(5)
+        hvd.barrier()
+
+        # barrier time should be at least 5 seconds for all ranks
+        barrier_time_end = datetime.now()
+        barrier_time = (barrier_time_end - barrier_time_start).total_seconds()
+
+        self.assertTrue(barrier_time >= 5)
 
 if __name__ == "__main__":
    unittest.main()
diff --git a/test/single/data/expected_buildkite_gpu_heads_pipeline.yaml b/test/single/data/expected_buildkite_gpu_heads_pipeline.yaml
new file mode 100644
index 0000000000..1ded040a6a
--- /dev/null
+++ b/test/single/data/expected_buildkite_gpu_heads_pipeline.yaml
@@ -0,0 +1,147 @@
+steps:
+- label: ':docker: Build test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2'
+  plugins:
+  - docker-compose#v3.5.0:
+      build: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      image-repository: 823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
+      cache-from: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2-latest
+      config: docker-compose.test.yml
+      push-retries: 5
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 40
+  retry:
+    automatic: true
+  agents:
+    queue: cpu
+- wait
+- wait
+- label: ':pytest: Gloo Parallel PyTests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 4x-gpu-v510
+- label: ':pytest: Gloo Single PyTests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)"
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 15
+  retry:
+    automatic: true
+  agents:
+    queue: 4x-gpu-v510
+- label: ':pytest: Gloo Cluster PyTests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: bash -c "HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.static.xml test_static_run.py"
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 4x-gpu-v510
+- wait
+- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_mnist.py
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 2x-gpu-v510
+- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 2x-gpu-v510
+- label: ':fire: Gloo PyTorch MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 2x-gpu-v510
+- label: ':muscle: Gloo MXNet2 MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet/mxnet2_mnist.py
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 2x-gpu-v510
+- label: ':factory: Elastic Tests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+  command: bash -c "cd /horovod/test/integration && HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.xml test_elastic_torch.py test_elastic_tensorflow2.py"
+  artifact_paths: "artifacts/**"
+  plugins:
+  - docker-compose#v3.5.0:
+      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      volumes: "./artifacts:/artifacts"
+      config: docker-compose.test.yml
+      pull-retries: 3
+  - ecr#v1.2.0:
+      login: true
+  timeout_in_minutes: 10
+  retry:
+    automatic: true
+  agents:
+    queue: 2x-gpu-v510
diff --git a/test/single/data/expected_buildkite_gpu_pipeline.yaml b/test/single/data/expected_buildkite_gpu_non_heads_pipeline.yaml
similarity index 61%
rename from test/single/data/expected_buildkite_gpu_pipeline.yaml
rename to test/single/data/expected_buildkite_gpu_non_heads_pipeline.yaml
index 7f3d63956b..ea83783617 100644
--- a/test/single/data/expected_buildkite_gpu_pipeline.yaml
+++ b/test/single/data/expected_buildkite_gpu_non_heads_pipeline.yaml
@@ -1,15 +1,15 @@
 steps:
-- label: ':docker: Build test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2'
+- label: ':docker: Build test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2'
   plugins:
   - docker-compose#v3.5.0:
-      build: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      build: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       image-repository: 823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
-      cache-from: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2-latest
+      cache-from: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2-latest
       config: docker-compose.test.yml
       push-retries: 5
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
   retry:
     automatic: true
   agents:
@@ -24,7 +24,7 @@ steps:
       push-retries: 5
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
   retry:
     automatic: true
   agents:
@@ -39,95 +39,80 @@ steps:
       push-retries: 5
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
   retry:
     automatic: true
   agents:
     queue: cpu
-- label: ':docker: Build test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2'
+- label: ':docker: Build test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2'
   plugins:
   - docker-compose#v3.5.0:
-      build: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      build: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       image-repository: 823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
-      cache-from: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2-latest
+      cache-from: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2-latest
       config: docker-compose.test.yml
       push-retries: 5
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
   retry:
     automatic: true
   agents:
     queue: cpu
-- label: ':docker: Build test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2'
+- label: ':docker: Build test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2'
   plugins:
   - docker-compose#v3.5.0:
-      build: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      build: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       image-repository: 823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
-      cache-from: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2-latest
+      cache-from: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2-latest
       config: docker-compose.test.yml
       push-retries: 5
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
   retry:
     automatic: true
   agents:
     queue: cpu
-- label: ':docker: Build test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2'
+- label: ':docker: Build test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2'
   plugins:
   - docker-compose#v3.5.0:
-      build: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      build: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       image-repository: 823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
-      cache-from: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2-latest
+      cache-from: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2-latest
       config: docker-compose.test.yml
       push-retries: 5
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 30
-  retry:
-    automatic: true
-  agents:
-    queue: cpu
-- label: ':docker: Build test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2'
-  plugins:
-  - docker-compose#v3.5.0:
-      build: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      image-repository: 823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite
-      cache-from: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2:823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite:SLUG-test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2-latest
-      config: docker-compose.test.yml
-      push-retries: 5
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 30
+  timeout_in_minutes: 40
   retry:
     automatic: true
   agents:
     queue: cpu
 - wait
 - wait
-- label: ':pytest: Gloo Parallel PyTests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Parallel PyTests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Single PyTests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Single PyTests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -138,12 +123,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Cluster PyTests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Cluster PyTests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.static.xml test_static_run.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -165,7 +150,7 @@ steps:
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
@@ -213,7 +198,7 @@ steps:
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
@@ -250,28 +235,28 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Parallel PyTests (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Parallel PyTests (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Single PyTests (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Single PyTests (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -282,12 +267,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Cluster PyTests (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Cluster PyTests (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.static.xml test_static_run.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -298,28 +283,28 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Parallel PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Parallel PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Single PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Single PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -330,12 +315,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Cluster PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Cluster PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.static.xml test_static_run.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -346,28 +331,28 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: MPI Parallel PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: MPI Parallel PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: MPI Single PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: MPI Single PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh mpi)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -378,12 +363,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: MPI Cluster PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: MPI Cluster PyTests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.mpi.static.xml test_static_run.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -394,44 +379,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Parallel PyTests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
+- label: ':pytest: Gloo Parallel PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 5
-  retry:
-    automatic: true
-  agents:
-    queue: 4x-gpu-v510
-- label: ':pytest: Gloo Single PyTests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 15
-  retry:
-    automatic: true
-  agents:
-    queue: 4x-gpu-v510
-- label: ':pytest: Gloo Cluster PyTests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: bash -c "HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.static.xml test_static_run.py"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -442,28 +395,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Parallel PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 5
-  retry:
-    automatic: true
-  agents:
-    queue: 4x-gpu-v510
-- label: ':pytest: Gloo Single PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Single PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh gloo)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -474,12 +411,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: Gloo Cluster PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: Gloo Cluster PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.static.xml test_static_run.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -490,28 +427,28 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: MPI Parallel PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: MPI Parallel PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " HOROVOD_TEST_GPU=1 cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 \$(cat /mpirun_command) /bin/bash /pytest.sh mpi)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
   - ecr#v1.2.0:
       login: true
-  timeout_in_minutes: 5
+  timeout_in_minutes: 10
   retry:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: MPI Single PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: MPI Single PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " HOROVOD_TEST_GPU=1 cd /horovod/test/single && (ls -1 test_*.py | xargs -n 1 /bin/bash /pytest_standalone.sh mpi)"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -522,12 +459,12 @@ steps:
     automatic: true
   agents:
     queue: 4x-gpu-v510
-- label: ':pytest: MPI Cluster PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':pytest: MPI Cluster PyTests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " HOROVOD_TEST_GPU=1 /etc/init.d/ssh start && cd /horovod/test/integration && pytest --forked -v --capture=fd --continue-on-collection-errors --junit-xml=/artifacts/junit.mpi.static.xml test_static_run.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -539,12 +476,12 @@ steps:
   agents:
     queue: 4x-gpu-v510
 - wait
-- label: ':tensorflow: Gloo TensorFlow MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow/tensorflow_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -555,12 +492,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo Keras MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo Keras MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/keras/keras_mnist_advanced.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -571,12 +508,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':fire: Gloo PyTorch MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':fire: Gloo PyTorch MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -587,12 +524,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':muscle: Gloo MXNet MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':muscle: Gloo MXNet MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet/mxnet_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -603,92 +540,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':factory: Elastic Tests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
+- label: ':factory: Elastic Tests (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2)'
   command: bash -c "cd /horovod/test/integration && HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.xml test_elastic_torch.py test_elastic_tensorflow.py test_elastic_tensorflow_keras.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Keras Rossmann Run (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_run.py --num-proc 2 --data-dir file:///data --epochs 3 --sample-rate 0.01"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Keras Rossmann Estimator (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_rossmann_estimator.py --num-proc 2 --work-dir /work --data-dir file:///data --epochs 3 --sample-rate 0.01"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Keras MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/keras/keras_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_3_1-mxnet1_5_1_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -779,38 +636,6 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_6_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
 - label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_mnist.py
   artifact_paths: "artifacts/**"
@@ -891,44 +716,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_4_3-keras2_3_1-torch1_7_1-mxnet1_7_0_p1-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -939,12 +732,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -955,12 +748,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':fire: Gloo PyTorch MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':fire: Gloo PyTorch MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -971,12 +764,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':muscle: Gloo MXNet MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':muscle: Gloo MXNet MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet/mxnet_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -987,44 +780,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':factory: Elastic Tests (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':factory: Elastic Tests (test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "cd /horovod/test/integration && HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.xml test_elastic_torch.py test_elastic_tensorflow2.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tf2_5_1-keras_none-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-gloo-py3_8-tf2_5_1-keras2_4_3-torch1_8_1-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1035,12 +796,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1051,12 +812,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1067,12 +828,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':fire: Gloo PyTorch MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':fire: Gloo PyTorch MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1083,12 +844,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':muscle: Gloo MXNet MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':muscle: Gloo MXNet MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet/mxnet_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1099,12 +860,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':factory: Elastic Tests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':factory: Elastic Tests (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "cd /horovod/test/integration && HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.xml test_elastic_torch.py test_elastic_tensorflow2.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1115,12 +876,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':jupyter: Run PyTests test_interactiverun (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':jupyter: Run PyTests test_interactiverun (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "cd /horovod/test && pytest -v --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.mpi.integration.xml integration/test_interactiverun.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1131,12 +892,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':fire: MPI PyTorch MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':fire: MPI PyTorch MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " \$(cat /mpirun_command) python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1147,12 +908,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':muscle: MPI MXNet MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':muscle: MPI MXNet MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " OMP_NUM_THREADS=1 \$(cat /mpirun_command) python /horovod/examples/mxnet/mxnet_mnist.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1163,12 +924,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: MPI TensorFlow 2.0 MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: MPI TensorFlow 2.0 MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " \$(cat /mpirun_command) python /horovod/examples/tensorflow2/tensorflow2_mnist.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1179,156 +940,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: MPI TensorFlow 2.0 Keras MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: MPI TensorFlow 2.0 Keras MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " \$(cat /mpirun_command) python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_mnist.py
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':fire: Gloo PyTorch MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':muscle: Gloo MXNet2 MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet/mxnet2_mnist.py
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':factory: Elastic Tests (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: bash -c "cd /horovod/test/integration && HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.xml test_elastic_torch.py test_elastic_tensorflow2.py"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-gpu-gloo-py3_8-tfhead-keras_none-torchhead-mxnethead-pyspark3_1_2
+      run: test-gpu-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1339,12 +956,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow 2.0 MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1355,12 +972,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: Gloo TensorFlow 2.0 Keras MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1371,12 +988,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':fire: Gloo PyTorch MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':fire: Gloo PyTorch MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1387,12 +1004,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':muscle: Gloo MXNet MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':muscle: Gloo MXNet MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet/mxnet_mnist.py
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1403,12 +1020,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':factory: Elastic Tests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':factory: Elastic Tests (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "cd /horovod/test/integration && HOROVOD_LOG_LEVEL=DEBUG pytest --forked -v --log-cli-level 10 --log-cli-format '[%(asctime)-15s %(levelname)s %(filename)s:%(lineno)d %(funcName)s()] %(message)s' --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.gloo.elastic.xml test_elastic_torch.py test_elastic_tensorflow2.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1419,12 +1036,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':jupyter: Run PyTests test_interactiverun (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':jupyter: Run PyTests test_interactiverun (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c "cd /horovod/test && pytest -v --capture=no --continue-on-collection-errors --junit-xml=/artifacts/junit.mpi.integration.xml integration/test_interactiverun.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1435,12 +1052,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':fire: MPI PyTorch MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':fire: MPI PyTorch MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " \$(cat /mpirun_command) python /horovod/examples/pytorch/pytorch_mnist.py --data-dir /data/pytorch_datasets"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1451,12 +1068,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':muscle: MPI MXNet MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':muscle: MPI MXNet MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " OMP_NUM_THREADS=1 \$(cat /mpirun_command) python /horovod/examples/mxnet/mxnet_mnist.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1467,12 +1084,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: MPI TensorFlow 2.0 MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: MPI TensorFlow 2.0 MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " \$(cat /mpirun_command) python /horovod/examples/tensorflow2/tensorflow2_mnist.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
@@ -1483,44 +1100,12 @@ steps:
     automatic: true
   agents:
     queue: 2x-gpu-v510
-- label: ':tensorflow: MPI TensorFlow 2.0 Keras MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
+- label: ':tensorflow: MPI TensorFlow 2.0 Keras MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
   command: bash -c " \$(cat /mpirun_command) python /horovod/examples/tensorflow2/tensorflow2_keras_mnist.py"
   artifact_paths: "artifacts/**"
   plugins:
   - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Torch MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
-      volumes: "./artifacts:/artifacts"
-      config: docker-compose.test.yml
-      pull-retries: 3
-  - ecr#v1.2.0:
-      login: true
-  timeout_in_minutes: 10
-  retry:
-    automatic: true
-  agents:
-    queue: 2x-gpu-v510
-- label: ':spark: Spark Lightning MNIST (test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2)'
-  command: bash -c "OMP_NUM_THREADS=1 /spark_env.sh python /horovod/examples/spark/pytorch/pytorch_lightning_spark_mnist.py --num-proc 2 --work-dir /work --data-dir /data --epochs 3"
-  artifact_paths: "artifacts/**"
-  plugins:
-  - docker-compose#v3.5.0:
-      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras_none-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
+      run: test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_8_0_p0-pyspark3_1_2
       volumes: "./artifacts:/artifacts"
       config: docker-compose.test.yml
       pull-retries: 3
diff --git a/test/single/test_buildkite.py b/test/single/test_buildkite.py
index ee1c847861..4c53fbb369 100644
--- a/test/single/test_buildkite.py
+++ b/test/single/test_buildkite.py
@@ -47,6 +47,11 @@ def _run(self, cmd, env):
         stdout = io.StringIO()
         stderr = io.StringIO()
         try:
+            if env is not None:
+                env = {
+                    'PATH': os.environ['PATH'],
+                    **env,
+                }
             exit_code = safe_shell_exec.execute(cmd, env=env, stdout=stdout, stderr=stderr)
             return exit_code, stdout.getvalue(), stderr.getvalue()
         finally:
@@ -56,44 +61,46 @@ def _run(self, cmd, env):
     """
     Tests the generated GPU buildkite pipeline.
     
-    Compares output of .buildkite/gen-pipeline.sh with test/single/data/expected_buildkite_pipeline.yaml.
+    Compares output of .buildkite/gen-pipeline.sh with test/single/data/expected_buildkite_gpu_heads_pipeline.yaml.
     To see the changes in the output, run the following in your Horovod project root:
     
-        BUILDKITE_PIPELINE_SLUG=SLUG BUILDKITE_BRANCH=BRANCH PIPELINE_MODE="GPU FULL" .buildkite/gen-pipeline.sh > test/single/data/expected_buildkite_gpu_pipeline.yaml
+        BUILDKITE_PIPELINE_SLUG=SLUG BUILDKITE_BRANCH=BRANCH PIPELINE_MODE="GPU HEADS" .buildkite/gen-pipeline.sh > test/single/data/expected_buildkite_gpu_heads_pipeline.yaml
     
     Then run `git diff` to see the changes in the generated pipeline YAML.
-    Commit `test/single/data/expected_buildkite_gpu_pipeline.yaml` to get those changes into your PR.
+    Commit `test/single/data/expected_buildkite_gpu_heads_pipeline.yaml` to get those changes into your PR.
     """
-    def test_gen_pipeline(self):
-        expected_filename = os.path.join(os.path.dirname(__file__), 'data/expected_buildkite_gpu_pipeline.yaml')
-        with open(expected_filename, 'r') as f:
-            lines = f.readlines()
-            expected_pipeline = ''.join(lines)
+    def test_gen_gpu_heads_pipeline(self):
+        self.do_test_gen_pipeline(GEN_PIPELINE_FNAME, 'GPU HEADS', {}, 'WARNING:root:no commit (None) or default branch (None) given\n')
 
-        gen_pipeline_env = dict(BUILDKITE_PIPELINE_SLUG='SLUG', BUILDKITE_BRANCH='BRANCH', PIPELINE_MODE='GPU FULL')
-        gen_pipeline_cmd = GEN_PIPELINE_FNAME
+    """
+    Tests the generated GPU buildkite pipeline.
 
-        exit_code, actual_pipeline, gen_pipeline_log = self._run(gen_pipeline_cmd, gen_pipeline_env)
+    Compares output of .buildkite/gen-pipeline.sh with test/single/data/expected_buildkite_gpu_non_heads_pipeline.yaml.
+    To see the changes in the output, run the following in your Horovod project root:
 
-        self.assertEqual(0, exit_code)
-        self.assertEqual('WARNING:root:no commit (None) or default branch (None) given\n', gen_pipeline_log)
-        self.assertEqual(expected_pipeline, actual_pipeline)
+        BUILDKITE_PIPELINE_SLUG=SLUG BUILDKITE_BRANCH=BRANCH PIPELINE_MODE="GPU NON HEADS" .buildkite/gen-pipeline.sh > test/single/data/expected_buildkite_gpu_non_heads_pipeline.yaml
+
+    Then run `git diff` to see the changes in the generated pipeline YAML.
+    Commit `test/single/data/expected_buildkite_gpu_non_heads_pipeline.yaml` to get those changes into your PR.
+    """
+    def test_gen_gpu_non_heads_pipeline(self):
+        self.do_test_gen_pipeline(GEN_PIPELINE_FNAME, 'GPU NON HEADS', {}, 'WARNING:root:no commit (None) or default branch (None) given\n')
 
     """
-    Tests the given command produces the full pipeline.
+    Tests the given command produces the expected pipeline.
     """
-    def do_test_gen_full_pipeline(self, cmd, env=dict()):
-        expected_filename = os.path.join(os.path.dirname(__file__), 'data/expected_buildkite_gpu_pipeline.yaml')
+    def do_test_gen_pipeline(self, cmd, flavour='GPU NON HEADS', env=dict(), expected_log=''):
+        expected_filename = os.path.join(os.path.dirname(__file__), f'data/expected_buildkite_{flavour.lower().replace(" ", "_")}_pipeline.yaml')
         with open(expected_filename, 'r') as f:
             lines = f.readlines()
             expected_pipeline = ''.join(lines)
 
-        cmd_env = dict(BUILDKITE_PIPELINE_SLUG='SLUG', BUILDKITE_BRANCH='BRANCH', PIPELINE_MODE='GPU FULL')
+        cmd_env = dict(BUILDKITE_PIPELINE_SLUG='SLUG', BUILDKITE_BRANCH='BRANCH', PIPELINE_MODE=flavour)
         cmd_env.update(env)
         exit_code, pipeline, log = self._run(cmd, cmd_env)
 
         self.assertEqual(0, exit_code)
-        self.assertEqual('', log)
+        self.assertEqual(expected_log, log)
         self.assertEqual(expected_pipeline, pipeline)
 
     """
@@ -117,7 +124,7 @@ def test_gen_pipeline_with_code_changes(self):
                 with open(os.path.join(dir, 'get_changed_code_files.py'), 'w') as py:
                     py.write("print('{}')".format(filename))
 
-                self.do_test_gen_full_pipeline(tmp_gen_pipeline_sh)
+                self.do_test_gen_pipeline(tmp_gen_pipeline_sh)
 
     """
     Tests non-code changes produces the short pipeline.
@@ -156,7 +163,7 @@ def test_gen_pipeline_on_default_branch(self):
                 py.write("pass")
 
             env = dict(BUILDKITE_BRANCH='default', BUILDKITE_PIPELINE_DEFAULT_BRANCH='default')
-            self.do_test_gen_full_pipeline(tmp_gen_pipeline_sh, env)
+            self.do_test_gen_pipeline(tmp_gen_pipeline_sh, env=env)
 
     """
     Tests a failing get_changed_code_files.py script produces the full pipeline.
@@ -174,7 +181,7 @@ def test_gen_pipeline_with_failing_py(self):
                 py.write('import sys\n')
                 py.write('sys.exit(1)')
 
-            self.do_test_gen_full_pipeline(tmp_gen_pipeline_sh)
+            self.do_test_gen_pipeline(tmp_gen_pipeline_sh)
 
     """
     Tests .buildkite/get_changed_code_files.py identifies files as non-code files.
diff --git a/test/single/test_ray.py b/test/single/test_ray.py
index cbe6b714d7..a51396f80c 100644
--- a/test/single/test_ray.py
+++ b/test/single/test_ray.py
@@ -14,6 +14,7 @@
 from horovod.common.util import gloo_built
 from horovod.ray.runner import (Coordinator, MiniSettings, RayExecutor)
 from horovod.ray.worker import BaseHorovodWorker
+from horovod.ray.strategy import create_placement_group
 
 sys.path.append(os.path.dirname(__file__))
 
@@ -421,6 +422,38 @@ def simple_fn(worker):
     hjob.shutdown()
 
 
+@pytest.mark.skipif(
+    not gloo_built(), reason='Gloo is required for Ray integration')
+def test_horovod_train_in_pg(ray_start_4_cpus):
+    pg, _ = create_placement_group(
+        {"CPU": 1, "GPU": int(torch.cuda.is_available())}, 4, 30, "PACK")
+
+    @ray.remote
+    class _Actor():
+        def run(self):
+            def simple_fn(worker):
+                local_rank = _train()
+                return local_rank
+
+            setting = RayExecutor.create_settings(timeout_s=30)
+            hjob = RayExecutor(
+                setting,
+                num_workers=4,
+                num_hosts=None,
+                num_workers_per_host=None,
+                cpus_per_worker=1,
+                gpus_per_worker=int(torch.cuda.is_available()) or None,
+                use_gpu=torch.cuda.is_available())
+            hjob.start()
+            assert not hjob.driver.strategy._created_placement_group
+            result = hjob.execute(simple_fn)
+            assert set(result) == {0, 1, 2, 3}
+            hjob.shutdown()
+    actor = _Actor.options(
+        num_cpus=0, num_gpus=0, placement_group_capture_child_tasks=True, placement_group=pg).remote()
+    ray.get(actor.run.remote())
+
+
 @pytest.mark.skipif(
     not gloo_built(), reason='Gloo is required for Ray integration')
 def test_remote_client_train(ray_start_client):
diff --git a/test/single/test_ray_elastic.py b/test/single/test_ray_elastic.py
index 5c70fb0d38..bd7b929867 100644
--- a/test/single/test_ray_elastic.py
+++ b/test/single/test_ray_elastic.py
@@ -6,11 +6,10 @@
 import psutil
 import os
 import socket
-
+import time
 import mock
 import pytest
 import ray
-
 from horovod.common.util import gloo_built
 from horovod.runner.elastic.discovery import HostDiscovery
 from horovod.ray.elastic import ElasticRayExecutor, RayHostDiscovery
@@ -142,9 +141,14 @@ def create_node_entry(hostname):
 
 
 class SimpleTestDiscovery(HostDiscovery):
-    def __init__(self, schedule):
+    def __init__(self, schedule, wait_for_previous_set=True):
         self._schedule = schedule
         self._generator = self.host_generator()
+        self.executor = None
+        # The previous set of hosts
+        # We need a reference to them as iterators only can provide the next set
+        self.prevlist = None
+        self.wait_for_previous_set = wait_for_previous_set
 
     def host_generator(self):
         for iters, hosts in self._schedule:
@@ -154,13 +158,56 @@ def host_generator(self):
 
     def find_available_hosts_and_slots(self):
         hostlist = next(self._generator)
+        # Ensure discovery waits for the previous set to register
+        self._wait_for_previous_set_registration(hostlist)
+
         hosts = {}
         for item in hostlist:
             host, slots = item.split(":")
             slots = int(slots)
             hosts[host] = slots
+
         return hosts
 
+    def _wait_for_previous_set_registration(self, hostlist):
+        """
+        Ensure that at least one host from the previous set of hosts have
+        been registered.
+        Without this, the discovery script will "discover" the new
+        set of hosts before the current set can register.
+        This would result in a race condition.
+        Consider a discovery schedule:
+        ```
+        discovery_schedule = [
+            (10, ['host-1:2']),
+            (30, ['host-1:2', 'host-2:1', 'host-3:1']),
+            (None, ['host-2:1']),
+        ]
+        ```
+        The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script
+        discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1'].
+        However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers.
+        When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments.
+        It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous
+        set is  ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed
+        hosts.
+        This ensures that the previous set of hosts can register before the current set is discovered.
+        """
+        if self.wait_for_previous_set is False:
+            return
+        while(self.prevlist and self.executor):
+            for item in self.prevlist:
+                host, slots = item.split(":")
+                slot = self.executor.driver.get_slot_info(host, 0)
+                # Avoid the empty slot
+                if (not slot.hostname) or self.executor.driver.get_worker_client(slot):
+                    break
+            else:
+                time.sleep(0.001)
+                continue
+            break
+        self.prevlist = hostlist
+
 
 class StatusCallback:
     def __init__(self):
@@ -215,9 +262,6 @@ def fault_tolerance_patches():
 
 @pytest.mark.skipif(
     not gloo_built(), reason='Gloo is required for Ray integration')
-@pytest.mark.skipif(
-    os.environ.get('GITHUB_ACTIONS', 'false') == 'true',
-    reason='This test fails on GitHub Workflow, see https://github.com/horovod/horovod/issues/2813')
 def test_fault_tolerance_hosts_added_and_removed(ray_8_cpus):
     with fault_tolerance_patches():
         discovery_schedule = [
@@ -231,7 +275,7 @@ def test_fault_tolerance_hosts_added_and_removed(ray_8_cpus):
         settings.discovery = SimpleTestDiscovery(discovery_schedule)
         executor = ElasticRayExecutor(
             settings, cpus_per_slot=1, override_discovery=False)
-
+        settings.discovery.executor = executor
         training_fn = _create_training_function(iterations=50)
         executor.start()
         trace = StatusCallback()
@@ -245,9 +289,7 @@ def test_fault_tolerance_hosts_added_and_removed(ray_8_cpus):
 
 @pytest.mark.skipif(
     not gloo_built(), reason='Gloo is required for Ray integration')
-@pytest.mark.skipif(
-    os.environ.get('GITHUB_ACTIONS', 'false') == 'true',
-    reason='This test fails on GitHub Workflow, see https://github.com/horovod/horovod/issues/2813')
+@pytest.mark.skip(reason='https://github.com/horovod/horovod/issues/3197')
 def test_fault_tolerance_hosts_remove_and_add(ray_8_cpus):
     with fault_tolerance_patches():
         discovery_schedule = [
diff --git a/test/single/test_torch_elastic.py b/test/single/test_torch_elastic.py
index c14463a79a..ed3f9e0e91 100644
--- a/test/single/test_torch_elastic.py
+++ b/test/single/test_torch_elastic.py
@@ -112,7 +112,7 @@ def __len__(self):
         state.sync()
 
         assert state.sampler.epoch == 0
-        assert len(state.sampler.processed_indices) == 0
+        assert state.sampler.processed_num == 0
 
         # Normal usage, no errors
         epochs = 2
@@ -120,12 +120,8 @@ def __len__(self):
         for epoch in range(epochs):
             sampler.set_epoch(epoch)
             for batch_idx, batch in enumerate(data_loader):
-                batch_indices = sampler.get_indices(batch_idx, batch_size)
-                batch_data = [dataset[idx] for idx in batch_indices]
-                assert batch_data == batch.numpy().tolist()
-
                 sampler.record_batch(batch_idx, batch_size)
-                assert len(sampler.processed_indices) == batch_size * (batch_idx + 1)
+                assert sampler.processed_num == batch_size * (batch_idx + 1)
 
                 total_batches += 1
         assert total_batches == (samples_per_worker / batch_size) * epochs
@@ -133,47 +129,44 @@ def __len__(self):
         # Do not reset epoch: processed samples are retained and data loader repeats
         total_batches = 0
         for _ in enumerate(data_loader):
-            assert len(sampler.processed_indices) == len(sampler)
+            assert sampler.processed_num == len(sampler)
             total_batches += 1
         assert total_batches == samples_per_worker / batch_size
 
         # Elastic: partial epoch + commit
         sampler.set_epoch(2)
-        assert len(sampler.processed_indices) == 0
+        assert sampler.processed_num == 0
 
         sampler.record_batch(0, batch_size)
         sampler.record_batch(1, batch_size)
-        assert len(sampler.processed_indices) == 2 * batch_size
+        assert sampler.processed_num == 2 * batch_size
 
-        committed_indices = copy.copy(sampler.processed_indices)
+        committed_num = copy.copy(sampler.processed_num)
         state.commit()
 
         # Elastic: partial epoch + restore
         sampler.record_batch(2, batch_size)
         sampler.record_batch(3, batch_size)
-        assert len(sampler.processed_indices) == 4 * batch_size
+        assert sampler.processed_num == 4 * batch_size
 
         state.restore()
 
-        assert len(sampler.processed_indices) == 2 * batch_size
-        assert sampler.processed_indices == committed_indices
+        assert sampler.processed_num == 2 * batch_size
+        assert sampler.processed_num == committed_num
 
         # Elastic: sync across workers and verify non-overlap of processed samples
         sampler.record_batch(2, batch_size)
-        assert len(sampler.processed_indices) == 3 * batch_size
+        assert sampler.processed_num == 3 * batch_size
 
         state.commit()
         state.sync()
 
-        assert len(sampler.processed_indices) == 3 * batch_size * hvd.size()
+        assert sampler.processed_num == 3 * batch_size * hvd.size()
 
         # After the sync, the remaining indices should be updated and repartitioned
         total_batches = 0
         assert len(sampler) == batch_size
         for batch_idx, batch in enumerate(data_loader):
-            batch_indices = sampler.get_indices(batch_idx, batch_size)
-            overlap_indices = set(batch_indices) & sampler.processed_indices
-            assert overlap_indices == set()
             total_batches += 1
         assert total_batches == 1