Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit

* Import: This repository is deprecated.

* Update

* Update

* Update

Git stats


Failed to load latest commit information.
Latest commit message
Commit time
September 18, 2021 20:21
September 18, 2021 20:15
June 20, 2020 03:15
September 3, 2019 23:18
June 20, 2020 03:15
September 8, 2019 12:30
February 22, 2021 15:57
February 22, 2021 15:57
February 22, 2021 15:57

ElasticDL: A Kubernetes-native Deep Learning Framework

Travis-CI Build Status Code Coverage License: MIT PyPI Status Badge

ElasticDL is a Kubernetes-native deep learning framework that supports fault-tolerance and elastic scheduling.

IMPORTANT: This repository is deprecated.

  1. ElasticDL repository is no longer actively maintained. Users are encouraged to switch to DLRover.

  2. In addition to the elasticity and fault-tolerance, DLRover also implements auto-scaling distributed training.

  3. In addition to TensorFlow and Horovod, DLRover Supports TorchElastic and users can use the elasticity and fault-tolerance without any modification of the training code like the TorchElastic example.

  4. To deploy a distributed job using kubectl, DLRover implements an ElasticJob CRD.

Main Features

Elastic Scheduling and Fault-Tolerance

Through Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.

Support TensorFlow and PyTorch

  • TensorFlow Estimator.
  • TensorFlow Keras.
  • PyTorch

Minimalism Interface

Given a model defined with Keras API, train the model distributedly with a command line.

elasticdl train \
  --image_name=elasticdl:mnist \
  --model_zoo=model_zoo \
  --model_def=mnist.mnist_functional_api.custom_model \
  --training_data=/data/mnist/train \
  --job_name=test-mnist \

Quick Start

Please check out our step-by-step tutorial for running ElasticDL on local laptop, on-prem cluster, or on public cloud such as Google Kubernetes Engine.

TensorFlow Estimator on MiniKube

TensorFlow Keras on MiniKube

PyTorch on MiniKube


TensorFlow/PyTorch has its native distributed computing feature that is fault-recoverable. In the case that some processes fail, the distributed computing job would fail; however, we can restart the job and recover its status from the most recent checkpoint files.

ElasticDL supports fault-tolerance during distributed training. In the case that some processes fail, the job would go on running. Therefore, ElasticDL doesn't need to save checkpoint nor recover from checkpoints.

The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resource for new-coming jobs with higher priority, the current job doesn't fail but continues with less resource.

Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.

The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native design -- it doesn't rely on Kubernetes extensions like Kubeflow to run TensorFlow/PyTorch programs; instead, the master process of an ElasticDL job calls Kubernetes API to start workers and parameter servers; it also watches events like process/pod killing and reacts to such events to realize fault-tolerance.

In short, ElasticDL enhances TensorFlow/PyTorch with fault-tolerance and elastic scheduling in the case that you have a Kubernetes cluster. We provide a tutorial showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL jobs there. We respect TensorFlow's native distributed computing feature, which doesn't require specific computing platforms like Kubernetes and allows TensorFlow running on any platform.

Development Guide

Please refer to this document for development guide.