Skip to content
Kubernetes-native Deep Learning Framework
Python Go Shell Dockerfile C++ Makefile Other
Branch: develop
Clone or download

Latest commit

brightcoder01 The metadata sample from SQLFlow transform expression and two corresp…
…onding models with keras preprocessing layers and feature columns. (#1865)

* Add transform_ops.py file

* Add the sample wide and deep model generated from SQLFlow transform expression - feature column version and keras preprocessing version

* Remove the out-of-date keras_process_layers.py file

* Add more comments

* Remove keras_preprocessing_layers in model_zoo

* Update Lookup to IndexLookup

* Calculate the id offsets from the num_buckets

* Resolve comments
Latest commit 960c5a8 Mar 31, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs Fix small typo (#1873) Mar 25, 2020
elasticdl do not manual_join if no peers (#1884) Mar 31, 2020
elasticdl_preprocessing Layer to convert tensor to ragged tensor dropping the ignored value (#… Mar 30, 2020
model_zoo The metadata sample from SQLFlow transform expression and two corresp… Mar 31, 2020
scripts Config pre-commit and unit test for elasticdl_preprocessing (#1847) Mar 17, 2020
tools/odps_table_tools Add ODPS UDF to normalize table with key-value column to wide table (#… Feb 17, 2020
.clang-format Add SGD Kernel Go Package (#1637) Jan 17, 2020
.codecov.yml Set up code coverage report (#1327) Oct 18, 2019
.gitignore Add SGD Kernel Go Package (#1637) Jan 17, 2020
.isort.cfg Add the recordio gen for heart dataset (#1549) Dec 5, 2019
.pre-commit-config.yaml Add SGD Kernel Go Package (#1637) Jan 17, 2020
.travis.yml Retry building consensus when needed and removed unnecessary ssh port ( Mar 27, 2020
CODE_OF_CONDUCT.md Add Code of Conduct document (#1226) Sep 20, 2019
LICENSE Create LICENSE (#1111) Sep 4, 2019
README.md Set up code coverage report (#1327) Oct 18, 2019
RELEASE.md Update RELEASE.md and setup.py (#1291) Oct 10, 2019
index.html add index.html (#1132) Sep 8, 2019
setup.py Add pkg/kernel/cpi to build package (#1823) Mar 13, 2020

README.md

ElasticDL: A Kubernetes-native Deep Learning Framework

Travis-CI Build Status Code Coverage License: MIT PyPI Status Badge

ElasticDL is a Kubernetes-native deep learning framework built on top of TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.

TensorFlow 1.x graph mode TensorFlow 2.x eager execution
No change to the runtime Uber Horovod ElasticDL (early stage)
Changes the runtime TensorFlow ps-based distribution TensorFlow distribution strategies

Note that ElasticDL is still under active development, and we have not extensively tested it in production environments. We open sourced this early-stage project with the hope of encouraging further work on fault-tolerance and elastic scheduling from the community.

Main Features

Elastic Scheduling and Fault-Tolerance

Through Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.

TensorFlow 2.0 Eager Execution

A distributed deep learning framework needs to know local gradients before the model update. Eager Execution allows ElasticDL to do it without hacking into the graph execution process.

Minimalism Interface

Given a model defined with Keras API, train the model with a command line.

elasticdl train --model_def=mnist_functional_api.custom_model --training_data=/mnist/train --output=output

Integration with SQLFlow

ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to distributed deep learning tasks with ElasticDL.

SELECT * FROM employee LABEL income INTO my_elasticdl_model

Quick Start

Please check out our step-by-step tutorial for running ElasticDL on local laptop, on-prem cluster, or on public cloud such as Google Kubernetes Engine.

Background

TensorFlow has its native distributed computing feature that is fault-recoverable. In the case that some processes fail, the distributed computing job would fail; however, we can restart the job and recover its status from the most recent checkpoint files.

ElasticDL, as an enhancement of TensorFlow's distributed training feature, supports fault-tolerance. In the case that some processes fail, the job would go on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from checkpoints.

The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resource for new-coming jobs with higher priority, the current job doesn't fail but continues with less resource.

Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.

The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native design -- it doesn't rely on Kubernetes extensions like Kubeflow to run TensorFlow programs; instead, the master process of an ElasticDL job calls Kubernetes API to start workers and parameter servers; it also watches events like process/pod killing and reacts to such events to realize fault-tolerance.

In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic scheduling in the case that you have a Kubernetes cluster. We provide a tutorial showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL jobs there. We respect TensorFlow's native distributed computing feature, which doesn't require specific computing platforms like Kubernetes and allows TensorFlow running on any platform.

Development Guide

Please refer to this document for development guide.

You can’t perform that action at this time.