TensorFlowOnSpark brings TensorFlow programs onto Apache Spark clusters
Python Shell
Latest commit 10c83f1 Feb 22, 2017 @anfeng anfeng committed on GitHub Merge pull request #35 from yahoo/leewyang_spark2
Fix Spark 2.x + HDFS hangs

README.md

TensorFlowOnSpark

What's TensorFlowOnSpark?

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

TensorFlowOnSpark enables distributed TensorFlow training and inference on Apache Spark clusters. It seeks to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid. Its Spark-compatible API helps manage the TensorFlow cluster with the following steps:

  1. Reservation - reserves a port for the TensorFlow process on each executor and also starts a listener for data/control messages.
  2. Startup - launches the Tensorflow main function on the executors.
  3. Data ingestion
    1. Readers & QueueRunners - leverages TensorFlow's Reader mechanism to read data files directly from HDFS.
    2. Feeding - sends Spark RDD data into the TensorFlow nodes using the feed_dict mechanism. Note that we leverage the Hadoop Input/Output Format for access to TFRecords on HDFS.
  4. Shutdown - shuts down the Tensorflow workers and PS nodes on the executors.

We have also enhanced TensorFlow to support direct access to remote memory (RDMA) on Infiniband networks.

TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

Why TensorFlowOnSpark?

TensorFlowOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

  • Easily migrate all existing TensorFlow programs with <10 lines of code change;
  • Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard;
  • Server-to-server direct communication achieves faster learning when available;
  • Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow;
  • Easily integrate with your existing data processing pipelines and machine learning algorithms (ex. MLlib, CaffeOnSpark);
  • Easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband.

Using TensorFlowOnSpark

Please check TensorFlowOnSpark wiki site for detailed documentations such as getting started guides for YARN cluster and AWS EC2 cluster. A Conversion Guide has been provided to help you convert your TensorFlow programs.

Mailing List

Please join TensorFlowOnSpark user group for discussions and questions.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.