diff --git a/README.md b/README.md index 131b925d..fc1aeb96 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ cluster with the following steps: 1. **Startup** - launches the Tensorflow main function on the executors, along with listeners for data/control messages. 1. **Data ingestion** - **InputMode.TENSORFLOW** - leverages TensorFlow's built-in APIs to read data files directly from HDFS. - - **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via the [feed_dict](https://www.tensorflow.org/how_tos/reading_data/#feeding) mechanism. Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS. + - **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via a `TFNode.DataFeed` class. Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS. 1. **Shutdown** - shuts down the Tensorflow workers and PS nodes on the executors. ## Table of Contents @@ -36,17 +36,17 @@ cluster with the following steps: ## Background TensorFlowOnSpark was developed by Yahoo for large-scale distributed -deep learning on our Hadoop clusters in Yahoo's private cloud. +deep learning on our Hadoop clusters in Yahoo's private cloud. TensorFlowOnSpark provides some important benefits (see [our blog](http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep)) over alternative deep learning solutions. - * Easily migrate all existing TensorFlow programs with <10 lines of code change; - * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard; - * Server-to-server direct communication achieves faster learning when available; - * Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow; - * Easily integrate with your existing data processing pipelines and machine learning algorithms (ex. MLlib, CaffeOnSpark); - * Easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. + * Easily migrate existing TensorFlow programs with <10 lines of code change + * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard + * Server-to-server direct communication achieves faster learning when available + * Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow + * Easily integrate with your existing Spark data processing pipelines + * Easily deployed on cloud or on-premise and on CPUs or GPUs. ## Install diff --git a/examples/mnist/README.md b/examples/mnist/README.md index 3d2e5ee5..3a3864bc 100644 --- a/examples/mnist/README.md +++ b/examples/mnist/README.md @@ -2,6 +2,105 @@ Original Source: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py -Note: this has been heavily modified to support different input formats (CSV and TFRecords) as well as to demonstrate the different data ingestion methods (feed_dict and QueueRunner). +Notes: +- This assumes that you have already [installed Spark, TensorFlow, and TensorFlowOnSpark](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_Standalone) +- This code has been heavily modified to support different input formats (CSV and TFRecords) and different data ingestion methods (`InputMode.TENSORFLOW` and `InputMode.SPARK`). -Please follow [these instructions](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN) to run this example. +### Download MNIST data + +``` +mkdir ${TFoS_HOME}/mnist +pushd ${TFoS_HOME}/mnist +curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" +curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz" +curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz" +curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz" +popd +``` + +### Convert the MNIST zip files using Spark + +``` +cd ${TFoS_HOME} +# rm -rf examples/mnist/csv +${SPARK_HOME}/bin/spark-submit \ +--master ${MASTER} \ +${TFoS_HOME}/examples/mnist/mnist_data_setup.py \ +--output examples/mnist/csv \ +--format csv +ls -lR examples/mnist/csv +``` + +### Start Spark Standalone Cluster + +``` +export MASTER=spark://$(hostname):7077 +export SPARK_WORKER_INSTANCES=2 +export CORES_PER_WORKER=1 +export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) +${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER} +``` + +### Run distributed MNIST training using `InputMode.SPARK` + +``` +# rm -rf mnist_model +${SPARK_HOME}/bin/spark-submit \ +--master ${MASTER} \ +--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ +--conf spark.cores.max=${TOTAL_CORES} \ +--conf spark.task.cpus=${CORES_PER_WORKER} \ +--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ +${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ +--cluster_size ${SPARK_WORKER_INSTANCES} \ +--images examples/mnist/csv/train/images \ +--labels examples/mnist/csv/train/labels \ +--format csv \ +--mode train \ +--model mnist_model + +ls -l mnist_model +``` + +### Run distributed MNIST inference using `InputMode.SPARK` + +``` +# rm -rf predictions +${SPARK_HOME}/bin/spark-submit \ +--master ${MASTER} \ +--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \ +--conf spark.cores.max=${TOTAL_CORES} \ +--conf spark.task.cpus=${CORES_PER_WORKER} \ +--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \ +${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \ +--cluster_size ${SPARK_WORKER_INSTANCES} \ +--images examples/mnist/csv/test/images \ +--labels examples/mnist/csv/test/labels \ +--mode inference \ +--format csv \ +--model mnist_model \ +--output predictions + +less predictions/part-00000 +``` + +The prediction result should look like: +``` +2017-02-10T23:29:17.009563 Label: 7, Prediction: 7 +2017-02-10T23:29:17.009677 Label: 2, Prediction: 2 +2017-02-10T23:29:17.009721 Label: 1, Prediction: 1 +2017-02-10T23:29:17.009761 Label: 0, Prediction: 0 +2017-02-10T23:29:17.009799 Label: 4, Prediction: 4 +2017-02-10T23:29:17.009838 Label: 1, Prediction: 1 +2017-02-10T23:29:17.009876 Label: 4, Prediction: 4 +2017-02-10T23:29:17.009914 Label: 9, Prediction: 9 +2017-02-10T23:29:17.009951 Label: 5, Prediction: 6 +2017-02-10T23:29:17.009989 Label: 9, Prediction: 9 +2017-02-10T23:29:17.010026 Label: 0, Prediction: 0 +``` + +### Shutdown Spark cluster + +``` +${SPARK_HOME}/sbin/stop-slave.sh; ${SPARK_HOME}/sbin/stop-master.sh +```