TensorFlowOnSpark consume too much executor memory #71

mayiming · 2017-04-26T02:28:54Z

Hi,

I've tried the solution on a Spark enabled Yarn cluster (CPU only). I noticed that it requires very high executor memory. For the mnist example, the actual model is very small, but I need to specify around 27GB executor memory to get it running using 10 executors. The amount of required memory increases when I tried a larger dataset.

Could someone explain a bit why it requires so much memory to run on spark? Is it related to the message exchange among ps and workers as in tensorflow/tensorflow#6508 ?

Thank you very much for the help!
Yiming.

leewyang · 2017-04-26T15:54:17Z

@mayiming can you post your command line? Otherwise, you can try to run the example on a single box using the Spark Standalone instructions. Please note that in our YARN instructions, we were using 27GB as a proxy for a GPU, since YARN doesn't support scheduling by GPUs. So, if you're running on CPU, you should be able to run with much smaller memory.

mayiming · 2017-04-26T16:15:49Z

Hi,

Thanks a lot for the reply. The command line is almost identical to the mnist example:

${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--num-executors 4
--executor-memory 27G
--py-files /export/home/yma/TensorFlowOnSpark/tfspark.zip,/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
--archives hdfs:///user/yma/Python.zip#Python
--conf spark.executorEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.driver.extraLibraryPath="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py
--images /user/yma/mnist/csv/train/images
--labels /user/yma/mnist/csv/train/labels
--model "hdfs://default/user/yma/mnist_model"

If I reduce the executor memory to 10GB, the job will stuck, whereas it should have finished in 5 minutes. I will give a try to local spark.

Thanks again,
Yiming.

leewyang · 2017-05-22T16:06:36Z

@mayiming have you been able to get local spark working fine? FWIW, I tried running the MNIST example in a "low-memory" configuration, and I was able to successfully run at 2G (without tuning any spark memory settings). Note that the Spark executor itself requires some amount of memory to run.

mayiming · 2017-05-22T20:25:23Z

@leewyang Thank you very much for looking into the issue. It appears that my Linux OS is having some issues with gRPC, which could cause extra memory overhead.

Could you let me know the Linux OS version that you used for your experiment, so I can replicate the result? I'm using a RHEL 6.6 cluster. Are you aware any issues running TFoS on this platform?

Thanks a lot,
Yiming.

leewyang · 2017-05-22T21:25:45Z

@mayiming We're running RHEL, but I'm not sure which specific version, and even then, it's likely customized for our env anyways.

That said, I'm not aware of anyone else reporting similar issues. Which version of tensorflow are you using? Public, pip-installed? Or git-cloned and compiled locally?

mayiming · 2017-05-23T16:25:13Z

@leewyang I built it from source, the TF version is 0.12. To make it compiled, I need to install devtoolset-4. Python env. is 2.7.

Also, when I increase the number of executors to 10, I observe typically the last worker just wait for the chief work indefinitely. Have you observed this behavior?

Thanks a lot,
Yiming.

leewyang · 2017-05-23T23:32:35Z

If you can, I'd recommend trying a pre-built pip package for TensorFlow, especially if you aren't using GPUs or RDMA. This should hopefully avoid any build/compile issues you might be seeing (e.g. gRPC).

As for the MNIST example hanging with an increased the number of executors, you might need to increase the number of --steps and/or --epochs. Note that the default settings are tuned to "not take too long" with 4 executors. By increasing the number of executors, there's a chance that there's not enough data being produced to "fill" each worker's queue. You can see if this is the case by looking at the yarn logs of the executors, and seeing where it's hanging...

mayiming · 2017-06-02T22:39:10Z

@leewyang Thanks again for the pointers. After increasing the steps and number of epochs, it does run well now. However, I still need to specify around 20GB memory through yarn.executor.memoryOverhead parameter. Have you observe similar issue?

leewyang · 2017-06-07T20:07:49Z

@mayiming I just re-tried a similar command-line in my environment, and I'm able to go down to --executor-memory 1G w/ no memoryOverhead setting, so not sure why you're seeing that. Again, you may want to try the pre-built pip package just to see if that helps.

leewyang · 2017-08-01T22:52:16Z

Closing due to inactivity.

leewyang closed this as completed Aug 1, 2017

leewyang mentioned this issue May 31, 2019

internal: Container killed by YARN for exceeding memory limits. 116.8 GB of 60 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead. #426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlowOnSpark consume too much executor memory #71

TensorFlowOnSpark consume too much executor memory #71

mayiming commented Apr 26, 2017

leewyang commented Apr 26, 2017

mayiming commented Apr 26, 2017

leewyang commented May 22, 2017

mayiming commented May 22, 2017

leewyang commented May 22, 2017

mayiming commented May 23, 2017

leewyang commented May 23, 2017

mayiming commented Jun 2, 2017

leewyang commented Jun 7, 2017

leewyang commented Aug 1, 2017

TensorFlowOnSpark consume too much executor memory #71

TensorFlowOnSpark consume too much executor memory #71

Comments

mayiming commented Apr 26, 2017

leewyang commented Apr 26, 2017

mayiming commented Apr 26, 2017

leewyang commented May 22, 2017

mayiming commented May 22, 2017

leewyang commented May 22, 2017

mayiming commented May 23, 2017

leewyang commented May 23, 2017

mayiming commented Jun 2, 2017

leewyang commented Jun 7, 2017

leewyang commented Aug 1, 2017