Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlowOnSpark consume too much executor memory #71

Closed
mayiming opened this issue Apr 26, 2017 · 10 comments
Closed

TensorFlowOnSpark consume too much executor memory #71

mayiming opened this issue Apr 26, 2017 · 10 comments

Comments

@mayiming
Copy link

Hi,

I've tried the solution on a Spark enabled Yarn cluster (CPU only). I noticed that it requires very high executor memory. For the mnist example, the actual model is very small, but I need to specify around 27GB executor memory to get it running using 10 executors. The amount of required memory increases when I tried a larger dataset.

Could someone explain a bit why it requires so much memory to run on spark? Is it related to the message exchange among ps and workers as in tensorflow/tensorflow#6508 ?

Thank you very much for the help!
Yiming.

@leewyang
Copy link
Contributor

@mayiming can you post your command line? Otherwise, you can try to run the example on a single box using the Spark Standalone instructions. Please note that in our YARN instructions, we were using 27GB as a proxy for a GPU, since YARN doesn't support scheduling by GPUs. So, if you're running on CPU, you should be able to run with much smaller memory.

@mayiming
Copy link
Author

Hi,

Thanks a lot for the reply. The command line is almost identical to the mnist example:

${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--num-executors 4
--executor-memory 27G
--py-files /export/home/yma/TensorFlowOnSpark/tfspark.zip,/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
--archives hdfs:///user/yma/Python.zip#Python
--conf spark.executorEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.driver.extraLibraryPath="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py
--images /user/yma/mnist/csv/train/images
--labels /user/yma/mnist/csv/train/labels
--model "hdfs://default/user/yma/mnist_model"

If I reduce the executor memory to 10GB, the job will stuck, whereas it should have finished in 5 minutes. I will give a try to local spark.

Thanks again,
Yiming.

@leewyang
Copy link
Contributor

@mayiming have you been able to get local spark working fine? FWIW, I tried running the MNIST example in a "low-memory" configuration, and I was able to successfully run at 2G (without tuning any spark memory settings). Note that the Spark executor itself requires some amount of memory to run.

@mayiming
Copy link
Author

@leewyang Thank you very much for looking into the issue. It appears that my Linux OS is having some issues with gRPC, which could cause extra memory overhead.

Could you let me know the Linux OS version that you used for your experiment, so I can replicate the result? I'm using a RHEL 6.6 cluster. Are you aware any issues running TFoS on this platform?

Thanks a lot,
Yiming.

@leewyang
Copy link
Contributor

@mayiming We're running RHEL, but I'm not sure which specific version, and even then, it's likely customized for our env anyways.

That said, I'm not aware of anyone else reporting similar issues. Which version of tensorflow are you using? Public, pip-installed? Or git-cloned and compiled locally?

@mayiming
Copy link
Author

@leewyang I built it from source, the TF version is 0.12. To make it compiled, I need to install devtoolset-4. Python env. is 2.7.

Also, when I increase the number of executors to 10, I observe typically the last worker just wait for the chief work indefinitely. Have you observed this behavior?

Thanks a lot,
Yiming.

@leewyang
Copy link
Contributor

If you can, I'd recommend trying a pre-built pip package for TensorFlow, especially if you aren't using GPUs or RDMA. This should hopefully avoid any build/compile issues you might be seeing (e.g. gRPC).

As for the MNIST example hanging with an increased the number of executors, you might need to increase the number of --steps and/or --epochs. Note that the default settings are tuned to "not take too long" with 4 executors. By increasing the number of executors, there's a chance that there's not enough data being produced to "fill" each worker's queue. You can see if this is the case by looking at the yarn logs of the executors, and seeing where it's hanging...

@mayiming
Copy link
Author

mayiming commented Jun 2, 2017

@leewyang Thanks again for the pointers. After increasing the steps and number of epochs, it does run well now. However, I still need to specify around 20GB memory through yarn.executor.memoryOverhead parameter. Have you observe similar issue?

@leewyang
Copy link
Contributor

leewyang commented Jun 7, 2017

@mayiming I just re-tried a similar command-line in my environment, and I'm able to go down to --executor-memory 1G w/ no memoryOverhead setting, so not sure why you're seeing that. Again, you may want to try the pre-built pip package just to see if that helps.

@leewyang
Copy link
Contributor

leewyang commented Aug 1, 2017

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants