-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorFlowOnSpark consume too much executor memory #71
Comments
@mayiming can you post your command line? Otherwise, you can try to run the example on a single box using the Spark Standalone instructions. Please note that in our YARN instructions, we were using 27GB as a proxy for a GPU, since YARN doesn't support scheduling by GPUs. So, if you're running on CPU, you should be able to run with much smaller memory. |
Hi, Thanks a lot for the reply. The command line is almost identical to the mnist example: ${SPARK_HOME}/bin/spark-submit If I reduce the executor memory to 10GB, the job will stuck, whereas it should have finished in 5 minutes. I will give a try to local spark. Thanks again, |
@mayiming have you been able to get local spark working fine? FWIW, I tried running the MNIST example in a "low-memory" configuration, and I was able to successfully run at 2G (without tuning any spark memory settings). Note that the Spark executor itself requires some amount of memory to run. |
@leewyang Thank you very much for looking into the issue. It appears that my Linux OS is having some issues with gRPC, which could cause extra memory overhead. Could you let me know the Linux OS version that you used for your experiment, so I can replicate the result? I'm using a RHEL 6.6 cluster. Are you aware any issues running TFoS on this platform? Thanks a lot, |
@mayiming We're running RHEL, but I'm not sure which specific version, and even then, it's likely customized for our env anyways. That said, I'm not aware of anyone else reporting similar issues. Which version of tensorflow are you using? Public, pip-installed? Or git-cloned and compiled locally? |
@leewyang I built it from source, the TF version is 0.12. To make it compiled, I need to install devtoolset-4. Python env. is 2.7. Also, when I increase the number of executors to 10, I observe typically the last worker just wait for the chief work indefinitely. Have you observed this behavior? Thanks a lot, |
If you can, I'd recommend trying a pre-built pip package for TensorFlow, especially if you aren't using GPUs or RDMA. This should hopefully avoid any build/compile issues you might be seeing (e.g. gRPC). As for the MNIST example hanging with an increased the number of executors, you might need to increase the number of |
@leewyang Thanks again for the pointers. After increasing the steps and number of epochs, it does run well now. However, I still need to specify around 20GB memory through yarn.executor.memoryOverhead parameter. Have you observe similar issue? |
@mayiming I just re-tried a similar command-line in my environment, and I'm able to go down to |
Closing due to inactivity. |
Hi,
I've tried the solution on a Spark enabled Yarn cluster (CPU only). I noticed that it requires very high executor memory. For the mnist example, the actual model is very small, but I need to specify around 27GB executor memory to get it running using 10 executors. The amount of required memory increases when I tried a larger dataset.
Could someone explain a bit why it requires so much memory to run on spark? Is it related to the message exchange among ps and workers as in tensorflow/tensorflow#6508 ?
Thank you very much for the help!
Yiming.
The text was updated successfully, but these errors were encountered: