-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResourceVariable save will lead to OOM in distributed mode #20914
Comments
Thanks for tracking this down. @agarwal-ashish it looks as if you put in this device assignment would you take a look thanks. |
@agarwal-ashish |
This intention here is to handle checkpointing of Variables placed on GPU when running with explicit device placement policy in Eager mode. I am submitting a fix that copies it to the CPU on the same machine instead of overriding the device to "/device:CPU:0" which might place it on a different job. |
…cate the copy of the variable on the same machine. Addresses Issue #20914. PiperOrigin-RevId: 205317119
@agarwal-ashish thanks for reply. I make a change in my local repo. It works well and looks simple. Here is the PR #20985. Maybe you can check whether it is helpful. |
Did you test if the code after commit# 8f130ff work for you ? |
@agarwal-ashish No, I just test my fix, the indent one. I will test it later. |
@agarwal-ashish The two fixes both work correctly. The only difference is you parse the device info manually and I utilize the context. |
Thanks for the fix! |
System information
Describe the problem
My model is more than 200GB so I run it in distributed mode on CPU, including 1000 workers and 100 ps. All the variables of my model are ResourceVariable partitioned by size, and these variables are placed by the default tf.replica_device_setter. The model is triggered by a MonitoredTrainingSession.
Problem happens when it begins to save model. The memory of ps0 rises to 200GB rapidly and then OOM. I open the log placement and find that all variables are identity to ps0 when running save_op.
In ResourceVariableSaveable, line 182 reset the device, which leads to all save_ops are placed on ps0. I remove this line and re-run, it works correctly.
Source code / logs
Open the log placement to see all save_ops are placed on ps0. So many such logs
log.txt
The text was updated successfully, but these errors were encountered: