New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471
Comments
Hi @akrupien , I am providing an example of how you can structure your code to use MultiWorkerMirroredStrategy along with saving checkpoints and using callbacks. This example assumes you have a working model training pipeline and focuses on the tensorflow configuration, strategy setup, and saving checkpoints. Please find the gist for reference. Thank you! |
Hi @Venkat6871, Thank you very much for your response and your example! I have changed my code to match your structure, so I only build and compile my model within strategy.scope(). Synchronous training between my machines is still working which is great. I am still having the issue with my Dice Coefficients/Metrics. I'll attach a snippet of the training output here so you have an example. Epoch 1/400 Epoch 2/400 Epoch 3/400 Epoch 4/400 Epoch 5/400 Epoch 6/400 You'll notice my dice coefficients are being divided by 2, I believe this is because I am using two machines. It appears as though it is summing my dice's and losses from each Machine, and showing these values throughout the steps of the epoch, and then when I save it averages the dices and losses between the two machines. (I believe it is summing because of the loss values, my typical loss on a single machine after the first epoch is ~1.6, so a loss of 3.3 only seems achievable to me by summing the losses from each machine). If I turn off the checkpoint saving, it seems to average the dices and losses on the last step of the epoch. I would appreciate some clarification on if this is what is happening. Tensorflow describes NCCL or Ring All reduce to sum the variables between machines, they do not say whether variables get averaged back out. I'd expect it to, but they don't seem to say explicitly anywhere. I am also confused as to why they would show me the summed values during training rather than the averaged value, it seems as though it is training throughout the epoch on an accumulated dice and loss, rather than an average loss and dice between the machines. It would be correct in my mind to train throughout the epoch on the average between workers rather than the sum? Otherwise your loss is artificially high? Thank you again, |
Any Ideas? |
Issue type
Support
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
2.15
Custom code
Yes
OS platform and distribution
No response
Mobile device
No response
Python version
No response
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
Hi,
I'm training a Model on Multiple Machines using MultiWorkerMirroredStrategy. I took my working Model from one machine, added it to another machine, and added the necessary Tf_Configs above the rest of my code. I declare MultiWorkerMirroredStrategy() and my training waits to commence until I run the code on the second Machine and it joins with the first.
I didn't find tensorflow's description of this very clear, but synchronization by step and by epoch did not commence unless I declared With Strategy.scope():.
Regardless, everything is working as hoped, except my Dice Coefficient is being halved on saving or the last epoch. By step and by epoch my Dice Values are synchronized, but regardless when I save them, they are getting halved, if I remove saving, they still get halved on the last step of the Epoch.
My first thought is that the all reduce algorithm is not summing correctly? Or is dividing by the number of machines when it shouldn't be? I know there's the Global Batch Size, but I'm not sure where in my code I could use this to correct the issue.
I know my dice equation isn't incorrect, has never had an issue before. Is the standard dice equation. Declaring the equation in strategy.scope() doesn't help.
Any suggestions or help would be greatly appreciated!
Thank you -
Standalone code to reproduce the issue
tf config miltiworkermirrorstrategy (using nccl or ring the issue happens) with strategy.scope() The rest to train and save your model with checkpoints and callbacks
Relevant log output
No response
The text was updated successfully, but these errors were encountered: