Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

Open
warmbasket opened this issue Mar 26, 2024 · 3 comments
Open

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

warmbasket opened this issue Mar 26, 2024 · 3 comments
Assignees
Labels
comp:apis Highlevel API related issues TF 2.15 For issues related to 2.15.x type:support Support issues

Comments

@warmbasket
Copy link

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.15

Custom code

Yes

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Hi,

I'm training a Model on Multiple Machines using MultiWorkerMirroredStrategy. I took my working Model from one machine, added it to another machine, and added the necessary Tf_Configs above the rest of my code. I declare MultiWorkerMirroredStrategy() and my training waits to commence until I run the code on the second Machine and it joins with the first.

I didn't find tensorflow's description of this very clear, but synchronization by step and by epoch did not commence unless I declared With Strategy.scope():.

Regardless, everything is working as hoped, except my Dice Coefficient is being halved on saving or the last epoch. By step and by epoch my Dice Values are synchronized, but regardless when I save them, they are getting halved, if I remove saving, they still get halved on the last step of the Epoch.

My first thought is that the all reduce algorithm is not summing correctly? Or is dividing by the number of machines when it shouldn't be? I know there's the Global Batch Size, but I'm not sure where in my code I could use this to correct the issue.

I know my dice equation isn't incorrect, has never had an issue before. Is the standard dice equation. Declaring the equation in strategy.scope() doesn't help.

Any suggestions or help would be greatly appreciated!

Thank you -

Standalone code to reproduce the issue

tf config
miltiworkermirrorstrategy (using nccl or ring the issue happens)

with strategy.scope()

  The rest to train and save your   model with checkpoints and   callbacks

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:support Support issues label Mar 26, 2024
@Venkat6871 Venkat6871 added TF 2.15 For issues related to 2.15.x comp:apis Highlevel API related issues labels Mar 27, 2024
@Venkat6871
Copy link

Venkat6871 commented Mar 27, 2024

Hi @akrupien ,

I am providing an example of how you can structure your code to use MultiWorkerMirroredStrategy along with saving checkpoints and using callbacks. This example assumes you have a working model training pipeline and focuses on the tensorflow configuration, strategy setup, and saving checkpoints. Please find the gist for reference.

Thank you!

@Venkat6871 Venkat6871 added the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024
@warmbasket
Copy link
Author

warmbasket commented Mar 28, 2024

Hi @Venkat6871,

Thank you very much for your response and your example! I have changed my code to match your structure, so I only build and compile my model within strategy.scope(). Synchronous training between my machines is still working which is great. I am still having the issue with my Dice Coefficients/Metrics. I'll attach a snippet of the training output here so you have an example.

Epoch 1/400
28/28 [==============================] - ETA: 0s - loss: 3.3312 - dice_coef: 0.0686
Epoch 1: val_loss improved from inf to 1.63522, saving model to /home/path/model1.h5
28/28 [==============================] - 184s 4s/step - loss: 1.6656 - dice_coef: 0.0343 - val_loss: 1.6352 - val_dice_coef: 0.0303

Epoch 2/400
28/28 [==============================] - ETA: 0s - loss: 3.1451 - dice_coef: 0.1314
Epoch 2: val_loss improved from 1.63522 to 1.57489, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 833ms/step - loss: 1.5726 - dice_coef: 0.0657 - val_loss: 1.5749 - val_dice_coef: 0.0431

Epoch 3/400
28/28 [==============================] - ETA: 0s - loss: 3.0354 - dice_coef: 0.1781
Epoch 3: val_loss improved from 1.57489 to 1.53716, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 828ms/step - loss: 1.5177 - dice_coef: 0.0890 - val_loss: 1.5372 - val_dice_coef: 0.0577

Epoch 4/400
28/28 [==============================] - ETA: 0s - loss: 2.9451 - dice_coef: 0.2227
Epoch 4: val_loss improved from 1.53716 to 1.51450, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 831ms/step - loss: 1.4726 - dice_coef: 0.1114 - val_loss: 1.5145 - val_dice_coef: 0.0577

Epoch 5/400
28/28 [==============================] - ETA: 0s - loss: 2.8993 - dice_coef: 0.2236
Epoch 5: val_loss improved from 1.51450 to 1.49228, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 829ms/step - loss: 1.4496 - dice_coef: 0.1118 - val_loss: 1.4923 - val_dice_coef: 0.0577

Epoch 6/400
28/28 [==============================] - ETA: 0s - loss: 2.8554 - dice_coef: 0.2237
Epoch 6: val_loss improved from 1.49228 to 1.47076, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 834ms/step - loss: 1.4277 - dice_coef: 0.1119 - val_loss: 1.4708 - val_dice_coef: 0.0577

You'll notice my dice coefficients are being divided by 2, I believe this is because I am using two machines.

It appears as though it is summing my dice's and losses from each Machine, and showing these values throughout the steps of the epoch, and then when I save it averages the dices and losses between the two machines. (I believe it is summing because of the loss values, my typical loss on a single machine after the first epoch is ~1.6, so a loss of 3.3 only seems achievable to me by summing the losses from each machine). If I turn off the checkpoint saving, it seems to average the dices and losses on the last step of the epoch. I would appreciate some clarification on if this is what is happening. Tensorflow describes NCCL or Ring All reduce to sum the variables between machines, they do not say whether variables get averaged back out. I'd expect it to, but they don't seem to say explicitly anywhere. I am also confused as to why they would show me the summed values during training rather than the averaged value, it seems as though it is training throughout the epoch on an accumulated dice and loss, rather than an average loss and dice between the machines. It would be correct in my mind to train throughout the epoch on the average between workers rather than the sum? Otherwise your loss is artificially high?

Thank you again,
@akrupien

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 28, 2024
@warmbasket
Copy link
Author

Any Ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues TF 2.15 For issues related to 2.15.x type:support Support issues
Projects
None yet
Development

No branches or pull requests

2 participants