MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

warmbasket · 2024-03-26T03:23:53Z

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.15

Custom code

Yes

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Hi,

I'm training a Model on Multiple Machines using MultiWorkerMirroredStrategy. I took my working Model from one machine, added it to another machine, and added the necessary Tf_Configs above the rest of my code. I declare MultiWorkerMirroredStrategy() and my training waits to commence until I run the code on the second Machine and it joins with the first.

I didn't find tensorflow's description of this very clear, but synchronization by step and by epoch did not commence unless I declared With Strategy.scope():.

Regardless, everything is working as hoped, except my Dice Coefficient is being halved on saving or the last epoch. By step and by epoch my Dice Values are synchronized, but regardless when I save them, they are getting halved, if I remove saving, they still get halved on the last step of the Epoch.

My first thought is that the all reduce algorithm is not summing correctly? Or is dividing by the number of machines when it shouldn't be? I know there's the Global Batch Size, but I'm not sure where in my code I could use this to correct the issue.

I know my dice equation isn't incorrect, has never had an issue before. Is the standard dice equation. Declaring the equation in strategy.scope() doesn't help.

Any suggestions or help would be greatly appreciated!

Thank you -

Standalone code to reproduce the issue

tf config
miltiworkermirrorstrategy (using nccl or ring the issue happens)

with strategy.scope()

  The rest to train and save your   model with checkpoints and   callbacks

Relevant log output

No response

Venkat6871 · 2024-03-27T05:54:05Z

Hi @akrupien ,

I am providing an example of how you can structure your code to use MultiWorkerMirroredStrategy along with saving checkpoints and using callbacks. This example assumes you have a working model training pipeline and focuses on the tensorflow configuration, strategy setup, and saving checkpoints. Please find the gist for reference.

Thank you!

warmbasket · 2024-03-28T01:20:24Z

Hi @Venkat6871,

Thank you very much for your response and your example! I have changed my code to match your structure, so I only build and compile my model within strategy.scope(). Synchronous training between my machines is still working which is great. I am still having the issue with my Dice Coefficients/Metrics. I'll attach a snippet of the training output here so you have an example.

Epoch 1/400
28/28 [==============================] - ETA: 0s - loss: 3.3312 - dice_coef: 0.0686
Epoch 1: val_loss improved from inf to 1.63522, saving model to /home/path/model1.h5
28/28 [==============================] - 184s 4s/step - loss: 1.6656 - dice_coef: 0.0343 - val_loss: 1.6352 - val_dice_coef: 0.0303

Epoch 2/400
28/28 [==============================] - ETA: 0s - loss: 3.1451 - dice_coef: 0.1314
Epoch 2: val_loss improved from 1.63522 to 1.57489, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 833ms/step - loss: 1.5726 - dice_coef: 0.0657 - val_loss: 1.5749 - val_dice_coef: 0.0431

Epoch 3/400
28/28 [==============================] - ETA: 0s - loss: 3.0354 - dice_coef: 0.1781
Epoch 3: val_loss improved from 1.57489 to 1.53716, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 828ms/step - loss: 1.5177 - dice_coef: 0.0890 - val_loss: 1.5372 - val_dice_coef: 0.0577

Epoch 4/400
28/28 [==============================] - ETA: 0s - loss: 2.9451 - dice_coef: 0.2227
Epoch 4: val_loss improved from 1.53716 to 1.51450, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 831ms/step - loss: 1.4726 - dice_coef: 0.1114 - val_loss: 1.5145 - val_dice_coef: 0.0577

Epoch 5/400
28/28 [==============================] - ETA: 0s - loss: 2.8993 - dice_coef: 0.2236
Epoch 5: val_loss improved from 1.51450 to 1.49228, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 829ms/step - loss: 1.4496 - dice_coef: 0.1118 - val_loss: 1.4923 - val_dice_coef: 0.0577

Epoch 6/400
28/28 [==============================] - ETA: 0s - loss: 2.8554 - dice_coef: 0.2237
Epoch 6: val_loss improved from 1.49228 to 1.47076, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 834ms/step - loss: 1.4277 - dice_coef: 0.1119 - val_loss: 1.4708 - val_dice_coef: 0.0577

You'll notice my dice coefficients are being divided by 2, I believe this is because I am using two machines.

It appears as though it is summing my dice's and losses from each Machine, and showing these values throughout the steps of the epoch, and then when I save it averages the dices and losses between the two machines. (I believe it is summing because of the loss values, my typical loss on a single machine after the first epoch is ~1.6, so a loss of 3.3 only seems achievable to me by summing the losses from each machine). If I turn off the checkpoint saving, it seems to average the dices and losses on the last step of the epoch. I would appreciate some clarification on if this is what is happening. Tensorflow describes NCCL or Ring All reduce to sum the variables between machines, they do not say whether variables get averaged back out. I'd expect it to, but they don't seem to say explicitly anywhere. I am also confused as to why they would show me the summed values during training rather than the averaged value, it seems as though it is training throughout the epoch on an accumulated dice and loss, rather than an average loss and dice between the machines. It would be correct in my mind to train throughout the epoch on the average between workers rather than the sum? Otherwise your loss is artificially high?

Thank you again,
@akrupien

warmbasket · 2024-04-02T15:52:30Z

Any Ideas?

google-ml-butler bot added the type:support Support issues label Mar 26, 2024

google-ml-butler bot assigned Venkat6871 Mar 26, 2024

Venkat6871 added TF 2.15 For issues related to 2.15.x comp:apis Highlevel API related issues labels Mar 27, 2024

Venkat6871 added the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

warmbasket commented Mar 26, 2024

Venkat6871 commented Mar 27, 2024 •

edited

warmbasket commented Mar 28, 2024 •

edited

warmbasket commented Apr 2, 2024

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating #64471

Comments

warmbasket commented Mar 26, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Venkat6871 commented Mar 27, 2024 • edited

warmbasket commented Mar 28, 2024 • edited

warmbasket commented Apr 2, 2024

Venkat6871 commented Mar 27, 2024 •

edited

warmbasket commented Mar 28, 2024 •

edited