New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiWorkerMirroredStrategy does not work with Keras + accuracy metric #33531
Comments
@vmarkovtsev, |
@rmothukuru You cannot reproduce it in Colab because it requires at least two physical nodes. |
Besides, you need to edit my snippet to proceed with the second epoch because the error happens during an epoch change. |
@vmarkovtsev, thanks for the report and apologies for the delay. I'm looking into this and will get back as soon as I find something. I was wondering how you set |
As I looked into it I have not been able to repro using code attached (the only difference is I've set the TF_CONFIG on the two workers). That said we can add a check before deleting the attr. |
I can verify this. I independently reported #36153 which seems to be the same issue. I haven't seen an influence of the accuracy metric though and it does happen when using a single node and multiple GPUs too. It does NOT happen when using a single GPU only. It does happen when using 2 nodes with 1 GPU each. I tried the code posted here but get multiple warnings:
And then a similar error as mine:
|
So here's what I think is going on. This is based on the same error message I saw during my runs. I TL;DR your dataset size must be an even multiple of your "total" batch size. Walking through what I saw:I'm using a dataset of size, let's say 3200. My batch size is 128, and I'm using datasets. I'm running without strategy/dp. It runs fine. I then switch over and start running two nodes in MWMS and same error:
I realize that the true batch size is 128 * number of workers = 256. Note that 3200 is evenly divisible by 128, yet not by 256. Again, not sure if its the same problem, so buyer beware. |
The actual issue is 2 things (I might have explained that in #36153 ):
Using those 2 it works, but it's of course a pitfall with confusing error messages. |
Based on this comment multiworkermirroredstrategy can now handle partial batch size , and no error is raised with TF 2.3.0 release. |
Hey I have a hiccup with the multiworker srategy to include validation set during training just to have a sense of the model overfit. here is the error I am getting: 2020-09-01 13:17:58,695 WARNING (MainThread-32393) |
Here is the code to reproduce this issue`def main_fun(args, ctx):
steps_per_epoch = 200
` |
System information
The same environment as in #32654
But with 2 machines instead of 1 and Tensorflow 2.0 release from PyPi.
Describe the current behavior
I am training DenseNet121 on Imagenet with standard Keras code and custom dataset pipeline.
model.compile
is called with the only "accuracy" metric. I am usingMultiWorkerMirroredStrategy
as described in the tutorial. Here is the log. I had to erase ~7,000 warnings which are all the same:2019-10-19 12:23:10.615259: W tensorflow/core/framework/op_kernel.cc:309] OpKernelContext is tracking allocations but they are not being consumed by the StepStatsCollector.
Describe the expected behavior
The expected behavior is a successful epoch ending.
Code to reproduce the issue
The text was updated successfully, but these errors were encountered: