-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TF 2.0] Using keras.metrics in TPU training results in error #33517
Comments
@georgealexandruvlad, As this is more related to TF models, please post this in TF model repo. Thanks! |
@gadagashwini The thing is that the model works with accuracy and it doesn't with other tf.keras.metrics. So the problem shouldn't be at the model's implementation if the only change is to add an additional metric to be evaluated. I implemented a simple example that still results in the error mentioned https://colab.research.google.com/drive/18okZncYBJOrd9AZxy4tCrHQ3O3s_tNsl. The question is whether this is a bug or I am doing something wrong. It's very little documentation on TPUs and for other distributed strategies this case seems to be documented as working in tensorflow guide pages. |
@ymodak Could I have some insights about what was causing the problem? Is it a bug or the keras.metrics are not yet supported on TPUs? And if they are not supported is there any alternative to this? |
@gadagashwini could you give me some information regarding the issue? I would like to know whether there is a problem (I didn't really get a response) that can't be resolved in the next couple of days/weeks or should I search for other solutions on the matter (like using tf1.x). |
Hi, sorry about the breakage. The internal version of this issue got routed to me yesterday and we should have a fix very soon today (at least on our nightly release). The root cause is our compiler had trouble handling conditionals with dynamic shapes, which is introduced by "Assert" operation in Metric. @rxsang also added an option to disable the dynamic shapes behavior, IIRC you can enable that by setting strategy.experimental_enable_dynamic_batch_size = False |
@yunxing Thanks for the update! Looking forward to the next nightly release then. |
Hi, I am having a similar dynamic dimension issue when using OneHot:
Is that related, or should I open a separate issue? 😄 |
Hi ahmadsalim@, we should have already fixed this issue a while ago. Which tf version are you using? |
@yunxing Thanks for the response. I am using 1.14, should I upgrade to 2.0 instead? 😄 |
Missed the notification. This should be fixed in nightly releases, do you have access to those ? I remember we also have 1.x nightly release which should also include the fix. cc @rxsang who is more familiar with this than me. |
tf.keras.metrics are not working with TPUs on colab version 2.3 https://colab.research.google.com/drive/1C09OUXP-7Es4KIthVA6daRcGq_bJKGe8#scrollTo=hbXc0o4p2W0a it looks like
|
TPUs are not workong on Colab either. tensorflow: 2.4 https://colab.research.google.com/drive/1slQKTzSOnE9U70QCQCoGJrcyXCdqRJwz#scrollTo=3Qz6XSPEDsyZ |
Could you try nightly? The fix yunxing@ mentioned may not be in TF2.4 release yet, it would be in TF2.5 and nightly.
|
Tried nightly, still experiencing the same issue, see my notebook on colab https://colab.research.google.com/drive/1PVBomGMDz5zbbCgSUTCKm6gIMQ2eG6V-?usp=sharing |
Anyone able to resolve the error? Am facing the same. |
I am trying to train a BERT model from https://github.com/tensorflow/models/tree/master/official/nlp on TPU in google colab. I changed the metrics list passed to model in compile method to:
where get_metrics is a function which returns a list of metrics ("accuracy" and instance for Recall and Precision built in tensorflow.keras.metrics):
Training results in the following error (after one epoch ends, before validation statistics are displayed):
With only "accuracy" returned it works well finishing all epochs. With custom metrics like:
it works as well, but the values returned are not correct. I suppose this is happening because on the TPU there are X steps per loop computed and somehow (I didn't dig too much into it) messes up the output metric. I tried with builtin functions to verify the behavior but it resulted in the error previously mentioned.
Snippet of the training call (the function is called run_keras_compile_fit in the github link I provided and it can be found in bert/run_classifier.py with almost none custom code added):
In colab I installed the stable release of tensorflow 2.0 as the nightly version doesn't work well with colab's TPU's for now. The keras metrics are supposed to work with TPUs or this is not yet a feature?
The text was updated successfully, but these errors were encountered: