-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CancelledError: [_Derived_]RecvAsync is cancelled. #45594
Comments
Please, share colab link or simple standalone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks! |
@ravikyram I am going to try and build a small example but the nature of this issue does not make this easy, as nobody seems to know where this issue is coming from (if one follows what people post e.g. in #33721). At this point I think it would help if somebody could help us track the source of this issue down on our side. E.g. somebody who has any clue whatsoever what Could you mention somebody here who you think could help us all track this issue down somehow? |
@stefan-falk Is it possible to test with recent TF versions ( |
@jvishnuvardhan I can certainly try that. Can I simply upgrade to |
@jvishnuvardhan Apparently I cannot do this with nightly just like so. It appears that there are some breaking changes which I have to adapt first. Update:I switched to |
@jvishnuvardhan @ravikyram I have upgraded to The problem persists though. After a few hours of training, the program crashes. Is there anything else I can do i.o. to track this issue down? |
One of my latest changes was using With that being said, I have no clue where this is coming from. I am 100% sure I didn't have this in 2.1.0 and I am not even sure if I had it with 2.3.0 but I certainly got that issue with 2.3.1. Not saying the problem is Tensorflow, maybe I am doing something wrong somewhere, but I have no idea how I could possibly track the source of this down. |
@jvishnuvardhan @ravikyram I think I have fixed the issue. The root of this was What seems to happen here is that there are batches which do not have enough samples s.t. there weren't enough examples for all cards. Since I set I don't know whether it is possible to raise an error or log a warning in such a case because the current error message is not really a good indicator i.o. to get an idea where to look. |
This code does not reproduces the error from above exactly but I think this is what is happening. If I run the code below on e.g. 4 GPUs it will simply crash because there will be a batch with just one example. I guess we can expect something like this but to me it was not very obvious where to look. import tensorflow as tf # v2.4.0
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
def sample_generator(nb_samples):
for i in range(nb_samples):
l = np.random.randint(6, 20)
yield np.random.rand(l, 8), np.random.rand(1, 1)
# One example for bucket (1, 5)
yield np.random.rand(3, 8), np.random.rand(1, 1)
def sample_len(sample, *_):
return tf.shape(sample)[0]
nb_replica = max(1, len(tf.config.experimental.list_physical_devices('GPU')))
assert nb_replica > 1, f'Number of GPUs must be >1 got {nb_replica}'
dataset = tf.data.Dataset.from_generator(
lambda: sample_generator(500),
output_types=(tf.float32, tf.float32),
output_shapes=((None, 8), (None, 1))
)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
boundaries = [5, 10]
batch_sizes = [i * nb_replica for i in range(len(boundaries) + 1)]
bucketing = tf.data.experimental.bucket_by_sequence_length(
sample_len,
bucket_boundaries=boundaries,
bucket_batch_sizes=batch_sizes,
drop_remainder=True
)
dataset = dataset.apply(bucketing).repeat()
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
inputs = layers.Input(shape=(None, 8))
x = inputs
x = layers.LSTM(16)(x)
x = layers.Dense(1)(x)
model = keras.Model(inputs=inputs, outputs=x)
model.compile(loss='mse')
model.fit(
dataset,
epochs=2,
steps_per_epoch=100,
) Output:
|
I was running RNN on Kaggle. The error message I encounter is
Apparently something was not right with the embedding layer. This is my embedding layer: layers.Embedding(
input_dim=SIZE_VOCAB,
output_dim=EMBED_DIM,
mask_zero=True,
input_length=MAX_SEQ_LEN,
), The suspect is After I comment out The code example is available here |
Also getting this error after training for ~6 epochs. I am using a GRU with embeddings (tf nightly gpu 2.5.0-dev20210115).
|
Any solution to this Multi GPU?
I am using
|
@stefan-falk thanks for digging into this! Definitely looks like a batch issue. I was using Thanks for your help :) |
I have also suffered from the same issue in my NER model with mirrored strategy using 2 GPUs. The environment I used is as follows: The example code uses the MNIST data, which has 60000 examples in training data and 10000 examples in test data.
In this experiment, when the number of samples in the last batch was 1 to 50, model.fit caused the CUDNN_STATUS_BAD_PARAM, while the number between 51 to 100 does not cause any error (It learned correctly with 2 GPUs). From this experiment, I get to a conclusion that the issue seems to occur when the last batch has samples less than or equal to the replica batch size. It seems that each replica does not get the same number of samples from one batch of data (I mean that non-even distribution among replica). For example, assume 2 GPUs (GPU0 and GPU1), 100 global batch size, 50 samples in the last batch. GPU0 seems to get the all first 50 samples and GPU1 gets no samples. I am not sure but I think this situation seems to cause the issue. To prevent this situation, we could simply set drop_remainder to True as commented by @stefan-falk or select insufficient number of samples from training data (randomly or based on some criteria like sample's weight) and add them to training data to make all batch have the same number (i.e., global batch size) of samples. The former is a very simple solution but loses some number of training samples while the latter does not. Through this approach, I could resolved this issue in my NER model training. Validation / test data may cause the same error. When I used the MNIST test data for validation with Model.fit(), the insufficient number of samples in the last validation batch does not cause any error. However, in my NER model training with CoNLL2003 data, the insufficient number of samples in the last validation batch caused the same error:
For the validation or test data, I didn't use dropping or adding method because this method may make the result of evaluation incomparable. Instead of dropping or adding, I dynamically changed the validation or test data batch size by increasing one by one until the number of samples in the last batch gets larger than replica batch size. Through this method, I could solved the same issue in validation or test data. |
Note
I am opening this issue because the error I am describing seem to affect quite some people. See Related Issues, but those have been closed "due to inactivity".
System information
Describe the current behavior
The training fails seemingly randomly with
CancelledError: [_Derived_]RecvAsync is cancelled
. All cases seem to have recurrent layer in common (see Related Issues).After starting, the training will run (in my case) for some time and then just crash with the above error.
Describe the expected behavior
Don't crash.
Standalone code to reproduce the issue
There are some in #33721.
Other info / logs
Related Issues
The text was updated successfully, but these errors were encountered: