Training is stuck at some point, I'm not sure if it is a PyTorch problem #140
Comments
@bowenc0221 , I am facing similar issues. This usually happens when the GPU memory usage approaches its max limit i.e when the GPU is almost out of memory. |
@ShethR Thanks for your reply. |
I also met this problem when using TITANV and V100 with 2 images per GPU. |
As @bowenc0221 mentioned, this problem is caused by a deadlock inside the dataloader when num_workers > 1. |
@ShethR If setting num_workers=1 solves the problem, there are probably some problems in the getitem function. (I'm not sure if the deadlock is caused by conflict operations of different threads.) |
Actually I can't solve the problem by setting num_workers==1. Any other method? Thanks |
Same here, any updates? Thanks. |
same here..in multi gpu training one gpu utilization set to 0 and rest are using their 100%...no error message ...any update ? |
same here... any solution now ?????? |
Expected results
There should be no problem for training.
Actual results
Training is stuck at [Step 553061 / 720000]. GPU utilization is 0% but memory is not released.
I waited for 2 days but it didn't resume so I killed the job.
It seems the problem is caused by dataloader deadlock, I got the following message when I killed the job:
Detailed steps to reproduce
E.g.:
System information
The text was updated successfully, but these errors were encountered: