-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stops with queue error #4
Comments
Reduced to 4 GPUs and same error again (with a few more iterations.) Env: ubuntu 16.04.05, Nvidia 1080 Ti, pytorch 0.4.0. |
We only tested CornerNet with Python3.6. Can you please update your Python and try it again? |
Thx, after switch to Python3.5, training now moved past previous failure points. It will be good for Readme to add a requirement of Python3. |
My environment is also python3.5, but I have encountered this problem: Exception in thread Thread-1: Have you ever encountered it? Thx! |
I changed the batch_size and chunk_sizes in config/CornerNet.json file, it works. |
How do you change the batch_size and chunk_sizes? |
@YiLiangNie How do you change the batch_size and chunk_sizes? |
These two values can be determined according to your gpu. For example, I only have one 1080Ti graphics card, batch_size: 2, chunk_sizes: [2]. The value of batch_size is equal to the value of all chunk_sizes values added. |
@YiLiangNie I set the batch_size: 2, chunk_sizes: [2]. And it do works, but it get stuck as follows: |
It looks like the code gets stuck at either line 136 (cannot get data) or line 137 (cannot complete one training iteration) in |
I print at line 136 and 137, the result is as follow:
0%| | 1/500000 [00:18<2583:50:27, 18.60s/it]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory |
1 similar comment
I print at line 136 and 137, the result is as follow:
0%| | 1/500000 [00:18<2583:50:27, 18.60s/it]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory |
@Iric2018 It looks like it ran out of GPU memory. What GPU are you using? |
+-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ |
How to continue training with coco pre-training model? and now my own dataset category is not 80, it would report an error: size mismatch for module.... def load_pretrained_params(self, pretrained_model): def load_params(self, iteration): How can I modify this part of the code? Thank you very much. |
@Iric2018 Can you set the batch size of 1 and try again? FYI, I can run 4 images on a GPU with 12GB memory. |
Have u solved this problem?I just run 'python train.py CornerNet',I got the runtimeerror problem.Ichanged the batch_size and chunk_size,I got another problem:Segmentation fault(core dumped).Do u know how to solve it,please? |
loading all datasets... Traceback (most recent call last): |
i have the same problem |
help |
Training with 8 1080 Ti GPUs, which erred out after between 100 to 700 iterations:
Any suggestion? Thx.
The text was updated successfully, but these errors were encountered: