Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When training own dataset, an error occurs when changing numberclasses to the corresponding category. If it is the default, it will report an error #42

Open
hx358031364 opened this issue Sep 13, 2021 · 0 comments

Comments

@hx358031364
Copy link

AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Scheduled epochs: 310
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "main.py", line 948, in
main()
File "main.py", line 664, in main
optimizers=optimizers)
File "main.py", line 782, in train_one_epoch
output = model(input)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 610, in forward
self._sync_params()
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1048, in _sync_params
authoritative_rank,
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 979, in _distributed_broadcast_coalesced
self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant