Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the num_targets and the max label in train.egs.csv are not equal #53

Open
JunLi0514 opened this issue Aug 26, 2022 · 1 comment
Open

Comments

@JunLi0514
Copy link

Hi,
I try to run the CNCeleb recipe, but a RuntimeError appears:

#### Training will run for 6 epochs.
Traceback (most recent call last):
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/training/trainer.py", line 283, in run
    loss, acc = self.train_one_batch(batch)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/training/trainer.py", line 182, in train_one_batch
    loss = model.get_loss(model_forward(inputs), targets)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/support/utils.py", line 157, in wrapper
    return function(self, *transformed)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/exp/SEResnet34_am_train_fbank40/config/resnet-se-xvector.py", line 559, in get_loss
    return self.loss(inputs, targets)
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/nnet/loss.py", line 360, in forward
    return self.loss_function(outputs/self.t, targets) + self.ring_loss * ring_loss
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [0,0,0], thread: [55,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

That means the num_speakers output of FC classifier is less than the label.
And I find the num_targets in exp/egs/train_sequential/info is 2687, while the max label in train.egs.csv is 2711.
So could you please tell me which script generates the exp/egs/train_sequential/info/num_targets?

@JunLi0514
Copy link
Author

Thanks to syousen, for offline egs, get_chunk_egs() in subtools/pytorch/pipeline/onestep/get_chunk_egs.py will generate num_targets file.
I find after the train, val set split, the num_spkrs of the train set is reduced because some speakers will be removed since their utterances are all split to the val set.
While it has no influence on the utt2spk label utt2spk_int, because it's not updated in filter() of the KaldiDataset in subtools/pytorch/libs/egs/kaldi_dataset.py. Only attributes belong to 'utt_first_files' and 'spk_first_files' are changed in filter() function.
So I recommend in subtools/pytorch/pipeline/onestep/get_chunk_egs.py, use dataset.num_spks instead of trainset.num_spks to generate the */info/num_targets file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant