the num_targets and the max label in train.egs.csv are not equal #53

JunLi0514 · 2022-08-26T06:08:24Z

Hi,
I try to run the CNCeleb recipe, but a RuntimeError appears:

#### Training will run for 6 epochs.
Traceback (most recent call last):
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/training/trainer.py", line 283, in run
    loss, acc = self.train_one_batch(batch)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/training/trainer.py", line 182, in train_one_batch
    loss = model.get_loss(model_forward(inputs), targets)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/support/utils.py", line 157, in wrapper
    return function(self, *transformed)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/exp/SEResnet34_am_train_fbank40/config/resnet-se-xvector.py", line 559, in get_loss
    return self.loss(inputs, targets)
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/nnet/loss.py", line 360, in forward
    return self.loss_function(outputs/self.t, targets) + self.ring_loss * ring_loss
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [0,0,0], thread: [55,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

That means the num_speakers output of FC classifier is less than the label.
And I find the num_targets in exp/egs/train_sequential/info is 2687, while the max label in train.egs.csv is 2711.
So could you please tell me which script generates the exp/egs/train_sequential/info/num_targets?

The text was updated successfully, but these errors were encountered:

JunLi0514 · 2022-08-26T08:40:09Z

Thanks to syousen, for offline egs, get_chunk_egs() in subtools/pytorch/pipeline/onestep/get_chunk_egs.py will generate num_targets file.
I find after the train, val set split, the num_spkrs of the train set is reduced because some speakers will be removed since their utterances are all split to the val set.
While it has no influence on the utt2spk label utt2spk_int, because it's not updated in filter() of the KaldiDataset in subtools/pytorch/libs/egs/kaldi_dataset.py. Only attributes belong to 'utt_first_files' and 'spk_first_files' are changed in filter() function.
So I recommend in subtools/pytorch/pipeline/onestep/get_chunk_egs.py, use dataset.num_spks instead of trainset.num_spks to generate the */info/num_targets file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the num_targets and the max label in train.egs.csv are not equal #53

the num_targets and the max label in train.egs.csv are not equal #53

JunLi0514 commented Aug 26, 2022

JunLi0514 commented Aug 26, 2022

the num_targets and the max label in train.egs.csv are not equal #53

the num_targets and the max label in train.egs.csv are not equal #53

Comments

JunLi0514 commented Aug 26, 2022

JunLi0514 commented Aug 26, 2022