You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SUSE Linux 15 SP3 - pyannote.audio-3.1.1 - ALCF Polaris
Issue description
I tried to train a VAD model with 4 GPUs on a single node. The error message is: Traceback (most recent call last): File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/pipelines/vad.py", line 157, in <module> model = train_model(args, protocol) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/pipelines/vad.py", line 118, in train_model trainer.fit(model) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _run call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 94, in _call_setup_hook _call_lightning_module_hook(trainer, "setup", stage=fn) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/core/model.py", line 264, in setup _ = self.example_output File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/functools.py", line 967, in __get__ val = self.func(instance) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/core/model.py", line 195, in example_output example_output = self(example_input_array) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/models/segmentation/PyanNet.py", line 172, in forward outputs = self.sincnet(waveforms) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/models/blocks/sincnet.py", line 81, in forward outputs = self.wav_norm1d(waveforms) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/instancenorm.py", line 87, in forward return self._apply_instance_norm(input) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/instancenorm.py", line 36, in _apply_instance_norm return F.instance_norm( File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/functional.py", line 2526, in instance_norm return torch.instance_norm( RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_batch_norm) [rank: 1] Child process with PID 35924 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/pbs/mom_priv/jobs/1788633.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov.SC: line 37: 35559 Killed python vad.py
It seems to me like the data was not loaded to GPU devices. Below is a simplified version of my training script:
def train_model(args, protocol):
vad = VoiceActivityDetection(protocol, duration=2., batch_size=256,
num_workers=args['gpus'])
model = PyanNet(
task=vad,
sincnet={
'stride': args['stride']
},
lstm={
'num_layers': args['lstm_layer'],
'bidirectional': args['bidirectional'],
'hidden_size': args['hidden_size'],
'dropout': args['dropout']
},
linear={
'num_layers': args['linear_layer']
})
trainer = pl.Trainer(
strategy='deepspeed_stage_2', # deepspeed_stage_2
max_time=args['time'],
max_epochs=max_epoch,
default_root_dir=OUTPUT_DIR+args['corpus_name']+'/',
devices="auto",
accelerator="auto",
use_distributed_sampler=True,
enable_progress_bar=True)
print('start training...')
model.to(torch.device("cuda"))
trainer.fit(model)
print("trained successfully!")
return model
if __name__ == '__main__':
# 1. get module configuration
args = read_config(CONFIGURATION)
# 2. check the allocations and return rank
check_devices(args)
# 3. read data
protocol = load_data(args['corpus_name'], args['sub_corpus'])
# 4. train the model and save checkpoints
model = train_model(args, protocol)
# 5. tune the pipeline
pipeline = tune_pipeline(model, protocol)
# 6. evaluate the performance of the pipeline
test_performance(pipeline, protocol, calculate_detection_error_rate, args['job_id'], args['corpus_name'])
Tested versions
System information
SUSE Linux 15 SP3 - pyannote.audio-3.1.1 - ALCF Polaris
Issue description
I tried to train a VAD model with 4 GPUs on a single node. The error message is:
Traceback (most recent call last): File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/pipelines/vad.py", line 157, in <module> model = train_model(args, protocol) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/pipelines/vad.py", line 118, in train_model trainer.fit(model) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _run call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 94, in _call_setup_hook _call_lightning_module_hook(trainer, "setup", stage=fn) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/core/model.py", line 264, in setup _ = self.example_output File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/functools.py", line 967, in __get__ val = self.func(instance) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/core/model.py", line 195, in example_output example_output = self(example_input_array) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/models/segmentation/PyanNet.py", line 172, in forward outputs = self.sincnet(waveforms) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/models/blocks/sincnet.py", line 81, in forward outputs = self.wav_norm1d(waveforms) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/instancenorm.py", line 87, in forward return self._apply_instance_norm(input) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/instancenorm.py", line 36, in _apply_instance_norm return F.instance_norm( File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/functional.py", line 2526, in instance_norm return torch.instance_norm( RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_batch_norm) [rank: 1] Child process with PID 35924 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/pbs/mom_priv/jobs/1788633.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov.SC: line 37: 35559 Killed python vad.py
It seems to me like the data was not loaded to GPU devices. Below is a simplified version of my training script:
And here is my bash script:
Minimal reproduction example (MRE)
None
The text was updated successfully, but these errors were encountered: