Training fails : "Expected all tensors to be on the same device" #1538

tomschelsen · 2023-11-10T13:19:17Z

Using pyannote.audio 3.0.1

The code that I run :

from pyannote.database import FileFinder, registry
registry.load_database("train_pyan_database.yml")
protocol = registry.get_protocol('Conversations.SpeakerDiarization.MyProtocol', preprocessors={"audio": FileFinder()})

from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.models.segmentation import PyanNet

sd_task = SpeakerDiarization(
    protocol, 
    duration=10.0,
    max_speakers_per_chunk=4,
    max_speakers_per_frame=1,
)
trained_from_scratch_model = PyanNet(task=sd_task)

import pytorch_lightning as pl
trainer = pl.Trainer(devices=1, accelerator="gpu", max_epochs=1)
trainer.fit(trained_from_scratch_model)

The result :

/usr/local/lib/python3.8/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
2023-11-10 13:14:10.661115: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-10 13:14:11.960991: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-10 13:14:11.961093: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-10 13:14:11.961103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/usr/local/lib/python3.8/dist-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
Protocol Conversations.SpeakerDiarization.MyProtocol does not precompute the output of torchaudio.info(): adding a 'torchaudio.info' preprocessor for you to speed up dataloaders. See pyannote.database documentation on how to do that yourself.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 15
     13 import pytorch_lightning as pl
     14 trainer = pl.Trainer(devices=1, accelerator="gpu", max_epochs=1)
---> 15 trainer.fit(trained_from_scratch_model)

File /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py:544, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    542 self.state.status = TrainerStatus.RUNNING
    543 self.training = True
--> 544 call._call_and_handle_interrupt(
    545     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    546 )

File /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     42     if trainer.strategy.launcher is not None:
     43         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44     return trainer_fn(*args, **kwargs)
     46 except _TunerExitException:
     47     _call_teardown_hook(trainer)

File /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py:580, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    573 assert self.state.fn is not None
    574 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    575     self.state.fn,
    576     ckpt_path,
    577     model_provided=True,
    578     model_connected=self.lightning_module is not None,
    579 )
--> 580 self._run(model, ckpt_path=ckpt_path)
    582 assert self.state.stopped
    583 self.training = False

File /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py:950, in Trainer._run(self, model, ckpt_path)
    947 self.strategy.setup_environment()
    948 self.__setup_profiler()
--> 950 call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
    952 # check if we should delay restoring checkpoint till later
    953 if not self.strategy.restore_checkpoint_after_setup:

File /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py:94, in _call_setup_hook(trainer)
     92     _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
     93 _call_callback_hooks(trainer, "setup", stage=fn)
---> 94 _call_lightning_module_hook(trainer, "setup", stage=fn)
     96 trainer.strategy.barrier("post_setup")

File /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py:157, in _call_lightning_module_hook(trainer, hook_name, pl_module, *args, **kwargs)
    154 pl_module._current_fx_name = hook_name
    156 with trainer.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{hook_name}"):
--> 157     output = fn(*args, **kwargs)
    159 # restore current_fx when nested context
    160 pl_module._current_fx_name = prev_fx_name

File /usr/local/lib/python3.8/dist-packages/pyannote/audio/core/model.py:264, in Model.setup(self, stage)
    261     self.task.setup_validation_metric()
    263     # cache for later (and to avoid later CUDA error with multiprocessing)
--> 264     _ = self.example_output
    266 # list of layers after adding task-dependent layers
    267 after = set((name, id(module)) for name, module in self.named_modules())

File /usr/lib/python3.8/functools.py:967, in cached_property.__get__(self, instance, owner)
    965 val = cache.get(self.attrname, _NOT_FOUND)
    966 if val is _NOT_FOUND:
--> 967     val = self.func(instance)
    968     try:
    969         cache[self.attrname] = val

File /usr/local/lib/python3.8/dist-packages/pyannote/audio/core/model.py:195, in Model.example_output(self)
    193 example_input_array = self.__example_input_array()
    194 with torch.inference_mode():
--> 195     example_output = self(example_input_array)
    197 def __example_output(
    198     example_output: torch.Tensor,
    199     specifications: Specifications = None,
    200 ) -> Output:
    201     if specifications.resolution == Resolution.FRAME:

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.8/dist-packages/pyannote/audio/models/segmentation/PyanNet.py:172, in PyanNet.forward(self, waveforms)
    160 def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
    161     """Pass forward
    162 
    163     Parameters
   (...)
    169     scores : (batch, frame, classes)
    170     """
--> 172     outputs = self.sincnet(waveforms)
    174     if self.hparams.lstm["monolithic"]:
    175         outputs, _ = self.lstm(
    176             rearrange(outputs, "batch feature frame -> batch frame feature")
    177         )

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.8/dist-packages/pyannote/audio/models/blocks/sincnet.py:81, in SincNet.forward(self, waveforms)
     73 def forward(self, waveforms: torch.Tensor) -> torch.Tensor:
     74     """Pass forward
     75 
     76     Parameters
     77     ----------
     78     waveforms : (batch, channel, sample)
     79     """
---> 81     outputs = self.wav_norm1d(waveforms)
     83     for c, (conv1d, pool1d, norm1d) in enumerate(
     84         zip(self.conv1d, self.pool1d, self.norm1d)
     85     ):
     87         outputs = conv1d(outputs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/instancenorm.py:87, in _InstanceNorm.forward(self, input)
     84 if input.dim() == self._get_no_batch_dim():
     85     return self._handle_no_batch_input(input)
---> 87 return self._apply_instance_norm(input)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/instancenorm.py:36, in _InstanceNorm._apply_instance_norm(self, input)
     35 def _apply_instance_norm(self, input):
---> 36     return F.instance_norm(
     37         input, self.running_mean, self.running_var, self.weight, self.bias,
     38         self.training or not self.track_running_stats, self.momentum, self.eps)

File /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:2523, in instance_norm(input, running_mean, running_var, weight, bias, use_input_stats, momentum, eps)
   2521 if use_input_stats:
   2522     _verify_spatial_size(input.size())
-> 2523 return torch.instance_norm(
   2524     input, weight, bias, running_mean, running_var, use_input_stats, momentum, eps, torch.backends.cudnn.enabled
   2525 )

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper_CUDA__cudnn_batch_norm)

Thanks :)

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-10T13:19:37Z

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

tomschelsen · 2023-11-13T13:35:59Z

Re-tried from a cleaner environment, after solving this : #1400 I got the training working.

So closing this one, leaving on the above issue to follow on the issue of lightning.pytorch / pytorch_lightning and lightning_fabric / lightning.fabric imports that are not in line with current Lightning version.

TuanBC · 2024-04-02T20:03:50Z

I had the same issue when trying the following tutorial on Google Colab

https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/training_a_model.ipynb

Python 3.10.12
Pytorch 2.2.1+cu121
pytorch_lightning 2.2.1

hbredin added the cannot_reproduce label Nov 10, 2023

tomschelsen closed this as completed Nov 13, 2023

TuanBC mentioned this issue Apr 2, 2024

RuntimeError with Mixed CUDA Devices for Multi-GPU Training #1673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails : "Expected all tensors to be on the same device" #1538

Training fails : "Expected all tensors to be on the same device" #1538

tomschelsen commented Nov 10, 2023 •

edited

Loading

github-actions bot commented Nov 10, 2023

tomschelsen commented Nov 13, 2023

TuanBC commented Apr 2, 2024

Training fails : "Expected all tensors to be on the same device" #1538

Training fails : "Expected all tensors to be on the same device" #1538

Comments

tomschelsen commented Nov 10, 2023 • edited Loading

github-actions bot commented Nov 10, 2023

tomschelsen commented Nov 13, 2023

TuanBC commented Apr 2, 2024

tomschelsen commented Nov 10, 2023 •

edited

Loading