Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Distributed Training #9

Closed
YZWYD opened this issue Feb 23, 2024 · 1 comment
Closed

About Distributed Training #9

YZWYD opened this issue Feb 23, 2024 · 1 comment

Comments

@YZWYD
Copy link

YZWYD commented Feb 23, 2024

Hello, first of all, thank you for open sourcing such a great project code. I would like to port your project to the Jetson platform. However, I have encountered a problem with the distributed training part of your code. Typically, distributed packages include gloo and nccl, but the Jetson platform can only use a specific version of PyTorch compiled by Nvidia. Unfortunately, the PyTorch provided by Nvidia for the Jetson does not include nccl. As a result, when I train the teacher, the module 'torch.distributed' has no attribute 'group', and torch.distributed.is_available() returns False.

I would like to ask you how to modify the code to block out the distributed computing content and only use a single GPU for training. Even when I set num_gpus=1, it still reports an error. I am looking forward to your reply. Thank you very much.

Some of my device and conda virtual environment info is as follows: python3.8, CUDA11.4, pytorch=1.13.0(Nvidia provides specific version), torchvision=0.14.1,torchaudio=0.13.1

Some of the error messages are shown below:
File "train.py", line 427, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
self._run_validation()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
self.val_loop.run()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
output = self.on_run_end()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 180, in on_run_end
self._evaluation_epoch_end(self._outputs)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 288, in _evaluation_epoch_end
self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "train.py", line 368, in validation_epoch_end
mean_psnr = all_gather_ddp_if_available(psnrs).mean()
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 161, in all_gather_ddp_if_available
return new_all_gather_ddp_if_available(*args, **kwargs)
File "/home/nvidia/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 197, in _all_gather_ddp_if_available
group = group if group is not None else torch.distributed.group.WORLD
AttributeError: module 'torch.distributed' has no attribute 'group'

@galalalala
Copy link
Collaborator

Hi, one solution is to use DP instead of DDP. Or you can simply remove the DDP code which will revert to the single GPU training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants