Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Cityscape #13

Closed
Yan1026 opened this issue Sep 8, 2022 · 4 comments
Closed

Training on Cityscape #13

Yan1026 opened this issue Sep 8, 2022 · 4 comments

Comments

@Yan1026
Copy link

Yan1026 commented Sep 8, 2022

Sorry to bother you.
I train with bash ./scripts/train_city.sh -l 372 -g 4 -b 50,but get error:

availble_gpus= [0, 1, 2, 3]
  0%|                                                                                                           | 0/93 [00:00<?, ?it/s]
  0%|                                                                                                           | 0/93 [00:05<?, ?it/s]
wandb: Waiting for W&B process to finish... (failed 1).
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: | 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /homedjy/PS-MT/wandb/offline-run-20220908_195002-2iykpb0m
wandb: Find logs at: ./wandb/offline-run-20220908_195002-2iykpb0m/logs
Traceback (most recent call last):
  File "CityCode/main.py", line 199, in <module>
    main(-1, 1, config, args)
  File "CityCode/main.py", line 116, in main
    trainer.train()
  File "/home/PS-MT/CityCode/Base/base_trainer.py", line 145, in train
    _ = self._warm_up(epoch, id=1)
  File "/homedjy/PS-MT/CityCode/train.py", line 173, in _warm_up
    curr_iter=batch_idx, epoch=epoch-1, id=id, warm_up=True)
  File "/home/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

I try to fix it but no effect.I want use GPU5,6,7,8,because GPU0123 is occupied.But when print availble_gpus,it's still [0, 1, 2, 3].
I can train model on VOC in same case.
Do you have any ideas?

@Yan1026
Copy link
Author

Yan1026 commented Sep 9, 2022

I fix it.Compared the code of VOC-main and Cityscape-main,I found that Cityscape-main lack a line of code about DDP.
args.ddp = True if args.gpus > 1 else False
After adding the code, the model can be trained.Maybe your code is a test version or I made a mistake.

@yyliu01
Copy link
Owner

yyliu01 commented Sep 9, 2022

Glad to hear you solve it.
In our experiments, I added the flag "--ddp" manually, and I missed this line when I re-organize the code.

Thanks a lot for reporting it.

@yyliu01 yyliu01 closed this as completed Sep 9, 2022
@Yan1026
Copy link
Author

Yan1026 commented Sep 13, 2022

Hi @yyliu01 ,I train with bash ./scripts/train_city.sh -l 372 -g 4 -b 50,but get error:

Saving a checkpoint: saved/final_test/372_mIoU_0.6137_model_e10.pth ... 
EVAL ID (Model 1) (10) | PixelAcc: 0.9311, Mean IoU: 0.6137 |
Traceback (most recent call last):
  File "CityCode/main.py", line 203, in <module>
    mp.spawn(main, nprocs=config['n_gpu'], args=(config['n_gpu'], config, args))
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/ PS-MT/CityCode/main.py", line 120, in main
    trainer.train()
  File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 171, in train
    self._save_checkpoint(epoch)
  File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 191, in _save_checkpoint
    upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
NameError: name 'upload_checkpoint' is not defined

About CityCode/Base/base_trainer.py, line 187---194 ,I found the following code annotated in VOC, but not in Cityscape.

Do you have any ideas?Maybe it is a test version of the code?

         pvc_dir = os.path.join("yy", "exercise_1", self.args.architecture,
                                "resnet{}_ckpt".format(str(self.args.backbone)), "city_cvpr_final",
                                                       str(self.args.labeled_examples))
        
         upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
         self.logger.info("Uploading current ckpt: mIoU_{}_model.pth to {}".format(str(state['monitor_best']),
```

@yyliu01
Copy link
Owner

yyliu01 commented Sep 21, 2022

Hi @Yan1026 , please comment that line as it is for the google cloud uploading, and shouldn't be used for your training.

I apologize for the inconvenience, please reopen the issue if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants