Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error happens with worker>=1 in training phase #22

Closed
Shaosifan opened this issue Jul 30, 2020 · 4 comments
Closed

Error happens with worker>=1 in training phase #22

Shaosifan opened this issue Jul 30, 2020 · 4 comments

Comments

@Shaosifan
Copy link

Shaosifan commented Jul 30, 2020

when I run train.py with worker=4, there is a error happen. Any body knows it?

Traceback (most recent call last):
File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 69, in
main()
File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 16, in main
model_script.main(cfg)
File "models/sbd/r34_dh128.py", line 24, in main
train(model, cfg, model_cfg, start_epoch=cfg.start_epoch)
File "models/sbd/r34_dh128.py", line 132, in train
trainer.training(epoch)
File "F:\research\codes\Others-projects\fbrs_interactive_segmentation-master\isegm\engine\trainer.py", line 119, in training
for i, batch_data in enumerate(tbar):
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\tqdm\std.py", line 1129, in iter
for obj in iterable:
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
w.start()
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train..scale_func'

@ptrvilya
Copy link
Contributor

Hi! I suggest that this issue comes from torch.multiprocessing implementation on Windows OS, consider moving scale_func out of train function to the top level of training script.

@Shaosifan
Copy link
Author

Hi! I suggest that this issue comes from torch.multiprocessing implementation on Windows OS, consider moving scale_func out of train function to the top level of training script.

When I move scale_func to the top level of training script, a similar issue comes out:

Traceback (most recent call last):
  File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 69, in <module>
    main()
  File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 16, in main
    model_script.main(cfg)
  File "models/sbd/r34_dh128.py", line 26, in main
    train(model, cfg, model_cfg, start_epoch=cfg.start_epoch)
  File "models/sbd/r34_dh128.py", line 135, in train
    trainer.training(epoch)
  File "F:\research\codes\Others-projects\fbrs_interactive_segmentation-master\isegm\engine\trainer.py", line 119, in training
    for i, batch_data in enumerate(tbar):
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\tqdm\std.py", line 1129, in __iter__
    for obj in iterable:
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
    w.start()
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function scale_func at 0x000001CDBF3F38B8>: import of module 'model_script' failed

Process finished with exit code 1

@ptrvilya
Copy link
Contributor

ptrvilya commented Aug 5, 2020

I believe that the issue is still with torch.multiprocessing on Windows. I suggest you to substitute this line with this one and remove scale_func completely. Also you can try running training using nvidia-docker with Ubuntu.

@Shaosifan
Copy link
Author

Thank your for your help! I remove the scale_func completely and it works out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants