Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56

jtamir · 2019-10-04T22:53:45Z

I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd

However, when I run the hyperparam opt, I get the following error:

RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54

Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding multiprocessing.set_start_method('spawn', force=True).

Looking at argparse_hopt.py, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

Looking for suggestions on what to try, thanks!

The text was updated successfully, but these errors were encountered:

antvconst · 2019-10-11T21:25:49Z

Also having this one

williamFalcon · 2019-10-11T21:32:11Z

I might need to update that post, but run these demos instead:

https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/multi_node_examples

williamFalcon · 2019-10-11T21:32:36Z

could you post your code and error?

antvconst · 2019-10-11T21:42:19Z

I'm kind of looking for a way to evaluate each set of hyperparams on a separate GPU in parallel, not train a single model on multiple GPUs. I've tried this:

def train_one(hparam, gpu_id_set):
    # load data, create model, create logger and checkpoint callback
    trainer = Trainer(logger=tt_logger, checkpoint_callback=checkpoint_callback,
                      gpus=[int(gpu_id_set)], max_nb_epochs=hparam.epochs, weights_summary=None)
    trainer.fit(model)
    trainer.test(model)

hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)

Here gpu_ids is ['0', '1'].

And here's the output:

gpu available: True, used: True
VISIBLE GPUS: 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Caught exception in worker thread cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54
Traceback (most recent call last):
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "main.py", line 40, in train_one
    trainer.fit(model)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 754, in fit
    self.__single_gpu_train(model)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 793, in __singl
e_gpu_train
    model.cuda(self.root_gpu)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 230, in _apply
    param_applied = fn(param)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54

After this one there's exact same message, but with GPU 1. After this process seems to hang. Being killed by Ctrl-C it also outputs

Traceback (most recent call last):
  File "main.py", line 65, in <module>
    hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 323, in optimize_parallel_gpu
    results = self.pool.map(optimize_parallel_gpu_private, self.trials)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 651, in get
    self.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt```

BraveDistribution · 2020-02-24T13:09:30Z

Any help? Having similar issue but without multi-node (I have two GPUs in my PC).

Traceback (most recent call last):
  File "D:/Users//MFT/MFT/simulation_runner.py", line 40, in <module>
    hparams.optimize_parallel_gpu('test', gpu_ids=['0'], max_nb_trials=1)

  File "D:\Users\\anaconda3\envs\pytorch\lib\site-packages\test_tube\argparse_hopt.py", line 322, in optimize_parallel_gpu
    self.pool = Pool(processes=nb_workers, initializer=init, initargs=(gpu_q,))
  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

and the entry:

def main(hparams):
    early_stopping = EarlyStopping('val_acc', patience=20)

    trainer = Trainer(
                         max_nb_epochs=50,
                         gpus=[0],
                         early_stop_callback=early_stopping,
                         train_percent_check=1,
                         check_val_every_n_epoch=1,
                         val_percent_check=1
                         )
    system = ParkinsonDecisionSystem(hparams)
    if hparams.evaluate:
        trainer.run_evaluation()
    else:
        trainer.fit(system)

if __name__ == '__main__':
    parent_parser = HyperOptArgumentParser(strategy='grid_search')
    parent_parser.opt_list('--augmentation', default="None", type=str, tunable=True, options=["Erosion",
                                                                                         "Gaussian",
                                                                                         "None",
                                                                                         "Median"])
    parent_parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
                               help='evaluate model on validation set')

    parent_parser.add_argument("--model_name", metavar="model_name", type=str, default=None,
                        help="Name od model from model_enum")

    hparams = parent_parser.parse_args()
    hparams.optimize_parallel_gpu(main, gpu_ids=['0'], max_nb_trials=1)

jtamir · 2020-02-24T15:37:42Z

I was able to get this working, though I can't remember exactly all the steps. My code is available here: https://github.com/jtamir/deepinpy/blob/master/main.py#L113

Things I remember being important:

Set num_workers=0 in your DataLoader or Python will try to spawn multiple Multiprocessing pools
Always pass GPU ID 0 to Pytorch Lightning's trainer, because TestTube already handles the GPU IDs: https://github.com/jtamir/deepinpy/blob/master/main.py#L46
Set distributed_backend=None in the trainer for similar reasons

BraveDistribution · 2020-02-25T07:31:59Z

@jtamir

Thank you for your quick response. I was lucky and I solved this myself even sooner.

You did mention everything except one thing. In testtube I had to remove the nested functions in (when pool is creating new processes) because pickle can't handle that. This removed the error: AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu..init'

after that I did all your steps and it works!

Maybe I will write a blog post about that or something... Also we should make a pull request for testtube (I will look into it...).

zuiaichirouya · 2024-04-15T08:09:31Z

use torch=1.13.0 pytorch-lighting=1.0.8 output:
File "/home/xiaoyang/python/envs/taming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LightningDistributedDataParallel' object has no attribute '_sync_params'

jtamir changed the title ~~Multiprocessing running optimize_parallel_gpu with pytorch + pytorch-lightning~~ Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning Oct 4, 2019

williamFalcon closed this as completed Oct 11, 2019

williamFalcon reopened this Oct 11, 2019

johnmbarrett mentioned this issue Mar 26, 2020

CAE Fitting Error themattinthehatt/behavenet#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56

jtamir commented Oct 4, 2019 •

edited

Loading

antvconst commented Oct 11, 2019

williamFalcon commented Oct 11, 2019

williamFalcon commented Oct 11, 2019

antvconst commented Oct 11, 2019 •

edited

Loading

BraveDistribution commented Feb 24, 2020 •

edited

Loading

jtamir commented Feb 24, 2020

BraveDistribution commented Feb 25, 2020

zuiaichirouya commented Apr 15, 2024

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56

Comments

jtamir commented Oct 4, 2019 • edited Loading

antvconst commented Oct 11, 2019

williamFalcon commented Oct 11, 2019

williamFalcon commented Oct 11, 2019

antvconst commented Oct 11, 2019 • edited Loading

BraveDistribution commented Feb 24, 2020 • edited Loading

jtamir commented Feb 24, 2020

BraveDistribution commented Feb 25, 2020

zuiaichirouya commented Apr 15, 2024

jtamir commented Oct 4, 2019 •

edited

Loading

antvconst commented Oct 11, 2019 •

edited

Loading

BraveDistribution commented Feb 24, 2020 •

edited

Loading