Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994

adamDhalla · 2021-03-23T04:36:53Z

Hi All,

Having an error being thrown at me when trying to log my metrics and hyperparameters on W&B via PyTorch Lightning whilst running on 8 TPU cores.

I first initialize the Weights and Biases run and project using the Lightning WandbLogger class, which practically runs wandb.init(). That goes fine. But then, I run the Trainer on 8 TPU cores, and with keyword argument 'logger=my_WandbLogger', I get the error AssertionError: can only test a child process.

Note that I tried this on a single TPU core, and that went fine and dandy. So it seems to be a problem with the distributive processing part of things.

How to reproduce
This isn't my code, but someone had the same issue a while back, although I couldn't find their solution. It's done using the bug-reproducer template ('The Boring Model') that Pytorch Lightning uses. Reproduction HERE.

I'm running things on Google Colab, with Pytorch Lighting version 1.2.4 (most recent) and W&B version 0.10.22 (one version behind the latest version).

Here's the full error stack trace if you're curious

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
<ipython-input-28-6650dc1eec9a> in <module>()
      3 wbLogger = WandbLogger(project='HPA Protein Localization Single Class Subset', name='Adam-128-0.001')
      4 trainer = Trainer(logger=wbLogger, deterministic=True, tpu_cores=8, max_epochs=epochNum, replace_sampler_ddp=False)
----> 5 trainer.fit(model, trainDL, valDL)
      6 
      7 print(time.time() - t0)

6 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    148         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    149         msg += original_trace
--> 150         raise ProcessRaisedException(msg, error_index, failed_process.pid)
    151 
    152 

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.7/logging/__init__.py", line 1028, in emit
    stream.write(msg + self.terminator)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 83, in new_process
    seed_everything(int(seed))
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/seed.py", line 54, in seed_everything
    log.info(f"Global seed set to {seed}")
  File "/usr/lib/python3.7/logging/__init__.py", line 1378, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1033, in emit
    self.handleError(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 946, in handleError
    sys.stderr.write('--- Logging error ---\n')
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
    file=sys.stderr)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

I'm wondering if there are any temporary workarounds for now since I need to find a way to connect and things are a bit time-sensitive!

The text was updated successfully, but these errors were encountered:

raubitsj · 2021-03-23T17:21:17Z

@adamDhalla,
We are looking into this issue now. We have some documentation and fixes related to DDP and multiprocess use that we are working on.

I am trying to reproduce the error, but for now (based on that exception) i suggest a temporary workaround of turning the console logging off, Using env variable: WANDB_CONSOLE=off

prash-p · 2021-03-23T19:15:30Z

I'm having the same error. I found this issue reported on the huggingface page, and this comment helped me fix the issue temporarily: huggingface/datasets#847 (comment)

adamDhalla · 2021-03-24T05:34:39Z

@adamDhalla,
We are looking into this issue now. We have some documentation and fixes related to DDP and multiprocess use that we are working on.

I am trying to reproduce the error, but for now (based on that exception) i suggest a temporary workaround of turning the console logging off, Using env variable: WANDB_CONSOLE=off

Thanks! but -
I tried doing that - doing WANDB_CONSOLE='off' (I assumed putting 'off' as a string since it doesn't work the other way?). I put it right above the definition of the wandb.init(). I still got the exact same error :(.

adamDhalla · 2021-03-24T06:25:18Z

@raubitsj @prash-p I didn't write test these at the same time but just including them both in a screenshot so you can see if I messed up on any of the assumptions of the spellings of things:

If these implementations are correct, sadly they both (independently) did not solve the problem :((

borisdayma · 2021-03-24T08:20:58Z

You need to use os.environ['WANDB_CONSOLE'] = 'off'

prash-p · 2021-03-24T14:13:55Z

@adamDhalla I didn't use WANDB_CONSOLE=off. What worked for me instead was commenting out lines 517 and 518 in site-packages/wandb/sdk/interface/interface.py :

    def _publish(self, record: pb.Record, local: bool = None) -> None:
        #if self._process and not self._process.is_alive():
        #    raise Exception("The wandb backend process has shutdown")
        if local:
            record.control.local = local
        if self.record_q:
            self.record_q.put(record)

raubitsj · 2021-03-24T17:20:34Z

@prash-p
Those lines are a safety check to make sure that the process that is storing console output is the process we are expecting, in this case it isn't (and we dont fully understand why)

Removing that check should be fine in the case of console output, but for other types of data, the code is trying to prevent a use case we haven't fully tested yet and which could lead to corrupted or incomplete data which we want to avoid.

We are still having trouble reproducing this failure so we can provide a safe workaround (and get the fix into a release ASAP)

adamDhalla · 2021-03-24T17:25:44Z

@borisdayma oh god yeah of course! Don't know what I was thinking. Still kind of a coding noob, but thank you.

prash-p · 2021-03-24T17:29:11Z

@raubitsj If it helps, I had this error when running my script as a notebook in vscode, using tqdm on my pytorch Dataloader. The exact same script ran without errors when running from the command line.

adamDhalla · 2021-04-03T02:39:23Z

Any progress on this?

hongvin · 2021-04-13T12:52:55Z

I encounter the same error, when training ViT with torch_xla using TPUs. Any progress?

KyleGoyette · 2021-04-21T15:40:22Z

Hi @khvmaths and @adamDhalla we're working on a solution. This may require changes to the wandb logger in Pytorch-Lightning, which could take a couple weeks to get a release out. We'll let you folks know here when a branch is ready

github-actions · 2021-06-21T00:10:35Z

This issue is stale because it has been open 60 days with no activity.

prikmm · 2021-07-24T18:38:11Z

@KyleGoyette Hii, any progress on this issue? I didn't use pytorch-lightning. I got this issue while using transformers and pytorch-xla.

KyleGoyette · 2021-07-24T19:36:08Z

@prikmm Either the next release or the release after will have a new experimental mode of running that supports this case.

github-actions · 2021-09-23T00:09:33Z

This issue is stale because it has been open 60 days with no activity.

Tesla-1i · 2021-09-24T15:35:59Z

@raubitsj If it helps, I had this error when running my script as a notebook in vscode, using tqdm on my pytorch Dataloader. The exact same script ran without errors when running from the command line.

I also have this error when using tqdm on my pytorch dataloader

R0bk · 2021-11-23T06:50:53Z

Hey, I've still got this issue training on TPUs with torch XLA.

shaunster0 · 2021-12-01T06:00:46Z

I'm still getting this issue, using pytorch-lightninig in jupyter-lab ... any updates?

Malachyiii · 2021-12-22T01:14:08Z

Still getting this issue in Kaggle notebooks when using pytorch lightning

andreimargeloiu · 2021-12-23T14:10:48Z

Thumbs up, same issue.

kptkin · 2022-01-03T18:17:56Z

Hi and thanks for reporting this issue.

We are currently working on improvements to our multiprocessing and was wondering if you would be interested in trying it out. The new logic should be more robust and allow more flexibility when using multiprocessing with wandb.

In case you are interested in trying it out, see this link: https://github.com/wandb/client/blob/master/docs/dev/wandb-service-user.md

One thing to note is that this is still experimental, and although we did testing it is still developed and improved all the time.

borisdayma · 2022-02-09T19:26:23Z

Hi, we just introduced a tentative fix.
Could you update Pytorch Lightning from master branch and install the most recent version of wandb?

pip install --upgrade wandb
pip install --upgrade git+https://github.com/PytorchLightning/pytorch-lightning.git

sreuvers · 2022-02-28T17:06:58Z

This unfortunately does not work for me, still getting the error

borisdayma · 2022-02-28T17:28:44Z

Could you confirm which versions you are using and whether you have reproducible code?

tsuga · 2022-03-17T16:07:28Z

I still have the same issue on Colab.
wandb: 0.12.11
pytorch lightning: 1.6.0dev

Colab reproducible code is
https://colab.research.google.com/drive/1bH8tpR5U7SalfQpbPNVHABh3lSNjt9s9?usp=sharing

borisdayma · 2022-03-17T18:04:38Z

Looks like it works if you comment wandb.login().

@raubitsj @kptkin is it possible that this somehow prevents the wandb.require("service")?

leoleoasd · 2022-03-20T05:39:11Z

Hi @borisdayma I met the same problem with torch.multiprocessing.
The minimal reproduction codeis https://gist.github.com/leoleoasd/5d8dd5e1cec5bb2822656e66ede55c55

borisdayma · 2022-03-22T17:43:12Z

@leoleoasd I think you can just add wandb.require("service") at the top of your script.

zplizzi · 2022-06-02T21:31:19Z

Also just encountered this issue, with wandb, version 0.12.15. Fixed with wandb.require("service")

kptkin · 2022-06-02T22:27:11Z

@zplizzi glad to hear that this solved your issue. Btw we have a pre-release where we have service activated by default if you want to try it out: https://docs.wandb.ai/guides/track/advanced/distributed-training#wandb-service-beta

The wandb service is enabled by default [1][2] starting with wandb sdk 0.13.0. This commit adjusts the requirements accordingly. The setup call is still necessary, since we later initialize wandb in subprocesses. [1] wandb/wandb#1994 (comment) [2] https://docs.wandb.ai/guides/track/advanced/distributed-training#w-and-b-sdk-0.13.0-and-above

MoAKgit · 2023-05-30T07:46:34Z

Hi, I got the same issue, using Pytorch Lightning in Google Colab!

adamDhalla added the a:cli Area: Client label Mar 23, 2021

github-actions bot added the stale label Jun 21, 2021

github-actions bot removed the stale label Jul 25, 2021

github-actions bot added the stale label Sep 23, 2021

github-actions bot removed the stale label Sep 25, 2021

Programmer-RD-AI mentioned this issue Oct 9, 2021

WandBlogger doesn't log when working with TPU cores Lightning-AI/pytorch-lightning#9876

Closed

sydholl added the ty:bug type of the issue is a bug label May 9, 2022

kptkin closed this as completed Jun 17, 2022

mariosasko mentioned this issue Oct 5, 2022

multiprocessing in dataset map "can only test a child process" huggingface/datasets#847

Closed

fishbotics mentioned this issue Mar 13, 2023

[CLI]: Crash During DDP Training With Child Process Errors #5152

Open

albertfgu mentioned this issue Jul 12, 2023

Fabric dataloaders die when num workers > 0 and distributed. Lightning-AI/pytorch-lightning#17378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994

Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994

adamDhalla commented Mar 23, 2021

raubitsj commented Mar 23, 2021 •

edited

prash-p commented Mar 23, 2021 •

edited

adamDhalla commented Mar 24, 2021

adamDhalla commented Mar 24, 2021

borisdayma commented Mar 24, 2021 •

edited

prash-p commented Mar 24, 2021

raubitsj commented Mar 24, 2021

adamDhalla commented Mar 24, 2021

prash-p commented Mar 24, 2021

adamDhalla commented Apr 3, 2021

hongvin commented Apr 13, 2021

KyleGoyette commented Apr 21, 2021

github-actions bot commented Jun 21, 2021

prikmm commented Jul 24, 2021

KyleGoyette commented Jul 24, 2021

github-actions bot commented Sep 23, 2021

Tesla-1i commented Sep 24, 2021

R0bk commented Nov 23, 2021

shaunster0 commented Dec 1, 2021

Malachyiii commented Dec 22, 2021 •

edited

andreimargeloiu commented Dec 23, 2021

kptkin commented Jan 3, 2022

borisdayma commented Feb 9, 2022

sreuvers commented Feb 28, 2022

borisdayma commented Feb 28, 2022

tsuga commented Mar 17, 2022

borisdayma commented Mar 17, 2022 •

edited

leoleoasd commented Mar 20, 2022

borisdayma commented Mar 22, 2022

zplizzi commented Jun 2, 2022

kptkin commented Jun 2, 2022

MoAKgit commented May 30, 2023

Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994

Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994

Comments

adamDhalla commented Mar 23, 2021

raubitsj commented Mar 23, 2021 • edited

prash-p commented Mar 23, 2021 • edited

adamDhalla commented Mar 24, 2021

adamDhalla commented Mar 24, 2021

borisdayma commented Mar 24, 2021 • edited

prash-p commented Mar 24, 2021

raubitsj commented Mar 24, 2021

adamDhalla commented Mar 24, 2021

prash-p commented Mar 24, 2021

adamDhalla commented Apr 3, 2021

hongvin commented Apr 13, 2021

KyleGoyette commented Apr 21, 2021

github-actions bot commented Jun 21, 2021

prikmm commented Jul 24, 2021

KyleGoyette commented Jul 24, 2021

github-actions bot commented Sep 23, 2021

Tesla-1i commented Sep 24, 2021

R0bk commented Nov 23, 2021

shaunster0 commented Dec 1, 2021

Malachyiii commented Dec 22, 2021 • edited

andreimargeloiu commented Dec 23, 2021

kptkin commented Jan 3, 2022

borisdayma commented Feb 9, 2022

sreuvers commented Feb 28, 2022

borisdayma commented Feb 28, 2022

tsuga commented Mar 17, 2022

borisdayma commented Mar 17, 2022 • edited

leoleoasd commented Mar 20, 2022

borisdayma commented Mar 22, 2022

zplizzi commented Jun 2, 2022

kptkin commented Jun 2, 2022

MoAKgit commented May 30, 2023

raubitsj commented Mar 23, 2021 •

edited

prash-p commented Mar 23, 2021 •

edited

borisdayma commented Mar 24, 2021 •

edited

Malachyiii commented Dec 22, 2021 •

edited

borisdayma commented Mar 17, 2022 •

edited