Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994

Closed
adamDhalla opened this issue Mar 23, 2021 · 32 comments
Labels
a:cli Area: Client ty:bug type of the issue is a bug

Comments

@adamDhalla
Copy link

Hi All,

Having an error being thrown at me when trying to log my metrics and hyperparameters on W&B via PyTorch Lightning whilst running on 8 TPU cores.

I first initialize the Weights and Biases run and project using the Lightning WandbLogger class, which practically runs wandb.init(). That goes fine. But then, I run the Trainer on 8 TPU cores, and with keyword argument 'logger=my_WandbLogger', I get the error AssertionError: can only test a child process.

image

Note that I tried this on a single TPU core, and that went fine and dandy. So it seems to be a problem with the distributive processing part of things.

How to reproduce
This isn't my code, but someone had the same issue a while back, although I couldn't find their solution. It's done using the bug-reproducer template ('The Boring Model') that Pytorch Lightning uses. Reproduction HERE.

I'm running things on Google Colab, with Pytorch Lighting version 1.2.4 (most recent) and W&B version 0.10.22 (one version behind the latest version).

Here's the full error stack trace if you're curious

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
<ipython-input-28-6650dc1eec9a> in <module>()
      3 wbLogger = WandbLogger(project='HPA Protein Localization Single Class Subset', name='Adam-128-0.001')
      4 trainer = Trainer(logger=wbLogger, deterministic=True, tpu_cores=8, max_epochs=epochNum, replace_sampler_ddp=False)
----> 5 trainer.fit(model, trainDL, valDL)
      6 
      7 print(time.time() - t0)

6 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    148         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    149         msg += original_trace
--> 150         raise ProcessRaisedException(msg, error_index, failed_process.pid)
    151 
    152 

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.7/logging/__init__.py", line 1028, in emit
    stream.write(msg + self.terminator)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 83, in new_process
    seed_everything(int(seed))
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/seed.py", line 54, in seed_everything
    log.info(f"Global seed set to {seed}")
  File "/usr/lib/python3.7/logging/__init__.py", line 1378, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1033, in emit
    self.handleError(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 946, in handleError
    sys.stderr.write('--- Logging error ---\n')
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
    file=sys.stderr)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

I'm wondering if there are any temporary workarounds for now since I need to find a way to connect and things are a bit time-sensitive!

@adamDhalla adamDhalla added the a:cli Area: Client label Mar 23, 2021
@raubitsj
Copy link
Member

raubitsj commented Mar 23, 2021

@adamDhalla,
We are looking into this issue now. We have some documentation and fixes related to DDP and multiprocess use that we are working on.

I am trying to reproduce the error, but for now (based on that exception) i suggest a temporary workaround of turning the console logging off, Using env variable: WANDB_CONSOLE=off

@prash-p
Copy link

prash-p commented Mar 23, 2021

I'm having the same error. I found this issue reported on the huggingface page, and this comment helped me fix the issue temporarily: huggingface/datasets#847 (comment)

@adamDhalla
Copy link
Author

@adamDhalla,
We are looking into this issue now. We have some documentation and fixes related to DDP and multiprocess use that we are working on.

I am trying to reproduce the error, but for now (based on that exception) i suggest a temporary workaround of turning the console logging off, Using env variable: WANDB_CONSOLE=off

Thanks! but -
I tried doing that - doing WANDB_CONSOLE='off' (I assumed putting 'off' as a string since it doesn't work the other way?). I put it right above the definition of the wandb.init(). I still got the exact same error :(.

@adamDhalla
Copy link
Author

@raubitsj @prash-p I didn't write test these at the same time but just including them both in a screenshot so you can see if I messed up on any of the assumptions of the spellings of things:
image

If these implementations are correct, sadly they both (independently) did not solve the problem :((

@borisdayma
Copy link
Contributor

borisdayma commented Mar 24, 2021

You need to use os.environ['WANDB_CONSOLE'] = 'off'

@prash-p
Copy link

prash-p commented Mar 24, 2021

@adamDhalla I didn't use WANDB_CONSOLE=off. What worked for me instead was commenting out lines 517 and 518 in site-packages/wandb/sdk/interface/interface.py :

    def _publish(self, record: pb.Record, local: bool = None) -> None:
        #if self._process and not self._process.is_alive():
        #    raise Exception("The wandb backend process has shutdown")
        if local:
            record.control.local = local
        if self.record_q:
            self.record_q.put(record)

@raubitsj
Copy link
Member

@prash-p
Those lines are a safety check to make sure that the process that is storing console output is the process we are expecting, in this case it isn't (and we dont fully understand why)

Removing that check should be fine in the case of console output, but for other types of data, the code is trying to prevent a use case we haven't fully tested yet and which could lead to corrupted or incomplete data which we want to avoid.

We are still having trouble reproducing this failure so we can provide a safe workaround (and get the fix into a release ASAP)

@adamDhalla
Copy link
Author

@borisdayma oh god yeah of course! Don't know what I was thinking. Still kind of a coding noob, but thank you.

@prash-p
Copy link

prash-p commented Mar 24, 2021

@raubitsj If it helps, I had this error when running my script as a notebook in vscode, using tqdm on my pytorch Dataloader. The exact same script ran without errors when running from the command line.

@adamDhalla
Copy link
Author

Any progress on this?

@hongvin
Copy link

hongvin commented Apr 13, 2021

I encounter the same error, when training ViT with torch_xla using TPUs. Any progress?

@KyleGoyette
Copy link
Contributor

Hi @khvmaths and @adamDhalla we're working on a solution. This may require changes to the wandb logger in Pytorch-Lightning, which could take a couple weeks to get a release out. We'll let you folks know here when a branch is ready

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the stale label Jun 21, 2021
@prikmm
Copy link

prikmm commented Jul 24, 2021

@KyleGoyette Hii, any progress on this issue? I didn't use pytorch-lightning. I got this issue while using transformers and pytorch-xla.

@KyleGoyette
Copy link
Contributor

@prikmm Either the next release or the release after will have a new experimental mode of running that supports this case.

@github-actions github-actions bot removed the stale label Jul 25, 2021
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the stale label Sep 23, 2021
@Tesla-1i
Copy link

@raubitsj If it helps, I had this error when running my script as a notebook in vscode, using tqdm on my pytorch Dataloader. The exact same script ran without errors when running from the command line.

I also have this error when using tqdm on my pytorch dataloader

@R0bk
Copy link

R0bk commented Nov 23, 2021

Hey, I've still got this issue training on TPUs with torch XLA.

@shaunster0
Copy link

I'm still getting this issue, using pytorch-lightninig in jupyter-lab ... any updates?

@Malachyiii
Copy link

Malachyiii commented Dec 22, 2021

Still getting this issue in Kaggle notebooks when using pytorch lightning

@andreimargeloiu
Copy link

Thumbs up, same issue.

@kptkin
Copy link
Contributor

kptkin commented Jan 3, 2022

Hi and thanks for reporting this issue.

We are currently working on improvements to our multiprocessing and was wondering if you would be interested in trying it out. The new logic should be more robust and allow more flexibility when using multiprocessing with wandb.

In case you are interested in trying it out, see this link: https://github.com/wandb/client/blob/master/docs/dev/wandb-service-user.md

One thing to note is that this is still experimental, and although we did testing it is still developed and improved all the time.

@borisdayma
Copy link
Contributor

Hi, we just introduced a tentative fix.
Could you update Pytorch Lightning from master branch and install the most recent version of wandb?

pip install --upgrade wandb
pip install --upgrade git+https://github.com/PytorchLightning/pytorch-lightning.git

@sreuvers
Copy link

This unfortunately does not work for me, still getting the error

@borisdayma
Copy link
Contributor

Could you confirm which versions you are using and whether you have reproducible code?

@tsuga
Copy link

tsuga commented Mar 17, 2022

I still have the same issue on Colab.
wandb: 0.12.11
pytorch lightning: 1.6.0dev

Colab reproducible code is
https://colab.research.google.com/drive/1bH8tpR5U7SalfQpbPNVHABh3lSNjt9s9?usp=sharing

@borisdayma
Copy link
Contributor

borisdayma commented Mar 17, 2022

Looks like it works if you comment wandb.login().

@raubitsj @kptkin is it possible that this somehow prevents the wandb.require("service")?

@leoleoasd
Copy link

Hi @borisdayma I met the same problem with torch.multiprocessing.
The minimal reproduction codeis https://gist.github.com/leoleoasd/5d8dd5e1cec5bb2822656e66ede55c55

@borisdayma
Copy link
Contributor

@leoleoasd I think you can just add wandb.require("service") at the top of your script.

@sydholl sydholl added the ty:bug type of the issue is a bug label May 9, 2022
@zplizzi
Copy link

zplizzi commented Jun 2, 2022

Also just encountered this issue, with wandb, version 0.12.15. Fixed with wandb.require("service")

@kptkin
Copy link
Contributor

kptkin commented Jun 2, 2022

@zplizzi glad to hear that this solved your issue. Btw we have a pre-release where we have service activated by default if you want to try it out: https://docs.wandb.ai/guides/track/advanced/distributed-training#wandb-service-beta

@kptkin kptkin closed this as completed Jun 17, 2022
timokau added a commit to timokau/avalon that referenced this issue Nov 16, 2022
The wandb service is enabled by default [1][2] starting with wandb sdk
0.13.0. This commit adjusts the requirements accordingly. The setup call
is still necessary, since we later initialize wandb in subprocesses.

[1] wandb/wandb#1994 (comment)
[2] https://docs.wandb.ai/guides/track/advanced/distributed-training#w-and-b-sdk-0.13.0-and-above
timokau added a commit to timokau/avalon that referenced this issue Nov 18, 2022
The wandb service is enabled by default [1][2] starting with wandb sdk
0.13.0. This commit adjusts the requirements accordingly. The setup call
is still necessary, since we later initialize wandb in subprocesses.

[1] wandb/wandb#1994 (comment)
[2] https://docs.wandb.ai/guides/track/advanced/distributed-training#w-and-b-sdk-0.13.0-and-above
timokau added a commit to timokau/avalon that referenced this issue Nov 23, 2022
The wandb service is enabled by default [1][2] starting with wandb sdk
0.13.0. This commit adjusts the requirements accordingly. The setup call
is still necessary, since we later initialize wandb in subprocesses.

[1] wandb/wandb#1994 (comment)
[2] https://docs.wandb.ai/guides/track/advanced/distributed-training#w-and-b-sdk-0.13.0-and-above
@MoAKgit
Copy link

MoAKgit commented May 30, 2023

Hi, I got the same issue, using Pytorch Lightning in Google Colab!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client ty:bug type of the issue is a bug
Projects
None yet
Development

No branches or pull requests