Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI] Wandb stalls training loop #3223

Closed
Guitaricet opened this issue Feb 6, 2022 · 7 comments
Closed

[CLI] Wandb stalls training loop #3223

Guitaricet opened this issue Feb 6, 2022 · 7 comments
Labels
a:cli Area: Client

Comments

@Guitaricet
Copy link

Description

This is a very frustrating bug, which I guess is also hard to pinpoint, but I lost a couple of thousand $ on cloud compute because wandb stalled the whole program during training 馃槩. I love wandb, it is genuinely one of the best products I ever used, but I think it is not safe for me to continue using it for large-scale tasks (where it would help me quite a bit).

About an hour after starting the training script, training stops (GPUs don't run anything, a CPU process works, but does nothing). When I CTRL+C the script, the stacktrace looks like this

^CProcess wandb_internal:
Traceback (most recent call last):
  File "/home/opc/miniconda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/opc/miniconda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/internal/internal.py", line 153, in wandb_internal
    thread.join()
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 1033, in join
    self._wait_for_tstate_lock()
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CTraceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/opc/miniconda/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/opc/miniconda/lib/python3.9/multiprocessing/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
  File "/home/opc/miniconda/lib/python3.9/multiprocessing/process.py", line 333, in _bootstrap
    threading._shutdown()
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 1428, in _shutdown
    lock.acquire()
KeyboardInterrupt
^C^CException in thread NetStatThr:
Traceback (most recent call last):
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 152, in check_network_status
    status_response = self._interface.communicate_network_status()
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 125, in communicate_network_status
    resp = self._communicate_network_status(status)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 388, in _communicate_network_status
    resp = self._communicate(req, local=True)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 213, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 218, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
Exception in thread ChkStopThr:
Traceback (most recent call last):
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/home/opc/miniconda/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 170, in check_status
    status_response = self._interface.communicate_stop_status()
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 114, in communicate_stop_status
    resp = self._communicate_stop_status(status)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 378, in _communicate_stop_status
    resp = self._communicate(req, local=True)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 213, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/home/opc/miniconda/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 218, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

Wandb features

  • Metric logging

How to reproduce

Unfortunately, I did not find a simple way to reproduce it, but I get it every time I run this T5 distillation script.

export TOKENIZERS_PARALLELISM=false
export MODEL_DIR=pretrained_models/distilt5_6l_8h_512d_2048ff
export TEACHER_MODEL=t5-large


python run_lfom_distillation_flax.py \
	--output_dir=$MODEL_DIR \
	--model_type="t5" \
	--config_name="t5-small" \
	--tokenizer_name=$TEACHER_MODEL \
	--teacher_model_name_or_path=$TEACHER_MODEL \
	--dataset_name="c4" \
	--dataset_config_name="en" \
	--preprocessing_num_workers="64" \
	--max_seq_length="256" \
	--temperature 2.0 \
	--per_device_train_batch_size="64" \
	--per_device_eval_batch_size="64" \
	--adafactor \
	--learning_rate="0.01" \
	--weight_decay="0.001" \
	--warmup_steps="1024" \
	--overwrite_output_dir \
	--logging_steps="8" \
	--save_steps="1024" \
	--eval_steps="512" \
        --num_train_epochs "1" \
	--push_to_hub \
	--dataset_fraction="0.1"

Environment

  • OS: Oracle Linux 8
  • Environment: 8x A100 GPUs, Jax 0.2.28
  • Python Version: 3.9.5
@Guitaricet Guitaricet added the a:cli Area: Client label Feb 6, 2022
@amankalra172
Copy link

I have the same issue, any luck?

@nate-wandb
Copy link
Contributor

Hi @Guitaricet,
Sorry to hear that W&B is making your run crash. Could you try setting the env variable WANDB_START_METHOD="thread" This tends to solve a lot of the issues with the backend process shutting down on Linux machines. If that doesn't work then I will try to replicate your bug and look into it more.
Thank you,
Nate

@Guitaricet
Copy link
Author

Thank you, this sounds like a reasonable solution. I will try it out this week.

@nate-wandb
Copy link
Contributor

Hi @Guitaricet ,
I wanted to follow up and see if this solution worked for you?

Thank you,
Nate

@Guitaricet
Copy link
Author

Hi! The solution seem to be working!

@nate-wandb
Copy link
Contributor

@Guitaricet Happy to hear that! Thank you for writing in.

@Marroh
Copy link

Marroh commented Apr 12, 2022

I had the same problem and solved it by setting the env variable WANDB_START_METHOD="thread"! I guess #3031 indicates what exactly caused the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client
Projects
None yet
Development

No branches or pull requests

4 participants