Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI]: wandb makes no progress uploading artifacts after hours of waiting #7266

Open
JasonGross opened this issue Mar 30, 2024 · 3 comments
Open
Labels
c:api Public api c:artifacts Candidate for artifact branch

Comments

@JasonGross
Copy link

Describe the bug

I see

wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=19 sec). Create a new API with an
integer timeout larger than 19, e.g., `api = wandb.Api(timeout=29)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=19 sec). Create a new API with an
integer timeout larger than 19, e.g., `api = wandb.Api(timeout=29)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=19 sec). Create a new API with an
integer timeout larger than 19, e.g., `api = wandb.Api(timeout=29)` to increase the graphql timeout.
Epochs          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3000/3000 0:32:56 • 0:00:00 1.58it/s                                     Epoch 2999/2999 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1       0:00:00 • 0:00:00 0.00it/s v_num: a7r7 loss: 0.000 acc: 1.000  Epochs          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3000/3000 0:32:56 • 0:00:00 1.58it/s                                     Epoch 2999/2999 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1       0:00:00 • 0:00:00 0.00it/s v_num: a7r7 loss: 0.000 acc: 1.000                                                                                          periodic_test_loss: 0.000                                                                                                   periodic_test_acc: 1.000            LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: PossibleUserWarning:

The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         test_acc          │            1.0            │
│         test_loss         │   9.035960601977422e-08   │
└───────────────────────────┴───────────────────────────┘
Testing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 0:00:00 • 0:00:00 15.22it/s
Saving to disk...
Saving to WandB...
wandb: WARNING No program path found, not creating job artifact. See https://docs.wandb.ai/guides/launch/create-job
wandb: / 0.000 MB of 633.037 MB uploadedaded

This is for https://wandb.ai/gbmi/MaxOf2-3000-epochs-adj-0,1,2,17-training-ratio-0.100-with-eos-nondeterministic/runs/ku05a7r7. The media does not display at all, despite uploading plotly.express plots on every epoch. It has been stuck at 0 MB for hours. I am not 100% sure how to reproduce, but you could try running poetry run python -m gbmi.exp_max_of_n.train --max-of 2 --non-deterministic --train-for-epochs 3000 --validate-every-epochs 1 --force-adjacent-gap 0,1,2,17 --use-log1p --lr 0.001 --betas 0.9 0.98 --weight-decay 1.0 --optimizer AdamW --training-ratio 0.099609375 --log-matrix-interp --checkpoint-every-epochs 1 --batch-size 408 --log-every-n-steps 1 --use-end-of-sequence --use-kaiming-init from https://github.com/JasonGross/guarantees-based-mechanistic-interpretability/

Additional Files

No response

Environment

WandB version: 0.16.3

OS: Linux 0e5569e59be2 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Python version: 3.10.13

Versions of relevant libraries: ?

Additional Context

No response

@JasonGross
Copy link
Author

JasonGross commented Mar 31, 2024

Possibly related: after running into this issue with two wandb processes simultaneously, I'm seeing

OpenBLAS blas_thread_init: pthread_create failed
OpenBLAS blas_thread_init: pthread_create failed for thread 10 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 11 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 12 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 13 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 14 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 15 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 16 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 17 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 18 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 19 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 20 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 21 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 23 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 24 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 25 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 26 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 27 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 28 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 29 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 30 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 31 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 32 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 33 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 34 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 35 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 37 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 38 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 39 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 40 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 41 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 42 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 43 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 44 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 45 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 46 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 47 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 48 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 49 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 50 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 51 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 52 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 53 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 54 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 55 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 56 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 57 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 58 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 59 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 60 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 61 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 62 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 63 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
Traceback (most recent call last):
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/core/__init__.py", line 24, in <module>
    from . import multiarray
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/core/multiarray.py", line 10, in <module>
    from . import overrides
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/core/overrides.py", line 8, in <module>
    from numpy.core._multiarray_umath import (
ImportError: PyCapsule_Import could not import module "datetime"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/__init__.py", line 130, in <module>
    from numpy.__config__ import show as show_config
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/__config__.py", line 4, in <module>
    from numpy.core._multiarray_umath import (
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/core/__init__.py", line 50, in <module>
    raise ImportError(msg)
ImportError:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.10 from "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/bin/python"
  * The NumPy version is: "1.26.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: PyCapsule_Import could not import module "datetime"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/gbmi/exp_max_of_n/train.py", line 5, in <module>
    import matplotlib.pyplot as plt
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/matplotlib/__init__.py", line 124, in <module>
    import numpy
  File "/root/jason_code/guarantees-based-mechanistic-interpretability/.venv/lib/python3.10/site-packages/numpy/__init__.py", line 135, in <module>
    raise ImportError(msg) from e
ImportError: Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there.
Segmentation fault (core dumped)

on trying to run a third. Following stack exchange, OPENBLAS_NUM_THREADS=1 seems to fix the issue with the third process, so maybe wandb is not robust to failures to create enough threads?

@JasonGross
Copy link
Author

Seems to be fixed after upgrading to the latest wandb, but it's pretty unfriendly to hang forever in recent versions of the CLI

@kptkin kptkin added a:cli Area: Client c:api Public api c:artifacts Candidate for artifact branch and removed c:api Public api labels Apr 2, 2024
@kptkin
Copy link
Contributor

kptkin commented Apr 2, 2024

@JasonGross sorry that you experienced this issue and glad to hear it was resolved. Agreed if we didn't provide an appropriate indication it is not great.
We are still happy to debug what went wrong, but usually it helps if we have access to the debug logs or/and repro script as it help us better understand what we should be debugging.
Thanks!

@kptkin kptkin removed the a:cli Area: Client label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:api Public api c:artifacts Candidate for artifact branch
Projects
None yet
Development

No branches or pull requests

2 participants