[slurm] wandb hangs at the end of jobs in dryrun mode #919

bknyaz · 2020-03-18T00:44:12Z

wandb --version && python --version && uname

Weights and Biases version: 0.8.21
Python version: Python 3.6.8 :: Anaconda, Inc.
Operating System: CentOS Linux release 7.7.1908 (Core)

Description

I'm using wandb on the GPU cluster with slurm to run jobs.
After the script finishes, wandb prints the following:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.

The problem is that the slurm scheduler doesn't quit this job and occupies the GPU node. Perhaps, for some reason some wandb processes are still running?

Not sure if the issue is with wandb or with the cluster I'm using. The cluster is actually one of the biggest in Canada, so I can imagine other people have this issue and it can result in a lot of nodes being idle for no reason. So would be great to solve this.

Other clusters I've used with Ubuntu and Internet access worked fine.

I use WANDB_MODE=dryrun, because the cluster doesn't have access to external network.

Update
My impression is that wandb tries to connect to the server after the script is finished, but because
there is no connection, it raises some exception and the process gets stuck for some reason.

In one of my log files I found an additional line printed at the end regarding the connection:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.
wandb: ERROR Failed to connect to W&B. Retrying in the background.

What I Did

see above

Thanks.

The text was updated successfully, but these errors were encountered:

cvphelps · 2020-03-18T22:54:22Z

Thanks for reporting this! @raubitsj could you please take a look

ariG23498 · 2020-12-03T06:29:43Z

Hey @bknyaz
In the past year we've majorly reworked the CLI and UI for Weights & Biases. We're closing issues older than 6 months. Please comment to reopen.

lukekenworthy · 2021-08-15T06:03:46Z

I am having this problem as well. Has anyone ever figured this out?

vanpelt · 2021-08-16T21:42:30Z

@lukekenworthy can you provide an example script? If you're using multiprocessing in your scripts you may need to explicity call wandb.finish() in the process that called wandb.init once processing as completed.

mohamedr002 · 2022-12-01T02:10:48Z

@lukekenworthy can you provide an example script? If you're using multiprocessing in your scripts you may need to explicitly call wandb.finish() in the process that called wandb.init once processing as completed.

I am having same issue when using wandb.sweep, where can I put wandb.finish() exactly in the script?

ssadhukha · 2023-02-12T20:00:40Z

I'm dealing with the same issue, any solutions for this yet?

zdhNarsil · 2023-03-04T04:33:16Z

Same issue here. Seems no update?

vanpelt · 2023-03-04T04:43:20Z

@ssadhukha or @zdhNarsil can you share an example script that gets you into this state?

davidoort · 2023-03-04T15:13:41Z

Same issue here

ssadhukha · 2023-03-04T19:07:36Z

I'm running into this state when I specify a different run inside a for loop. The main parts of my script below:

## IMPORT PACKAGES + LOCAL MODULES

import os
import sys
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Subset
import pytorch_lightning as pl  # lightning toolbox
from pytorch_lightning.loggers import WandbLogger
import wandb

wandb.login()
wandb_logger = WandbLogger(project="transformer-tcg", log_model=True)
pl.seed_everything(42)

sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
from tools.analysis_utils import *
from tools.dataset import *

## FUNCTIONS

def main():
    run = wandb.init(name=f"pair{int(pr)}-{dtype}", project="transformer", allow_val_change=True)
    print(f"Num of layers: {nlayers}")
    print(f"Num of heads: {nheads}")
    basename = (
        f"pr{int(pr)}_trl_{trl}_{dtype}_{nlayers}layers_{nheads}heads_{ndims}dims_"
    )
    models = glob(os.path.join(pr_modeldir, "%s*" % basename + "*weights"))
    if models == []:
        ## For the first trained model
        filename = basename + "0000"
    else:
        last_num = max([int(i.split("_")[-2]) for i in models])
        filename = basename + f"{str(last_num + 1).zfill(4)}"

    filename = os.path.join(pr_modeldir, filename)

    if not os.path.isfile(filename + "_weights"):
        print("Training " + filename)
        model, trainer = train(
            train_loader,
            val_loader,
            filename,
            pretrain_filename=pretrain_filename,
            early_stopping_method=early_stopping_method,
            patience=patience,
            verbose=None,
            max_epochs=max_epochs,
            mode=mode,
            min_delta=min_delta,
            input_dim=nfeats,
            max_len=maxlen,
            num_classes=nclasses,
            model_dim=ndims,
            num_heads=nheads,
            num_layers=nlayers,
            max_pool=False,
            dropout=dropout,
            lr=lr,
            warmup=warmup,
            min_acc=min_acc,
            cross_val=cross_val,
            logger=False,
            stop_early=stop_early,
        )
        run.finish()
    return model, trainer, filename

## MAIN SCRIPT
# --------------------------------- Model parameters --------------------------------- #
###
###
### ...

dtype="test"
model_type="test"
projectdir="test"
pairs = [i for i in range(1,27)]
trials = [i for i in range(1,80)]
nlayers=2
nheads=2
ndims=16
for pr in pairs:
    print(f"Pair: {pr}")
    pr_modeldir = os.path.join(
        projectdir, "models", dtype, model_type, str(int(pr))
    )  # Save models here
    if os.path.exists(pr_modeldir):
        pass
    else:
        os.mkdir(pr_modeldir)
    for trl in trials:
        if trl == torch.tensor(1):
            pretrain_filename = None
        else:
            pretrain_filename = os.path.join(
                pr_modeldir,
                f"pr{int(pr)}_trl_{trl-1}_{dtype}_{nlayers}layers_{nheads}heads_{ndims}dims_0000",
            )
        mask = (torch.tensor(dataset.pairs) == pr) & (
            torch.tensor(dataset.trials) == trl
        )
        indices = torch.tensor(mask)
        indices = indices.nonzero().reshape(-1)
        train_set = Subset(dataset, indices)
        train_loader = DataLoader(
            train_set, batch_size=batch_size, shuffle=True, num_workers=0
        )
        if __name__ == "__main__":
            model, trainer, filename = main()

adrialopezescoriza · 2023-05-09T16:59:10Z

Same issue here, still no update?

cherry-nancy · 2023-08-09T17:57:51Z

same issue here, any update?

kptkin · 2024-04-19T15:57:31Z

@adrialopezescoriza @cherry-nancy @davidoort could you please provide a small repro so we could debug the issue and hopefully resolve it for you?

kptkin · 2024-04-19T16:02:35Z

@ssadhukha just to verify something in your repro: #919 (comment)
if on this line: wandb_logger = WandbLogger(project="transformer-tcg", log_model=True) you remove the log_model are you still seeing an issue?
Thanks!

anmolmann · 2024-05-02T17:08:45Z

Hey folks, we implemented network logging and file pusher timeout for better debugging. If you are still running into this issue, then please share a small repro as my colleague asked above. Please try setting the env vars as suggested in this PR.

cvphelps changed the title ~~Wandb processes do not finish on a cluster~~ [slurm] wandb hangs at the end of jobs in dryrun mode Mar 18, 2020

ariG23498 closed this as completed Dec 3, 2020

vanpelt reopened this Mar 4, 2023

kptkin added c:offline env:slurm labels Mar 8, 2023

kptkin added the a:cli Area: Client label Mar 25, 2024

kptkin added c:core Component: Core and removed c:offline env:slurm labels Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[slurm] wandb hangs at the end of jobs in dryrun mode #919

[slurm] wandb hangs at the end of jobs in dryrun mode #919

bknyaz commented Mar 18, 2020 •

edited

cvphelps commented Mar 18, 2020

ariG23498 commented Dec 3, 2020

lukekenworthy commented Aug 15, 2021

vanpelt commented Aug 16, 2021

mohamedr002 commented Dec 1, 2022

ssadhukha commented Feb 12, 2023

zdhNarsil commented Mar 4, 2023

vanpelt commented Mar 4, 2023

davidoort commented Mar 4, 2023

ssadhukha commented Mar 4, 2023 •

edited

adrialopezescoriza commented May 9, 2023

cherry-nancy commented Aug 9, 2023

kptkin commented Apr 19, 2024

kptkin commented Apr 19, 2024

anmolmann commented May 2, 2024

[slurm] wandb hangs at the end of jobs in dryrun mode #919

[slurm] wandb hangs at the end of jobs in dryrun mode #919

Comments

bknyaz commented Mar 18, 2020 • edited

Description

What I Did

cvphelps commented Mar 18, 2020

ariG23498 commented Dec 3, 2020

lukekenworthy commented Aug 15, 2021

vanpelt commented Aug 16, 2021

mohamedr002 commented Dec 1, 2022

ssadhukha commented Feb 12, 2023

zdhNarsil commented Mar 4, 2023

vanpelt commented Mar 4, 2023

davidoort commented Mar 4, 2023

ssadhukha commented Mar 4, 2023 • edited

adrialopezescoriza commented May 9, 2023

cherry-nancy commented Aug 9, 2023

kptkin commented Apr 19, 2024

kptkin commented Apr 19, 2024

anmolmann commented May 2, 2024

bknyaz commented Mar 18, 2020 •

edited

ssadhukha commented Mar 4, 2023 •

edited