Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[slurm] wandb hangs at the end of jobs in dryrun mode #919

Open
bknyaz opened this issue Mar 18, 2020 · 15 comments
Open

[slurm] wandb hangs at the end of jobs in dryrun mode #919

bknyaz opened this issue Mar 18, 2020 · 15 comments
Labels
a:cli Area: Client c:core Component: Core

Comments

@bknyaz
Copy link

bknyaz commented Mar 18, 2020

wandb --version && python --version && uname

  • Weights and Biases version: 0.8.21
  • Python version: Python 3.6.8 :: Anaconda, Inc.
  • Operating System: CentOS Linux release 7.7.1908 (Core)

Description

I'm using wandb on the GPU cluster with slurm to run jobs.
After the script finishes, wandb prints the following:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.

The problem is that the slurm scheduler doesn't quit this job and occupies the GPU node. Perhaps, for some reason some wandb processes are still running?

Not sure if the issue is with wandb or with the cluster I'm using. The cluster is actually one of the biggest in Canada, so I can imagine other people have this issue and it can result in a lot of nodes being idle for no reason. So would be great to solve this.

Other clusters I've used with Ubuntu and Internet access worked fine.

I use WANDB_MODE=dryrun, because the cluster doesn't have access to external network.

Update
My impression is that wandb tries to connect to the server after the script is finished, but because
there is no connection, it raises some exception and the process gets stuck for some reason.

In one of my log files I found an additional line printed at the end regarding the connection:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.
wandb: ERROR Failed to connect to W&B. Retrying in the background.

What I Did

see above

Thanks.

@cvphelps
Copy link
Contributor

Thanks for reporting this! @raubitsj could you please take a look

@cvphelps cvphelps changed the title Wandb processes do not finish on a cluster [slurm] wandb hangs at the end of jobs in dryrun mode Mar 18, 2020
@ariG23498
Copy link
Contributor

Hey @bknyaz
In the past year we've majorly reworked the CLI and UI for Weights & Biases. We're closing issues older than 6 months. Please comment to reopen.

@lukekenworthy
Copy link

I am having this problem as well. Has anyone ever figured this out?

@vanpelt
Copy link
Contributor

vanpelt commented Aug 16, 2021

@lukekenworthy can you provide an example script? If you're using multiprocessing in your scripts you may need to explicity call wandb.finish() in the process that called wandb.init once processing as completed.

@mohamedr002
Copy link

@lukekenworthy can you provide an example script? If you're using multiprocessing in your scripts you may need to explicitly call wandb.finish() in the process that called wandb.init once processing as completed.

I am having same issue when using wandb.sweep, where can I put wandb.finish() exactly in the script?

@ssadhukha
Copy link

I'm dealing with the same issue, any solutions for this yet?

@zdhNarsil
Copy link

Same issue here. Seems no update?

@vanpelt
Copy link
Contributor

vanpelt commented Mar 4, 2023

@ssadhukha or @zdhNarsil can you share an example script that gets you into this state?

@vanpelt vanpelt reopened this Mar 4, 2023
@davidoort
Copy link

Same issue here

@ssadhukha
Copy link

ssadhukha commented Mar 4, 2023

I'm running into this state when I specify a different run inside a for loop. The main parts of my script below:

## IMPORT PACKAGES + LOCAL MODULES

import os
import sys
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Subset
import pytorch_lightning as pl  # lightning toolbox
from pytorch_lightning.loggers import WandbLogger
import wandb

wandb.login()
wandb_logger = WandbLogger(project="transformer-tcg", log_model=True)
pl.seed_everything(42)

sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
from tools.analysis_utils import *
from tools.dataset import *

## FUNCTIONS

def main():
    run = wandb.init(name=f"pair{int(pr)}-{dtype}", project="transformer", allow_val_change=True)
    print(f"Num of layers: {nlayers}")
    print(f"Num of heads: {nheads}")
    basename = (
        f"pr{int(pr)}_trl_{trl}_{dtype}_{nlayers}layers_{nheads}heads_{ndims}dims_"
    )
    models = glob(os.path.join(pr_modeldir, "%s*" % basename + "*weights"))
    if models == []:
        ## For the first trained model
        filename = basename + "0000"
    else:
        last_num = max([int(i.split("_")[-2]) for i in models])
        filename = basename + f"{str(last_num + 1).zfill(4)}"

    filename = os.path.join(pr_modeldir, filename)

    if not os.path.isfile(filename + "_weights"):
        print("Training " + filename)
        model, trainer = train(
            train_loader,
            val_loader,
            filename,
            pretrain_filename=pretrain_filename,
            early_stopping_method=early_stopping_method,
            patience=patience,
            verbose=None,
            max_epochs=max_epochs,
            mode=mode,
            min_delta=min_delta,
            input_dim=nfeats,
            max_len=maxlen,
            num_classes=nclasses,
            model_dim=ndims,
            num_heads=nheads,
            num_layers=nlayers,
            max_pool=False,
            dropout=dropout,
            lr=lr,
            warmup=warmup,
            min_acc=min_acc,
            cross_val=cross_val,
            logger=False,
            stop_early=stop_early,
        )
        run.finish()
    return model, trainer, filename

## MAIN SCRIPT
# --------------------------------- Model parameters --------------------------------- #
###
###
### ...

dtype="test"
model_type="test"
projectdir="test"
pairs = [i for i in range(1,27)]
trials = [i for i in range(1,80)]
nlayers=2
nheads=2
ndims=16
for pr in pairs:
    print(f"Pair: {pr}")
    pr_modeldir = os.path.join(
        projectdir, "models", dtype, model_type, str(int(pr))
    )  # Save models here
    if os.path.exists(pr_modeldir):
        pass
    else:
        os.mkdir(pr_modeldir)
    for trl in trials:
        if trl == torch.tensor(1):
            pretrain_filename = None
        else:
            pretrain_filename = os.path.join(
                pr_modeldir,
                f"pr{int(pr)}_trl_{trl-1}_{dtype}_{nlayers}layers_{nheads}heads_{ndims}dims_0000",
            )
        mask = (torch.tensor(dataset.pairs) == pr) & (
            torch.tensor(dataset.trials) == trl
        )
        indices = torch.tensor(mask)
        indices = indices.nonzero().reshape(-1)
        train_set = Subset(dataset, indices)
        train_loader = DataLoader(
            train_set, batch_size=batch_size, shuffle=True, num_workers=0
        )
        if __name__ == "__main__":
            model, trainer, filename = main()


@adrialopezescoriza
Copy link

Same issue here, still no update?

@cherry-nancy
Copy link

same issue here, any update?

@kptkin kptkin added the a:cli Area: Client label Mar 25, 2024
@kptkin
Copy link
Contributor

kptkin commented Apr 19, 2024

@adrialopezescoriza @cherry-nancy @davidoort could you please provide a small repro so we could debug the issue and hopefully resolve it for you?

@kptkin
Copy link
Contributor

kptkin commented Apr 19, 2024

@ssadhukha just to verify something in your repro: #919 (comment)
if on this line: wandb_logger = WandbLogger(project="transformer-tcg", log_model=True) you remove the log_model are you still seeing an issue?
Thanks!

@kptkin kptkin added c:core Component: Core and removed c:offline env:slurm labels Apr 19, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@wandb wandb deleted a comment from exalate-issue-sync bot May 2, 2024
@anmolmann
Copy link

Hey folks, we implemented network logging and file pusher timeout for better debugging. If you are still running into this issue, then please share a small repro as my colleague asked above. Please try setting the env vars as suggested in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client c:core Component: Core
Projects
None yet
Development

No branches or pull requests