feat(sdk): add shared mode to enable multiple independent writers to the same run #6882

dmitryduev · 2024-01-22T22:49:29Z

Description

Make it possible to log to the same run using multiple writers with independent wandb-core's (including ones running, for example, on different machines):

wandb.require("core")

run = wandb.init(settings=wandb.Settings(mode="shared"))
# define custom x-axis for the train job:
run.define_metric(name="loss", step_metric="train_step")
# for the eval jobs:
run.define_metric(name="eval_accuracy", step_metric="eval_step")

NOTE: This feature only works with SDK's new backend, wandb-core. As of version 0.17.0, you must explicitly opt into using the new backend: wandb.require("core").

To run the example below, do:

pip install -U wandb
# run the fake training script:
python train.py

A simple example that runs a training script and every 20 epochs spins up an independent eval job (that uses its own wandb-core, mimicking a situation where it's running on a different machine):

train.py

import argparse
import math
import os
import random
import subprocess
import time

import tqdm

import wandb

wandb.require("core")


def main(
    project: str = "igena",
    sleep: int = 1,
):
    run = wandb.init(
        project=project,
        settings=wandb.Settings(
            init_timeout=60,
            mode="shared",
            console="off",
            _stats_sample_rate_seconds=1,
            _stats_samples_to_average=1,
            _stats_disk_paths=["/System/Volumes/Data"],
            disable_job_creation=True,
        ),
    )
    print("run_id:", run.id)

    run.define_metric(name="loss", step_metric="train_step")
    # for the eval job:
    run.define_metric(name="eval_accuracy", step_metric="eval_step")

    bar = tqdm.tqdm()
    train_step = 0
    eval_step = 0
    while True:
        try:
            value = math.exp(-train_step / 100) + random.random() / 20
            run.log(
                {
                    "train_step": train_step,
                    "loss": value,
                }
            )
            bar.update(1)
            train_step += 1
            time.sleep(sleep)

            # kick-off evaluation
            if train_step % 20 == 0:
                subprocess.run(
                    [
                        "python",
                        "eval.py",
                        "--attach_id",
                        run.id,
                        "--eval_step",
                        str(eval_step),
                    ],
                    # reset WANDB_SERVICE so that it spins its own wandb-core
                    # this is done to mimic a multi-node scenario.
                    env={**os.environ, **{"WANDB_SERVICE": ""}},
                )
                eval_step += 1

        except KeyboardInterrupt:
            bar.close()
            break

    run.finish()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--project", type=str, default="igena")
    parser.add_argument("--sleep", type=int, default=1)

    args = parser.parse_args()

    main(**vars(args))

eval.py

import argparse
import math
import random

import wandb

wandb.require("core")

def main(attach_id: str, eval_step: int, project: str):
    run = wandb.init(
        id=attach_id,
        project=project,
        settings=wandb.Settings(
            mode="shared",
            console="off",
            _disable_machine_info=True,
            _disable_stats=True,
        ),
    )

    value = min(math.log(eval_step + 1) / 5 + random.random() / 20, 1)
    run.log(
        {
            "eval_accuracy": value,
            "eval_step": eval_step,
        },
    )

    run.finish()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--attach_id", type=str, required=True)
    parser.add_argument("--project", type=str, default="igena")
    parser.add_argument("--eval_step", type=int, required=True)

    args = parser.parse_args()

    main(**vars(args))

TODO

List all features that don't merge well in async mode
Remove the examples train.py and eval.py
Verify behavior when the user tries to use the shared run mode with the same wandb-core process and initializes two runs with the same id.

Testing

How was this PR tested?

codecov · 2024-01-22T22:53:06Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (bc2faa3) 79.54% compared to head (8aef6a4) 79.64%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6882      +/-   ##
==========================================
+ Coverage   79.54%   79.64%   +0.10%     
==========================================
  Files         452      452              
  Lines       50667    50712      +45     
==========================================
+ Hits        40302    40390      +88     
+ Misses      10074    10027      -47     
- Partials      291      295       +4

Flag	Coverage Δ
func	`50.24% <80.76%> (-1.08%)`	⬇️
system	`64.39% <36.53%> (+0.01%)`	⬆️
unit	`58.89% <30.76%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
core/pkg/filestream/filestream.go	`87.09% <100.00%> (+0.88%)`	⬆️
core/pkg/filestream/loop_process.go	`69.72% <100.00%> (+2.39%)`	⬆️
core/pkg/server/sender.go	`83.31% <100.00%> (+0.10%)`	⬆️
wandb/sdk/lib/_settings_toposort_generated.py	`100.00% <ø> (ø)`
wandb/sdk/wandb_settings.py	`93.22% <100.00%> (+<0.01%)`	⬆️
core/pkg/server/history.go	`97.61% <96.42%> (-0.18%)`	⬇️
wandb/sdk/wandb_run.py	`91.61% <50.00%> (+<0.01%)`	⬆️

... and 9 files with indirect coverage changes

core/pkg/filestream/loop_process.go

core/pkg/filestream/filestream.go

core/pkg/server/sender.go

core/pkg/server/history.go

…ync-attach

dmitryduev added 2 commits January 22, 2024 14:48

feat(sdk): make attach interplanetary

f6b415b

lint

dbe9db1

dmitryduev requested a review from a team as a code owner January 22, 2024 22:49

github-actions bot added the cc-feat label Jan 22, 2024

kptkin added 4 commits January 22, 2024 15:43

add another dervide setting for now

2130662

make partial history work

720cd9b

lint

9ecd896

check work with init

294975e

kptkin reviewed Jan 23, 2024

View reviewed changes

core/pkg/filestream/loop_process.go Outdated Show resolved Hide resolved

add warning message for the async with step

b115223

kptkin reviewed Jan 23, 2024

View reviewed changes

core/pkg/filestream/filestream.go Show resolved Hide resolved

kptkin reviewed Jan 23, 2024

View reviewed changes

core/pkg/filestream/filestream.go Outdated Show resolved Hide resolved

kptkin reviewed Jan 23, 2024

View reviewed changes

core/pkg/server/sender.go Show resolved Hide resolved

github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024

rm attach

46322cc

kptkin changed the title ~~feat(sdk): make attach interplanetary~~ feat(sdk): asyncify my run Jan 23, 2024

github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024

codegen

2fce17d

github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024

lint

b4606c7

github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024

dmitryduev and others added 5 commits January 23, 2024 13:21

add a nay functional test

10714f3

add a nay functional test

b920978

add a nay functional test

f1e5fe8

lint

690a5f9

Merge branch 'main' into async-attach

c8ff462

dmitryduev requested a review from a team January 23, 2024 22:51

github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024

kptkin reviewed Jan 23, 2024

View reviewed changes

core/pkg/server/history.go Outdated Show resolved Hide resolved

dmitryduev and others added 7 commits January 23, 2024 15:45

Merge branch 'main' of https://github.com/wandb/wandb into async-attach

835fc83

Merge branch 'async-attach' of https://github.com/wandb/wandb into as…

16ac999

…ync-attach

run nay tests

0c8868b

run nay tests

67e79fa

minor doc update

2942906

fix name

26723d5

Merge branch 'async-attach' of https://github.com/wandb/wandb into as…

8aef6a4

…ync-attach

dmitryduev requested a review from a team January 24, 2024 00:10

kptkin approved these changes Jan 24, 2024

View reviewed changes

dmitryduev merged commit 3e785da into main Jan 24, 2024
77 of 78 checks passed

dmitryduev deleted the async-attach branch January 24, 2024 00:48

github-actions bot added cc-feat and removed cc-feat labels Jan 26, 2024

github-actions bot added cc-feat and removed cc-feat labels Feb 15, 2024

github-actions bot added cc-feat and removed cc-feat labels May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): add shared mode to enable multiple independent writers to the same run #6882

feat(sdk): add shared mode to enable multiple independent writers to the same run #6882

dmitryduev commented Jan 22, 2024 •

edited

codecov bot commented Jan 22, 2024 •

edited

feat(sdk): add shared mode to enable multiple independent writers to the same run #6882

feat(sdk): add shared mode to enable multiple independent writers to the same run #6882

Conversation

dmitryduev commented Jan 22, 2024 • edited

Description

TODO

Testing

codecov bot commented Jan 22, 2024 • edited

Codecov Report

dmitryduev commented Jan 22, 2024 •

edited

codecov bot commented Jan 22, 2024 •

edited