Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): add shared mode to enable multiple independent writers to the same run #6882

Merged
merged 30 commits into from
Jan 24, 2024

Conversation

dmitryduev
Copy link
Member

@dmitryduev dmitryduev commented Jan 22, 2024

Description

Make it possible to log to the same run using multiple writers with independent wandb-core's (including ones running, for example, on different machines):

wandb.require("core")

run = wandb.init(settings=wandb.Settings(mode="shared"))
# define custom x-axis for the train job:
run.define_metric(name="loss", step_metric="train_step")
# for the eval jobs:
run.define_metric(name="eval_accuracy", step_metric="eval_step")

NOTE: This feature only works with SDK's new backend, wandb-core. As of version 0.17.0, you must explicitly opt into using the new backend: wandb.require("core").


To run the example below, do:

pip install -U wandb
# run the fake training script:
python train.py

A simple example that runs a training script and every 20 epochs spins up an independent eval job (that uses its own wandb-core, mimicking a situation where it's running on a different machine):

train.py

import argparse
import math
import os
import random
import subprocess
import time

import tqdm

import wandb

wandb.require("core")


def main(
    project: str = "igena",
    sleep: int = 1,
):
    run = wandb.init(
        project=project,
        settings=wandb.Settings(
            init_timeout=60,
            mode="shared",
            console="off",
            _stats_sample_rate_seconds=1,
            _stats_samples_to_average=1,
            _stats_disk_paths=["/System/Volumes/Data"],
            disable_job_creation=True,
        ),
    )
    print("run_id:", run.id)

    run.define_metric(name="loss", step_metric="train_step")
    # for the eval job:
    run.define_metric(name="eval_accuracy", step_metric="eval_step")

    bar = tqdm.tqdm()
    train_step = 0
    eval_step = 0
    while True:
        try:
            value = math.exp(-train_step / 100) + random.random() / 20
            run.log(
                {
                    "train_step": train_step,
                    "loss": value,
                }
            )
            bar.update(1)
            train_step += 1
            time.sleep(sleep)

            # kick-off evaluation
            if train_step % 20 == 0:
                subprocess.run(
                    [
                        "python",
                        "eval.py",
                        "--attach_id",
                        run.id,
                        "--eval_step",
                        str(eval_step),
                    ],
                    # reset WANDB_SERVICE so that it spins its own wandb-core
                    # this is done to mimic a multi-node scenario.
                    env={**os.environ, **{"WANDB_SERVICE": ""}},
                )
                eval_step += 1

        except KeyboardInterrupt:
            bar.close()
            break

    run.finish()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--project", type=str, default="igena")
    parser.add_argument("--sleep", type=int, default=1)

    args = parser.parse_args()

    main(**vars(args))

eval.py

import argparse
import math
import random

import wandb

wandb.require("core")

def main(attach_id: str, eval_step: int, project: str):
    run = wandb.init(
        id=attach_id,
        project=project,
        settings=wandb.Settings(
            mode="shared",
            console="off",
            _disable_machine_info=True,
            _disable_stats=True,
        ),
    )

    value = min(math.log(eval_step + 1) / 5 + random.random() / 20, 1)
    run.log(
        {
            "eval_accuracy": value,
            "eval_step": eval_step,
        },
    )

    run.finish()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--attach_id", type=str, required=True)
    parser.add_argument("--project", type=str, default="igena")
    parser.add_argument("--eval_step", type=int, required=True)

    args = parser.parse_args()

    main(**vars(args))

TODO

  • List all features that don't merge well in async mode
  • Remove the examples train.py and eval.py
  • Verify behavior when the user tries to use the shared run mode with the same wandb-core process and initializes two runs with the same id.

Testing

How was this PR tested?

@dmitryduev dmitryduev requested a review from a team as a code owner January 22, 2024 22:49
Copy link

codecov bot commented Jan 22, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (bc2faa3) 79.54% compared to head (8aef6a4) 79.64%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6882      +/-   ##
==========================================
+ Coverage   79.54%   79.64%   +0.10%     
==========================================
  Files         452      452              
  Lines       50667    50712      +45     
==========================================
+ Hits        40302    40390      +88     
+ Misses      10074    10027      -47     
- Partials      291      295       +4     
Flag Coverage Δ
func 50.24% <80.76%> (-1.08%) ⬇️
system 64.39% <36.53%> (+0.01%) ⬆️
unit 58.89% <30.76%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
core/pkg/filestream/filestream.go 87.09% <100.00%> (+0.88%) ⬆️
core/pkg/filestream/loop_process.go 69.72% <100.00%> (+2.39%) ⬆️
core/pkg/server/sender.go 83.31% <100.00%> (+0.10%) ⬆️
wandb/sdk/lib/_settings_toposort_generated.py 100.00% <ø> (ø)
wandb/sdk/wandb_settings.py 93.22% <100.00%> (+<0.01%) ⬆️
core/pkg/server/history.go 97.61% <96.42%> (-0.18%) ⬇️
wandb/sdk/wandb_run.py 91.61% <50.00%> (+<0.01%) ⬆️

... and 9 files with indirect coverage changes

@github-actions github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024
@kptkin kptkin changed the title feat(sdk): make attach interplanetary feat(sdk): asyncify my run Jan 23, 2024
@github-actions github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024
@github-actions github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024
@github-actions github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024
@dmitryduev dmitryduev requested a review from a team January 23, 2024 22:51
@github-actions github-actions bot added cc-feat and removed cc-feat labels Jan 23, 2024
core/pkg/server/history.go Outdated Show resolved Hide resolved
@dmitryduev dmitryduev requested a review from a team January 24, 2024 00:10
@dmitryduev dmitryduev merged commit 3e785da into main Jan 24, 2024
77 of 78 checks passed
@dmitryduev dmitryduev deleted the async-attach branch January 24, 2024 00:48
@github-actions github-actions bot added cc-feat and removed cc-feat labels Jan 26, 2024
@github-actions github-actions bot added cc-feat and removed cc-feat labels Feb 15, 2024
@github-actions github-actions bot added cc-feat and removed cc-feat labels May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants