Skip to content

Experiments: multi node logging#1246

Merged
ngrayluna merged 22 commits into
mainfrom
multi_node_runs
Apr 18, 2025
Merged

Experiments: multi node logging#1246
ngrayluna merged 22 commits into
mainfrom
multi_node_runs

Conversation

@ngrayluna
Copy link
Copy Markdown
Contributor

@ngrayluna ngrayluna commented Apr 10, 2025

Cleans up existing "Distributed logging" doc and adds a section on what and how to use public "Multi node" feature.

Jira ticket: https://wandb.atlassian.net/browse/DOCS-1373

@ngrayluna ngrayluna requested a review from a team as a code owner April 10, 2025 03:35
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 10, 2025

Deploying docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 649dcea
Status: ✅  Deploy successful!
Preview URL: https://f63c3744.docodile.pages.dev
Branch Preview URL: https://multi-node-runs.docodile.pages.dev

View logs

Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Copy link
Copy Markdown
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see in-line

@ngrayluna ngrayluna requested a review from noaleetz April 11, 2025 02:53
@github-actions
Copy link
Copy Markdown
Contributor

Images automagically compressed by Calibre's image-actions

Compression reduced images by 44.6%, saving 235.95 KB.

Filename Before After Improvement Visual comparison
assets/images/track/multi_node_system_metrics.png 529.03 KB 293.08 KB -44.6% View diff

437 images did not require optimisation.

Comment thread content/guides/models/track/log/distributed-training.md Outdated
Copy link
Copy Markdown
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! commented on including the console log experience in this shared mode (it is specific to shared mode / distributed setup).

@noaleetz noaleetz requested review from dmitryduev and kptkin April 11, 2025 16:01
Comment thread content/guides/models/track/log/distributed-training.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Images automagically compressed by Calibre's image-actions

Compression reduced images by 46.2%, saving 263.42 KB.

Filename Before After Improvement Visual comparison
assets/images/track/multi_node_console_logs.png 569.74 KB 306.32 KB -46.2% View diff

438 images did not require optimisation.

Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Copy link
Copy Markdown
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server release constraint

Comment thread content/guides/models/track/log/distributed-training.md
Copy link
Copy Markdown
Contributor

@mdlinville mdlinville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of my feedback -- please ignore anything where I've introduced errors or it doesn't make sense to you. I'm not so familiar with this area of content.

Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md
Comment thread content/guides/models/track/log/distributed-training.md
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Copy link
Copy Markdown
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing server release constraint

Copy link
Copy Markdown
Contributor

@mdlinville mdlinville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things, which I leave up to you whether to change or not. This is a big improvement to this page.

Comment thread content/guides/models/track/log/distributed-training.md Outdated
1. Checks the rank with the `--local_rank` command line argument.
1. If the rank is set to 0, sets up `wandb` logging conditionally in the [`train()`](https://github.com/wandb/examples/blob/master/examples/pytorch/pytorch-ddp/log-ddp.py#L24) function.

```python
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this might be better to use the Prism shortcode since you explicitly name the script here. You can grep around for some examples.

Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md Outdated
Comment thread content/guides/models/track/log/distributed-training.md
@ngrayluna ngrayluna merged commit 4ff4da8 into main Apr 18, 2025
4 checks passed
@ngrayluna ngrayluna deleted the multi_node_runs branch April 18, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants