Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sdk): fix step logic when resuming runs with no metrics logged #6480

Merged
merged 11 commits into from
Oct 30, 2023

Conversation

luisbergua
Copy link
Contributor

@luisbergua luisbergua commented Oct 20, 2023

Description

What does the PR do?

When resuming a run, the step is incremented by 1 to avoid overwriting the last step. The problem is that, if the run didn't log any metrics, the step is still being increased by 1:

run = wandb.init(project='test_resume_step')
id = run.id
print('Final step:', run.step)
run.finish()

run = wandb.init(project='test_resume_step', resume='must', id=id)
print('Step after resuming:', run.step)
run.finish()

prints:

Final step: 0
Step after resuming: 1

This PR handles that specific case by checking if the summary contains any user-logged metrics and only increasing the step by 1 in that case. This is to avoid setting _step to 0 if only 1 step was logged as "_step" = n_steps - 1 so if n_steps=1 then _step will still be 0

馃 Generated by Copilot at c9b9b9a

Fix a bug in SendManager that caused incorrect step increment when resuming a run with only system metrics. Update the logic for setting the resume state step in sender.py to check for user-defined data in the summary.

Testing

How was this PR tested? Checking different cases before and after the fix:

run = wandb.init(project='test_resume_step')
id = run.id
print('Final step:', run.step)
run.finish()

run = wandb.init(project='test_resume_step', resume='must', id=id)
print('Step after resuming:', run.step)
run.finish()

Before:

Final step: 0
Step after resuming: 1

After:

Final step: 0
Step after resuming: 0

run = wandb.init(project='test_resume_step')
id = run.id
run.log({'loss': 1})
print('Final step:', run.step)
run.finish()

run = wandb.init(project='test_resume_step', resume='must', id=id)
print('Step after resuming:', run.step)
run.finish()

Before:

Final step: 1
Step after resuming: 1

After:

Final step: 1
Step after resuming: 1

run = wandb.init(project='test_resume_step')
id = run.id
run.log({'loss': 1})
run.log({'loss': 2})
print('Final step:', run.step)
run.finish()

run = wandb.init(project='test_resume_step', resume='must', id=id)
print('Step after resuming:', run.step)
run.finish()

Before:

Final step: 2
Step after resuming: 2

After:

Final step: 2
Step after resuming: 2

馃 Generated by Copilot at c9b9b9a

SendManager fixes
Resume state bug in metrics
Autumn leaves no trace

@luisbergua luisbergua changed the title Wb 15093 fix(sdk) fix step logic when resuming runs with no metrics logged Oct 20, 2023
@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

Merging #6480 (c7ce691) into main (ed0922e) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6480      +/-   ##
==========================================
- Coverage   77.03%   76.96%   -0.07%     
==========================================
  Files         389      389              
  Lines       45188    45190       +2     
==========================================
- Hits        34809    34780      -29     
- Misses      10326    10357      +31     
  Partials       53       53              
Flag Coverage 螖
unittest 81.24% <100.00%> (-0.08%) 猬囷笍

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage 螖
wandb/sdk/internal/sender.py 92.62% <100.00%> (+0.01%) 猬嗭笍

... and 17 files with indirect coverage changes

@luisbergua luisbergua changed the title fix(sdk) fix step logic when resuming runs with no metrics logged fix(sdk): fix step logic when resuming runs with no metrics logged Oct 20, 2023
@dmitryduev dmitryduev added this to the sdk-2023-11.1 milestone Oct 24, 2023
Copy link
Contributor

@kptkin kptkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@luisbergua luisbergua merged commit dc7b036 into main Oct 30, 2023
79 checks passed
@luisbergua luisbergua deleted the WB-15093 branch October 30, 2023 18:10
@luisbergua luisbergua self-assigned this Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants