Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(launch): Fix race condition in agent thread clean up #6352

Merged
merged 10 commits into from
Sep 25, 2023

Conversation

KyleGoyette
Copy link
Contributor

@KyleGoyette KyleGoyette commented Sep 22, 2023

Fixes

  • Fixes WB-NNNNN
  • Fixes #NNNN

Description

There was a race condition between the main agent loops update_finished function and the within thread, thread clean up. The agent could try to access a thread that was deleted in update_finished but this would lead to a KeyError and crash the whole agent.

The solution is to move all thread clean up out of the main agent loop, threads all clean up after themselves now.

🤖 Generated by Copilot at 2e337a1

Improved the launch agent logic and fixed a TypeError bug. The file wandb/sdk/launch/agent/agent.py was refactored to use a generator for thread ids and to simplify the job status checking and thread finishing.

Testing

How was this PR tested?

Checklist

  • Include reference to internal ticket "Fixes WB-NNNNN" and/or GitHub issue "Fixes #NNNN" (if applicable)
  • Ensure PR title compliance with the conventional commits standards

🤖 Generated by Copilot at 2e337a1

Sing, O Muse, of the skillful refactorer
Who tamed the agent logic with his code
And made a generator for thread ids
To spawn and join the workers as they strode.

@codecov
Copy link

codecov bot commented Sep 22, 2023

Codecov Report

Merging #6352 (df6fdcd) into main (34c41a5) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6352      +/-   ##
==========================================
- Coverage   77.24%   77.21%   -0.04%     
==========================================
  Files         387      387              
  Lines       44441    44437       -4     
==========================================
- Hits        34329    34310      -19     
- Misses      10059    10074      +15     
  Partials       53       53              
Flag Coverage Δ
unittest 80.97% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
wandb/sdk/launch/agent/agent.py 81.57% <100.00%> (+2.93%) ⬆️

... and 16 files with indirect coverage changes

@github-actions github-actions bot added cc-fix and removed cc-fix labels Sep 22, 2023
@github-actions github-actions bot added cc-fix and removed cc-fix labels Sep 22, 2023
@KyleGoyette KyleGoyette marked this pull request as ready for review September 25, 2023 16:39
@KyleGoyette KyleGoyette requested a review from a team as a code owner September 25, 2023 16:39
Copy link
Contributor

@bcsherma bcsherma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one nit

And I think we might want to acquire the lock before we read the job here https://github.com/wandb/wandb/blob/4a71959e0c9d440b8fdf5567c8db9ee6ca3101ad/wandb/sdk/launch/agent/agent.py#L285C51-L285C51 just to be safe

wandb/sdk/launch/agent/agent.py Outdated Show resolved Hide resolved
@KyleGoyette KyleGoyette merged commit b1f079b into main Sep 25, 2023
78 checks passed
@KyleGoyette KyleGoyette deleted the launch/fix-race-condition-on-threads branch September 25, 2023 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants