Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken pipe when shutting down due to daemon threads #3031

Merged
merged 5 commits into from
Dec 11, 2021

Conversation

raubitsj
Copy link
Member

@raubitsj raubitsj commented Dec 11, 2021

Fixes WB-NNNN
Fixes #NNNN

Description

        # Message stop_status is called from a daemon thread started by wandb_run
        # The underlying socket might go away while the thread is still running.
        # Handle this like a timedout message as the daemon thread will eventually
        # be killed.

This can happen with this sequence of events with wandb service:

wandb.require("service")
wandb.init()
sys.exit()

Note there is no explicit finish call.
with wandb service this gets handled by shutting down the internal process in the atexit handler.
The atexit handler has no clean way to shutdown the daemon threads in the user process. In the above case
it could shut them down -- but we dont, so we keep the atexit as simple as possible. The cases where it would
not be possible to shutdown the threads is if the above wandb.init() and no finish was called from a child process.

Testing

Not really tested for this PR.. Will run in the regression and see if it fails again. This was seen at least a few times in the ~160 tests run in the regression.

Checklist

  • Name PR "[WB-NNNN][WB-MMMM] Add support for..." similar to entries in CHANGELOG.md
  • Include reference to internal ticket "Fixes WB-NNNN" (and github issue "Fixes #NNNN" if applicable)

@raubitsj raubitsj requested a review from kptkin December 11, 2021 03:34
@raubitsj raubitsj marked this pull request as ready for review December 11, 2021 03:34
@raubitsj raubitsj added the changelog-ignore Ignore PR from changelog label Dec 11, 2021
@codecov
Copy link

codecov bot commented Dec 11, 2021

Codecov Report

Merging #3031 (2bd68b5) into master (a7fb6f6) will decrease coverage by 0.00%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3031      +/-   ##
==========================================
- Coverage   78.85%   78.85%   -0.01%     
==========================================
  Files         209      209              
  Lines       27470    27482      +12     
==========================================
+ Hits        21661    21670       +9     
- Misses       5809     5812       +3     
Flag Coverage Δ
functest 56.89% <66.66%> (+0.03%) ⬆️
unittest 69.00% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/interface/interface_sock.py 89.36% <66.66%> (-7.79%) ⬇️
wandb/sdk/internal/stats.py 67.22% <0.00%> (-0.56%) ⬇️
wandb/sdk/interface/interface_shared.py 84.51% <0.00%> (-0.33%) ⬇️
wandb/sdk/wandb_run.py 87.65% <0.00%> (-0.27%) ⬇️
wandb/sdk/lib/git.py 76.35% <0.00%> (ø)
wandb/sdk/internal/sender.py 91.66% <0.00%> (+0.27%) ⬆️
wandb/sdk/internal/meta.py 90.18% <0.00%> (+3.06%) ⬆️

Copy link
Contributor

@kptkin kptkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@raubitsj raubitsj merged commit fbee87b into master Dec 11, 2021
@raubitsj raubitsj deleted the fix-broken-pipe-shutdown branch December 11, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-ignore Ignore PR from changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants