Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Logging for NCCL Async Error Handling Process Crash #47244

Closed
wants to merge 1 commit into from

Conversation

osalpekar
Copy link
Member

@osalpekar osalpekar commented Nov 2, 2020

Stack from ghstack:

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called ProcessGroupNCCL.WorkNCCL.handleNCCLGuard, which is recorded as an entry in the scuba_caffe2_pytorch_usage_stats Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the fblearner_workflow_run_status scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the scuba_caffe2_pytorch_usage_stats table.

As a demonstration, I ran the following workflow with this diff patched: f229675892
Since the workflow above causes a desync, the handleNCCLGuard event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries.

Differential Revision: D24688739

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table.

As a demonstration, I ran the following workflow with this diff patched: f229675892
Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries.

Differential Revision: [D24688739](https://our.internmc.facebook.com/intern/diff/D24688739/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 2, 2020
osalpekar added a commit that referenced this pull request Nov 2, 2020
This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table.

As a demonstration, I ran the following workflow with this diff patched: f229675892
Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries.

Differential Revision: [D24688739](https://our.internmc.facebook.com/intern/diff/D24688739/)

ghstack-source-id: 115708632
Pull Request resolved: #47244
@dr-ci
Copy link

dr-ci bot commented Nov 3, 2020

💊 CI failures summary and remediations

As of commit 6885538 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@codecov
Copy link

codecov bot commented Nov 3, 2020

Codecov Report

Merging #47244 into gh/osalpekar/100/base will increase coverage by 0.00%.
The diff coverage is n/a.

@@                  Coverage Diff                   @@
##           gh/osalpekar/100/base   #47244   +/-   ##
======================================================
  Coverage                  60.81%   60.81%           
======================================================
  Files                       2748     2748           
  Lines                     254027   254027           
======================================================
+ Hits                      154490   154495    +5     
+ Misses                     99537    99532    -5     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 8b13ab9.

@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/100/head branch November 7, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants