Event Logging for NCCL Async Error Handling Process Crash #47244

osalpekar · 2020-11-02T23:54:10Z

Stack from ghstack:

Event Logging for NCCL Async Error Handling Process Crash #47244 Event Logging for NCCL Async Error Handling Process Crash

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called ProcessGroupNCCL.WorkNCCL.handleNCCLGuard, which is recorded as an entry in the scuba_caffe2_pytorch_usage_stats Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the fblearner_workflow_run_status scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the scuba_caffe2_pytorch_usage_stats table.

As a demonstration, I ran the following workflow with this diff patched: f229675892
Since the workflow above causes a desync, the handleNCCLGuard event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries.

Differential Revision: D24688739

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table. As a demonstration, I ran the following workflow with this diff patched: f229675892 Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries. Differential Revision: [D24688739](https://our.internmc.facebook.com/intern/diff/D24688739/) [ghstack-poisoned]

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table. As a demonstration, I ran the following workflow with this diff patched: f229675892 Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries. Differential Revision: [D24688739](https://our.internmc.facebook.com/intern/diff/D24688739/) ghstack-source-id: 115708632 Pull Request resolved: #47244

dr-ci · 2020-11-03T01:19:21Z

💊 CI failures summary and remediations

As of commit 6885538 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.8-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

codecov · 2020-11-03T03:41:49Z

Codecov Report

Merging #47244 into gh/osalpekar/100/base will increase coverage by 0.00%.
The diff coverage is n/a.

@@                  Coverage Diff                   @@
##           gh/osalpekar/100/base   #47244   +/-   ##
======================================================
  Coverage                  60.81%   60.81%           
======================================================
  Files                       2748     2748           
  Lines                     254027   254027           
======================================================
+ Hits                      154490   154495    +5     
+ Misses                     99537    99532    -5

facebook-github-bot · 2020-11-03T23:15:56Z

This pull request has been merged in 8b13ab9.

osalpekar requested review from mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 2, 2020 23:54

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 2, 2020

jiayisuse approved these changes Nov 2, 2020

View reviewed changes

facebook-github-bot closed this in 8b13ab9 Nov 3, 2020

facebook-github-bot added the Merged label Nov 3, 2020

facebook-github-bot deleted the gh/osalpekar/100/head branch November 7, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event Logging for NCCL Async Error Handling Process Crash #47244

Event Logging for NCCL Async Error Handling Process Crash #47244

osalpekar commented Nov 2, 2020 •

edited

dr-ci bot commented Nov 3, 2020 •

edited

codecov bot commented Nov 3, 2020

facebook-github-bot commented Nov 3, 2020

Event Logging for NCCL Async Error Handling Process Crash #47244

Event Logging for NCCL Async Error Handling Process Crash #47244

Conversation

osalpekar commented Nov 2, 2020 • edited

dr-ci bot commented Nov 3, 2020 • edited

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

codecov bot commented Nov 3, 2020

Codecov Report

facebook-github-bot commented Nov 3, 2020

osalpekar commented Nov 2, 2020 •

edited

dr-ci bot commented Nov 3, 2020 •

edited