Skip to content

Conversation

osalpekar
Copy link
Member

Summary:
Fixes: #27643

This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.

Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.

Differential Revision: D20164420

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

@dr-ci
Copy link

dr-ci bot commented Mar 12, 2020

💊 CircleCI build failures summary and remediations

As of commit 94cb64f (more details on the Dr. CI page):


  • 1/2 failures introduced in this PR

  • 1/2 broken upstream at merge base 064f628 on Mar 18 from 2:35pm to 7:57pm (10 commits; d927d58 - c747f09)

    Please rebase on the viable/strict branch (expand for instructions)

    Since your merge base is older than viable/strict, run these commands:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase FETCH_HEAD
    

    Check out the recency history of this "viable master" tracking branch.


🚧 1 upstream failure:

These were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 37 times.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the number of DIST_AUTOGRAD_FAILURE_REQ messages received per context id to debug info and then verify here that each node gets one of these messages for all the other context ids? Can add this as a separate PR.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

…ilure during Distributed Autograd (pytorch#34638)

Summary:
Pull Request resolved: pytorch#34638

Fixes: pytorch#27643

This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.

Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.

Differential Revision: D20164420

fbshipit-source-id: 5aada5544ed12cd7e24053ba3b93f8b9b38ba021
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20164420

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5f67c92.

@ezyang
Copy link
Contributor

ezyang commented Mar 19, 2020

this broke lint

  {
    path: 'test/test_jit.py',
    start_line: 8424,
    end_line: 8424,
    start_column: 33,
    end_column: 33,
    annotation_level: 'failure',
    message: '[E999] SyntaxError: invalid syntax'
  },
  {
    path: 'torch/testing/_internal/distributed/rpc/dist_autograd_test.py',
    start_line: 1,
    end_line: 1,
    start_column: 1,
    end_column: 1,
    annotation_level: 'failure',
    message: "[F401] 'sys' imported but unused"
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notify workers of failure of distributed backward pass

5 participants