Skip to content

Conversation

mingzhe09088
Copy link
Contributor

@mingzhe09088 mingzhe09088 commented Nov 11, 2020

Stack from ghstack:

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.

Differential Revision: D24863808

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.

Differential Revision: [D24863808](https://our.internmc.facebook.com/intern/diff/D24863808/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 11, 2020
mingzhe09088 pushed a commit that referenced this pull request Nov 11, 2020
NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.

Differential Revision: [D24863808](https://our.internmc.facebook.com/intern/diff/D24863808/)

ghstack-source-id: 116461969
Pull Request resolved: #47797
@dr-ci
Copy link

dr-ci bot commented Nov 12, 2020

💊 CI failures summary and remediations

As of commit 8b00330 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@codecov
Copy link

codecov bot commented Nov 12, 2020

Codecov Report

Merging #47797 (8b00330) into gh/mingzhe09088/15/base (fcd44ce) will increase coverage by 0.00%.
The diff coverage is 0.00%.

@@                   Coverage Diff                    @@
##           gh/mingzhe09088/15/base   #47797   +/-   ##
========================================================
  Coverage                    64.58%   64.58%           
========================================================
  Files                         1680     1680           
  Lines                       166945   166946    +1     
========================================================
+ Hits                        107815   107822    +7     
+ Misses                       59130    59124    -6     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 66f9b1d.

@facebook-github-bot facebook-github-bot deleted the gh/mingzhe09088/15/head branch November 16, 2020 15:17
tugsbayasgalan pushed a commit to tugsbayasgalan/pytorch that referenced this pull request Nov 16, 2020
Summary:
Pull Request resolved: pytorch#47797

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24863808

fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants