Skip to content

Conversation

robieta
Copy link

@robieta robieta commented Dec 18, 2020

Summary:
A bunch of these tests are marked flaky, and have been since time immemorial. (Read: as far back as Buck will build.) However closer inspection reveals that they fail if and only if run on a GPU worker. What seems to be going on is that there are more jobs than GPUs, so the contention causes waits which registers as timeouts on the test.

This diff is kind of hacky, but it basically just drops deadlines if a GPU is present. Because Caffe2 is going away I'm not too terribly concerned about a beautiful solution, but we may as well keep some test coverage if it's easy.

CC Sebastian, Ilia, Min, and Hongzheng who also have tasks for what seems to be the same flakiness.

Test Plan: Turn the tests back on and see if they fall over. (The failure repros reliably on an OnDemand GPU and is fixed by this change, so it's not really just a hail Mary.)

Reviewed By: ngimel

Differential Revision: D25632981

Summary:
A bunch of these tests are marked flaky, and have been since time immemorial. (Read: as far back as Buck will build.) However closer inspection reveals that they fail if and only if run on a GPU worker. What seems to be going on is that there are more jobs than GPUs, so the contention causes waits which registers as timeouts on the test.

This diff is kind of hacky, but it basically just drops deadlines if a GPU is present. Because Caffe2 is going away I'm not too terribly concerned about a beautiful solution, but we may as well keep some test coverage if it's easy.

CC Sebastian, Ilia, Min, and Hongzheng who also have tasks for what seems to be the same flakiness.

Test Plan: Turn the tests back on and see if they fall over. (The failure repros reliably on an OnDemand GPU and is fixed by this change, so it's not really just a hail Mary.)

Reviewed By: ngimel

Differential Revision: D25632981

fbshipit-source-id: 2d815cdf0f2c4556087be4e25e326af66dff6c3e
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 18, 2020

💊 CI failures summary and remediations

As of commit 5bac067 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 3 times.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25632981

@codecov
Copy link

codecov bot commented Dec 18, 2020

Codecov Report

Merging #49591 (5bac067) into master (ed0489c) will decrease coverage by 0.14%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #49591      +/-   ##
==========================================
- Coverage   80.61%   80.46%   -0.15%     
==========================================
  Files        1879     1879              
  Lines      203684   203684              
==========================================
- Hits       164207   163902     -305     
- Misses      39477    39782     +305     

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in faf6032.

hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 6, 2021
…orch#49591)

Summary:
Pull Request resolved: pytorch#49591

A bunch of these tests are marked flaky, and have been since time immemorial. (Read: as far back as Buck will build.) However closer inspection reveals that they fail if and only if run on a GPU worker. What seems to be going on is that there are more jobs than GPUs, so the contention causes waits which registers as timeouts on the test.

This diff is kind of hacky, but it basically just drops deadlines if a GPU is present. Because Caffe2 is going away I'm not too terribly concerned about a beautiful solution, but we may as well keep some test coverage if it's easy.

CC Sebastian, Ilia, Min, and Hongzheng who also have tasks for what seems to be the same flakiness.

Test Plan: Turn the tests back on and see if they fall over. (The failure repros reliably on an OnDemand GPU and is fixed by this change, so it's not really just a hail Mary.)

Reviewed By: ngimel

Differential Revision: D25632981

fbshipit-source-id: 43dcce416fea916ba91f891e9e5b59b2c11cca1a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants