Skip to content

[release][no_ci] Mark tf mnist gpu benchmark test as unstable.#30475

Merged
amogkam merged 1 commit intoray-project:masterfrom
xwjiang2010:skip_tf_mnist_gpu_weekly
Nov 18, 2022
Merged

[release][no_ci] Mark tf mnist gpu benchmark test as unstable.#30475
amogkam merged 1 commit intoray-project:masterfrom
xwjiang2010:skip_tf_mnist_gpu_weekly

Conversation

@xwjiang2010
Copy link
Copy Markdown
Contributor

@xwjiang2010 xwjiang2010 commented Nov 18, 2022

Signed-off-by: xwjiang2010 xwjiang2010@gmail.com

Why are these changes needed?

Mark weekly tf mnist gpu benchmark test as unstable.

For the following reason:

  1. The nightly counterpart is already marked unstable.
  2. The test has a pretty high chance of timing out (since the day it was written) when acquiring vanilla tf time (20% of the time). It seems to be a native issue from distributed tf. Hence we don't have a reliable way of acquiring vanilla tf time leading to the test flakiness.
  3. Since updating cuda version, both air timing and vanilla tf timing have increased greatly from 400+ to 800+ for 200 epochs. Currently I don't have a good understanding of why that's the case. But this just makes the whole test even more likely to time out.

Overall, the test is not providing much signal as to how AIR is doing compared to vanilla tf.

See more details in #29922.

Related issue number

#29922

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
@amogkam amogkam merged commit 4e0d1b4 into ray-project:master Nov 18, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…roject#30475)

Mark weekly tf mnist gpu benchmark test as unstable.

For the following reason:

- The nightly counterpart is already marked unstable.
- The test has a pretty high chance of timing out (since the day it was written) when acquiring vanilla tf time (20% of the time). It seems to be a native issue from distributed tf. Hence we don't have a reliable way of acquiring vanilla tf time leading to the test flakiness.
- Since updating cuda version, both air timing and vanilla tf timing have increased greatly from 400+ to 800+ for 200 epochs. Currently I don't have a good understanding of why that's the case. But this just makes the whole test even more likely to time out.
- Overall, the test is not providing much signal as to how AIR is doing compared to vanilla tf.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@xwjiang2010 xwjiang2010 deleted the skip_tf_mnist_gpu_weekly branch July 26, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants