New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: All pods is existing when restart count exceed max retry #1719
Fix: All pods is existing when restart count exceed max retry #1719
Conversation
Welcome @LuBingtan! |
/assign @shinytang6 |
2e98a97
to
91a26be
Compare
/cc @wpeng102 |
I think this issue should be discussed in the community. I'll take it in the weekly meeting. |
When I tried to reproduce the phenomenon mentioning at #1736, I found that the mpi master was compeleted but workers were still running when maxRetry is 1. |
@LuBingtan Thanks for your pr. Is there e2e case to track this case? If not, please add the e2e or add your testing results to make sure it playing well with the case in pr 1657. |
/cc @wpeng102 Please take a look. |
Ok, I will add an e2e testing case |
Can someone see why it is not merged? |
Waiting for more test results. |
a2865ec
to
2bea7b9
Compare
2bea7b9
to
1c87728
Compare
…e2e test for job retry Signed-off-by: lubingtan <lubingtan@126.com>
1c87728
to
083f2f9
Compare
@jasonliu747 @william-wang @Thor-wl Hi, I have updated this pr with an e2e test for job retry. |
/lgtm |
@jasonliu747 @william-wang @Thor-wl |
I'm OK about current PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@shinytang6 @william-wang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fix: #1718