Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task executors that support specific roles are restarted when they fail #620

Open
zuston opened this issue Nov 25, 2021 · 2 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@zuston
Copy link
Member

zuston commented Nov 25, 2021

Why

Now TonY introduces the Sidecar Tensorboard, but sometimes it will fail due to hardware problems and unstable HDFS. But for users, it's better to unconscious restart it.

So we need to introduce the general mechanism to meet above requirements

@zuston
Copy link
Member Author

zuston commented Nov 25, 2021

@oliverhu Please check it.

Further more, maybe elastic training also need above feature.

@oliverhu
Copy link
Member

oliverhu commented Feb 8, 2022

Sounds good to me

@zuston zuston added the enhancement New feature or request label Mar 12, 2022
@zuston zuston self-assigned this May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants