Skip to content

SM_CURRENT_HOST changes for spot instance training  #1260

Closed
@sivakhno

Description

@sivakhno

What did you find confusing? Please describe.
We are trying to create am XGboost training across multiple models that runs in distributed manner on spot instances
We have found this tutorial from AWS explaining this
https://aws.amazon.com/blogs/machine-learning/running-distributed-tensorflow-training-with-amazon-sagemaker/

I understand that we can use SM_CURRENT_HOST which by default set to algo-1, algo-2, algo-3
so we can correlate SM_CURRENT_HOST with the subset of the data.

Question:
What would happen to SM_CURRENT_HOST when the spot instance is suddenly killed and the new is created for it? Would it retain the old name SM_CURRENT_HOST or what would be the algorithm of assigning SM_CURRENT_HOST to a new spot instance that created in place of the old one?

Describe how documentation can be improved
Brief description of how SM_CURRENT_HOST is set when spot instance is reclaimed and new one created will be helpful.

Additional context
Add any other context or screenshots about the documentation request here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions