Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] handle network error from pulling docker image #2551

Merged

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Sep 13, 2023

This resolves #2523

Some users have network restrictions and this prevents them from pulling the docker image from registry while creating the node(pod). And when this prevention happens, our sky launch was silently stalling without failing early. This PR resolves this issue by failing early and releasing the scheduled pod.

To reproduce the image pull failure event with networking error, it was necessary to block the registry domain, us-central1-docker.pkg.dev. This can be done by adding

127.0.0.1    us-central1-docker.pkg.dev

to /etc/hosts.

To reproduce this with kind setting, we would have to set

127.0.0.1    us-central1-docker.pkg.dev

at /etc/hosts in the node container rather than directly in the machine where you are running kind. The node container appears to have the name of skypilot-control-plane when running docker ps, and the interactive session can be entered by running docker exec -it skypilot-control-plane /bin/bash. We need to handle this differently as kind simulates a node with docker container, which is in our case, skypilot-control-plane. And kubelet for kind shares its networking setting with the node container rather than the machine running kind.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual testing for kind/GKE by blocking registry domain ensuring early fail over
  • pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"

@landscapepainter landscapepainter marked this pull request as draft September 13, 2023 06:19
@landscapepainter landscapepainter marked this pull request as ready for review September 16, 2023 01:32
@landscapepainter landscapepainter marked this pull request as draft September 19, 2023 08:00
@landscapepainter landscapepainter marked this pull request as ready for review September 21, 2023 03:46
@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Sep 21, 2023

@romilbhardwaj This is now ready for a look. Wondering if we should keep if 'rpc error: code = Unknown' in container_status.state.waiting.message: or not since checking for if container_status.state.waiting.reason == 'ErrImagePull': could be enough and allow more general handling.

Update: Depending on either or not there's a slight delay, waiting.reason can be 'ImagePullBackOff' as well. Removing if 'rpc error: code = Unknown' in container_status.state.waiting.message: and conditioning on waiting.reason == 'ErrImagePull' or waiting.reason == 'ImagePullBackOff' to raise network error message.

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj Did some refactoring on create_node for waiting for pod schdule, waiting for pods to run, and injecting env vars in to pods. This is ready for a look now.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left some comments.

sky/utils/kubernetes/create_cluster.sh Outdated Show resolved Hide resolved
sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved
Comment on lines 309 to 315
if waiting and (waiting.reason == 'ErrImagePull' or
waiting.reason == 'ImagePullBackOff'):
raise config.KubernetesError(
'Failed to pull docker image while '
'launching the node. Please check '
'your network connection. Error details: '
f'{container_status.state.waiting.message}.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check and raise be moved to after L265, where we already have similar checks in place? It seems to make more sense to have all waiting related errors handled at one place, and this method should be relegated to simply be a wait loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I'm wondering now if the following check from _wait_for_pods_to_schedule should actually be done after the pods are scheduled and doesn't need to be checked from _wait_for_pods_to_schedule. waiting.reason can be set to 'ContainerCreating' only after the pods are scheduled, so checking if the pods reached ContainerCreating state should be placed in _wait_for_pods_to_run. And this can update the original waiting check at _wait_for_pods_to_run with the waiting check from _wait_for_pods_to_schedule.

                    for container_status in pod.status.container_statuses:
                        # If the container wasn't in 'ContainerCreating'
                        # state, then we know pod wasn't scheduled or
                        # had some other error, such as image pull error.
                        # See list of possible reasons for waiting here:
                        # https://stackoverflow.com/a/57886025
                        waiting = container_status.state.waiting
                        if waiting is not None and waiting.reason != 'ContainerCreating':
                            all_pods_scheduled = False
                            break

I updated

                        if waiting and (waiting.reason == 'ErrImagePull' or
                                        waiting.reason == 'ImagePullBackOff'):

from _wait_for_pods_to_run with

                        if waiting is not None and waiting.reason != 'ContainerCreating':

so that the post-schedule errors can be hanlded from _wait_for_pods_to_run. Please correct me if I'm missing anything! Tested for network error(post-schedule error) and excessive resource request error(pre-schedule error), and both failed over correctly.

sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left some comments.

Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj This is ready for another look!

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter - this is good to go!

sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved
@landscapepainter landscapepainter merged commit 7b1bf0b into skypilot-org:master Nov 17, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster gets stuck in INIT state when pulling docker image fails
2 participants