Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap fails if trying to get the Salt Master container while two exist #2434

Closed
gdemonet opened this issue Apr 20, 2020 · 0 comments
Closed
Labels
complexity:easy Something that requires less than a day to fix kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages topic:flakiness Some test are flaky and cause CI to do transient failing

Comments

@gdemonet
Copy link
Contributor

Component: salt, scripts

What happened:

During execution of bootstrap.sh (observed at least on 2.6):

> Syncing Utility modules on Salt master...
time="2020-04-20T07:14:03-04:00" level=fatal msg="execing command in container failed: rpc error: code = Unknown desc = failed to find container \"112e206b7a22bbfd7eb49ccad329db2558622326a7961e28f91d9530131b6175\\n173aa24cfd67ce8ab440becf5f8fe753b027f02c48230a3888699e761bed3227\" in store: does not exist

What was expected: The bootstrap script shouldn't fail for this kind of issue

Steps to reproduce: It's a timing issue, and only rarely happens.

Resolution proposal (optional):

In the get_salt_container function, defined in scripts/common.sh, wait for the crictl ps query to only return a single container (if two are Running, we can't know for sure which one we should use, so better wait than try to guess).

@gdemonet gdemonet added kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages topic:flakiness Some test are flaky and cause CI to do transient failing complexity:easy Something that requires less than a day to fix labels Apr 20, 2020
@gdemonet gdemonet added this to the MetalK8s 2.5.1 milestone Apr 20, 2020
@gdemonet gdemonet added this to To Do in Flakiness Investigations via automation Apr 20, 2020
gdemonet added a commit that referenced this issue Apr 20, 2020
Sometimes, if kubelet restarted the `salt-master` static Pod after an
operation, two containers matching the usual selector will co-exist for
a small time window.
If we use the `scripts/common.sh:get_salt_container` function at that
point in time, we may return a string with two container IDs instead of
just one, and subsequent commands will fail.
Instead, we now wait for a single container to exist (and also add a
sleep time between two attemps, which we didn't before).

Fixes: #2434
gdemonet added a commit that referenced this issue Apr 20, 2020
Sometimes, if kubelet restarted the `salt-master` static Pod after an
operation, two containers matching the usual selector will co-exist for
a small time window.
If we use the `scripts/common.sh:get_salt_container` function at that
point in time, we may return a string with two container IDs instead of
just one, and subsequent commands will fail.
Instead, we now wait for a single container to exist (and also add a
sleep time between two attemps, which we didn't before).

Fixes: #2434
@bert-e bert-e closed this as completed in 17183dd Apr 21, 2020
Flakiness Investigations automation moved this from To Do to Done Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:easy Something that requires less than a day to fix kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages topic:flakiness Some test are flaky and cause CI to do transient failing
Projects
Development

No branches or pull requests

1 participant