Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding --annotation-ttl for automatic unlock #119

Merged
merged 7 commits into from
May 20, 2020
Merged

Adding --annotation-ttl for automatic unlock #119

merged 7 commits into from
May 20, 2020

Conversation

michalschott
Copy link
Contributor

@michalschott michalschott commented Mar 9, 2020

I have observed this behavior when running kured together with cluster autoscaler, reason why this idea was born.

  • CA adds nodes (CoreOS in my case)
  • Nodes are starting to reboot themselves after fetching updates, ie. annotation has been added (lock owned by node 123.123.123.123)
  • CA decides to downscale ASG (not all nodes had a chance to reboot yet), and it decides to cut off node 123.123.123.123
  • Need to manually take off annotation

This PR introduces --annotationTTL parameter, by default set to 0 (means do not use this feature), which allows other kured pods to take off the lock if it has expired.

Comparing kured to coreos-update-operator latter one is adding lock annotation onto node object which makes such feature as lockTTL completely not required.

@michalschott michalschott changed the title Adding annotationTTL. [WIP] Adding annotationTTL. Mar 16, 2020
@michalschott michalschott changed the title [WIP] Adding annotationTTL. Adding annotationTTL. Mar 16, 2020
@michalschott
Copy link
Contributor Author

@dholbach any feedback on this one please?

@michalschott michalschott changed the title Adding annotationTTL. Adding annotationTTL for automatic unlock Mar 28, 2020
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
I think it needs a few more safeguards as noted.
Also, how about unit tests?

README.md Outdated Show resolved Hide resolved
@@ -79,6 +86,11 @@ func (dsl *DaemonSetLock) Test(metadata interface{}) (holding bool, err error) {
if err := json.Unmarshal([]byte(valueString), &value); err != nil {
return false, err
}

if ttlExpired(value.Created, value.TTL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you have a running kured install, and you add the TTL flag; won't it see the previous annotation as expired immediately?
Similarly if someone takes the lock manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take lock manually as described, then set TTL flag it will invalidate it instantly as there will be no created and ttl values set in annotation:

if ttl > 0 && time.Since(created) >= ttl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is acceptable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So double checked this behaviour.

Launch params:

kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Kubernetes Reboot Daemon: -64ebf53"
kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Node ID: XXX.compute.internal"
kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Reboot Sentinel: /var/run/update_engine_autoupdate_completed every 1m0s"
kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Blocking Pod Selectors: []"
kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Reboot on: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"
kured-pzs8b kured time="2020-05-05T18:27:40Z" level=info msg="Force annotation cleanup after: 2m0s"

Manually disabled reboots:

➜  ~ kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}'
daemonset.apps/kured annotated

Output:

kured-pzs8b kured time="2020-05-05T18:29:29Z" level=info msg="Reboot required"
kured-pzs8b kured time="2020-05-05T18:29:29Z" level=warning msg="Lock already held: manual"
kured-pzs8b kured time="2020-05-05T18:30:29Z" level=info msg="Reboot required"
kured-pzs8b kured time="2020-05-05T18:30:29Z" level=warning msg="Lock already held: manual"
kured-pzs8b kured time="2020-05-05T18:31:29Z" level=info msg="Reboot required"
kured-pzs8b kured time="2020-05-05T18:31:29Z" level=warning msg="Lock already held: manual"
kured-pzs8b kured time="2020-05-05T18:32:29Z" level=info msg="Reboot required"
kured-pzs8b kured time="2020-05-05T18:32:29Z" level=warning msg="Lock already held: manual"
kured-pzs8b kured time="2020-05-05T18:33:29Z" level=info msg="Reboot required"
kured-pzs8b kured time="2020-05-05T18:33:29Z" level=warning msg="Lock already held: manual"

Removing lock manually:

➜  ~ kubectl -n kube-system annotate ds kured weave.works/kured-node-lock-
daemonset.apps/kured annotated

Output:

kured-pzs8b kured time="2020-05-05T18:35:29Z" level=info msg="Reboot required"
kured-pzs8b kured time="2020-05-05T18:35:29Z" level=info msg="Acquired reboot lock"
kured-pzs8b kured time="2020-05-05T18:35:29Z" level=info msg="Draining node XXX.compute.internal"
...

So if lock has been taken manually, enabling this won't take over as I though initially.

To answer your question if this is acceptable - turns out an ability to manually disable reboot is not gone despite enabling annotation TTL functionality, so all is ok.

Of course this might not fit in every environments, but ideally I think the better solution would be to annotate node objects instead ds which would not require any cleanup mechanism being implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I had missed that if ttl > 0 is checking the ttl set by the locker (if any), not Kured's command-line setting.

pkg/daemonsetlock/daemonsetlock.go Outdated Show resolved Hide resolved
cmd/kured/main.go Outdated Show resolved Hide resolved
@@ -79,6 +86,11 @@ func (dsl *DaemonSetLock) Test(metadata interface{}) (holding bool, err error) {
if err := json.Unmarshal([]byte(valueString), &value); err != nil {
return false, err
}

if ttlExpired(value.Created, value.TTL) {
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would print a log message in this case, for when it does something unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not see any other log references inside this pkg, shall this be changed then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid; could add another return value and print at the higher level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think I could find place for err inside ttlExpired function as this is simple true/false check function.

If I'd return an err value here, it would be errored as fatal - https://github.com/weaveworks/kured/blob/f2a0f8e20dbe6b0cc4a5d03cbe06889333adee98/cmd/kured/main.go#L196-L205

Buuut, I might be blind...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll fix it in post.

@michalschott
Copy link
Contributor Author

@bboreham I'd love to extend unit tests if there are any ;)

Addressed your concerns, can you have 2nd look please?

README.md Outdated Show resolved Hide resolved
@bboreham
Copy link
Contributor

bboreham commented May 5, 2020

@michalschott
Copy link
Contributor Author

@bboreham I'd love to extend unit tests if there are any ;)

Yes there are some:
https://github.com/weaveworks/kured/blob/master/pkg/timewindow/days_test.go
https://github.com/weaveworks/kured/blob/master/pkg/timewindow/timewindow_test.go

Added tests for func ttlExpired.

@michalschott michalschott changed the title Adding annotationTTL for automatic unlock Adding --annotation-ttl for automatic unlock May 6, 2020
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

@@ -79,6 +86,11 @@ func (dsl *DaemonSetLock) Test(metadata interface{}) (holding bool, err error) {
if err := json.Unmarshal([]byte(valueString), &value); err != nil {
return false, err
}

if ttlExpired(value.Created, value.TTL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I had missed that if ttl > 0 is checking the ttl set by the locker (if any), not Kured's command-line setting.

@@ -79,6 +86,11 @@ func (dsl *DaemonSetLock) Test(metadata interface{}) (holding bool, err error) {
if err := json.Unmarshal([]byte(valueString), &value); err != nil {
return false, err
}

if ttlExpired(value.Created, value.TTL) {
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll fix it in post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants