Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node drain wait period #4506

Merged
merged 25 commits into from
Jan 18, 2022
Merged

Conversation

rayterrill
Copy link
Contributor

Description

Add ability to specify a node drain wait period during a nodegroup drain operation. This helps with applications that may take time to start and do not react well to the "fire and forget" drain operation.

Also discussed in #1699

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
@Skarlso
Copy link
Contributor

Skarlso commented Dec 2, 2021

Thank you very much for your PR! Let's work to get this in together. :) And welcome to the eksctl contributors community! :)

@Skarlso Skarlso added the kind/feature New feature or request label Dec 2, 2021
@Himangini
Copy link
Collaborator

Himangini commented Jan 5, 2022

@rayterrill HNY 🥳 Any updates on this PR? Please feel free to ask us if you need help with this PR :)
I am removing the review requests, please feel free to tag us again when the PR is ready to be reviewed 👍🏻

@rayterrill
Copy link
Contributor Author

@rayterrill HNY 🥳 Any updates on this PR? Please feel free to ask us if you need help with this PR :)
I am removing the review requests, please feel free to tag us again when the PR is ready to be reviewed 👍🏻

@Himangini I'd still love to get this in somehow - this is something we definitely would like to see in eksctl. Not sure how to proceed - still pretty new to Go from a big project perspective. :)

@Skarlso
Copy link
Contributor

Skarlso commented Jan 6, 2022

@rayterrill No worries. I'll take a look and give some code advice in the morning :) Some proper code, so you can see what I meant about waiting. :) It's not easy, so don't worry about it.

@Skarlso
Copy link
Contributor

Skarlso commented Jan 6, 2022

Sorry @rayterrill but I just realised that we are doing a lot of things now since November or so, and I was wondering if this is still a thing that is required or not working? #1699 (comment)

We are doing a proper eviction and are filtering based on pods which can be evicted or deleted based on status.

@rayterrill
Copy link
Contributor Author

rayterrill commented Jan 6, 2022

Sorry @rayterrill but I just realised that we are doing a lot of things now since November or so, and I was wondering if this is still a thing that is required or not working? #1699 (comment)

We are doing a proper eviction and are filtering based on pods which can be evicted or deleted based on status.

This is really about slowing down the eviction/drain process - we have some apps that need some time to come up once they've been evicted from nodes, and the current process basically rapid fire cordon and drains all nodes in the nodegroup. We're looking to inject some wait time in there in between each operation to allow things to stabilize.

Is there some docs on how the health check piece works?

@Skarlso
Copy link
Contributor

Skarlso commented Jan 7, 2022

Is there some docs on how the health check piece works?

Sadly no. It's just the code, I'm reading. Look at this function:

func (d *Evictor) GetPodsForEviction(nodeName string) (*PodDeleteList, []error) {

in pkg/drain/evictor/evictor.go.

This is really about slowing down the eviction/drain process - we have some apps that need some time to come up once they've been evicted from nodes, and the current process basically rapid fire cordon and drains all nodes in the nodegroup.

Ahhhha. I see. So wait in between... for what exactly? Is there a criteria? We can't check if the pod came up at a different location ofc... so is it really just a hard sleep between node evictions?

@rayterrill
Copy link
Contributor Author

Ahhhha. I see. So wait in between... for what exactly? Is there a criteria? We can't check if the pod came up at a different location ofc... so is it really just a hard sleep between node evictions?

That's what I was thinking - a configurable sleep (an option we could pass in). Def crude, but it would help with some app workloads.

@Skarlso
Copy link
Contributor

Skarlso commented Jan 13, 2022

@rayterrill Alright then. Let's do that.

@Skarlso
Copy link
Contributor

Skarlso commented Jan 13, 2022

@rayterrill Can you take a look at the lint and test errors or do you want me to?

@Skarlso
Copy link
Contributor

Skarlso commented Jan 13, 2022

Oh yeah, you have to update the tests and everywhere where the function is called since you updated the function parameter list. :)

@rayterrill
Copy link
Contributor Author

@Skarlso let me take a look at that this wknd.

Copy link
Contributor

@aclevername aclevername left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Collaborator

@cPu1 cPu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I have left some comments.

pkg/ctl/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/ctl/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/ctl/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/ctl/drain/nodegroup_test.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/ctl/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/actions/nodegroup/drain.go Outdated Show resolved Hide resolved
rayterrill and others added 12 commits January 17, 2022 13:53
Co-authored-by: Jake Klein <jakelarsj@gmail.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
Co-authored-by: Chetan Patwal <cPu1@users.noreply.github.com>
@rayterrill
Copy link
Contributor Author

I think I made all requested changes.

@aclevername aclevername requested a review from cPu1 January 18, 2022 09:32
pkg/drain/nodegroup.go Outdated Show resolved Hide resolved
pkg/ctl/drain/nodegroup.go Outdated Show resolved Hide resolved
@cPu1
Copy link
Collaborator

cPu1 commented Jan 18, 2022

I have fixed the compilation errors and the merge conflicts for you. We're good to merge.

@cPu1 cPu1 enabled auto-merge (squash) January 18, 2022 10:51
@cPu1 cPu1 merged commit 2ec15dd into eksctl-io:main Jan 18, 2022
@rayterrill rayterrill deleted the add-node-drain-wait-period branch January 18, 2022 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants