-
Notifications
You must be signed in to change notification settings - Fork 782
feat(podtemplate): calculate GOMEMLIMIT at 90% of memory limit to reduce OOM #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ce OOM Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
fix(podtemplate-test): fix unit & tests (+comment) Co-authored-by: Rémi BUISSON <remi-buisson@orange.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there!
Thanks a lot for this contribution. I'm personally a great fan of the performance and stability improvements that setting GOMEMLIMIT
introduces, which is one of the reasons I've submitted various PRs elsewhere to add this functionality to Helm charts.
As such, I think it's great to allow more flexibility on this from the chart side. Although it does come with some maintenance burden and slightly complicated helpers, I still consider this beneficial to improve a lot of usecases.
However, I do wonder a bit what would be the most suitable 'default' here to pick. The current implementation is of course the simplest, but as you've clearly seen in your use-case, it may not be the best choice for everyone. But I think that using 80% will effectively be the same: it will be suitable to some use-cases, but might also impact some use-cases which are currently doing fine in a negative way. Given that decreasing the limit may result in more gc runs, which may result in more CPU consumption and thus eventually potential slower response times, I think it is essential that we pick a number that is both generic enough, yet still customisable for people aiming to tweak their own setup.
For the choice of default value, in cert-manager/cert-manager#6977 (comment) I was provided with some great feedback on it. Based on those references, with the (indirectly referenced) Go documentation itself being the most relevant in my opinion, I would suggest that probably a value between 90 and 95 would be more accurate as 'default'.
I think a default deviating from the recommended 90-95% or the simple current 100% is something we should only do if we actually have some very clear benchmarks for a representative use-case that suggests another value is more suitable.
So in that regard I would suggest:
- allowing tuning as @darkweaver87 suggested is essential to introduce this change;
- using a default value of 90% to 95% is more suitable based on current knowledge.
Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
Hey @jnoordsij, I appreciate your comment and review. I applied @darkweaver87 and your suggestion |
Co-authored-by: Jesper Noordsij <45041769+jnoordsij@users.noreply.github.com>
Co-authored-by: Jesper Noordsij <45041769+jnoordsij@users.noreply.github.com>
Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@shlomitubul You need to update the schema. See |
Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
Signed-off-by: ShlomiTubul <shlomi.tubul@placer.ai>
@mloiseleur done |
@shlomitubul And also "make docs", to update VALUES.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (after running make docs
)
What does this PR do?
chart current pod template injects
GOMEMLIMIT = pod memory limits, it leaves no headroom for Go GC to run before the process hits the cgroup ceiling, leading to OOM kills. In my tests, I show great improvement (in terms of oom events and steady memory usage) with lower values, like 90%.
Motivation
Multiple OOMs that were gone once we changed GOMEMLIMIT value
More
make test
and all the tests passed