-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: Prepare job needs significant more memory for provisioning #11103
Conversation
The OSD creation may need to burst during OSD provisioning depending on the size of the OSD or similar factors. If the OSD prepare job is OOM killed it will cause OSD provisioning to fail and have various side effects that are difficult to troubleshoot to get the OSD to succeed. So we increase the recommendation significantly to avoid the OOM kill. Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
osd: Prepare job needs significant more memory for provisioning (backport #11103)
All, Apologies for arriving late after this has already been merged. I just tested the value of
And the Job is killed by the OOMKiller. Further testing suggests that The note/hint about OSD prepare being potentially killed is useful. Perhaps a sentence could be added that users may need to increase this value if they have "large" volumes to prepare and they don't get OSD pods as expected. |
@rajha-korithrien Thanks for your observations! We could keep raising the limit to something like 2Gi. But I'm wondering if we should just not specify the memory limits for the osd prepare. It's a one-time action and we don't want memory to prevent the creation. Is there really any reason to apply limits to it? @satoru-takeuchi @kfox1111 thoughts? |
@travisn Now we know it's difficult to estimate the proper memory limit. So don't set memory limit by default and describe this behavior in ceph-common-issues.md is a reasonable solution for now. |
osd: Prepare job needs significant more memory for provisioning (backport #11103)
yeah, maybe no limit may be better... though I think someone said on slack that they had a prepare push over a running osd... so maybe a limit does help. Its kind of unclear. its messy to clean up after a failed one, so reserving rather then limiting more then enough memory may be a good default, and then if its too much, they can always tweak it down? |
Like you said, it's messy to clean up if it gets in a failed state. So allowing it to run unconstrained to completion will be best. If it causes other pods to fall over, there would be a hiccup, but they should recover after the pod comes back up again. It's still possible to set the limits or different requests if desired, for defaults we just need to leave it unconstrained. |
* See note: rook/rook#11103 * See note: rook/rook#11109
Description of your changes:
The OSD creation may need to burst during OSD provisioning depending on the size of the OSD or similar factors. If the OSD prepare job is OOM killed it will cause OSD provisioning to fail and have various side effects that are difficult to troubleshoot to get the OSD to succeed. So we increase the recommendation significantly to avoid the OOM kill.
Which issue is resolved by this Pull Request:
Resolves #10219
Checklist:
skip-ci
on the PR.