-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSD and MON memory consumption #5811
Comments
If its new ceph cluster then you did something wrong . Where is the cluster created ? Cloud , on-perm? |
@OpsPita Cluster is Bare Metal and was created more than a year ago |
That is something enormous |
@Antiarchitect It's recommended to add the resource limits to the OSDs. See the Cluster CR doc on resource limits. There is also an example in cluster.yaml. |
@travisn Thank you for the tip - any recommended values? |
@Antiarchitect In the Cluster CR doc it mentions the minimum memory limit is 2G, but a better default is 4G. |
@travisn Situation normalized, but 2 of 15 OSDs are encountering this:
Original issue: #5814 - will reopen |
@leseb Thoughts on what would cause the admin_socket to be invalid in the liveness probe? |
Also I've set the resources for mon, mgr, osd, mds but cannot see it in my pods specs. Are they managed somehow different? By operator maybe? |
@travisn @leseb Meanwhile seeng this in operator logs:
|
Is that normal that osd_memory_target value is always exactly the same as my memory resource limit. What about pod / OSD process overhead etc. Still getting OOMKilled. Tried to set 4GB, 8GB, 16GB limits. Only limit of 32GB is now giving stable result. That is sad as I have 3 OSDs per node and only 64GB of memory on each node. |
@Antiarchitect It's very unexpected that you would need over 4GB to stabilize. |
@travisn It seems like when the cluster is stable OSDs consume less, but when part of OSDs is down the others start to eat memory (but I haven't faced these amounts before - the record was 45Gi on one of the OSDs) and it is a self-destabilizing process. Added P.S. Ceph is one of the beautiful pieces of software I've ever met in practice - its tenacious like a cockroach. |
Cluster has stabilized - no data loss, updated Rook to 1.3.8. The picture:
It scares that this calm picture can turn in to two nights nightmare momentarily. |
@Antiarchitect I ran into something like this with rook 1.3 (not OCS) a couple days ago, when I did a really intense fio read workload with 4 NVMs/host and 25-GbE link - the OSDs started caching data furiously until the node ran out of memory, at which time I started seeing OOM Kills on the OSDs, which causes remaining OSDs to work hard to recover during this heavy load. Eventually it recovers, but the point is to prevent this you need to control caching. I did this with bluestore_default_buffered_read: false, but Josh Durgin suggested that instead of that, bluefs_buffered_io can be set to false so that reads are done with O_DIRECT and do not involve the kernel buffer cache. This will slow down bluestore rocksDB compaction somewhat (Mark Nelson) but will result in much more stable memory utilization. You can also ensure that osd_memory_target is set so that the OSD itself is limiting its in-process memory consumption, I'd suggest > 4 GiB, and set the memory CGroup limit to at least 50% higher than osd_memory_target to give OSDs a chance to avoid OOM. Make sure transparent hugepages are disabled for Ceph daemons, this should happen automatically in modern versions of Ceph such as the one you are using. If you are using SSD storage, the cost of a cache miss is much lower. Let us know if this makes sense and if it helps. @travisn FYI |
@bengland2 Thank you so much for the explanation! It does make sense and I will try it. Actually this could be part of bluestore fine-tuning in Rook storage config in future releases of Rook. P.S. We're decided to move the Ceph cluster out to dedicated nodes not managed by K8s and use Rook's ceph external cluster feature to provide RBD and CephFS storage types. |
@bengland2 Thanks for all the input! Rook only sets these two vars depending on the resource requests/limits in the CephCluster CR: Ceph picks up on those env vars in the osd daemon and will set the cgroup limit accordingly, looks like here. I thought Ceph was setting it to 0.8 of the memory limit, but I don't see that calculation there. Are we missing some other setting that will do that calculation? |
@bengland2 @travisn Just ran |
Found it. Seems false by default. |
@Antiarchitect so is the RSS in your OSDs growing or is the kernel buffer cache (inactive pages) growing? If the kernel buffer cache, then Bluestore is not doing O_DIRECT. Otherwise, the bluestore OSD-internal cache must be growing, I think there are counters available for monitoring this ceph daemon osd.N perf dump (hard to get to with containers). Let's isolate the problem. To set low-level options (e.g. ceph.conf), you can use ceph_config_overrides configmap, this is documented by rook.io. To change them on the fly, use "ceph tell" with injectargs but this isn't guaranteed to work for all parameters. Try dropping cache manually with echo 1 > /proc/sys/vm/drop_caches after you make changes so cache is clean to begin with. |
From the rook toolbox container running
From the osd.0 container I get |
Because the toolbox does not run any daemon. The |
Ok, I got it.
|
and I see "osd_memory_target": "1073741824", what's your resources:memory:limit on your OSD pod? Because osd_memory_target seems to be limiting you to 1 GiB, but typically I never want to see that below 4 GiB, not sure what lower limit is. |
https://gist.github.com/Antiarchitect/b0b68a463e021e3dabca1e60dff6f924 and new dump |
I haven't done any tests where osd_memory_target was set to 1 GiB. If your memory limit is 32 Gi, that means you're willing to give the OSD up to 32 GB of RAM before killing it, so why set osd_memory_target to 1 GiB? since you have 10 OSDs and 64 GB RAM, you have enough memory to provide 4 GB RAM for each OSD, which is usually what I see it set at. Based on above discussion, since you avoid using kernel buffer cache, that should free up memory for OSDs. I'm not sure what minimum value of osd_memory_target is but Ceph has to be able to cache OSD metadata to run efficiently, if you starve it for RAM then it will constantly have to go to RocksDB to get this metadata and will slow down, at best. |
I didn't set osd_memory_target manually, any options I've changed so far is by changing rook osd limits |
try raising request memory to 4 GiB, see what happens? request=1GiB is still too low. Basically that means "don't schedule the pod on this node unless there is 1 GiB of free mem". You don't want the OSD running on there anyway unless there is more memory available than that. In fact, I'd suggest setting request to 4 GiB and limit to 6 GiB (i.e. don't OOM kill it until it gets to the limit), and watch what happens to osd_memory_target. Hopefully you get an OSD memory target that is some percentage under the limit, so the Ceph OSD trims its memory usage before it gets OOM killed by the Linux kernel. Make sense? |
@bengland2 Thank you for the tip: Here is the result: https://gist.github.com/Antiarchitect/fc1cfd989a58528a33a3c04455a45b3b |
I will close this issue as the cluster is stable for about a week already. If it shows itself again - will reopen. Thank you all for your patience and good advice! |
I have this picture of memory consumption by Rook and ceph:
I have several OSD per node, Total Raw Capacity 5.2 TiB, one node was offline a week or so so last rebalancing was very slow and lasts for several hours. Is memory consumption by OSDs so high is normal? I have only 64GB of RAM on each node and seems like more than a half is consumed by Ceph OSDs.
Environment:
uname -a
): Linux worker-5.prod.lwams1.enapter.ninja 5.7.1-1.el7.elrepo.x86_64rook version
inside of a Rook Pod):ceph -v
):kubectl version
):ceph health
in the Rook Ceph toolbox):One node is out for maintenance
The text was updated successfully, but these errors were encountered: