Rook 1.2 Ceph OSD Pod memory consumption very high #5821

alexcpn · 2020-07-14T07:45:39Z

Existing Related
#5811
#2764 -->
ceph/ceph#26856

Is this a bug report or feature request?
Bug Report

Deviation from expected behaviour:
There is no guideline to set the rook-ceph pod memory limits. So we haven't set any. However, though the internal
osd_memory_target is set as the default 4 GB,** I could see in the ceph osd pool top command that it is taking 8 GB as resident set memory and more as virtual memory

**ceph config get osd.0 osd_memory_target 
4294967296 

https://ceph.com/releases/v12-2-10-luminous-released/   -- The bluestore_cache_* options are no longer needed.. They are replaced by osd_memory_target, defaulting to 4GB. BlueStore will expand and contract its cache to attempt to stay within this limit.

Inside the OSD pod - top command

Grafana dashboard

Pod Name	Memory Usage	File System Usage	CPU	Network(I)	Network(o)
rook-ceph-osd-2-bb5b46984-wkstx	179.27 GB	311.30 kB	11.28%	15 Tbps	8 Tbps
rook-ceph-osd-0-7cf65bdb4d-xs2tw	183.91 GB	327.68 kB	10.64%	16 Tbps	9 Tbps
rook-ceph-osd-1-6c7fd5b8bf-24bhk	181.65 GB	311.30 kB	10.44%	18 Tbps	9 Tbps
rook-ceph-mgr-a-bbcf558b7-l4nqg	702.88 MB	202.21 MB	4.74%	413 Gbps	93 Gbps
rook-ceph-mon-e-5b4bf84f4f-z6dn4	2.07 GB	234.18 MB	0.63%	53 Gbps	133 Gbps
rook-ceph-mon-d-57df95d979-8mtq2	1.73 GB	213.33 MB	0.43%	62 Gbps	5 Gbps
rook-ceph-mon-f-7c85d58c4f-488bc	1.70 GB	210.13 MB	0.39%	61 Gbps	5 Gbps
rook-ceph-rgw-rook-ceph-objstore-a-5b9c8dcb54-qjkzr	899.08 MB	217.36 MB	0.27%	69 Gbps	70 Gbps
rook-ceph-tools-9b657cbf-t2m9k	42.56 MB	258.05 kB	0.07%	8 Tbps	7 Tbps

memory growth - Last 14 days

Expected behavior:
At the max the Ceph-OSD pod should take 4GB for ceph-osd process and say may 1 or 2 GB more for other process running inside the pod

How to reproduce it (minimal and precise):
Running for few days observed this

Environment:

OS (e.g. from /etc/os-release):
CentOS
Kernel (e.g. uname -a):
(
Cloud provider or hardware configuration:
HP Servers
Rook version (use rook version inside of a Rook Pod):
rook 1.2
Storage backend version (e.g. for ceph do ceph -v):
ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
Kubernetes version (use kubectl version):
Kubernetes 1.16
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Custom
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

HEALTH_WARN crush map has legacy tunables (require firefly, min is hammer); 9 pool(s) have non-power-of-two pg_num; too many PGs per OSD (766 > max 250)

The text was updated successfully, but these errors were encountered:

alexcpn · 2020-07-14T12:47:34Z

It seems grafana dashboard was wrong. Went into the node with the pod, found out the pod process id and it is taking only less

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
23748 167       20   0 6613980   1.1g  31748 S   0.7   0.9  54:57.30 ceph-osd 
--
sudo pmap -x 23748
23748:   ceph-osd --foreground --id 3 --fsid 14b73e8c-8d87-4887-9921-d83074868f71 --setuser ceph --setgroup ceph --crush-location=root=default host=da-cicd-enc1-bl12 --default-log-to-file false --ms-learn-addr-from-peer=false

---------------- ------- ------- ------- 
total kB         6613984 1173952 1142088

alexcpn · 2020-07-15T03:36:38Z

Right Grafana metrics calculation

To calculate container memory utilization we use:

sum(container_memory_working_set_bytes{name!~"POD"})
  by (name)
In the above query, we need to exclude the container who’s name contains “POD”. This is parent cgroup for this container and will track stats for all the containers in the pod.

https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66

A Deep Dive into Kubernetes Metrics — Part 3 Container Resource Metrics
This is part 3 of a multi-part series about all the metrics you can gather from you Kubernetes cluster.

JieZeng1993 · 2020-09-22T14:31:24Z

You closed this isuuse, then if you already resolve this problem, Could you tell how to resolve this problem?

galexrt · 2020-09-22T22:20:46Z

@JieZeng1993 From what I understand, @alexcpn "fixed" the issue by fixing the query for the graph used to monitor the memory usage of Kubernetes Pods (including the OSD Pods).

raj-katonic · 2022-02-10T09:48:18Z

Facing the same issue, every osd pod is consuming 4GB RAM on average. Please can anyone let me know why it needs so much ?

travisn · 2022-02-10T16:44:56Z

@raj-katonic Did you set memory requests/limits on the OSDs? See this topic. The recommended memory is generally 4GB per osd in production, but smaller clusters could set it lower if needed. But if these limits are not set, the osd will potentially use a lot more memory since it is not aware of any limits.

zhucan · 2022-03-17T04:35:00Z

@travisn I have a question, why the default_osd_memory_target is 4G, but we need to set requests/limits on the OSDs？ And why ceph osd pod memory consumption very high more than 4G?

travisn · 2022-03-17T15:05:52Z

I'm not sure about the default_osd_memory_target, the OSD must not respect that strictly like it will respect the osd_memory_target that is picked up by the OSD when the resource limits/requests are set.

microyahoo · 2022-03-29T02:32:50Z

@travisn I have noticed that osd_memory_target is 4G, but osd memory usage is far more than 4G when not set resource limits/requests like follows.

[rook@rook-ceph-tools-7ccf879b55-ljlsv /]$ ceph config get osd.10 osd_memory_target
4294967296

travisn · 2022-03-30T18:44:50Z

@microyahoo Did you try setting the resource limits? Setting the resource limits is the recommended way to have the OSDs respect memory usage instead of growing so large like that.

microyahoo · 2022-03-31T14:40:30Z

@microyahoo Did you try setting the resource limits? Setting the resource limits is the recommended way to have the OSDs respect memory usage instead of growing so large like that.

@travisn No, I didn't set the resource limits. I'm just curious why the osd_memory_target setting doesn't take effect to control memory usage.

alexcpn added the bug label Jul 14, 2020

alexcpn closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rook 1.2 Ceph OSD Pod memory consumption very high #5821

Rook 1.2 Ceph OSD Pod memory consumption very high #5821

alexcpn commented Jul 14, 2020

alexcpn commented Jul 14, 2020

alexcpn commented Jul 15, 2020

JieZeng1993 commented Sep 22, 2020 •

edited

galexrt commented Sep 22, 2020

raj-katonic commented Feb 10, 2022

travisn commented Feb 10, 2022

zhucan commented Mar 17, 2022

travisn commented Mar 17, 2022

microyahoo commented Mar 29, 2022

travisn commented Mar 30, 2022

microyahoo commented Mar 31, 2022

Rook 1.2 Ceph OSD Pod memory consumption very high #5821

Rook 1.2 Ceph OSD Pod memory consumption very high #5821

Comments

alexcpn commented Jul 14, 2020

alexcpn commented Jul 14, 2020

alexcpn commented Jul 15, 2020

JieZeng1993 commented Sep 22, 2020 • edited

galexrt commented Sep 22, 2020

raj-katonic commented Feb 10, 2022

travisn commented Feb 10, 2022

zhucan commented Mar 17, 2022

travisn commented Mar 17, 2022

microyahoo commented Mar 29, 2022

travisn commented Mar 30, 2022

microyahoo commented Mar 31, 2022

JieZeng1993 commented Sep 22, 2020 •

edited