Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD and MON memory consumption #5811

Closed
Antiarchitect opened this issue Jul 13, 2020 · 30 comments
Closed

OSD and MON memory consumption #5811

Antiarchitect opened this issue Jul 13, 2020 · 30 comments
Labels

Comments

@Antiarchitect
Copy link

Antiarchitect commented Jul 13, 2020

I have this picture of memory consumption by Rook and ceph:

core-rook                                core-rook-toolbox-64bf577fb6-28jkp                                1m           8Mi             
core-rook                                csi-cephfsplugin-4qn92                                            1m           57Mi            
core-rook                                csi-cephfsplugin-6wwj8                                            1m           73Mi            
core-rook                                csi-cephfsplugin-7bkn6                                            1m           62Mi            
core-rook                                csi-cephfsplugin-mwhdk                                            1m           56Mi            
core-rook                                csi-cephfsplugin-provisioner-8ff65f84d-kflt8                      3m           72Mi            
core-rook                                csi-cephfsplugin-provisioner-8ff65f84d-xsmhs                      5m           77Mi            
core-rook                                csi-cephfsplugin-ztfw9                                            1m           62Mi            
core-rook                                csi-rbdplugin-2zqhz                                               1m           81Mi            
core-rook                                csi-rbdplugin-8kv64                                               1m           86Mi            
core-rook                                csi-rbdplugin-fgbwf                                               2m           79Mi            
core-rook                                csi-rbdplugin-provisioner-594f9bb949-7j8vh                        6m           109Mi           
core-rook                                csi-rbdplugin-provisioner-594f9bb949-wc2cs                        4m           140Mi           
core-rook                                csi-rbdplugin-r9xmf                                               1m           64Mi            
core-rook                                csi-rbdplugin-zjp9x                                               1m           77Mi            
core-rook                                rook-ceph-crashcollector-worker-2-tkkxn   0m           7Mi             
core-rook                                rook-ceph-crashcollector-worker-3-crzn5   0m           7Mi             
core-rook                                rook-ceph-crashcollector-worker-4-2vv22   0m           8Mi             
core-rook                                rook-ceph-crashcollector-worker-5-mcklw   0m           7Mi             
core-rook                                rook-ceph-mds-core-rook-a-5996f955fb-wmwzg                        18m          86Mi            
core-rook                                rook-ceph-mds-core-rook-b-c5dc5d6db-wgg42                         39m          43Mi            
core-rook                                rook-ceph-mgr-a-7f899ddd6f-xp2m9                                  31m          381Mi           
core-rook                                rook-ceph-mon-ba-f7b95fb59-9zxkr                                  47m          1608Mi          
core-rook                                rook-ceph-mon-bd-55448d644b-ddx8h                                 68m          1436Mi          
core-rook                                rook-ceph-mon-be-7b6765557c-tsfbt                                 38m          127Mi           
core-rook                                rook-ceph-operator-78564dd996-jwhv2                               57m          115Mi           
core-rook                                rook-ceph-osd-0-7c9b58679f-hz45f                                  95m          8452Mi          
core-rook                                rook-ceph-osd-10-8647487d5b-s6rvq                                 279m         14526Mi         
core-rook                                rook-ceph-osd-11-bc5cf6579-zkrrh                                  278m         12843Mi         
core-rook                                rook-ceph-osd-12-ffdbb5c67-wlmvm                                  209m         13301Mi         
core-rook                                rook-ceph-osd-14-57995cc8bc-7p28w                                 222m         11337Mi         
core-rook                                rook-ceph-osd-16-55b6b479bc-fcvxv                                 155m         12394Mi         
core-rook                                rook-ceph-osd-4-5f857b5467-xxl66                                  147m         13020Mi         
core-rook                                rook-ceph-osd-5-67dfc5c6d6-gs2nn                                  119m         9645Mi          
core-rook                                rook-ceph-osd-6-7fd969c9fb-fnlkz                                  263m         8997Mi          
core-rook                                rook-ceph-osd-7-9fbbb49cf-x8qw4                                   141m         12402Mi         
core-rook                                rook-ceph-osd-8-594496d97b-2trhp                                  247m         12369Mi         
core-rook                                rook-ceph-osd-9-6bfcd45f9c-kxr8j                                  87m          13761Mi         
core-rook                                rook-discover-2kn6p                                               1m           39Mi            
core-rook                                rook-discover-4lpph                                               968m         56Mi            
core-rook                                rook-discover-bmdxz                                               1m           43Mi            
core-rook                                rook-discover-ppnkm                                               1m           63Mi            
core-rook                                rook-discover-sprm8                                               1m           33Mi            

I have several OSD per node, Total Raw Capacity 5.2 TiB, one node was offline a week or so so last rebalancing was very slow and lasts for several hours. Is memory consumption by OSDs so high is normal? I have only 64GB of RAM on each node and seems like more than a half is consumed by Ceph OSDs.

Environment:

  • OS (e.g. from /etc/os-release): CentOS 7
  • Kernel (e.g. uname -a): Linux worker-5.prod.lwams1.enapter.ninja 5.7.1-1.el7.elrepo.x86_64
  • Cloud provider or hardware configuration: Bare Metal
  • Rook version (use rook version inside of a Rook Pod):
rook version
rook: v1.3.7
go: go1.13.8
  • Storage backend version (e.g. for ceph do ceph -v):
Rook Toolbox shows: ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
Actually: 15.2.4 (Ceph Dashboard all nodes)
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:03Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:35:47Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Bare Metal + Puppet + Kubeadm
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
HEALTH_WARN Degraded data redundancy: 116638/924012 objects degraded (12.623%), 214 pgs degraded, 214 pgs undersized; 2 daemons have recently crashed

One node is out for maintenance

@OpsPita
Copy link

OpsPita commented Jul 13, 2020

If its new ceph cluster then you did something wrong .
I had this when i started with rook .
There are 2 places to change the version .
The operator , and cluster.yaml

Where is the cluster created ? Cloud , on-perm?

@Antiarchitect
Copy link
Author

@OpsPita Cluster is Bare Metal and was created more than a year ago

@Antiarchitect
Copy link
Author

rook-ceph-osd-10-8647487d5b-xz76v                                 865m         36869Mi

That is something enormous

@travisn
Copy link
Member

travisn commented Jul 15, 2020

@Antiarchitect It's recommended to add the resource limits to the OSDs. See the Cluster CR doc on resource limits. There is also an example in cluster.yaml.

@Antiarchitect
Copy link
Author

@travisn Thank you for the tip - any recommended values?

@travisn
Copy link
Member

travisn commented Jul 15, 2020

@Antiarchitect In the Cluster CR doc it mentions the minimum memory limit is 2G, but a better default is 4G.

@Antiarchitect
Copy link
Author

Antiarchitect commented Jul 15, 2020

@travisn Situation normalized, but 2 of 15 OSDs are encountering this:

  Warning  Unhealthy  4m43s (x2 over 13m)  kubelet, worker-3.xxx.xxx  Liveness probe failed: no valid command found; 10 closest matches:
0
1
2
abort
assert
bluestore allocator dump block
bluestore allocator dump bluefs-db
bluestore allocator fragmentation block
bluestore allocator fragmentation bluefs-db
bluestore allocator score block
admin_socket: invalid command

Original issue: #5814 - will reopen

@travisn
Copy link
Member

travisn commented Jul 15, 2020

@leseb Thoughts on what would cause the admin_socket to be invalid in the liveness probe?

@Antiarchitect
Copy link
Author

Also I've set the resources for mon, mgr, osd, mds but cannot see it in my pods specs. Are they managed somehow different? By operator maybe?

@Antiarchitect
Copy link
Author

Antiarchitect commented Jul 15, 2020

@travisn @leseb Meanwhile seeng this in operator logs:

2020-07-15 23:02:22.612726 I | util: retrying after 1m0s, last error: failed to check if we can stop the deployment rook-ceph-osd-13: failed to check if rook-ceph-osd-13 was ok to stop: deployment rook-ceph-osd-13 cannot be stopped: exit status 16
2020-07-15 23:03:22.612931 I | op-mon: checking if we can stop the deployment rook-ceph-osd-13
2020-07-15 23:03:32.987387 I | util: retrying after 1m0s, last error: failed to check if we can stop the deployment rook-ceph-osd-13: failed to check if rook-ceph-osd-13 was ok to stop: deployment rook-ceph-osd-13 cannot be stopped: exit status 16
2020-07-15 23:04:32.987649 I | op-mon: checking if we can stop the deployment rook-ceph-osd-13
2020-07-15 23:04:46.519241 I | util: retrying after 1m0s, last error: failed to check if we can stop the deployment rook-ceph-osd-13: failed to check if rook-ceph-osd-13 was ok to stop: deployment rook-ceph-osd-13 cannot be stopped: exit status 16
2020-07-15 23:05:46.519431 I | op-mon: checking if we can stop the deployment rook-ceph-osd-13
2020-07-15 23:06:02.601144 I | util: retrying after 1m0s, last error: failed to check if we can stop the deployment rook-ceph-osd-13: failed to check if rook-ceph-osd-13 was ok to stop: deployment rook-ceph-osd-13 cannot be stopped: exit status 16
2020-07-15 23:07:02.601397 I | op-mon: checking if we can stop the deployment rook-ceph-osd-13
2020-07-15 23:07:12.808486 E | op-osd: failed to update osd deployment 13. failed to check if deployment "rook-ceph-osd-13" can be updated. max retries exceeded, last err: exit status 16
deployment rook-ceph-osd-13 cannot be stopped
github.com/rook/rook/pkg/daemon/ceph/client.okToStopDaemon
	/home/rook/go/src/github.com/rook/rook/pkg/daemon/ceph/client/upgrade.go:203
github.com/rook/rook/pkg/daemon/ceph/client.OkToStop
	/home/rook/go/src/github.com/rook/rook/pkg/daemon/ceph/client/upgrade.go:168
github.com/rook/rook/pkg/operator/ceph/cluster/mon.UpdateCephDeploymentAndWait.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/mon/spec.go:308
github.com/rook/rook/pkg/operator/k8sutil.UpdateDeploymentAndWait.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/k8sutil/deployment.go:80
github.com/rook/rook/pkg/util.Retry
	/home/rook/go/src/github.com/rook/rook/pkg/util/retry.go:28
github.com/rook/rook/pkg/operator/k8sutil.UpdateDeploymentAndWait
	/home/rook/go/src/github.com/rook/rook/pkg/operator/k8sutil/deployment.go:79
github.com/rook/rook/pkg/operator/ceph/cluster/mon.UpdateCephDeploymentAndWait
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/mon/spec.go:332
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).startOSDDaemonsOnNode
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:566
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).handleStatusConfigMapStatus
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:265
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).checkNodesCompleted
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:147
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).completeOSDsForAllNodes
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:164
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).completeProvision
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:119
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).startProvisioningOverNodes
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:409
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).Start
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:209
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).doOrchestration
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:297
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).createInstance
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:226
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureLocalCephCluster.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:471
k8s.io/apimachinery/pkg/util/wait.WaitFor
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:434
k8s.io/apimachinery/pkg/util/wait.pollInternal
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:320
k8s.io/apimachinery/pkg/util/wait.Poll
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:314
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureLocalCephCluster
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:455
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:522
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).onAdd
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:290
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:198
k8s.io/client-go/tools/cache.newInformer.func1
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:370
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/delta_fifo.go:422
k8s.io/client-go/tools/cache.(*controller).processLoop
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:153
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*controller).Run
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:125
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357
failed to check if rook-ceph-osd-13 was ok to stop
github.com/rook/rook/pkg/daemon/ceph/client.OkToStop
	/home/rook/go/src/github.com/rook/rook/pkg/daemon/ceph/client/upgrade.go:170
github.com/rook/rook/pkg/operator/ceph/cluster/mon.UpdateCephDeploymentAndWait.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/mon/spec.go:308
github.com/rook/rook/pkg/operator/k8sutil.UpdateDeploymentAndWait.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/k8sutil/deployment.go:80
github.com/rook/rook/pkg/util.Retry
	/home/rook/go/src/github.com/rook/rook/pkg/util/retry.go:28
github.com/rook/rook/pkg/operator/k8sutil.UpdateDeploymentAndWait
	/home/rook/go/src/github.com/rook/rook/pkg/operator/k8sutil/deployment.go:79
github.com/rook/rook/pkg/operator/ceph/cluster/mon.UpdateCephDeploymentAndWait
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/mon/spec.go:332
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).startOSDDaemonsOnNode
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:566
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).handleStatusConfigMapStatus
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:265
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).checkNodesCompleted
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:147
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).completeOSDsForAllNodes
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:164
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).completeProvision
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:119
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).startProvisioningOverNodes
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:409
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).Start
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:209
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).doOrchestration
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:297
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).createInstance
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:226
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureLocalCephCluster.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:471
k8s.io/apimachinery/pkg/util/wait.WaitFor
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:434
k8s.io/apimachinery/pkg/util/wait.pollInternal
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:320
k8s.io/apimachinery/pkg/util/wait.Poll
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:314
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureLocalCephCluster
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:455
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:522
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).onAdd
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:290
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:198
k8s.io/client-go/tools/cache.newInformer.func1
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:370
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/delta_fifo.go:422
k8s.io/client-go/tools/cache.(*controller).processLoop
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:153
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*controller).Run
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:125
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357
failed to check if we can stop the deployment rook-ceph-osd-13
github.com/rook/rook/pkg/operator/ceph/cluster/mon.UpdateCephDeploymentAndWait.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/mon/spec.go:314
github.com/rook/rook/pkg/operator/k8sutil.UpdateDeploymentAndWait.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/k8sutil/deployment.go:80
github.com/rook/rook/pkg/util.Retry
	/home/rook/go/src/github.com/rook/rook/pkg/util/retry.go:28
github.com/rook/rook/pkg/operator/k8sutil.UpdateDeploymentAndWait
	/home/rook/go/src/github.com/rook/rook/pkg/operator/k8sutil/deployment.go:79
github.com/rook/rook/pkg/operator/ceph/cluster/mon.UpdateCephDeploymentAndWait
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/mon/spec.go:332
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).startOSDDaemonsOnNode
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:566
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).handleStatusConfigMapStatus
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:265
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).checkNodesCompleted
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:147
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).completeOSDsForAllNodes
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:164
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).completeProvision
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/status.go:119
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).startProvisioningOverNodes
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:409
github.com/rook/rook/pkg/operator/ceph/cluster/osd.(*Cluster).Start
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/osd.go:209
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).doOrchestration
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:297
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).createInstance
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:226
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureLocalCephCluster.func1
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:471
k8s.io/apimachinery/pkg/util/wait.WaitFor
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:434
k8s.io/apimachinery/pkg/util/wait.pollInternal
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:320
k8s.io/apimachinery/pkg/util/wait.Poll
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:314
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureLocalCephCluster
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:455
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:522
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).onAdd
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:290
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:198
k8s.io/client-go/tools/cache.newInformer.func1
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:370
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/delta_fifo.go:422
k8s.io/client-go/tools/cache.(*controller).processLoop
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:153
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/home/rook/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*controller).Run
	/home/rook/go/pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/controller.go:125
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357

@Antiarchitect
Copy link
Author

Is that normal that osd_memory_target value is always exactly the same as my memory resource limit. What about pod / OSD process overhead etc. Still getting OOMKilled. Tried to set 4GB, 8GB, 16GB limits. Only limit of 32GB is now giving stable result. That is sad as I have 3 OSDs per node and only 64GB of memory on each node.

@travisn
Copy link
Member

travisn commented Jul 16, 2020

@Antiarchitect It's very unexpected that you would need over 4GB to stabilize.
@bengland2 What would help troubleshoot why the OSDs are getting OOMKilled? Looks like the cluster is on 15.2.4.

@Antiarchitect
Copy link
Author

Antiarchitect commented Jul 16, 2020

@travisn It seems like when the cluster is stable OSDs consume less, but when part of OSDs is down the others start to eat memory (but I haven't faced these amounts before - the record was 45Gi on one of the OSDs) and it is a self-destabilizing process. Added system-cluster-critical to OSDs and other parts of Rook and left 16Gi limit (it's currently applying).
Ceph cluster capacity in my case is ridiculous: 5 nodes, 6TiB total, 15 OSDs (3 per node, each bluestore, each separate device but attached to one HPSA disk controller), ~120 PGs. Actually we're experiencing some troubles with HPSA disk controller sometimes - all disks on the node are attached to it (https://bugzilla.kernel.org/show_bug.cgi?id=208215)

P.S. Ceph is one of the beautiful pieces of software I've ever met in practice - its tenacious like a cockroach.

@Antiarchitect
Copy link
Author

Cluster has stabilized - no data loss, updated Rook to 1.3.8. The picture:

rook-ceph-osd-0-7c5c6dd5c4-dfdsm                                  49m          499Mi
rook-ceph-osd-1-64cdcdfd65-m8xdj                                  48m          442Mi
rook-ceph-osd-11-dcb85cf9-c7w7b                                   39m          648Mi
rook-ceph-osd-12-68c7cf9699-wmf9c                                 32m          812Mi
rook-ceph-osd-13-6dbb7477fb-crw2r                                 54m          219Mi
rook-ceph-osd-15-677b4cd9c-lz2dt                                  108m         304Mi
rook-ceph-osd-2-d8c46d685-f5hbj                                   152m         451Mi
rook-ceph-osd-3-bb64b7fb9-nd9nq                                   49m          492Mi
rook-ceph-osd-4-6c7c698976-z6xxj                                  40m          222Mi
rook-ceph-osd-5-58cc767849-qjhg5                                  45m          678Mi
rook-ceph-osd-6-b4d7558bb-ksxct                                   31m          617Mi
rook-ceph-osd-7-5869bb86f4-7khmx                                  79m          356Mi
rook-ceph-osd-8-6f7c9fff8c-wdznr                                  38m          354Mi
rook-ceph-osd-9-784b55766b-zcrdp                                  108m         439Mi

It scares that this calm picture can turn in to two nights nightmare momentarily.

@bengland2
Copy link

@Antiarchitect I ran into something like this with rook 1.3 (not OCS) a couple days ago, when I did a really intense fio read workload with 4 NVMs/host and 25-GbE link - the OSDs started caching data furiously until the node ran out of memory, at which time I started seeing OOM Kills on the OSDs, which causes remaining OSDs to work hard to recover during this heavy load. Eventually it recovers, but the point is to prevent this you need to control caching. I did this with bluestore_default_buffered_read: false, but Josh Durgin suggested that instead of that, bluefs_buffered_io can be set to false so that reads are done with O_DIRECT and do not involve the kernel buffer cache. This will slow down bluestore rocksDB compaction somewhat (Mark Nelson) but will result in much more stable memory utilization. You can also ensure that osd_memory_target is set so that the OSD itself is limiting its in-process memory consumption, I'd suggest > 4 GiB, and set the memory CGroup limit to at least 50% higher than osd_memory_target to give OSDs a chance to avoid OOM. Make sure transparent hugepages are disabled for Ceph daemons, this should happen automatically in modern versions of Ceph such as the one you are using. If you are using SSD storage, the cost of a cache miss is much lower. Let us know if this makes sense and if it helps. @travisn FYI

@Antiarchitect
Copy link
Author

Antiarchitect commented Jul 17, 2020

@bengland2 Thank you so much for the explanation! It does make sense and I will try it. Actually this could be part of bluestore fine-tuning in Rook storage config in future releases of Rook.

P.S. We're decided to move the Ceph cluster out to dedicated nodes not managed by K8s and use Rook's ceph external cluster feature to provide RBD and CephFS storage types.

@travisn
Copy link
Member

travisn commented Jul 17, 2020

@bengland2 Thanks for all the input! Rook only sets these two vars depending on the resource requests/limits in the CephCluster CR: POD_MEMORY_LIMIT and POD_MEMORY_REQUEST.

Ceph picks up on those env vars in the osd daemon and will set the cgroup limit accordingly, looks like here. I thought Ceph was setting it to 0.8 of the memory limit, but I don't see that calculation there. Are we missing some other setting that will do that calculation?

@Antiarchitect
Copy link
Author

Antiarchitect commented Jul 17, 2020

@bengland2 @travisn Just ran ceph config show osd.0 and cannot find bluefs_buffered_io in the output (I'm using ceph/ceph:v15.2.4-20200630 official docker image). Am I looking in the wrong place? Anyway, if there is a guide of how these low-level options can be set within Rook YAML files it would be nice to see it. Thank you all once again :)

@Antiarchitect
Copy link
Author

Found it. Seems false by default.

@bengland2
Copy link

@Antiarchitect so is the RSS in your OSDs growing or is the kernel buffer cache (inactive pages) growing? If the kernel buffer cache, then Bluestore is not doing O_DIRECT. Otherwise, the bluestore OSD-internal cache must be growing, I think there are counters available for monitoring this ceph daemon osd.N perf dump (hard to get to with containers). Let's isolate the problem.

To set low-level options (e.g. ceph.conf), you can use ceph_config_overrides configmap, this is documented by rook.io. To change them on the fly, use "ceph tell" with injectargs but this isn't guaranteed to work for all parameters. Try dropping cache manually with echo 1 > /proc/sys/vm/drop_caches after you make changes so cache is clean to begin with.

@Antiarchitect
Copy link
Author

From the rook toolbox container running ceph daemon osd.0 perf dump I get:

admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

From the osd.0 container I get {}.

@leseb
Copy link
Member

leseb commented Jul 20, 2020

From the rook toolbox container running ceph daemon osd.0 perf dump I get:

admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

From the osd.0 container I get {}.

Because the toolbox does not run any daemon. The ceph daemon command looks for a socket. So you must be in the osd container.

@Antiarchitect
Copy link
Author

Antiarchitect commented Jul 20, 2020

Ok, I got it. ceph tell osd.X perf dump and ceph tell osd.X config show from the rook toolbox :) The cluster is relatively calm at the moment, but here is the dumps and configs of two most memory consuming OSDs at the moment (~2GB): https://gist.github.com/Antiarchitect/e83ff05b9dc6a65033618588ffea4e70

"bluefs_buffered_io": "false"
"bluestore_default_buffered_read": "true"

@bengland2
Copy link

and I see

"osd_memory_target": "1073741824",
"osd_memory_target_cgroup_limit_ratio": "0.800000",

what's your resources:memory:limit on your OSD pod? Because osd_memory_target seems to be limiting you to 1 GiB, but typically I never want to see that below 4 GiB, not sure what lower limit is.

@Antiarchitect
Copy link
Author

kubectl -n core-rook get pods rook-ceph-osd-12-675595cbfd-xvr87 -ojson | jq .spec.containers[].resources
{
  "limits": {
    "cpu": "2",
    "memory": "32Gi"
  },
  "requests": {
    "cpu": "500m",
    "memory": "1Gi"
  }
}

https://gist.github.com/Antiarchitect/b0b68a463e021e3dabca1e60dff6f924 and new dump

@bengland2
Copy link

I haven't done any tests where osd_memory_target was set to 1 GiB. If your memory limit is 32 Gi, that means you're willing to give the OSD up to 32 GB of RAM before killing it, so why set osd_memory_target to 1 GiB? since you have 10 OSDs and 64 GB RAM, you have enough memory to provide 4 GB RAM for each OSD, which is usually what I see it set at. Based on above discussion, since you avoid using kernel buffer cache, that should free up memory for OSDs. I'm not sure what minimum value of osd_memory_target is but Ceph has to be able to cache OSD metadata to run efficiently, if you starve it for RAM then it will constantly have to go to RocksDB to get this metadata and will slow down, at best.

@Antiarchitect
Copy link
Author

I didn't set osd_memory_target manually, any options I've changed so far is by changing rook osd limits

@bengland2
Copy link

try raising request memory to 4 GiB, see what happens? request=1GiB is still too low. Basically that means "don't schedule the pod on this node unless there is 1 GiB of free mem". You don't want the OSD running on there anyway unless there is more memory available than that. In fact, I'd suggest setting request to 4 GiB and limit to 6 GiB (i.e. don't OOM kill it until it gets to the limit), and watch what happens to osd_memory_target. Hopefully you get an OSD memory target that is some percentage under the limit, so the Ceph OSD trims its memory usage before it gets OOM killed by the Linux kernel. Make sense?

@Antiarchitect
Copy link
Author

@bengland2 Thank you for the tip: Here is the result: https://gist.github.com/Antiarchitect/fc1cfd989a58528a33a3c04455a45b3b
and "osd_memory_target": "4294967296" which is exactly 4Gb

@Antiarchitect
Copy link
Author

I will close this issue as the cluster is stable for about a week already. If it shows itself again - will reopen. Thank you all for your patience and good advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants