Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph mgr dashboard service response very slow and offen fail. #12877

Open
cyxinda opened this issue Sep 11, 2023 · 15 comments
Open

ceph mgr dashboard service response very slow and offen fail. #12877

cyxinda opened this issue Sep 11, 2023 · 15 comments
Assignees

Comments

@cyxinda
Copy link

cyxinda commented Sep 11, 2023

I install rook of the version v1.12.3 in k8s in the version 1.24.16 .
I found the mgr service occupies a high CPU resources during operation. And the dashboard service response very slowly and offen faild like this.
截图 2023-09-11 11-02-27
截图 2023-09-11 10-48-00
I try to increase mgr's cpu and memory limits .

        resources:
          limits:
            cpu: "4"
            memory: 6Gi
          requests:
            cpu: "2"
            memory: 1Gi

But the effect for the response speed was not obvious .
Of course, in some cases, the dashboard service work normally with a slow response speed.
截图 2023-09-11 10-50-39

Expected behavior:
I expect the mgr dashboard service wouldn't fail to response,and rise up the response speed.

Environment:

  • OS (e.g. from /etc/os-release):
[root@sc-neutral-2 ~]# cat  /etc/os-release
NAME="Rocky Linux"
VERSION="9.2 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.2"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.2"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.2"
  • Kernel (e.g. uname -a):
[root@sc-neutral-2 ~]# uname -a
Linux sc-neutral-2 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 9 17:09:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
    1.12.3
  • Storage backend version (e.g. for ceph do ceph -v):
bash-4.4$ ceph -v
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
  • Kubernetes version (use kubectl version):
[root@sc-neutral-2 ~]# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.16", GitCommit:"51e33fadff13065ae5518db94e84598293965939", GitTreeState:"clean", BuildDate:"2023-07-19T12:26:21Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.16", GitCommit:"51e33fadff13065ae5518db94e84598293965939", GitTreeState:"clean", BuildDate:"2023-07-19T12:19:24Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
    kubernetes
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
bash-4.4$ ceph health detail
HEALTH_OK
bash-4.4$ ceph -s
  cluster:
    id:     aa0e7bed-d3e3-49axxxxxxxxxxxxxxxxxxxxxxxxxxxx
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,c,d (age 4d)
    mgr: b(active, since 8h), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 36 osds: 36 up (since 74m), 36 in (since 3d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 265 pgs
    objects: 985.44k objects, 2.6 TiB
    usage:   7.8 TiB used, 123 TiB / 131 TiB avail
    pgs:     265 active+clean
 
  io:
    client:   852 B/s rd, 4.0 KiB/s wr, 1 op/s rd, 0 op/s wr
@cyxinda cyxinda added the bug label Sep 11, 2023
@rkachach
Copy link
Contributor

@cyxinda are you still able to reproduce this issue? can you please attach the mgr pod logs?

@cyxinda
Copy link
Author

cyxinda commented Sep 26, 2023

I have nginx log :

172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 670 81.350 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.294, 21.055 504, - 1a706b2bb99402d18ab0b4fb584e8617
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 126.356 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443, 172.70.21.20:8443 0, 0, 0 60.424, 60.356, 5.577 504, 504, - 5bb8fbe9ff295048ccfb125ab333a403
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 46.363 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443 0 46.362 - d8ec50b231061755afb82253a961bb4d
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 141.355 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443, 172.70.21.20:8443 0, 0, 0 60.059, 60.323, 20.973 504, 504, - fd53fa34173e8cd84cef69fc33832003
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 76.363 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.070, 16.292 504, - e951f378651b8d776bd80dc546153953
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 91.359 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.109, 31.250 504, - 3f593ba332b964931765a19484e5a5f1
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 61.363 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.451, 0.913 504, - 2fe646b3705d6156a8e27e127b55b87e
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 31.366 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443 0 31.366 - f4ce351d32a10d134910bc29a3e2d4e7
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "OPTIONS /query?timeout=20s HTTP/2.0" 200 0 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 38 0.015 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.58.156:8080 0 0.015 200 140eafec89811d52912ccf763fe8bf59
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "OPTIONS /health HTTP/2.0" 200 290 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 29 0.015 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.116.131:8080 290 0.015 200 3f516dd3053be9c3d8dc32ad49f1f82e
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "GET /health HTTP/2.0" 200 290 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 22 0.033 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.48.107:8080 290 0.033 200 b495a6137d3f848b3028178ed7e8df0b
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "OPTIONS /state HTTP/2.0" 200 3044 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 28 0.036 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.48.107:8080 3056 0.035 200 85bdf91ea08afcee6515aa1ac8d13bce
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "POST /query?timeout=20s HTTP/2.0" 200 194 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 61 0.031 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.58.156:8080 194 0.030 200 869b23c66e24d2f54d81c1d424a50295
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "GET /health HTTP/2.0" 200 290 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 23 0.019 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.116.131:8080 290 0.018 200 1b881ecf9557e15cf4792836b48ff767
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "GET /state HTTP/2.0" 200 3044 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 22 0.036 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.48.107:8080 3056 0.035 200 3e43823b800224ddb515a0298098c93f
^C
···
but However, there is no abnormality in the log of mgr service:
```bash
debug 2023-09-26T09:31:25.409+0000 7f8be7093700  0 [dashboard INFO request] [::ffff:172.70.21.19:39500] [GET] [200] [63.173s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:25.558+0000 7f8be6091700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:25.587+0000 7f8be7894700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:25.741+0000 7f8be508f700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:25.866+0000 7f8be5890700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:25.914+0000 7f8be8095700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:26.309+0000 7f8be6892700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:26.353+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3576: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 847 B/s rd, 357 KiB/s wr, 42 op/s; 5667607/110061723 objects misplaced (5.149%); 11 MiB/s, 26 objects/s recovering
debug 2023-09-26T09:31:27.082+0000 7f8be308b700  0 [rook INFO rook.rook_cluster] MODIFIED event: sc-node-ceph-3
debug 2023-09-26T09:31:27.252+0000 7f8be408d700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:27.493+0000 7f8be488e700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:27.927+0000 7f8be408d700  0 [dashboard INFO request] [::ffff:172.70.21.19:39546] [GET] [200] [64.426s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:28.041+0000 7f8be408d700  0 [dashboard INFO request] [::ffff:172.70.21.19:38740] [GET] [200] [0.114s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-09-26T09:31:28.192+0000 7f8be6091700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:28.397+0000 7f8df855e700  0 client.0 ms_handle_reset on v2:172.70.21.20:6802/3380289298
debug 2023-09-26T09:31:28.449+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3577: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 843 B/s rd, 354 KiB/s wr, 42 op/s; 5667607/110061723 objects misplaced (5.149%); 9.3 MiB/s, 24 objects/s recovering
debug 2023-09-26T09:31:28.677+0000 7f8be7894700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:28.711+0000 7f8be388c700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:28.844+0000 7f8ceedf9700  0 [devicehealth INFO root] Check health
debug 2023-09-26T09:31:28.855+0000 7f8be5890700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:28.899+0000 7f8be7093700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:29.359+0000 7f8be308b700  0 [rook INFO rook.rook_cluster] MODIFIED event: sc-neutral-2
debug 2023-09-26T09:31:29.461+0000 7f8be408d700  0 [dashboard INFO request] [::ffff:172.70.21.19:38744] [GET] [200] [1.176s] [admin] [1.4K] /api/summary
debug 2023-09-26T09:31:29.504+0000 7f8be388c700  0 [dashboard INFO request] [::ffff:172.70.21.19:39528] [GET] [200] [67.048s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:29.505+0000 7f8be408d700  0 [dashboard INFO request] [::ffff:172.70.21.19:38748] [GET] [200] [0.043s] [admin] [69.0B] /api/feature_toggles
debug 2023-09-26T09:31:29.549+0000 7f8be508f700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:29.561+0000 7f8be408d700  0 [dashboard INFO request] [::ffff:172.70.21.19:38758] [GET] [200] [0.043s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-09-26T09:31:29.870+0000 7f8be8095700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:29.884+0000 7f8ceb5f2700  0 [prometheus WARNING root] Collecting data took more time than configured scrape interval. This possibly results in stale data. Please check the `stale_cache_strategy` configuration option. Collecting data took 89.60 seconds but scrape interval is configured to be 15 seconds.
debug 2023-09-26T09:31:30.344+0000 7f8be6892700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:30.477+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3578: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 1.3 KiB/s rd, 547 KiB/s wr, 66 op/s; 5667208/110061723 objects misplaced (5.149%); 12 MiB/s, 36 objects/s recovering
debug 2023-09-26T09:31:30.718+0000 7f8cf05fc700  0 [balancer INFO root] Optimize plan auto_2023-09-26_09:31:30
debug 2023-09-26T09:31:30.718+0000 7f8cf05fc700  0 [balancer INFO root] Mode upmap, max misplaced 0.050000
debug 2023-09-26T09:31:30.718+0000 7f8cf05fc700  0 [balancer INFO root] Too many objects (0.051491 > 0.050000) are misplaced; try again later
debug 2023-09-26T09:31:30.828+0000 7f8bfa7b8700  0 [volumes INFO mgr_util] scanning for idle connections..
debug 2023-09-26T09:31:30.828+0000 7f8bfa7b8700  0 [volumes INFO mgr_util] cleaning up connections: []
debug 2023-09-26T09:31:30.834+0000 7f8c027c8700  0 [volumes INFO mgr_util] scanning for idle connections..
debug 2023-09-26T09:31:30.834+0000 7f8c027c8700  0 [volumes INFO mgr_util] cleaning up connections: []
debug 2023-09-26T09:31:31.041+0000 7f8be488e700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:31.712+0000 7f8be7894700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:31.802+0000 7f8be6091700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:32.061+0000 7f8be5890700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:32.073+0000 7f8c07fd3700  0 [rbd_support INFO root] TrashPurgeScheduleHandler: load_schedules
debug 2023-09-26T09:31:32.081+0000 7f8c07fd3700  0 [rbd_support INFO root] load_schedules: rbd-pool, start_after=
debug 2023-09-26T09:31:32.192+0000 7f8be7093700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:32.345+0000 7f8cdf5ea700  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2023-09-26T09:31:32.354+0000 7f8cdf5ea700  0 [rbd_support INFO root] load_schedules: rbd-pool, start_after=
debug 2023-09-26T09:31:32.687+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3579: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 912 B/s rd, 353 KiB/s wr, 44 op/s; 5667208/110061723 objects misplaced (5.149%); 7.8 MiB/s, 24 objects/s recovering
debug 2023-09-26T09:31:32.705+0000 7f8be308b700  0 [rook INFO rook.rook_cluster] MODIFIED event: sc-node-app-1
debug 2023-09-26T09:31:33.005+0000 7f8be388c700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:33.072+0000 7f8be408d700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:33.170+0000 7f8be508f700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:33.333+0000 7f8be8095700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:34.227+0000 7f8bfd7be700  0 [volumes INFO mgr_util] scanning for idle connections..
debug 2023-09-26T09:31:34.227+0000 7f8bfd7be700  0 [volumes INFO mgr_util] cleaning up connections: []
debug 2023-09-26T09:31:34.558+0000 7f8be6892700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:34.593+0000 7f8be7894700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:34.699+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3580: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 12 KiB/s rd, 3.2 MiB/s wr, 81 op/s; 5666827/110061723 objects misplaced (5.149%); 13 MiB/s, 36 objects/s recovering
debug 2023-09-26T09:31:34.705+0000 7f8be488e700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:34.921+0000 7f8be5890700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:35.723+0000 7f8be6091700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:36.027+0000 7f8be388c700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:36.590+0000 7f8be7093700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:36.699+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3581: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 11 KiB/s rd, 3.0 MiB/s wr, 60 op/s; 5666827/110061723 objects misplaced (5.149%); 11 MiB/s, 26 objects/s recovering
debug 2023-09-26T09:31:37.123+0000 7f8be508f700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:37.368+0000 7f8be408d700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:37.542+0000 7f8be7894700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:37.727+0000 7f8be8095700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:38.299+0000 7f8be6892700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:38.333+0000 7f8be488e700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:38.539+0000 7f8be5890700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:38.700+0000 7f8cdede9700  0 log_channel(cluster) log [DBG] : pgmap v3582: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 11 KiB/s rd, 3.0 MiB/s wr, 60 op/s; 5666827/110061723 objects misplaced (5.149%); 8.6 MiB/s, 24 objects/s recovering
debug 2023-09-26T09:31:39.177+0000 7f8be6091700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:39.184+0000 7f8be6892700  0 [dashboard INFO request] [::ffff:172.70.21.19:39562] [GET] [200] [63.946s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:39.381+0000 7f8be6892700  0 [dashboard INFO request] [::ffff:172.70.21.19:41838] [GET] [200] [0.195s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-09-26T09:31:39.521+0000 7f8be7093700  0 [dashboard INFO orchestrator] is orchestrator available: True, 
debug 2023-09-26T09:31:39.536+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] _maybe_adjust
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] Pool '.mgr' root_id -1 using 2.595266847380288e-08 of space, bias 1.0, pg target 3.1143202168563455e-05 quantized to 1 (current 1)
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] Pool 'rbd-pool' root_id -1 using 0.0006924952425323288 of space, bias 1.0, pg target 0.8309942910387945 quantized to 32 (current 32)
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] Pool 'cephfs-metadata' root_id -1 using 5.640708726107163e-09 of space, bias 4.0, pg target 2.7060359995378097e-05 quantized to 16 (current 16)
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] Pool 'cephfs-cephfs' root_id -1 using 3.291666895872016e-12 of space, bias 1.0, pg target 3.947805830449171e-09 quantized to 32 (current 32)
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] Pool 'knowdee-s3-object-store.rgw.log' root_id -1 using 2.2696605747076718e-08 of space, bias 1.0, pg target 2.722079582599401e-05 quantized to 8 (current 8)
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] Pool 'knowdee-s3-object-store.rgw.buckets.index' root_id -1 using 0.0003764210182109483 of space, bias 1.0, pg target 0.451454274507664 quantized to 8 (current 8)
debug 2023-09-26T09:31:39.594+0000 7f8cecdf5700  0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992

@cyxinda are you still able to reproduce this issue? can you please attach the mgr pod logs?

@rkachach
Copy link
Contributor

rkachach commented Sep 26, 2023

@cyxinda mgr logs seem normal... so not sure why is it consuming so much CPU (if it's still the case!)

@cyxinda
Copy link
Author

cyxinda commented Nov 10, 2023

@rkachach @hunter @rcorrear @weekface
hi,
I noticed that before the dashboard service happen 504 httpcode , the usage of cpu, memory, and socket of the mgr service soared.
I have not set the maximum CPU threshold for the mgr service, and the memory of the mgr service has not reached the upper limit.

    mgr:
      limits:
        memory: "4Gi"
      requests:
        cpu: "2"
        memory: "4Gi"

So, I think this may be related to the socket being heavily occupied.
截图 2023-11-10 11-26-51
socket:
截图 2023-11-10 13-35-55
network:
截图 2023-11-10 13-33-13
memory and io:
截图 2023-11-10 13-33-48
cpu:
截图 2023-11-10 13-34-42

I found that there are a lot of connect of wait_close status in the mgr pod.

[root@sc-master-1 ~]# kubectl -n rook-ceph exec -it po/rook-ceph-mgr-b-745f46c4f-4pfq7 -- /bin/bash
Defaulted container "mgr" out of: mgr, watch-active, log-collector, chown-container-data-dir (init)
[root@sc-neutral-1 ceph]# ss -t|grep ESTAB|wc -l
163
[root@sc-neutral-1 ceph]# ss -t|grep CLOSE-WAIT|wc -l
3623

I don't know why so many close_wait connection exist in the mgr service.
I think the close_wait connection may be the cause for the case?

After a period of time, CPU, memory, network transmission, and socket usage have all dropped significantly. And the dashboard returned to normal.
截图 2023-11-10 15-26-50
the network transation and socket:
截图 2023-11-10 15-25-05
the memory and io:
截图 2023-11-10 15-26-03
cpu usage:
截图 2023-11-10 15-27-50

@rkachach
Copy link
Contributor

rkachach commented Nov 10, 2023

@cyxinda

It appears that there might be a situation where someone (potentially a security scanning tool) is initiating a lot of connections to the dashboard but isn't closing them appropriately. Consequently, the CPU and RAM usage spike for a period, leaving behind a considerable number of connections in the CLOSE-WAIT state, likely because the client did not acknowledge the TCP close.

Could you observe your network activity to determine the source of these connections?
Additionally, are you familiar with any security scanning tools that run periodically in your environment?

@rkachach rkachach self-assigned this Nov 10, 2023
@cyxinda
Copy link
Author

cyxinda commented Nov 14, 2023

@cyxinda

It appears that there might be a situation where someone (potentially a security scanning tool) is initiating a lot of connections to the dashboard but isn't closing them appropriately. Consequently, the CPU and RAM usage spike for a period, leaving behind a considerable number of connections in the CLOSE-WAIT state, likely because the client did not acknowledge the TCP close.

Could you observe your network activity to determine the source of these connections? Additionally, are you familiar with any security scanning tools that run periodically in your environment?

@rkachach
I expose dashboard service by nginx-ingress.And I am sure that all connection almost come from the nginx .There are no body except me accessing the dashboard service.
Generally, the connection in CLOSE-WAIT state is cause by the server code.
Therefore, I suspect that these connections in the close-wait state may be caused by a bug in the mgr code for the dashboard.
We can see the connections with CLOSE-WAIT state in mgr pod container.

[root@sc-neutral-2 ceph]# ss -t |grep CLOSE-WAIT
State      Recv-Q  Send-Q         Local Address:Port                  Peer Address:Port        Process
CLOSE-WAIT 1122    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:53496               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:36722               
CLOSE-WAIT 0       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37094               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35446               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:52502               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:34484               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44228               
CLOSE-WAIT 1102    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35352               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:49698               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:54000               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37142               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:52494               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35360               
CLOSE-WAIT 1102    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37108               
CLOSE-WAIT 1102    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37134               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44254               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:36700               
CLOSE-WAIT 0       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:52508               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44236               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:55096               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:34494               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:55184               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:47170               
CLOSE-WAIT 1102    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:53486               
CLOSE-WAIT 1134    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35456               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:54024               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:47140               
CLOSE-WAIT 1       0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:34510               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37122               
CLOSE-WAIT 1122    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37112               
CLOSE-WAIT 1126    0      [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44190  

my nginx-controller service pod's address is 172.70.21.19 .

@rkachach
Copy link
Contributor

@cyxinda I see, talking to some dashboard experts on the team they pointed me to this open issue: https://tracker.ceph.com/issues/55837. As I understood from them there're some BUGs in the cherrypy cheroot that could lead to a situation where all the threads from the workers pool are consumed because of bad error handling. These problems only appear when SSL is enabled. In your case as you are using an nginx in the front you can try to disable ssl in the dashboard side and see if the problem persists or not.

@cyxinda
Copy link
Author

cyxinda commented Nov 14, 2023

@rkachach
hi,very happy about your answer.
I changed the https of the dashboard to http in the your way.
I haven't find a surge in the number of socket connections inside the dashboard pod for almost one hour.
The number of socket connections remained stable. And there is no 504 error httpcode in the front end.
Then, the description above can basically verify your solution that the problem occurs when SSL is turned on.
截图 2023-11-14 18-57-28

@rkachach
Copy link
Contributor

rkachach commented Nov 14, 2023

@cyxinda glad that turning off SSL solved your issue. So all indicates that you were running on the same issue as reported by the ceph tracker I posted previously. Unfortunately it's still open and right now there's no solution for it as the problem is not on ceph dashboard code but rather it's a BUG on a library (cherrypy) that dashobard relies on.

@cyxinda
Copy link
Author

cyxinda commented Nov 15, 2023

@cyxinda glad that turning off SSL solved your issue. So all indicates that you were running on the same issue as reported by the ceph tracker I posted previously. Unfortunately it's still open and right now there's no solution for it as the problem is not on ceph dashboard code but rather it's a BUG on a library (cherrypy) that dashobard relies on.

Ok,I will close this issue for the cause above.
thx.

@cyxinda cyxinda closed this as not planned Won't fix, can't repro, duplicate, stale Nov 15, 2023
@rkachach rkachach reopened this Jan 10, 2024
@rkachach
Copy link
Contributor

@cyxinda reopening as dashboard team is trying to address the ssl issue by this PR ceph/ceph#55108

@cyxinda
Copy link
Author

cyxinda commented Jan 23, 2024

@cyxinda reopening as dashboard team is trying to address the ssl issue by this PR ceph/ceph#55108

I'm glad that the community can resolve this issue ,and waiting for the further solution.
There are another good thing:the cause that the mgr service occupies a high CPU resources during operation seems to have been found . I solved the problem by setting the daemon of containerd 's param for LimitNOFILE=1048576.
This is due to two issues from the rook community : ISSUE 11253 and ISSUE 10110
I change the param like this:

[root@sc-midware-2 ~]# vim /usr/local/lib/systemd/system/containerd.service;
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
#uncomment to enable the experimental sbservice (sandboxed) version of containerd/cri integration
#Environment="ENABLE_CRI_SANDBOXES=sandboxed"
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
#LimitNOFILE=infinity
LimitNOFILE=1048576
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

Then I restart the daemon of continaerd for all the node of k8s:

[root@sc-midware-2 ~]# systemctl daemon-reload && systemctl restart containerd && sleep 2s && systemctl status containerd
● containerd.service - containerd container runtime
     Loaded: loaded (/usr/local/lib/systemd/system/containerd.service; enabled; preset: disabled)
     Active: active (running) since Tue 2024-01-23 02:45:15 EST; 2s ago
       Docs: https://containerd.io
    Process: 2664780 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 2664781 (containerd)
      Tasks: 893
     Memory: 4.1G
        CPU: 2.751s
     CGroup: /system.slice/containerd.service

I restart the mgr pod and the osd pod .Comparing the CPU usage before and after the modification,It is obviously for these processes .
截图 2024-01-23 14-56-53
截图 2024-01-23 15-38-00

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 30, 2024
@rkachach rkachach reopened this Apr 23, 2024
@rkachach
Copy link
Contributor

rkachach commented Apr 23, 2024

Reopening this just to keep track of it. Original Ceph dashboard issue in fact was caused by some cheroot issues (cherrpy depends on cheroot) that must be now solved on the release: https://github.com/cherrypy/cheroot/releases/tag/v10.0.1

@github-actions github-actions bot removed the wontfix label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants