-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph mgr dashboard service response very slow and offen fail. #12877
Comments
@cyxinda are you still able to reproduce this issue? can you please attach the mgr pod logs? |
I have nginx log : 172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 670 81.350 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.294, 21.055 504, - 1a706b2bb99402d18ab0b4fb584e8617
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 126.356 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443, 172.70.21.20:8443 0, 0, 0 60.424, 60.356, 5.577 504, 504, - 5bb8fbe9ff295048ccfb125ab333a403
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 46.363 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443 0 46.362 - d8ec50b231061755afb82253a961bb4d
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 141.355 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443, 172.70.21.20:8443 0, 0, 0 60.059, 60.323, 20.973 504, 504, - fd53fa34173e8cd84cef69fc33832003
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 76.363 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.070, 16.292 504, - e951f378651b8d776bd80dc546153953
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 91.359 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.109, 31.250 504, - 3f593ba332b964931765a19484e5a5f1
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 61.363 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443, 172.70.21.20:8443 0, 0 60.451, 0.913 504, - 2fe646b3705d6156a8e27e127b55b87e
172.70.100.1 - - [26/Sep/2023:09:31:25 +0000] "GET /api/health/minimal HTTP/2.0" 499 0 "https://ceph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 33 31.366 [rook-ceph-rook-ceph-mgr-dashboard-https-dashboard] [] 172.70.21.20:8443 0 31.366 - f4ce351d32a10d134910bc29a3e2d4e7
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "OPTIONS /query?timeout=20s HTTP/2.0" 200 0 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 38 0.015 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.58.156:8080 0 0.015 200 140eafec89811d52912ccf763fe8bf59
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "OPTIONS /health HTTP/2.0" 200 290 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 29 0.015 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.116.131:8080 290 0.015 200 3f516dd3053be9c3d8dc32ad49f1f82e
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "GET /health HTTP/2.0" 200 290 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 22 0.033 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.48.107:8080 290 0.033 200 b495a6137d3f848b3028178ed7e8df0b
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "OPTIONS /state HTTP/2.0" 200 3044 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 28 0.036 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.48.107:8080 3056 0.035 200 85bdf91ea08afcee6515aa1ac8d13bce
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "POST /query?timeout=20s HTTP/2.0" 200 194 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 61 0.031 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.58.156:8080 194 0.030 200 869b23c66e24d2f54d81c1d424a50295
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "GET /health HTTP/2.0" 200 290 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 23 0.019 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.116.131:8080 290 0.018 200 1b881ecf9557e15cf4792836b48ff767
172.70.100.1 - - [26/Sep/2023:09:31:30 +0000] "GET /state HTTP/2.0" 200 3044 "https://ratel-dgraph.knowdee.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.40" 22 0.036 [dgraph111-dgraph-dgraph-alpha-8080] [] 10.244.48.107:8080 3056 0.035 200 3e43823b800224ddb515a0298098c93f
^C
···
but However, there is no abnormality in the log of mgr service:
```bash
debug 2023-09-26T09:31:25.409+0000 7f8be7093700 0 [dashboard INFO request] [::ffff:172.70.21.19:39500] [GET] [200] [63.173s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:25.558+0000 7f8be6091700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:25.587+0000 7f8be7894700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:25.741+0000 7f8be508f700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:25.866+0000 7f8be5890700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:25.914+0000 7f8be8095700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:26.309+0000 7f8be6892700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:26.353+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3576: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 847 B/s rd, 357 KiB/s wr, 42 op/s; 5667607/110061723 objects misplaced (5.149%); 11 MiB/s, 26 objects/s recovering
debug 2023-09-26T09:31:27.082+0000 7f8be308b700 0 [rook INFO rook.rook_cluster] MODIFIED event: sc-node-ceph-3
debug 2023-09-26T09:31:27.252+0000 7f8be408d700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:27.493+0000 7f8be488e700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:27.927+0000 7f8be408d700 0 [dashboard INFO request] [::ffff:172.70.21.19:39546] [GET] [200] [64.426s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:28.041+0000 7f8be408d700 0 [dashboard INFO request] [::ffff:172.70.21.19:38740] [GET] [200] [0.114s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-09-26T09:31:28.192+0000 7f8be6091700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:28.397+0000 7f8df855e700 0 client.0 ms_handle_reset on v2:172.70.21.20:6802/3380289298
debug 2023-09-26T09:31:28.449+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3577: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 843 B/s rd, 354 KiB/s wr, 42 op/s; 5667607/110061723 objects misplaced (5.149%); 9.3 MiB/s, 24 objects/s recovering
debug 2023-09-26T09:31:28.677+0000 7f8be7894700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:28.711+0000 7f8be388c700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:28.844+0000 7f8ceedf9700 0 [devicehealth INFO root] Check health
debug 2023-09-26T09:31:28.855+0000 7f8be5890700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:28.899+0000 7f8be7093700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:29.359+0000 7f8be308b700 0 [rook INFO rook.rook_cluster] MODIFIED event: sc-neutral-2
debug 2023-09-26T09:31:29.461+0000 7f8be408d700 0 [dashboard INFO request] [::ffff:172.70.21.19:38744] [GET] [200] [1.176s] [admin] [1.4K] /api/summary
debug 2023-09-26T09:31:29.504+0000 7f8be388c700 0 [dashboard INFO request] [::ffff:172.70.21.19:39528] [GET] [200] [67.048s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:29.505+0000 7f8be408d700 0 [dashboard INFO request] [::ffff:172.70.21.19:38748] [GET] [200] [0.043s] [admin] [69.0B] /api/feature_toggles
debug 2023-09-26T09:31:29.549+0000 7f8be508f700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:29.561+0000 7f8be408d700 0 [dashboard INFO request] [::ffff:172.70.21.19:38758] [GET] [200] [0.043s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-09-26T09:31:29.870+0000 7f8be8095700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:29.884+0000 7f8ceb5f2700 0 [prometheus WARNING root] Collecting data took more time than configured scrape interval. This possibly results in stale data. Please check the `stale_cache_strategy` configuration option. Collecting data took 89.60 seconds but scrape interval is configured to be 15 seconds.
debug 2023-09-26T09:31:30.344+0000 7f8be6892700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:30.477+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3578: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 1.3 KiB/s rd, 547 KiB/s wr, 66 op/s; 5667208/110061723 objects misplaced (5.149%); 12 MiB/s, 36 objects/s recovering
debug 2023-09-26T09:31:30.718+0000 7f8cf05fc700 0 [balancer INFO root] Optimize plan auto_2023-09-26_09:31:30
debug 2023-09-26T09:31:30.718+0000 7f8cf05fc700 0 [balancer INFO root] Mode upmap, max misplaced 0.050000
debug 2023-09-26T09:31:30.718+0000 7f8cf05fc700 0 [balancer INFO root] Too many objects (0.051491 > 0.050000) are misplaced; try again later
debug 2023-09-26T09:31:30.828+0000 7f8bfa7b8700 0 [volumes INFO mgr_util] scanning for idle connections..
debug 2023-09-26T09:31:30.828+0000 7f8bfa7b8700 0 [volumes INFO mgr_util] cleaning up connections: []
debug 2023-09-26T09:31:30.834+0000 7f8c027c8700 0 [volumes INFO mgr_util] scanning for idle connections..
debug 2023-09-26T09:31:30.834+0000 7f8c027c8700 0 [volumes INFO mgr_util] cleaning up connections: []
debug 2023-09-26T09:31:31.041+0000 7f8be488e700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:31.712+0000 7f8be7894700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:31.802+0000 7f8be6091700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:32.061+0000 7f8be5890700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:32.073+0000 7f8c07fd3700 0 [rbd_support INFO root] TrashPurgeScheduleHandler: load_schedules
debug 2023-09-26T09:31:32.081+0000 7f8c07fd3700 0 [rbd_support INFO root] load_schedules: rbd-pool, start_after=
debug 2023-09-26T09:31:32.192+0000 7f8be7093700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:32.345+0000 7f8cdf5ea700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
debug 2023-09-26T09:31:32.354+0000 7f8cdf5ea700 0 [rbd_support INFO root] load_schedules: rbd-pool, start_after=
debug 2023-09-26T09:31:32.687+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3579: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 912 B/s rd, 353 KiB/s wr, 44 op/s; 5667208/110061723 objects misplaced (5.149%); 7.8 MiB/s, 24 objects/s recovering
debug 2023-09-26T09:31:32.705+0000 7f8be308b700 0 [rook INFO rook.rook_cluster] MODIFIED event: sc-node-app-1
debug 2023-09-26T09:31:33.005+0000 7f8be388c700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:33.072+0000 7f8be408d700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:33.170+0000 7f8be508f700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:33.333+0000 7f8be8095700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:34.227+0000 7f8bfd7be700 0 [volumes INFO mgr_util] scanning for idle connections..
debug 2023-09-26T09:31:34.227+0000 7f8bfd7be700 0 [volumes INFO mgr_util] cleaning up connections: []
debug 2023-09-26T09:31:34.558+0000 7f8be6892700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:34.593+0000 7f8be7894700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:34.699+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3580: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 12 KiB/s rd, 3.2 MiB/s wr, 81 op/s; 5666827/110061723 objects misplaced (5.149%); 13 MiB/s, 36 objects/s recovering
debug 2023-09-26T09:31:34.705+0000 7f8be488e700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:34.921+0000 7f8be5890700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:35.723+0000 7f8be6091700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:36.027+0000 7f8be388c700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:36.590+0000 7f8be7093700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:36.699+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3581: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 11 KiB/s rd, 3.0 MiB/s wr, 60 op/s; 5666827/110061723 objects misplaced (5.149%); 11 MiB/s, 26 objects/s recovering
debug 2023-09-26T09:31:37.123+0000 7f8be508f700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:37.368+0000 7f8be408d700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:37.542+0000 7f8be7894700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:37.727+0000 7f8be8095700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:38.299+0000 7f8be6892700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:38.333+0000 7f8be488e700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:38.539+0000 7f8be5890700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:38.700+0000 7f8cdede9700 0 log_channel(cluster) log [DBG] : pgmap v3582: 488 pgs: 31 active+remapped+backfilling, 457 active+clean; 14 TiB data, 44 TiB used, 87 TiB / 131 TiB avail; 11 KiB/s rd, 3.0 MiB/s wr, 60 op/s; 5666827/110061723 objects misplaced (5.149%); 8.6 MiB/s, 24 objects/s recovering
debug 2023-09-26T09:31:39.177+0000 7f8be6091700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:39.184+0000 7f8be6892700 0 [dashboard INFO request] [::ffff:172.70.21.19:39562] [GET] [200] [63.946s] [admin] [10.0K] /api/health/minimal
debug 2023-09-26T09:31:39.381+0000 7f8be6892700 0 [dashboard INFO request] [::ffff:172.70.21.19:41838] [GET] [200] [0.195s] [admin] [22.0B] /api/prometheus/notifications
debug 2023-09-26T09:31:39.521+0000 7f8be7093700 0 [dashboard INFO orchestrator] is orchestrator available: True,
debug 2023-09-26T09:31:39.536+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] _maybe_adjust
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] Pool '.mgr' root_id -1 using 2.595266847380288e-08 of space, bias 1.0, pg target 3.1143202168563455e-05 quantized to 1 (current 1)
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] Pool 'rbd-pool' root_id -1 using 0.0006924952425323288 of space, bias 1.0, pg target 0.8309942910387945 quantized to 32 (current 32)
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.592+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] Pool 'cephfs-metadata' root_id -1 using 5.640708726107163e-09 of space, bias 4.0, pg target 2.7060359995378097e-05 quantized to 16 (current 16)
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] Pool 'cephfs-cephfs' root_id -1 using 3.291666895872016e-12 of space, bias 1.0, pg target 3.947805830449171e-09 quantized to 32 (current 32)
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] Pool 'knowdee-s3-object-store.rgw.log' root_id -1 using 2.2696605747076718e-08 of space, bias 1.0, pg target 2.722079582599401e-05 quantized to 8 (current 8)
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
debug 2023-09-26T09:31:39.593+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] Pool 'knowdee-s3-object-store.rgw.buckets.index' root_id -1 using 0.0003764210182109483 of space, bias 1.0, pg target 0.451454274507664 quantized to 8 (current 8)
debug 2023-09-26T09:31:39.594+0000 7f8cecdf5700 0 [pg_autoscaler INFO root] effective_target_ratio 0.0 0.0 0 143999989972992
|
@cyxinda mgr logs seem normal... so not sure why is it consuming so much CPU (if it's still the case!) |
@rkachach @hunter @rcorrear @weekface mgr:
limits:
memory: "4Gi"
requests:
cpu: "2"
memory: "4Gi" So, I think this may be related to the socket being heavily occupied. I found that there are a lot of connect of wait_close status in the mgr pod. [root@sc-master-1 ~]# kubectl -n rook-ceph exec -it po/rook-ceph-mgr-b-745f46c4f-4pfq7 -- /bin/bash
Defaulted container "mgr" out of: mgr, watch-active, log-collector, chown-container-data-dir (init)
[root@sc-neutral-1 ceph]# ss -t|grep ESTAB|wc -l
163
[root@sc-neutral-1 ceph]# ss -t|grep CLOSE-WAIT|wc -l
3623 I don't know why so many close_wait connection exist in the mgr service. After a period of time, CPU, memory, network transmission, and socket usage have all dropped significantly. And the dashboard returned to normal. |
It appears that there might be a situation where someone (potentially a security scanning tool) is initiating a lot of connections to the dashboard but isn't closing them appropriately. Consequently, the CPU and RAM usage spike for a period, leaving behind a considerable number of connections in the CLOSE-WAIT state, likely because the client did not acknowledge the TCP close. Could you observe your network activity to determine the source of these connections? |
@rkachach [root@sc-neutral-2 ceph]# ss -t |grep CLOSE-WAIT
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
CLOSE-WAIT 1122 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:53496
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:36722
CLOSE-WAIT 0 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37094
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35446
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:52502
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:34484
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44228
CLOSE-WAIT 1102 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35352
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:49698
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:54000
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37142
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:52494
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35360
CLOSE-WAIT 1102 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37108
CLOSE-WAIT 1102 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37134
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44254
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:36700
CLOSE-WAIT 0 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:52508
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44236
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:55096
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:34494
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:55184
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:47170
CLOSE-WAIT 1102 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:53486
CLOSE-WAIT 1134 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:35456
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:54024
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:47140
CLOSE-WAIT 1 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:34510
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37122
CLOSE-WAIT 1122 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:37112
CLOSE-WAIT 1126 0 [::ffff:172.70.21.20]:pcsync-https [::ffff:172.70.21.19]:44190 my nginx-controller service pod's address is 172.70.21.19 . |
@cyxinda I see, talking to some dashboard experts on the team they pointed me to this open issue: https://tracker.ceph.com/issues/55837. As I understood from them there're some BUGs in the cherrypy cheroot that could lead to a situation where all the threads from the workers pool are consumed because of bad error handling. These problems only appear when SSL is enabled. In your case as you are using an nginx in the front you can try to disable ssl in the dashboard side and see if the problem persists or not. |
@rkachach |
@cyxinda glad that turning off SSL solved your issue. So all indicates that you were running on the same issue as reported by the ceph tracker I posted previously. Unfortunately it's still open and right now there's no solution for it as the problem is not on ceph dashboard code but rather it's a BUG on a library (cherrypy) that dashobard relies on. |
Ok,I will close this issue for the cause above. |
@cyxinda reopening as dashboard team is trying to address the ssl issue by this PR ceph/ceph#55108 |
I'm glad that the community can resolve this issue ,and waiting for the further solution. [root@sc-midware-2 ~]# vim /usr/local/lib/systemd/system/containerd.service;
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target
[Service]
#uncomment to enable the experimental sbservice (sandboxed) version of containerd/cri integration
#Environment="ENABLE_CRI_SANDBOXES=sandboxed"
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
#LimitNOFILE=infinity
LimitNOFILE=1048576
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
[Install]
WantedBy=multi-user.target Then I restart the daemon of continaerd for all the node of k8s: [root@sc-midware-2 ~]# systemctl daemon-reload && systemctl restart containerd && sleep 2s && systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/usr/local/lib/systemd/system/containerd.service; enabled; preset: disabled)
Active: active (running) since Tue 2024-01-23 02:45:15 EST; 2s ago
Docs: https://containerd.io
Process: 2664780 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 2664781 (containerd)
Tasks: 893
Memory: 4.1G
CPU: 2.751s
CGroup: /system.slice/containerd.service I restart the mgr pod and the osd pod .Comparing the CPU usage before and after the modification,It is obviously for these processes . |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
Reopening this just to keep track of it. Original Ceph dashboard issue in fact was caused by some cheroot issues (cherrpy depends on cheroot) that must be now solved on the release: https://github.com/cherrypy/cheroot/releases/tag/v10.0.1 |
I install rook of the version v1.12.3 in k8s in the version 1.24.16 .
I found the mgr service occupies a high CPU resources during operation. And the dashboard service response very slowly and offen faild like this.
I try to increase mgr's cpu and memory limits .
But the effect for the response speed was not obvious .
Of course, in some cases, the dashboard service work normally with a slow response speed.
Expected behavior:
I expect the mgr dashboard service wouldn't fail to response,and rise up the response speed.
Environment:
uname -a
):rook version
inside of a Rook Pod):1.12.3
ceph -v
):kubectl version
):kubernetes
ceph health
in the Rook Ceph toolbox):The text was updated successfully, but these errors were encountered: