Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poller is not picking up the Queued tasks, the Host and Service checks are getting timed out. #2032

Open
sjose1x opened this issue Mar 14, 2024 · 1 comment

Comments

@sjose1x
Copy link

sjose1x commented Mar 14, 2024

Hi All,
I'm blocked due to failure with Shinken monitoring.., someone please help me on this... I can provide all information's and logs for sorting out this issue.

with OS upgrade, we are planning to run container based Shinken components for monitoring Windows 2008, 2012 and 2019 nodes.
individual containers are been created for all the components say Arbiter, Broker, Poller, Scheduler...etc

Configured Webui and Thruk for UI

All the configurations are good, and connectivity is there, but I could notice that the Hosts and Service checks are getting timed out.., after been in Pending state for quite some time... moreover these checks are not executing on time as well...

Enabled Debug log level and I could see queued up tasks are not getting picked by Poller.. Could not understand the reason behind it..

But when I logged in to the Poller-container and executed the command manually it got executed instantly..

System Details

OS - Alma 9.3
CPU - 4
RAM - 8GB
Shinken Version - 2.4.3

Even though the system has sufficient resources and only a few simple services are there to check, still the Poller service is consuming most of the CPU but the tasks are not getting done.

In the Poller log Queued up tasks are not getting picked up

[1710406970] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:0.0][calling:3.09944152832e-06] [json:8.10623168945e-06] [global:1.62124633789e-05]
[1710406970] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406970] DEBUG: [Shinken] Debug perf: get_broks [args:4.00543212891e-05] [aqu_lock:0.000130891799927][calling:5.50746917725e-05] [json:7.86781311035e-06] [global:0.000233888626099]
[1710406971] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.00135803223e-05] [global:1.71661376953e-05]
[1710406971] DEBUG: [Shinken] Debug perf: get_external_commands [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:2.40802764893e-05] [json:9.77516174316e-06] [global:3.981590271e-05]
[1710406971] DEBUG: [Shinken] Debug perf: ping [args:3.09944152832e-06] [aqu_lock:9.53674316406e-07][calling:1.90734863281e-06] [json:8.10623168945e-06] [global:1.4066696167e-05]
[1710406971] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406971] DEBUG: [Shinken] Debug perf: get_broks [args:5.48362731934e-05] [aqu_lock:0.0001540184021][calling:5.50746917725e-05] [json:1.00135803223e-05] [global:0.000273942947388]
[1710406972] DEBUG: [Shinken]  ========================
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:0 (Queued:154 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:1 (Queued:0 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:2 (Queued:4 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:3 (Queued:16 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] I decide to up wait ratio
[1710406972] DEBUG: [Shinken] Wait ratio: 2.040918
[1710406972] DEBUG: [Shinken] [shinken-poller-1] Trying to adjust worker number. Actual number : 4, min per module : 4, max per module : 4
[1710406972] DEBUG: [Shinken] Ask actions to 0, got 0
[1710406972] DEBUG: [Shinken] Loop turn
[1710406972] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:3.09944152832e-06] [json:8.82148742676e-06] [global:1.78813934326e-05]
[1710406972] DEBUG: [Shinken] Debug perf: get_external_commands [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.59876251221e-05] [json:9.05990600586e-06] [global:4.00543212891e-05]
[1710406972] DEBUG: [Shinken] Debug perf: ping [args:3.81469726562e-06] [aqu_lock:0.0][calling:2.14576721191e-06] [json:1.78813934326e-05] [global:2.38418579102e-05]
[1710406972] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406972] DEBUG: [Shinken] Debug perf: get_broks [args:3.91006469727e-05] [aqu_lock:0.000120878219604][calling:5.50746917725e-05] [json:8.10623168945e-06] [global:0.000223159790039]
[1710406973] DEBUG: [Shinken] Debug perf: ping [args:2.86102294922e-06] [aqu_lock:7.15255737305e-06][calling:1.90734863281e-06] [json:1.00135803223e-05] [global:2.19345092773e-05]
[1710406973] DEBUG: [Shinken] Debug perf: get_external_commands [args:5.96046447754e-06] [aqu_lock:9.53674316406e-07][calling:3.09944152832e-05] [json:1.19209289551e-05] [global:4.98294830322e-05]
[1710406973] DEBUG: [Shinken] Debug perf: ping [args:3.81469726562e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710406973] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406973] DEBUG: [Shinken] Debug perf: get_broks [args:5.07831573486e-05] [aqu_lock:0.000138998031616][calling:5.50746917725e-05] [json:1.00135803223e-05] [global:0.00025486946106]
[1710406974] DEBUG: [Shinken]  ========================

Scheduler DEBUG logs

[1710407544] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.09672546387e-05] [global:1.90734863281e-05]
[1710407544] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710407544] DEBUG: [Shinken] Debug perf: get_broks [args:3.88622283936e-05] [aqu_lock:0.0001380443573][calling:5.19752502441e-05] [json:7.15255737305e-06] [global:0.000236034393311]
[1710407545] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:0.0
2024-03-14 05:12:27,044 DEBG 'shinken-scheduler' stdout output:
][calling:2.86102294922e-06] [json:9.05990600586e-06] [global:1.59740447998e-05]
[1710407545] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407545] DEBUG: [Shinken] Debug perf: get_checks [args:8.58306884766e-05] [aqu_lock:0.00012993812561][calling:4.50611114502e-05] [json:5.96046447754e-06] [global:0.000266790390015]
[1710407545] DEBUG: [Shinken] Debug perf: put_results [args:0.000179052352905] [aqu_lock:0.0][calling:1.4066696167e-05] [json:1.50203704834e-05] [global:0.000208139419556]
[1710407545] DEBUG: [Shinken] Load: (sleep) 1.00 (average: 1.00) -> 0%
[1710407545] DEBUG: [Shinken] Time to send 0 broks (after 0 secs)
[1710407545] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407545] DEBUG: [Shinken] Latency (avg/min/max): 1435.98/1.88/2870.07
[1710407545] DEBUG: [Shinken] Check average = 0 checks/s
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.86102294922e-06] [json:1.00135803223e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710407546] DEBUG: [Shinken] Debug perf: get_broks [args:3.69548797607e-05] [aqu_lock:0.000128984451294][calling:4.81605529785e-05] [json:5.96046447754e-06] [global:0.000220060348511]
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.00135803223e-05] [global:1.71661376953e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407546] DEBUG: [Shinken] Debug perf: get_checks [args:7.89165496826e-05] [aqu_lock:0.000149011611938][calling:4.81605529785e-05] [json:6.91413879395e-06] [global:0.000283002853394]
[1710407546] DEBUG: [Shinken] Debug perf: put_results [args:0.000194072723389] [aqu_lock:9.53674316406e-07][calling:1.50203704834e-05] [json:1.59740447998e-05] [global:0.000226020812988]
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:1.90734863281e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407546] DEBUG: [Shinken] Debug perf: get_checks [args:9.67979431152e-05] [aqu_lock:0.0001380443573][calling:6.60419464111e-05] [json:8.10623168945e-06] [global:0.000308990478516]
[1710407546] DEBUG: [Shinken] Debug perf: put_results [args:0.000180959701538] [aqu_lock:0.0][calling:1.4066696167e-05] [json:1.50203704834e-05] [global:0.000210046768188]
[1710407546] DEBUG: [Shinken] Load: (sleep) 1.00 (average: 1.00) -> 0%
[1710407546] DEBUG: [Shinken] Time to send 0 broks (after 0 secs)
[1710407546] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407546] DEBUG: [Shinken] Latency (avg/min/max): 1435.98/1.88/2870.07
[1710407546] DEBUG: [Shinken] Check average = 0 checks/s
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:2.86102294922e-06] [aqu_lock:1.19209289551e-06][calling:2.86102294922e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] The arbiter asked me what I manage. It's {0: 128736}
[1710407546] DEBUG: [Shinken] Debug perf: what_i_managed [args:3.81469726562e-06] [aqu_lock:1.19209289551e-06][calling:0.000140905380249] [json:2.09808349609e-05] [global:0.000166893005371]
[1710407547] DEBUG: [Shinken] Debug perf: ping [args:6.91413879395e-06] [aqu_lock:1.90734863281e-06][calling:3.09944152832e-06] [json:1.78813934326e-05] [global:2.98023223877e-05]
[1710407547] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407547] DEBUG: [Shinken] Debug perf: get_checks [args:8.70227813721e-05] [aqu_lock:0.000164985656738][calling:4.81605529785e-05] [json:5.96046447754e-06] [global:0.000306129455566]
[1710407547] DEBUG: [Shinken] Debug perf: ping [args:4.

In the Scheduler log I could see these WARNING

docker logs shinken-scheduler-1 |grep WARNING

[1710405731] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405791] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405851] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405911] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405971] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406031] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406091] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406151] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406211] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406932] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406992] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407052] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407112] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407173] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling

Scheduler log

docker logs shinken-scheduler-1 |grep 'Checks: total'

[1710407422] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407423] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407424] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407425] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407426] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407427] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0

[root@localhost docker-shinken]# docker ps
CONTAINER ID   IMAGE                              COMMAND                  CREATED       STATUS       PORTS                                                            NAMES
3ab294d19b53   thews/shinken-receiver:latest      "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7773/tcp                                                         shinken-receiver-1
cd1b61bf032e   thews/shinken-thruk:latest         "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   0.0.0.0:8000->80/tcp, :::8000->80/tcp                            docker-shinken-shinken-thruk-1
42b05cd36e91   thews/shinken-poller:latest        "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7771/tcp                                                         shinken-poller-1
cbf7ae4065a8   thews/shinken-arbiter:latest       "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7770/tcp                                                         shinken-arbiter-1
33d919544257   thews/shinken-reactionner:latest   "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7769/tcp                                                         shinken-reactionner-1
1b4e9f25fcec   thews/shinken-scheduler:latest     "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7768/tcp                                                         shinken-scheduler-1
b91b9d835c0a   thews/shinken-broker:latest        "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7772/tcp, 50000/tcp, 0.0.0.0:8001->7767/tcp, :::8001->7767/tcp   shinken-broker-1

docker stats

CONTAINER ID   NAME                             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
3ab294d19b53   shinken-receiver-1               0.80%     42.84MiB / 7.493GiB   0.56%     46.4MB / 42.4MB   0B / 62.7MB   6
cd1b61bf032e   docker-shinken-shinken-thruk-1   0.02%     133.7MiB / 7.493GiB   1.74%     301kB / 839kB     0B / 1.2MB    16
42b05cd36e91   shinken-poller-1                 390.03%   75.05MiB / 7.493GiB   0.98%     52.3MB / 50.8MB   0B / 66.4MB   51
cbf7ae4065a8   shinken-arbiter-1                0.51%     45.05MiB / 7.493GiB   0.59%     94.3MB / 107MB    0B / 1.77MB   8
33d919544257   shinken-reactionner-1            0.96%     49.39MiB / 7.493GiB   0.64%     64.4MB / 68MB     0B / 71.3MB   10
1b4e9f25fcec   shinken-scheduler-1              0.74%     44.74MiB / 7.493GiB   0.58%     70.4MB / 59.4MB   0B / 86.3MB   6
b91b9d835c0a   shinken-broker-1                 1.19%     77.4MiB / 7.493GiB    1.01%     110MB / 109MB     0B / 76.7MB   13

@geektophe Could you please help me on this.. ;)

@sjose1x sjose1x changed the title Poller is not picking up the Queued checks, the Host and Service checks are getting timed out. Poller is not picking up the Queued tasks, the Host and Service checks are getting timed out. Mar 14, 2024
@sjose1x
Copy link
Author

sjose1x commented Mar 21, 2024

I could locate the issue as in, the docker is not support multiprocessing... Poller uses multiprocessing for running worker.. need to figure out how to mitigate this

Workers are running, but work.start() is getting stuck.. from there on no progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant