Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sensu-client process is not getting monitored by the check-process.rb script #55

Open
gopalvd opened this issue Nov 9, 2017 · 20 comments

Comments

@gopalvd
Copy link

gopalvd commented Nov 9, 2017

Team,
I have a strange behaviour. I am trying to monitor the sensu-client process running on my linux server. For that i have set up the check definition and installed the respective dependent gems. This is what i am seeing.

When i run the script manually i see CheckProcess OK and found 1 process matching.
[root@XXXXXXXX checks]# /opt/sensu/embedded/bin/ruby /etc/sensu/plugins/check-process.rb -p sensu-client
CheckProcess OK: Found 1 matching processes; cmd /sensu-client/

When sensu-client itself is running the check, I see this.
{"timestamp":"2017-11-09T03:53:50.181638-0600","level":"info","message":"publishing check result","payload":{"client":"xxxxxxxxxx","check":{"command":"/etc/sensu/plugins/check-process.rb -p sensu-client","subscribers":[],"standalone":true,"handlers":["default","mailer"],"interval":30,"type":"standard","mail_to":"xxxxxxxxxxxx@target.com","name":"check_process-test","issued":1510221230,"executed":1510221230,"duration":0.17,"output":"CheckProcess CRITICAL: Found 0 matching processes; cmd /sensu-client/\n","status":2}}}

Can some one help, what could be the issue?

@majormoses
Copy link
Member

majormoses commented Nov 9, 2017

Hmm that is indeed strange, when I have the time I will try to replicate in my environment. I never really considered using sensu to monitor the sensu process as how would it even get the request from the transport (rabbitmq|redis) if the process is dead? Even in the case of standalone checks we still have the same problem of it being able to locally schedule itself.

I think you probably should leverage native functionality in your process supervisor such as upstart, runit, systemd, etc to attempt restarting the process if it dies. It's kinda a who watches the watcher problem. Here are some other suggestions which are not mutually exclusive with what was proposed above:

  • Another monitoring system such as a monit to watch the clients and health of sensu server, api, transport, etc.
  • Something that may make sense if you have a static setup is to use the ttl functionality to check that it is still receiving a "heartbeat" and will create an alert when the TTL expires without the check actually able to run. You can read about sensu check TTLs here: https://sensuapp.org/docs/latest/reference/checks.html#check-attributes.

That being said if the process is in fact up then there is definately a bug, config, or environment issue somewhere.

Can you please include the version of this plugin, sensu-server, and sensu-client?

@gopalvd
Copy link
Author

gopalvd commented Nov 10, 2017

@majormoses Thanks for looking in to this. I totally agree that monitoring sensu-client by sensu is not a great idea. Also we are using a diff tools for that. And yes this plugin is working perfectly when using other processes like crond, chef-client systemd etc.
I noticed this strange behaviour only when i did for sensu-client. And this was done with out any intention. I want to test the plugin and gave the sensu-client process to check and noticed this.

Yes to me seems like a bug.

Here are the details that you are looking for.
gem version for plugins ---- 2.5.0
sensu-client version --- "sensu-0.26.3-1.x86_64"
sensu-client version --- "sensu-1.0.2-1.el7.x86_64"

@gopalvd
Copy link
Author

gopalvd commented Nov 10, 2017

Sorry the last is the sensu-server version
sensu-server version --- "sensu-1.0.2-1.el7.x86_64"

@majormoses
Copy link
Member

majormoses commented Nov 10, 2017

Hmm this is what I show in my environment and I can't explain it, I will look through the code when I have some time. This is what I got manually which matches your output:

babrams@ip-10-55-141-110:~$ /opt/sensu/embedded/bin/check-process.rb -p sensu-client
CheckProcess OK: Found 1 matching processes; cmd /sensu-client/

I was scratching my head for a minute (clearly tired) as this is what I initially did and got back:

babrams@ip-10-55-141-110:~$ /opt/sensu/embedded/bin/check-process.rb - sensu-client
CheckProcess OK: Found 121 matching processes

@gopalvd
Copy link
Author

gopalvd commented Nov 10, 2017

Means a bug? or some options needs to be used in the command?

@majormoses
Copy link
Member

majormoses commented Nov 10, 2017

I was missing the p after the - so it was returning matching all processes!
which basically matches:

babrams@ip-10-55-141-110:~$ ps -Al | wc -l
122

@gopalvd
Copy link
Author

gopalvd commented Nov 10, 2017

I did the same
CheckProcess OK: Found 402 matching processes

@gopalvd
Copy link
Author

gopalvd commented Nov 10, 2017

Correct when i run the command, its giving one matching process. But when sensu-client is executing i see 0 processes running and throwing the error.
CheckProcess CRITICAL: Found 0 matching processes; cmd /sensu-client/

@gopalvd
Copy link
Author

gopalvd commented Nov 27, 2017

@majormoses Any updates you have on this issue?

@majormoses
Copy link
Member

majormoses commented Nov 28, 2017

No, I honestly have not really thought about it much as it is such a limited use case to have sensu check if sensu is running. Without using TTLs there is of no value, I marked it as a low priority to reflect that it will not likely see a quick resolution without another contributor who is more motivated to solve it. It is intriguing problem so if someone does not volunteer to triage it I will get to it when I have the time.

@xlr5
Copy link

xlr5 commented Mar 27, 2018

Hello, I ran into this issue today as well.
IMO monitoring the sensu client process is neccessary to define proper dependencies. I don't want any further notifications in case the sensu-client dies.
And it's like described above. Executing the check-process.rb on the shell returns 1 process running., while doing the same check via sensu returns no running process.

Can you please take a look?

@majormoses
Copy link
Member

I agree its a bug but I don't see much value in fixing this bug. By the very nature of this check and how sensu works if the sensu-client process is not running this would never be executed on. I would suggest looking at say the keepalive which is a built in TTL check that is meant to solve this exact problem. You can also add a very simple check which runs /bin/true and another TTL check on top of that which is essentially what the built in healthcheck does for you other than it is a fixed 20 second interval. Other options include scheduling a wrapper script via cron and if there are issues (check $?) to update it via the servers api or using monit and update the api. Bottom line is unless it is a TTL check there is 0 value in executing this from within the sensu client process as it is chicken and egg scenario.

I do not work for sensu or am I monetarily compensated for maintaining the plugins. There are lots of issues for me to look at to fix bugs, enhance plugins, and do housekeeping. This means I need to prioritize the things I spend my efforts on. While I find the problem itself intriguing the use case makes little sense to me so I will not be prioritizing this over things that I consider bring greater value to the community. I am more than happy to review a PR should you or someone else find the issue.

If you think this is important you have a couple options:

  • articulate what use case this has, explain how it could possibly execute if the client is not running and what this would bring you maybe my perspective will change
  • look into the problem yourself
  • put up a bounty to motivate others to solve your problem
  • file a bug on the main sensu project and maybe they will feel differently and will look into it

@majormoses
Copy link
Member

I have pinged the other maintainers as well for another set of 👀 as maybe they will have a different perspective than I do.

@jaredledvina
Copy link
Member

jaredledvina commented Mar 27, 2018

I agree with @majormoses that the use-case here isn't totally clear. Looking through the code, I would be interested to see if pass -m or possibly -M causes this to work though: https://github.com/sensu-plugins/sensu-plugins-process-checks/blob/master/bin/check-process.rb#L79-L84

@majormoses
Copy link
Member

majormoses commented Mar 27, 2018

@jaredledvina good call, these options might help but again this gives you no real value without adding a TTL check and I would suggest using the keepalive instead as that is it's whole purpose

The options are defined here:

The code that rejects the processes unless args are changed:

@huynt1979
Copy link

@majormoses The sane philosophy for this would be using another tool, for example as simple as cron. to monitor the monitoring tool like sensu-client. It's weird to have a monitoring tool watch ifself because when it dies, what would alert you... you are arealdy DEAD?
Anyway, if you insist this line may shed more light on why @gopalvd saw what he saw...
https://github.com/sensu-plugins/sensu-plugins-process-checks/blob/master/bin/check-process.rb#L258-L259

@majormoses
Copy link
Member

The sane philosophy for this would be using another tool, for example as simple as cron. to monitor the monitoring tool like sensu-client. It's weird to have a monitoring tool watch ifself because when it dies, what would alert you... you are arealdy DEAD?

Yup the only real other alternative which is native to sensu that provides value in certain scenarios such as a client dying but the server remains functional is a TTL. Bottom line even if adding those args solves the problems it will not provide any value without adding a TTL. If you are gonna do that you already have something built into sensu that does this for you its called the keepalive check. It is hardcoded to send an update every 20 seconds and you can adjust your keepalive thresholds and dependencies as you see fit.

@majormoses
Copy link
Member

After reading through the code and seeing those options pointed out I highly doubt this is a bug anymore. It sounds like intended functionality and with sane defaults. The use case is still flawed but it would appear to be possible.

@xlr5
Copy link

xlr5 commented Mar 28, 2018

Well, after reading your arguments I came to the conclusion that i need to rethink my approach.
It was with my nagios installation where i used service dependencies on the check if nrpe is running to suppress further service notifications when the host died.

I admit that i need to learn better how sensu handles those situations.

@majormoses
Copy link
Member

You can do the same thing, just use keepalive. As this clearly is something that multiple people have tried and does not work I feel like we should add some documentation to help out users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants