Checker is a monitoring service that is designed to alert you when things are going wrong with the various things you care about. It has sample jobs that can ping TCP ports and HTTP servers, and you can run any job you care about by writing a python class.
The primary problem with any sort of monitoring service is: "How do I know the monitoring service is running?". Checker fixes this with two system jobs.
Checker is designed to send you email alerts when things break. To make sure its emails are getting through, Checker will email itself and confirm that it can recieve its own emails. If it can, it assumes that it can successfully email you. It's your job to make sure it doesn't go into the spam bin.
Checker can't (reliably) run by itself. It needs to have at least one peer. Checker will ask its peers if they're still running successfully. There are four answers a peer can give (and that it can give a peer):
- I'm running fine (Success)
- I can't send email (Error)
- I don't seem to be running any jobs (Error)
- (No Response) (Error)
When checker sees that one of its peers has reported one of the three error conditions, it emails that peer's admin.
Peering should be symmetrical - you check your friend's instance and your friend checks yours. Or you can run it yourself on two servers (when you do this, you only need one server to run all the jobs, the other one can just run as a no-job peer).
This wouldn't be any good if you couldn't specify your own custom jobs. There are two ways to do this:
JobBase is the base for a job, and should be used when you have a single, custom job you want to run. Your job needs to match the name of the file it is in. You should override three (or four) functions:
- This should return a JobFrequency constant indicating how often you want the job to run
- This should return a JobFailureNotificationFrequency constant indicating how often you want to be notified about a (continually) failing job
- This should return a JobFailureCountMinimumBeforeNotification constant indicating how many failures should occur in a row before notifying on failure
- This does the work. It returns True if the job succeeded or False if it didn't
- This is called if the job failed and the admin should be notified. You should return if the email could be sent or not.
- E.G. return self.sendEmail("Job Failed", self.details_of_error)
- This is called if a) you specified JobFailureNotificationFrequency.ONSTATECHANGE in notifyOnFailureEvery
- A job was previously failing but just suceeded
- Like onFailure, it should return if the email could be sent or not.
- E.G. return self.sendEmail("Job Suceeded", "")
JobSpawner should be used when you want to run the same logic for multiple servers. Look at HTTPServerChecker and TCPServerChecker for examples of how one can use it.
Be sure to call JobBase.init(self, config, ...) in the inner JobBase's init. It should be passed all the arguments that distinguish one job from another - for example for the TCP checker this is the IP and port. If you omit this step, the job state file will get messed up and be unable to distinguish between your individual spawned jobs. (It does not need 'extra' configuration arguments that do not distinguish jobs such as JobFrequency. If you pass them anyway, you will get extra lines in the state file when/if you change those arguments which is not bad, just a little messy.)
Edit the config file
- Copy settings.cfg.example to settings.cfg
- Fill in servername and alertcontact
- Give it the email account details it should send mail from. I create a specific gmail account for this.
- If you don't create the 'auto-delete messages from me' filter, set that setting to False
- If you have no peers, comment out all the lines. If you have peers, add them in.
- If you want a peer, I'll give it a shot. Email me
Checker relies on the system cron to run at every interval. You need one cron job line for each frequency that is defined in JobFrequency.
First edit the ... to your path, then enter the following into your cron
1-59 * * * * .../checker/main.py -m cron -c minute >/dev/null 2>&1 0 1-11,13-23 * * * .../checker/main.py -m cron -c hour >/dev/null 2>&1 0 0 * * * .../checker/main.py -m cron -c day >/dev/null 2>&1 0 12 * * * .../checker/main.py -m cron -c day_noon >/dev/null 2>&1