Schedule SNS or HTTP (Webhook) notifications to be sent on a repeating schedule, or in the future with database backed persistence.
- Multiple agents can process jobs and schedules, by taking a lease on a schedule's individual cron or job.
- JSON logging to allow CloudWatch to extract metrics from the logs.
- MySQL 5.6 database (e.g. Aurora)
- Create a database for
callme.CREATE SCHEMAcallmeDEFAULT CHARACTER SET utf8 ;
- Set the
CALLME_CONNECTION_STRINGenvironment variable.export CALLME_CONNECTION_STRING='Server=localhost;Port=3306;Database=callme;Uid=root;Pwd=callme;Allow User Variables=true;multiStatements=true'
- Run the
callmeexecutable directly or run the Docker container (welldigital/callme). docker run welldigital/callme:latest- Interact with the API to setup recurring notifications, or jobs at a specific point in time.
The system is unit tested, and also has different types of integration test. The first is the mysql tests which test that the MySQL queries function as designed, while the second tests the system behaviour when loaded with synthetic data (see ./harness).
- Schedule
- An operation that should happen on a periodic basis.
- A schedule triggers the creation of jobs on a just-in-time basis, i.e. when the schedule is ready for the next item, a job is created at that moment.
- Job
- An operation that should happen at some point in the future.
- Jobs will be created by Schedules, and will happen near to the date and time of the next schedule. The exact date/time can't be guaranteed, since it depends on system load and the regularity of database polling to retrieve upcoming jobs.
- Lease
- A lock acquired by a system process which gives that process the right to process a schedule's cron or a single job for a time limit.
A Prometheus metric endpoint is opened at port 6666 by default, at the /metrics URL. To modify the listening port, set the CALLME_PROMETHEUS_PORT environment variable.
- job_leased_total
- The number of calls to collect jobs, seperated by success, error or none_available.
- job_leased_duration_milliseconds
- How long it took to execute a database claim and retrieve a job to work on.
- job_executed_total
- The number of executions of jobs, split up by success.
- job_executed_duration_milliseconds
- How long it took to execute jobs, split up by success.
- job_executed_delay_milliseconds
- The amount of delay between a job's scheduled start time, and when it actually started.
- job_completed_total
- The number of jobs marked as completed, split up by status.
- job_completed_duration_milliseconds
- How long it took to mark jobs as completed.
- schedule_leased_total
- The number of calls to collect a lease on a single schedule, seperated by success, error or none_available.
- schedule_leased_duration_milliseconds
- How long it took to execute a database claim and retrieve a schedule to work on.
- schedule_executed_total
- The number of times that the schedule's cron expression was parsed, separated by success or error.
- schedule_executed_delay_milliseconds
- The amount of delay between a schedule's update time, and when it actually started.
- schedule_job_started_total
- The count of jobs started by a schedule.
- schedule_job_started_duration_milliseconds
- The amount of time taken to start jobs and mark the schedule as updated.
Checklist for problems:
- Can the callme process access the database to acquire leases?
- Check the
schedule_leased_totalandjob_leased_totalmetrics for the error count. - Check the logs for database connection errors.
- Check the
- Is there work to process?
- The
schedule_leased_totalandjob_leased_totalmetrics have a dimension ofnone_availableif there aren't.
- The
- Is the work being collected successfully?
- The
schedule_leased_totalandjob_leased_totalmetrics have a dimension oferror, which should be zero. - There are also
schedule_leased_duration_millisecondsandjob_leased_duration_millisecondsmetrics which track how long database operations are taking.
- The
- Are jobs being processed?
- The
job_executed_totalmetric tells us how many jobs are processed. Check thesuccess/errordimension though.
- The
- Are jobs taking a long time to process?
- The
job_executed_duration_millisecondsmetric records how long jobs are taking to execute.
- The
- Are jobs starting later than they should?
- The
job_executed_delay_millisecondsmetric records the delta between when jobs should have started, and when they did. If jobs are taking too long to start, you may wish to take action, e.g. increasingCALLME_JOB_WORKER_COUNTto process more on a single box, or increasing the performance of your database instance, depending on where the bottleneck is.
- The
- How long is it taking to send the SNS notification?
- The
job_executed_duration_millisecondsmetric tracks the duration.
- The
- Is there a problem marking jobs as complete resulting in jobs being processed twice?
- The
job_completed_totalmetric tells us whether jobs are being completed insuccessorerrorstates.
- The
- Are database operations slow?
- The
job_completed_duration_millisecondsandjob_leased_duration_millisecondsmetrics record how long each database operation takes.
- The
- Are schedules being processed?
- The work should be completed succesfully (see
schedule_leased_total), after that cron parsing counts are tracked by theschedule_executed_totalmetric.
- The work should be completed succesfully (see
- Are schedules being processed fast enough?
- The
schedule_executed_delay_millisecondsmetric tracks how many milliseconds taken between when a schedule should have been executed, and when it did.
- The
- Are jobs being started based on schedules?
- The
schedule_job_started_totalmetric tracks how many jobs were started from schedules.
- The
- How long is it taking to schedule a new job to start and update the schedule?
- The
schedule_job_started_duration_millisecondsmetric tracks how long it is taking to complete the schedule processing.
- The
- Are any workers logging errors?
- The
error_totalmetric tracks the number of errors being logged by callme processes.
- The
- Failed to get schedule
- Impact: Possible delay to scheduling jobs, track the
schedule_executed_delay_millisecondsmetric to see the impact. - Failed to parse cron
- Impact: Cron will be retried forever every 30 minutes until database record is updated.
- Failed to update schedule and start a job
- Impact: The schedule update and job start is carried out within a transaction using the
sm_startjobandupdatecronprocedure, the operation will be retried by any schedule worker process when the lease expires (by default, 30 minutes), so the scheduled job will be deplayed.
- Failed to get job
- Impact: Possible delay to starting scheduled jobs, track the
job_executed_delay_millisecondsmetric to see the impact.
| Sent SNS OK | Sent SNS failed | |
|---|---|---|
| Marked complete OK | A: No problem | B: Job marked as errored, human intervention required |
| Marked complete failed | C: SNS will be sent again | D: Will be retried again |
This is the normal usage scenario. Jobs with this status do not require action.
Human intervention is required here, since the SNS notification may not have been sent for many reasons. Several errors could be the case:
- The
callmeprocess lacked permission to write to the SNS topic.- Create an instance IAM role which allows the SNS topic to be written to and apply it to the instance running
callme
- Create an instance IAM role which allows the SNS topic to be written to and apply it to the instance running
- The SNS topic didn't exist, so it can't be written to.
- Update the database
jobrecord to change the SNS topic ARN and update any matching schedule which will create more jobs in the future
- Update the database
- SNS was unavailable due to a lack of connectivity to SNS, or an AWS SNS outage.
- Delete the
jobresponserecords associated with the outage, which will trigger the jobs to run again.
- Delete the
C: Sent the SNS notification, but the job could not successfully be marked as complete within the completion timeout
These jobs may be sent multiple times, because the job worker was not able to mark them as complete, resulting in them being retried.
It's the responsibility of the receiver to reject duplicate messages if required, e.g. by creating a hash of the payload content and keeping a cache of previously received messages in Redis.
The most likely scenario here is the backing database has become unavailable. In this case, because jobs are requested from the database one at a time, then it's likely only jobs currently being processed will be sent more than once.
This scenario is likely that after pulling a job from the database, all network connectivity was lost. As a result, the process won't be able to send notifications or mark it as complete. This is equivalent to a no-op.
| Environment Variable | Default | Description |
|---|---|---|
| CALLME_CONNECTION_STRING | None, it's required | The connection string to the database. |
| CALLME_SCHEDULE_WORKER_COUNT | 1 | Number of routines processing schedules |
| CALLME_JOB_WORKER_COUNT | 1 | Number of routines processing jobs. |
| CALLME_LOCK_EXPIRY_MINUTES | 30 | Minutes a routine has to process a job or schedule. |
| CALLME_PROMETHEUS_PORT | 6666 | The port for the metrics HTTP endpoint |
| CALLME_MODE | sns | Whether to send SNS notifications (sns), or Web (web) |
- Uses https://github.com/mattes/migrate
- Use
date -u +"%Y%m%d%H%M%S"to generate an ISO date for naming the migration.
- Run
make docker-build, which outputs thewelldigital/callme:latestcontainer.