feat(cron): echo-scheduler SQL (#561)

Adds support and documentation for running echo scheduler in clustered SQL mode for resiliency
spinnaker · May 28, 2019 · d845cea · d845cea
1 parent d2a4076
commit d845cea
Show file tree

Hide file tree

Showing 12 changed files with 371 additions and 130 deletions.
diff --git a/README.md b/README.md
@@ -1,39 +1,38 @@
 # Echo
 [![Build Status](https://api.travis-ci.org/spinnaker/echo.svg?branch=master)](https://travis-ci.org/spinnaker/echo)
-Echo serves as a router for events that happen within Spinnaker.
+
+`Echo` serves as a router for events that happen within Spinnaker.
 
 ## Outgoing Events
 
-It provides integrations for outgoing notifications in the echo-notifications package via:
+It provides integrations for outgoing notifications in the `echo-notifications` package via:
 
 *  email
 *  Slack
 *  Bearychat
 *  Google Chat
 *  sms ( via Twilio )
 
-Echo is also able to send events within Spinnaker to a predefined url, which is configurable under the echo-rest module.
+`Echo` is also able to send events within Spinnaker to a predefined url, which is configurable under the `echo-rest` module.
 
-You can extend the way in which Echo events are sent by implementing the `EchoEventListener` interface.
+You can extend the way in which `echo` events are sent by implementing the `EchoEventListener` interface.
 
 
 ## Event Types
-
-Currently, echo receives build events from igor and orchestration events from orca.
+Currently, `echo` receives build events from [igor](http://www.github.com/spinnaker/igor) and orchestration events from [orca](http://www.github.com/spinnaker/orca).
 
 ## Incoming Events
 Echo also integrates with [igor](http://www.github.com/spinnaker/igor), [front50](http://www.github.com/spinnaker/front50) and [orca](http://www.github.com/spinnaker/orca) to trigger pipeline executions.
 
 It does so via two modules:
 
-* pipeline-triggers :  Responsible firing off events from Jenkins Triggers
-* scheduler : Triggers pipelines off cron expressions. Support for cron expressions is provided by Netflix's [Fenzo](https://github.com/netflix/fenzo) library.
+* `pipeline-triggers`:  Responsible firing off events from Jenkins Triggers
+* `scheduler`: Triggers pipelines off cron expressions. Support for cron expressions is provided by [quartz](http://www.quartz-scheduler.org)
 
 ## Running Echo
 This can be done locally via `./gradlew bootRun`, which will start with an embedded cassandra instance. Or by following the instructions using the [Spinnaker installation scripts](http://www.github.com/spinnaker/spinnaker).
 
 ### Debugging
-
 To start the JVM in debug mode, set the Java system property `DEBUG=true`:
 ```
 ./gradlew -DDEBUG=true
@@ -43,4 +42,120 @@ The JVM will then listen for a debugger to be attached on port 8189.  The JVM wi
 the debugger to be attached before starting Echo; the relevant JVM arguments can be seen and
 modified as needed in `build.gradle`.
 
-[//]: # "Only here to retrigger the echo build"
+## Configuration
+`echo` can run in two modes: **in-memory** and **SQL**.
+
+**In-memory** mode keeps all CRON trigger information in RAM.  
+While this is simpler to configure (this is the default) the in-memory mode does not provide for any redundancy because it requires that a single instance of `echo` scheduler be running. If there are multiple instances, they will all attempt to start executions for a given CRON trigger. There is no locking, leader election, or any other kind of coordination between scheduler instances using the in-memory mode.  
+If/when this single instance goes down, CRON triggers will not fire.
+
+**SQL** mode keeps all CRON trigger information in a single SQL database. This allows for multiple `echo` scheduler instances to run providing redundancy (only one instance will trigger a given CRON).
+
+To run in SQL mode you will need to initialize the database and provide a connection string in `echo.yml` (note these instructions assume MySQL).
+1. Create a database.
+2. Initialize the database by running the script (MySQL dialect provided [here](echo-scheduler/src/main/resources/db/database-mysql.sql))
+3. Configure the SQL mode in `echo.yml` (obviously, change the connection strings below):
+    ```yaml
+    sql:
+      enabled: true
+      connectionPool:
+        jdbcUrl: jdbc:mysql://localhost:3306/echo?serverTimezone=UTC
+        user: echo_service
+      migration:
+        jdbcUrl: jdbc:mysql://localhost:3306/echo?serverTimezone=UTC
+        user: echo_migrate
+    ```
+
+See [Sample deployment topology](#sample-deployment-topology) for additional information
+
+### Configuration options
+`echo` has several configuration options (can be specified in `echo.yml`), key ones are listed below:  
+* `scheduler.enabled` (default: `false`)  
+    when set to `true` this instance will schedule and trigger CRON events
+* `scheduler.pipelineConfigsPoller.enabled` (default: `false`)  
+    when `true`, will synchronize pipeline triggers (set this to `true` if you enable `scheduler` unless running a missed scheduler configuration)
+* `scheduler.compensationJob.enabled` (default: false)  
+    when `true` this instance will poll for missed CRON triggers and attempt to re-trigger them (see [Missed CRON scheduler](#Missed-CRON-scheduler))
+* `orca.pipelineInitiatorRetryCount` (default: `5`)  
+    Number of retries on `orca` failures (leave at default)
+* `orca.pipelineInitiatorRetryDelayMillis` (default: 5000ms)  
+    Number of milliseconds between retries to `orca` (leave at default)
+
+## Missed CRON scheduler
+The missed CRON scheduler is a feature in `echo` that ensures that CRON triggers are firing reliably. It is enabled by setting `scheduler.compensationJob.enabled` configuration option.  
+In an event that a CRON trigger fails to fire or it fires but, for whatever reason, the execution doesn't start the missed CRON scheduler will detect it and attempt to re-trigger the pipeline.  
+The main scenario when missed cron scheduler is useful is for main scheduler outages either planned (upgrade) or unplanned (hardware failure).    
+Missed scheduler should be run as a separate instance as that will provide the most benefit and the resilience needed. Most situation likely don't necessitate the need for a missed scheduler instance, especially if you elect to run in SQL mode. (With the SQL mode support and pending additional investigation this feature will likely be removed all-together)
+
+## Sample deployment topology
+Here are two examples of what configurations you can deploy `echo`.
+
+|                   | Using in-memory            | Using SQL |
+|-------------------|----------------------------|-----------|
+|**Server Group 1** |3x `echo`                   | 3x `echo` with `echo-scheduler`
+|**Server Group 2** |1x `echo-scheduler`         | 1x `echo-missed-scheduler`*
+|**Server Group 3** |1x `echo-missed-scheduler`* | n/a
+
+\* _optional `echo-missed-scheduler` see [Missed CRON scheduler](#Missed-CRON-scheduler)_
+
+If you opt for using an in-memory execution mode, take care when deploying upgrades to `echo`.
+Since only instance should be running at a time, a rolling-push strategy will need to be used. Furthermore, if using `echo-missed-scheduler`, make sure to upgrade `echo-scheduler` followed by `echo-missed-scheduler` to ensure pipelines (which had a trigger during the deploy period) are re-triggered correctly after deploy.
+
+The following are configuration options for each server group (note that other configurations options will be required, which halyard will configure):  
+`echo` (this instance handles general events)   
+```yaml
+scheduler:
+  enabled: false
+  pipelineConfigsPoller:
+    enabled: false
+  compensationJob:
+    enabled: false
+```
+
+`echo-scheduler` (this instance triggers pipelines on a CRON)  
+```yaml
+scheduler:
+  enabled: true
+  pipelineConfigsPoller:
+    enabled: true
+  compensationJob:
+    enabled: false
+```
+
+`echo-missed-scheduler` (this instance triggers "missed" pipelines)  
+```yaml
+scheduler:
+  enabled: true
+  pipelineConfigsPoller:
+    enabled: false
+  compensationJob:
+    enabled: true
+    pipelineFetchSize: 50
+
+    # run every 1 min to minimize the skew between expected and actual trigger times
+    recurringPollIntervalMs: 60000
+
+    # look for missed cron triggers in the last 5 mins (allows for a restart of the service)
+    windowMs: 300000
+```
+
+## Monitoring
+`echo` emits numerous metrics that allow for monitoring its operation.  
+Some of the key metrics are listed below:
+
+* `orca.trigger.success`  
+   Number of successful triggers. That is when `orca` returns `HTTP 200` on a given trigger  
+
+* `orca.trigger.errors`  
+   Number of failed triggers. When `orca` fails to execute a pipeline (returns non-successful HTTP status code).  
+   This is a good metric to monitor as it indicates either invalid pipelines or some system failure in triggering pipelines
+
+* `orca.trigger.retries`  
+   Number of retries to `orca`. Failed calls to `orca` will be retried (assuming they are network type errors).  
+   Consistent non-zero numbers here means there is likely a networking issue communicating with `orca` and should be investigated.
+
+* `echo.triggers.sync.error`,  
+   `echo.triggers.sync.failedUpdateCount`, and  
+   `echo.triggers.sync.removeFailCount`  
+    Indicates a failure during trigger synchronization. This likely means there are pipeline with invalid CRON expressions which will not trigger.  
+    `echo` logs should provide additional information as to the cause of the issue
diff --git a/...linetriggers/src/main/java/com/netflix/spinnaker/echo/pipelinetriggers/PipelineCache.java b/...linetriggers/src/main/java/com/netflix/spinnaker/echo/pipelinetriggers/PipelineCache.java
@@ -66,7 +66,7 @@ public class PipelineCache implements MonitoredPoller {
 
   @Autowired
   public PipelineCache(
-      @Value("${front50.polling-interval-ms:10000}") int pollingIntervalMs,
+      @Value("${front50.polling-interval-ms:30000}") int pollingIntervalMs,
       @Value("${front50.polling-sleep-ms:100}") int pollingSleepMs,
       ObjectMapper objectMapper,
       @NonNull Front50Service front50,

diff --git a/...ers/src/main/java/com/netflix/spinnaker/echo/pipelinetriggers/orca/PipelineInitiator.java b/...ers/src/main/java/com/netflix/spinnaker/echo/pipelinetriggers/orca/PipelineInitiator.java
@@ -162,6 +162,14 @@ public void startPipeline(Pipeline pipeline, TriggerSource triggerSource) {
     } else {
       log.info(
           "Would trigger {} due to {} but triggering is disabled", pipeline, pipeline.getTrigger());
+      registry
+          .counter(
+              "orca.trigger.disabled",
+              "triggerSource",
+              triggerSource.name(),
+              "triggerType",
+              getTriggerType(pipeline))
+          .increment();
     }
   }
 

diff --git a/echo-scheduler/echo-scheduler.gradle b/echo-scheduler/echo-scheduler.gradle
@@ -30,6 +30,9 @@ dependencies {
   implementation "com.netflix.spinnaker.kork:kork-artifacts"
   implementation "com.netflix.spinnaker.kork:kork-sql"
 
+  implementation "mysql:mysql-connector-java"
   implementation "org.springframework:spring-context-support"
-  implementation "org.quartz-scheduler:quartz"
+  implementation ("org.quartz-scheduler:quartz") {
+    exclude group: 'com.zaxxer', module: 'HikariCP-java7'
+  }
 }
diff --git a/...scheduler/src/main/groovy/com/netflix/spinnaker/echo/config/SchedulerConfiguration.groovy b/...scheduler/src/main/groovy/com/netflix/spinnaker/echo/config/SchedulerConfiguration.groovy
@@ -16,19 +16,19 @@
 
 package com.netflix.spinnaker.echo.config
 
-import com.netflix.spinnaker.echo.scheduler.actions.pipeline.AutowiringSpringBeanJobFactory
+
 import com.netflix.spinnaker.echo.scheduler.actions.pipeline.PipelineConfigsPollingJob
 import com.netflix.spinnaker.echo.scheduler.actions.pipeline.PipelineTriggerJob
 import com.netflix.spinnaker.echo.scheduler.actions.pipeline.TriggerListener
 import com.netflix.spinnaker.echo.scheduler.actions.pipeline.TriggerConverter
 import com.netflix.spinnaker.kork.sql.config.DefaultSqlConfiguration
 import com.squareup.okhttp.OkHttpClient
 import org.quartz.JobDetail
-import org.quartz.Trigger
-import org.quartz.spi.JobFactory
 import org.springframework.beans.factory.annotation.Value
 import org.springframework.boot.autoconfigure.condition.ConditionalOnExpression
-import org.springframework.context.ApplicationContext
+import org.springframework.boot.autoconfigure.quartz.QuartzAutoConfiguration
+import org.springframework.boot.autoconfigure.quartz.QuartzProperties
+import org.springframework.boot.autoconfigure.quartz.SchedulerFactoryBeanCustomizer
 import org.springframework.context.annotation.Bean
 import org.springframework.context.annotation.Configuration
 import org.springframework.context.annotation.Import
@@ -42,46 +42,11 @@ import java.util.concurrent.TimeUnit
 
 @Configuration
 @ConditionalOnExpression('${scheduler.enabled:false}')
-@Import(DefaultSqlConfiguration)
+@Import([DefaultSqlConfiguration, QuartzAutoConfiguration])
 class SchedulerConfiguration {
   @Value('${scheduler.pipeline-configs-poller.polling-interval-ms:30000}')
   long syncInterval
 
-  @Bean
-  SchedulerFactoryBean schedulerFactoryBean(
-    Optional<DataSource> dataSourceOptional,
-    TriggerListener triggerListener,
-    JobDetail pipelineJobBean,
-    JobFactory jobFactory,
-    Optional<Trigger> syncJobTrigger
-  ) {
-    SchedulerFactoryBean factoryBean = new SchedulerFactoryBean()
-    if (dataSourceOptional.isPresent()) {
-      factoryBean.dataSource = dataSourceOptional.get()
-    }
-
-    factoryBean.setGlobalTriggerListeners(triggerListener)
-    factoryBean.setJobDetails(pipelineJobBean)
-    factoryBean.setJobFactory(jobFactory)
-
-    if (syncJobTrigger.isPresent()) {
-      factoryBean.setTriggers(syncJobTrigger.get())
-    }
-
-    return factoryBean
-  }
-
-  /**
-   * Job factory used to create jobs as beans on behalf of Quartz
-   */
-  @Bean
-  JobFactory jobFactory(ApplicationContext applicationContext) {
-    AutowiringSpringBeanJobFactory jobFactory = new AutowiringSpringBeanJobFactory()
-    jobFactory.setApplicationContext(applicationContext)
-
-    return jobFactory
-  }
-
   /**
    * Job for syncing pipeline triggers
    */
@@ -90,10 +55,11 @@ class SchedulerConfiguration {
     @Value('${scheduler.cron.timezone:America/Los_Angeles}') String timeZoneId
   ) {
     JobDetailFactoryBean syncJob = new JobDetailFactoryBean()
-    syncJob.jobClass = PipelineConfigsPollingJob.class
+    syncJob.setJobClass(PipelineConfigsPollingJob.class)
     syncJob.jobDataMap.put("timeZoneId", timeZoneId)
-    syncJob.name = "Sync Pipelines"
-    syncJob.group = "Sync"
+    syncJob.setName("Sync Pipelines")
+    syncJob.setGroup("Sync")
+    syncJob.setDurability(true)
 
     return syncJob
   }
@@ -104,16 +70,16 @@ class SchedulerConfiguration {
   @Bean
   @ConditionalOnExpression('${scheduler.pipeline-configs-poller.enabled:true}')
   SimpleTriggerFactoryBean syncJobTriggerBean(
-    @Value('${scheduler.pipeline-configs-poller.polling-interval-ms:30000}') long intervalMs,
+    @Value('${scheduler.pipeline-configs-poller.polling-interval-ms:60000}') long intervalMs,
     JobDetail pipelineSyncJobBean
   ) {
     SimpleTriggerFactoryBean triggerBean = new SimpleTriggerFactoryBean()
 
-    triggerBean.name = "Sync Pipelines"
-    triggerBean.group = "Sync"
-    triggerBean.startDelay = 1 * 1000
-    triggerBean.repeatInterval = intervalMs
-    triggerBean.jobDetail = pipelineSyncJobBean
+    triggerBean.setName("Sync Pipelines")
+    triggerBean.setGroup("Sync")
+    triggerBean.setStartDelay(TimeUnit.SECONDS.toMillis(60 + new Random().nextInt() % 60))
+    triggerBean.setRepeatInterval(intervalMs)
+    triggerBean.setJobDetail(pipelineSyncJobBean)
 
     return triggerBean
   }
@@ -124,13 +90,37 @@ class SchedulerConfiguration {
   @Bean
   JobDetailFactoryBean pipelineJobBean() {
     JobDetailFactoryBean triggerJob = new JobDetailFactoryBean()
-    triggerJob.jobClass = PipelineTriggerJob.class
-    triggerJob.name = TriggerConverter.JOB_ID
-    triggerJob.durability = true
+    triggerJob.setJobClass(PipelineTriggerJob.class)
+    triggerJob.setName(TriggerConverter.JOB_ID)
+    triggerJob.setDurability(true)
 
     return triggerJob
   }
 
+  @Bean
+  SchedulerFactoryBeanCustomizer echoSchedulerFactoryBeanCustomizer(
+    Optional<DataSource> dataSourceOptional,
+    TriggerListener triggerListener,
+    @Value('${sql.enabled:false}')
+    boolean sqlEnabled
+  ) {
+    return new SchedulerFactoryBeanCustomizer() {
+      @Override
+      void customize(SchedulerFactoryBean schedulerFactoryBean) {
+        if (dataSourceOptional.isPresent()) {
+          schedulerFactoryBean.setDataSource(dataSourceOptional.get())
+        }
+
+        if (sqlEnabled) {
+          Properties props = new Properties()
+          props.put("org.quartz.jobStore.isClustered", "true")
+          schedulerFactoryBean.setQuartzProperties(props)
+        }
+        schedulerFactoryBean.setGlobalTriggerListeners(triggerListener)
+      }
+    }
+  }
+
   @Bean
   Client retrofitClient(@Value('${retrofit.connect-timeout-millis:10000}') long connectTimeoutMillis,
                         @Value('${retrofit.read-timeout-millis:15000}') long readTimeoutMillis) {