Skip to content

Commit

Permalink
feat(cron): echo-scheduler SQL (#561)
Browse files Browse the repository at this point in the history
Adds support and documentation for running echo scheduler in clustered SQL mode for resiliency
  • Loading branch information
marchello2000 committed May 28, 2019
1 parent d2a4076 commit d845cea
Show file tree
Hide file tree
Showing 12 changed files with 371 additions and 130 deletions.
135 changes: 125 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,38 @@
# Echo
[![Build Status](https://api.travis-ci.org/spinnaker/echo.svg?branch=master)](https://travis-ci.org/spinnaker/echo)
Echo serves as a router for events that happen within Spinnaker.

`Echo` serves as a router for events that happen within Spinnaker.

## Outgoing Events

It provides integrations for outgoing notifications in the echo-notifications package via:
It provides integrations for outgoing notifications in the `echo-notifications` package via:

* email
* Slack
* Bearychat
* Google Chat
* sms ( via Twilio )

Echo is also able to send events within Spinnaker to a predefined url, which is configurable under the echo-rest module.
`Echo` is also able to send events within Spinnaker to a predefined url, which is configurable under the `echo-rest` module.

You can extend the way in which Echo events are sent by implementing the `EchoEventListener` interface.
You can extend the way in which `echo` events are sent by implementing the `EchoEventListener` interface.


## Event Types

Currently, echo receives build events from igor and orchestration events from orca.
Currently, `echo` receives build events from [igor](http://www.github.com/spinnaker/igor) and orchestration events from [orca](http://www.github.com/spinnaker/orca).

## Incoming Events
Echo also integrates with [igor](http://www.github.com/spinnaker/igor), [front50](http://www.github.com/spinnaker/front50) and [orca](http://www.github.com/spinnaker/orca) to trigger pipeline executions.

It does so via two modules:

* pipeline-triggers : Responsible firing off events from Jenkins Triggers
* scheduler : Triggers pipelines off cron expressions. Support for cron expressions is provided by Netflix's [Fenzo](https://github.com/netflix/fenzo) library.
* `pipeline-triggers`: Responsible firing off events from Jenkins Triggers
* `scheduler`: Triggers pipelines off cron expressions. Support for cron expressions is provided by [quartz](http://www.quartz-scheduler.org)

## Running Echo
This can be done locally via `./gradlew bootRun`, which will start with an embedded cassandra instance. Or by following the instructions using the [Spinnaker installation scripts](http://www.github.com/spinnaker/spinnaker).

### Debugging

To start the JVM in debug mode, set the Java system property `DEBUG=true`:
```
./gradlew -DDEBUG=true
Expand All @@ -43,4 +42,120 @@ The JVM will then listen for a debugger to be attached on port 8189. The JVM wi
the debugger to be attached before starting Echo; the relevant JVM arguments can be seen and
modified as needed in `build.gradle`.

[//]: # "Only here to retrigger the echo build"
## Configuration
`echo` can run in two modes: **in-memory** and **SQL**.

**In-memory** mode keeps all CRON trigger information in RAM.
While this is simpler to configure (this is the default) the in-memory mode does not provide for any redundancy because it requires that a single instance of `echo` scheduler be running. If there are multiple instances, they will all attempt to start executions for a given CRON trigger. There is no locking, leader election, or any other kind of coordination between scheduler instances using the in-memory mode.
If/when this single instance goes down, CRON triggers will not fire.

**SQL** mode keeps all CRON trigger information in a single SQL database. This allows for multiple `echo` scheduler instances to run providing redundancy (only one instance will trigger a given CRON).

To run in SQL mode you will need to initialize the database and provide a connection string in `echo.yml` (note these instructions assume MySQL).
1. Create a database.
2. Initialize the database by running the script (MySQL dialect provided [here](echo-scheduler/src/main/resources/db/database-mysql.sql))
3. Configure the SQL mode in `echo.yml` (obviously, change the connection strings below):
```yaml
sql:
enabled: true
connectionPool:
jdbcUrl: jdbc:mysql://localhost:3306/echo?serverTimezone=UTC
user: echo_service
migration:
jdbcUrl: jdbc:mysql://localhost:3306/echo?serverTimezone=UTC
user: echo_migrate
```

See [Sample deployment topology](#sample-deployment-topology) for additional information

### Configuration options
`echo` has several configuration options (can be specified in `echo.yml`), key ones are listed below:
* `scheduler.enabled` (default: `false`)
when set to `true` this instance will schedule and trigger CRON events
* `scheduler.pipelineConfigsPoller.enabled` (default: `false`)
when `true`, will synchronize pipeline triggers (set this to `true` if you enable `scheduler` unless running a missed scheduler configuration)
* `scheduler.compensationJob.enabled` (default: false)
when `true` this instance will poll for missed CRON triggers and attempt to re-trigger them (see [Missed CRON scheduler](#Missed-CRON-scheduler))
* `orca.pipelineInitiatorRetryCount` (default: `5`)
Number of retries on `orca` failures (leave at default)
* `orca.pipelineInitiatorRetryDelayMillis` (default: 5000ms)
Number of milliseconds between retries to `orca` (leave at default)

## Missed CRON scheduler
The missed CRON scheduler is a feature in `echo` that ensures that CRON triggers are firing reliably. It is enabled by setting `scheduler.compensationJob.enabled` configuration option.
In an event that a CRON trigger fails to fire or it fires but, for whatever reason, the execution doesn't start the missed CRON scheduler will detect it and attempt to re-trigger the pipeline.
The main scenario when missed cron scheduler is useful is for main scheduler outages either planned (upgrade) or unplanned (hardware failure).
Missed scheduler should be run as a separate instance as that will provide the most benefit and the resilience needed. Most situation likely don't necessitate the need for a missed scheduler instance, especially if you elect to run in SQL mode. (With the SQL mode support and pending additional investigation this feature will likely be removed all-together)

## Sample deployment topology
Here are two examples of what configurations you can deploy `echo`.

| | Using in-memory | Using SQL |
|-------------------|----------------------------|-----------|
|**Server Group 1** |3x `echo` | 3x `echo` with `echo-scheduler`
|**Server Group 2** |1x `echo-scheduler` | 1x `echo-missed-scheduler`*
|**Server Group 3** |1x `echo-missed-scheduler`* | n/a

\* _optional `echo-missed-scheduler` see [Missed CRON scheduler](#Missed-CRON-scheduler)_

If you opt for using an in-memory execution mode, take care when deploying upgrades to `echo`.
Since only instance should be running at a time, a rolling-push strategy will need to be used. Furthermore, if using `echo-missed-scheduler`, make sure to upgrade `echo-scheduler` followed by `echo-missed-scheduler` to ensure pipelines (which had a trigger during the deploy period) are re-triggered correctly after deploy.

The following are configuration options for each server group (note that other configurations options will be required, which halyard will configure):
`echo` (this instance handles general events)
```yaml
scheduler:
enabled: false
pipelineConfigsPoller:
enabled: false
compensationJob:
enabled: false
```

`echo-scheduler` (this instance triggers pipelines on a CRON)
```yaml
scheduler:
enabled: true
pipelineConfigsPoller:
enabled: true
compensationJob:
enabled: false
```

`echo-missed-scheduler` (this instance triggers "missed" pipelines)
```yaml
scheduler:
enabled: true
pipelineConfigsPoller:
enabled: false
compensationJob:
enabled: true
pipelineFetchSize: 50

# run every 1 min to minimize the skew between expected and actual trigger times
recurringPollIntervalMs: 60000

# look for missed cron triggers in the last 5 mins (allows for a restart of the service)
windowMs: 300000
```

## Monitoring
`echo` emits numerous metrics that allow for monitoring its operation.
Some of the key metrics are listed below:

* `orca.trigger.success`
Number of successful triggers. That is when `orca` returns `HTTP 200` on a given trigger

* `orca.trigger.errors`
Number of failed triggers. When `orca` fails to execute a pipeline (returns non-successful HTTP status code).
This is a good metric to monitor as it indicates either invalid pipelines or some system failure in triggering pipelines

* `orca.trigger.retries`
Number of retries to `orca`. Failed calls to `orca` will be retried (assuming they are network type errors).
Consistent non-zero numbers here means there is likely a networking issue communicating with `orca` and should be investigated.

* `echo.triggers.sync.error`,
`echo.triggers.sync.failedUpdateCount`, and
`echo.triggers.sync.removeFailCount`
Indicates a failure during trigger synchronization. This likely means there are pipeline with invalid CRON expressions which will not trigger.
`echo` logs should provide additional information as to the cause of the issue
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ public class PipelineCache implements MonitoredPoller {

@Autowired
public PipelineCache(
@Value("${front50.polling-interval-ms:10000}") int pollingIntervalMs,
@Value("${front50.polling-interval-ms:30000}") int pollingIntervalMs,
@Value("${front50.polling-sleep-ms:100}") int pollingSleepMs,
ObjectMapper objectMapper,
@NonNull Front50Service front50,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,14 @@ public void startPipeline(Pipeline pipeline, TriggerSource triggerSource) {
} else {
log.info(
"Would trigger {} due to {} but triggering is disabled", pipeline, pipeline.getTrigger());
registry
.counter(
"orca.trigger.disabled",
"triggerSource",
triggerSource.name(),
"triggerType",
getTriggerType(pipeline))
.increment();
}
}

Expand Down
5 changes: 4 additions & 1 deletion echo-scheduler/echo-scheduler.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ dependencies {
implementation "com.netflix.spinnaker.kork:kork-artifacts"
implementation "com.netflix.spinnaker.kork:kork-sql"

implementation "mysql:mysql-connector-java"
implementation "org.springframework:spring-context-support"
implementation "org.quartz-scheduler:quartz"
implementation ("org.quartz-scheduler:quartz") {
exclude group: 'com.zaxxer', module: 'HikariCP-java7'
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,19 @@

package com.netflix.spinnaker.echo.config

import com.netflix.spinnaker.echo.scheduler.actions.pipeline.AutowiringSpringBeanJobFactory

import com.netflix.spinnaker.echo.scheduler.actions.pipeline.PipelineConfigsPollingJob
import com.netflix.spinnaker.echo.scheduler.actions.pipeline.PipelineTriggerJob
import com.netflix.spinnaker.echo.scheduler.actions.pipeline.TriggerListener
import com.netflix.spinnaker.echo.scheduler.actions.pipeline.TriggerConverter
import com.netflix.spinnaker.kork.sql.config.DefaultSqlConfiguration
import com.squareup.okhttp.OkHttpClient
import org.quartz.JobDetail
import org.quartz.Trigger
import org.quartz.spi.JobFactory
import org.springframework.beans.factory.annotation.Value
import org.springframework.boot.autoconfigure.condition.ConditionalOnExpression
import org.springframework.context.ApplicationContext
import org.springframework.boot.autoconfigure.quartz.QuartzAutoConfiguration
import org.springframework.boot.autoconfigure.quartz.QuartzProperties
import org.springframework.boot.autoconfigure.quartz.SchedulerFactoryBeanCustomizer
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration
import org.springframework.context.annotation.Import
Expand All @@ -42,46 +42,11 @@ import java.util.concurrent.TimeUnit

@Configuration
@ConditionalOnExpression('${scheduler.enabled:false}')
@Import(DefaultSqlConfiguration)
@Import([DefaultSqlConfiguration, QuartzAutoConfiguration])
class SchedulerConfiguration {
@Value('${scheduler.pipeline-configs-poller.polling-interval-ms:30000}')
long syncInterval

@Bean
SchedulerFactoryBean schedulerFactoryBean(
Optional<DataSource> dataSourceOptional,
TriggerListener triggerListener,
JobDetail pipelineJobBean,
JobFactory jobFactory,
Optional<Trigger> syncJobTrigger
) {
SchedulerFactoryBean factoryBean = new SchedulerFactoryBean()
if (dataSourceOptional.isPresent()) {
factoryBean.dataSource = dataSourceOptional.get()
}

factoryBean.setGlobalTriggerListeners(triggerListener)
factoryBean.setJobDetails(pipelineJobBean)
factoryBean.setJobFactory(jobFactory)

if (syncJobTrigger.isPresent()) {
factoryBean.setTriggers(syncJobTrigger.get())
}

return factoryBean
}

/**
* Job factory used to create jobs as beans on behalf of Quartz
*/
@Bean
JobFactory jobFactory(ApplicationContext applicationContext) {
AutowiringSpringBeanJobFactory jobFactory = new AutowiringSpringBeanJobFactory()
jobFactory.setApplicationContext(applicationContext)

return jobFactory
}

/**
* Job for syncing pipeline triggers
*/
Expand All @@ -90,10 +55,11 @@ class SchedulerConfiguration {
@Value('${scheduler.cron.timezone:America/Los_Angeles}') String timeZoneId
) {
JobDetailFactoryBean syncJob = new JobDetailFactoryBean()
syncJob.jobClass = PipelineConfigsPollingJob.class
syncJob.setJobClass(PipelineConfigsPollingJob.class)
syncJob.jobDataMap.put("timeZoneId", timeZoneId)
syncJob.name = "Sync Pipelines"
syncJob.group = "Sync"
syncJob.setName("Sync Pipelines")
syncJob.setGroup("Sync")
syncJob.setDurability(true)

return syncJob
}
Expand All @@ -104,16 +70,16 @@ class SchedulerConfiguration {
@Bean
@ConditionalOnExpression('${scheduler.pipeline-configs-poller.enabled:true}')
SimpleTriggerFactoryBean syncJobTriggerBean(
@Value('${scheduler.pipeline-configs-poller.polling-interval-ms:30000}') long intervalMs,
@Value('${scheduler.pipeline-configs-poller.polling-interval-ms:60000}') long intervalMs,
JobDetail pipelineSyncJobBean
) {
SimpleTriggerFactoryBean triggerBean = new SimpleTriggerFactoryBean()

triggerBean.name = "Sync Pipelines"
triggerBean.group = "Sync"
triggerBean.startDelay = 1 * 1000
triggerBean.repeatInterval = intervalMs
triggerBean.jobDetail = pipelineSyncJobBean
triggerBean.setName("Sync Pipelines")
triggerBean.setGroup("Sync")
triggerBean.setStartDelay(TimeUnit.SECONDS.toMillis(60 + new Random().nextInt() % 60))
triggerBean.setRepeatInterval(intervalMs)
triggerBean.setJobDetail(pipelineSyncJobBean)

return triggerBean
}
Expand All @@ -124,13 +90,37 @@ class SchedulerConfiguration {
@Bean
JobDetailFactoryBean pipelineJobBean() {
JobDetailFactoryBean triggerJob = new JobDetailFactoryBean()
triggerJob.jobClass = PipelineTriggerJob.class
triggerJob.name = TriggerConverter.JOB_ID
triggerJob.durability = true
triggerJob.setJobClass(PipelineTriggerJob.class)
triggerJob.setName(TriggerConverter.JOB_ID)
triggerJob.setDurability(true)

return triggerJob
}

@Bean
SchedulerFactoryBeanCustomizer echoSchedulerFactoryBeanCustomizer(
Optional<DataSource> dataSourceOptional,
TriggerListener triggerListener,
@Value('${sql.enabled:false}')
boolean sqlEnabled
) {
return new SchedulerFactoryBeanCustomizer() {
@Override
void customize(SchedulerFactoryBean schedulerFactoryBean) {
if (dataSourceOptional.isPresent()) {
schedulerFactoryBean.setDataSource(dataSourceOptional.get())
}

if (sqlEnabled) {
Properties props = new Properties()
props.put("org.quartz.jobStore.isClustered", "true")
schedulerFactoryBean.setQuartzProperties(props)
}
schedulerFactoryBean.setGlobalTriggerListeners(triggerListener)
}
}
}

@Bean
Client retrofitClient(@Value('${retrofit.connect-timeout-millis:10000}') long connectTimeoutMillis,
@Value('${retrofit.read-timeout-millis:15000}') long readTimeoutMillis) {
Expand Down
Loading

0 comments on commit d845cea

Please sign in to comment.