Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

Latest commit

 

History

History
633 lines (500 loc) · 35.8 KB

configuration.adoc

File metadata and controls

633 lines (500 loc) · 35.8 KB

Overview

Cook has a lot of options in its configuration file. This document will attempt to cover all the supported configuration options for the Cook scheduler. Cook is designed to support multiple config file formats (json, yaml, edn), but today, it only supports edn. The edn format (pronounced eden, like the garden) is described at https://github.com/edn-format/edn. The most important thing to know about edn is that there’s nothing separating keys and values (that is, no : like in Python or like in Scala) in maps, and commas are whitespace.

In this guide, configuration will be written as fragments.

Sample Config Files

Before looking over this guide, you should check out the sample config files, dev-config.edn and example-prod-config.edn. The dev config starts up an embedded Zookeeper and Datomic so that you don’t have to. It also enables all of the introspection features, like JMX metrics and nREPL, without sending any emails or metrics to alerting systems. To use the dev config yourself, just make sure that the Mesos cluster’s :master is set correctly, and the username exists on your slaves.

The prod config instead uses external Zookeeper and Datomic to ensure high availability and persistence. It registers with Mesos so that it can fail over successfully without losing any tasks. It also is chattier, sending emails and metrics to other systems to improve visibility.

At this point, you’re probably wondering what all these options actually do, and how to configure them. The next several sections will document what every option does.

Basic Configuration Options

:port

This is the port that the REST API will bind to.

:hostname

This is the hostname of the server Cook is running on. Defaults to the hostname (via getCanonicalHostname)

:database

This configures which database Cook will connect to. Currently, Cook only supports Datomic. Thus, :database must be set to a map with a single key: {:datomic-uri "$DB_URI"}.

Datomic’s pretty awesome because it has an in-process embedded in-memory version, which is specified by using an in-memory backend. To use the in-memory DB, use the URI datomic:mem://cook-jobs. An example URI for connecting to a Datomic free transactor on the host $HOST would be datomic:free://$HOST:4334/cook-jobs. See http://docs.datomic.com/getting-started.html for more information on setting up Datomic.

Tip
Datomic Configuration in Production

Cook uses special transaction functions to maintain distributed consistency. When running a standalone transactor (as in most QA and Production environments), you’ll need to include a copy of the Cook jar on the Datomic transactor’s classpath. This will ensure that all of the transaction functions are available on the transactor.

:zookeeper

This configures which Zookeeper Cook will connect to. You can either have Cook use an embedded Zookeeper (great for development and trying out Cook), or use an external Zookeeper quorum (required for production). To use a production Zookeeper quorum located at $QUORUM (e.g. zk1.example.com,zk2.example.com,zk3.example.com/cook), you should use a map: {:connection "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181/cook"}.

To use the embedded Zookeeper for development, use the map {:local? true}. By default, the embedded Zookeeper will bind to port 3291. If need it to bind to another port, you can specify that with the :local-port key: e.g. {:local? true, :local-port 9001}.

:mesos

This key configures how Cook will interact with the Mesos cluster. See Mesos Configuration for details.

:authorization

This key configures how Cook will validate users for multitenant scheduling. Cook currently supports a single-user development mode, HTTP Basic authentication, and Kerberos authentication. See Authorization Configuration for details.

:authorization-fn

This key specifies what function to use to perform user authorization. Two example authorization functions are provided. cook.rest.authorization/open-auth allows any user to do anything, for testing and development. cook.rest.authorization/configfile-admins-auth consults the :admins key in the config file for a list of admins. Admins may do anything to any object; other users may only manipulate their own object. It’s easy to write your own custom authorization function. See the cook.rest.authorization docstrings for more information.

:admins

The value of this key is a set of usernames who should be considered administrators when using the configfile-admins-auth authorization-fn.

:plugins

Cook has two extension points that let plugins reject jobs at job submission time as well as accept or defer the launching of jobs via :job-launch-filter and :job-submission-validator.

  • :job-launch-filter configures the entrypoint for filtering job launches. It has several keys:

    • :factory-fn contains a string with a namespace-qualified path of a function for creating either a JobLaunchFilter. The factory function can fetch the current configuration out of the config defstate in cook.config/config

    • :age-out-first-seen-deadline-minutes controls how we get rid of jobs whose launch is perpetually deferred by a plugin. When a job 'ages out', we force it to launch now, to keep the queue from being cluttered with always deferred jobs. The clock for aging out starts when a job is near enough to the front of the queue to be eligible to run, not when it is added to the queue. A job is eligible for aging out when it was first seen in the scheudler queue at least this long ago. If you want jobs to sit longer in the queue than the default 10 hours before being aged out, increase this number.

    • :age-out-last-seen-deadline-minutes A job is eligible for aging out when it has been seen in the launch queue at least this recently.

    • :age-out-seen-count We must have attempted to schedule the job at least this many times before we age it out.

  • :job-submission-validator configures the entrypoint for filtering job submissions. It has several keys:

    • :factory-fn contains a string with a namespace-qualified path of a function for creating a JobSubmissionValidator. The factory function can fetch the current configuration out of the config defstate in cook.config/config

    • :batch-timeout-seconds: This includes a critical timeout value. We check job launches synchronously, so the plugin has to respond fast. In particular, we must complete all of the launch checks before the HTTP timeout. To do this we implement another timeout, defaulting to 40 seconds. Once we cross this soft timeout, a default accept/reject policy is implemented for submitted jobs, one that cannot look at jobs individually.

  • :job-adjuster configures a plugin for adjusting jobs on submission.

    • :factory-fn contains a string with a namespace-qualified path of a function for creating a JobAdjuster.

:rate-limit

Configure rate limits for the scheduler. rate-limit is a map with three possible keys: :user-limit-per-m , :job-submission, :expire-minutes. These keys control rate limits.

  • :user-limit-per-m is the max REST requests a single user can send in a minute. The default is 600 requests.

  • :job-submission is a dictionary with a token bucket filter configuration that allows per-user restriction of the job submission rate.

  • :expire-minutes For our token-bucket-filter rate limits, we will create one token bucket filter rate limit object for each user. If a user has become very idle, we expire old unused rate-limit entries after a time period. This should be set to several hours, or at least as high as (:bucket-size/:tokens-replenished-per-minute)

A token bucket filter configuration has 3 keys, :enforce, :tokens-replenished-per-minute and :bucket-size. Tokens are added to a bucket of maximum size :bucket-size at a rate of :tokens-replenished-per-minute, :enforce is used to decide whether we reject requests that violate this rate limit. Even if enforcement is off violations of the rate limit are logged. See https://en.wikipedia.org/wiki/Token_bucket.

:agent-query-cache

Configure the cache used to store sandbox locations of tasks on different mesos agents. agent-query-cache is a map with two possible keys, threshold and ttl. max-size is the maximum number of elements in the cache before the LRU eviction semantics apply. The default is 1000. ttl-ms is the default time in milliseconds that entries are allowed to reside in the cache. The default is 60000, i.e. 1 minute.

:exit-code-syncer

The Cook scheduler throttles the rate at which it publishes task exit-codes. This allows us to handle high rate of incoming exit-code messages in a graceful manner. exit-code-syncer is a map with the following possible keys: publish-batch-size and publish-interval-ms. The publish-batch-size is an integer representing the number of facts that are updated in individual datomic instance exit-code directory update transactions. The default value is 100. The publish-interval-ms is an integer representing the number of millisecond intervals at which exit-code directories updates will be published to datomic. The default value is 2500.

:sandbox-syncer

The Cook scheduler throttles the rate at which it publishes task sandbox directories. This allows us to handle high rate of incoming progress messages in a graceful manner. sandbox-syncer is a map with the following possible keys: max-consecutive-sync-failure, publish-batch-size, publish-interval-ms and sync-interval-ms. The max-consecutive-sync-failure represents the maximum number of failures before sandbox sync is not retried on that agent. The default value is 15. The publish-batch-size is an integer representing the number of facts that are updated in individual datomic instance sandbox directory update transactions. The default value is 100. The publish-interval-ms is an integer representing the number of millisecond intervals at which sandbox directories updates will be published to datomic. The default value is 2500. The sync-interval-ms represents the intervals at which the sandbox syncer triggers state lookup on pending mesos agents. This value should ideally be lower than the agent-query-cache ttl-ms. The default value is 15000, i.e. 15 seconds.

Mesos Configuration

Mesos configuration is specified as a map, because there are several properties that can be configured about the way Cook connects to Mesos. We’ll look at the configurable options in turn:

:master

This option sets the Mesos master connection string. For example, if you are running Mesos with a Zookeeper node on the local machine (a common development setup), you’d use the connection string zk://localhost:2181/mesos.

:failover-timeout-ms

This options sets the number of milliseconds that Mesos will wait for the Cook framework to reconnect. In development, you should set this to nil, which means that Mesos will treat any disconnection of Cook as the framework ending; this will kill all of Cook’s tasks when it disconnects. In production, it’s recommended to set this to 1209600000, which is 2 weeks in milliseconds. This means that when the Cook scheduler goes down, you have 2 weeks to reconnect a new instance, during which no tasks will be forcibly killed. Typically, however, you’ll only wait 10-30 seconds for reconnection, since Cook is usually run with hot standbys.

:leader-path

This configures the path that Cook will use for its high-availibility configuration. The Zookeeper quorum is the one configured in the top-level :zookeeper option. As long as the Zookeeper quorum and :leader-path are the same, then multiple instances of Cook will be able to synchronize, perform leader election, and perform framework recovery and failover automatically. For a production deployment, you can just run two or three copies of Cook on different hosts, and even if a host fails, Cook won’t be affected.

:principal

This sets the principal that Cook will connect to Mesos with. You can omit this property unless you’ve enabled security features with Mesos. The value here should match with authorized principals in register_frameworks Action in Mesos Authorization file. See http://mesos.apache.org/documentation/latest/authorization/ for details.

:role

This sets the role that Cook will connect to Mesos with. Default: * You can omit this property unless you’ve enabled security features in Mesos. The value should be in authorized list for the current :principal in register_frameworks Action in Mesos Authorization file. See http://mesos.apache.org/documentation/latest/authorization/ for details.

:run-as-user

When configured, this sets the user that Cook will override and set as the user to run tasks on Mesos with. You can omit this property, in which case the user configured in the job will be used as the user to run the job (the default behavior). Cook’s scheduling algorithm continues to use the user specified in the job to compute job schedules.

:framework-name

This sets part of the name of the framework that Cook will register to Mesos. Default: Cook When connecting to Mesos, Cook will use a framework name like "YourFrameworkName-e254483". It will append the current git hash to the value you specify here.

:enable-gpu-support

This enables GPU support for Cook. It is a boolean value, with default value false. This property will only work with Mesos 1.0 and above, since that’s when GPU support was added. If you enable this on an earlier version of Mesos, Cook will fail to start and print the error in the log. If you enable this and your cluster doesn’t have any GPU machines, Cook will accept GPU jobs, but they’ll never be scheduled. See https://github.com/apache/mesos/blob/master/docs/gpu-support.md for details on configuring the agents, installing external NVidia dependencies, and configuring Docker/GPU integration.

:leader-reports-unhealthy

This configures whether or not the leader reports his status as healthy by returning 200 from the /debug endpoint. This can be used to isolate the leader from query load. If set to true, the leader will return 503 on the /debug endpoint. If set to false, the leader will return 200 on the /debug endpoint. The default value is false.

Authorization Configuration

One of Cook’s most valuable features is its fair-sharing of a cluster. But how does Cook know who submitted which jobs? Every request to Cook’s REST API is authenticated, so that we know which user is making the request. Keep in mind that the username used for authentication is also the username that Cook will run the job as, so make sure that user exists on your Mesos slaves. We’ll look at the three authentication mechanisms supported:

:one-user

When doing development with Cook, it’s nice to be able to use it without any authentication. You can have Cook treat every request as coming from a specific user $USER by configuring the :authorization like so:

{
 ; ... snip ...
 :authorization {:one-user "$USER"}
 ; ... snip ...
}
:http-basic

Most organizations will want to use HTTP Basic authentication. Cook allows you to configure how the user name and password are configured. Currently, Cook supports specifying the logins in the config file or using no validation This also makes it super easy to get started: to use HTTP Basic, simply use {:http-basic true} as your :authorization. This will use no validation. To use config-file validation, set :authorization to: {:http-basic {:validation :config-file :valid-logins #{["user" "password"] ["user2" "password2"]}}}

:kerberos

If you have Kerberos at your organization, then you can use it to authenticate users with Cook. To use Kerberos, simply use {:kerberos true} as your :authorization.

Scheduler Knobs

The Cook scheduler comes with a few knobs to tune its behavior under the :scheduler key.

:offer-incubate-ms

This option configures how long Cook will hold onto offers, in order to try to coalesce offers and find better placements for tasks. We recommend setting this to 15000. If you set this to zero, Cook might not be able to find sufficiently large offers for tasks if you’re running other frameworks on your Mesos cluster at the same time.

:mea-culpa-failure-limit

When an instance fails, it can be for a variety of reasons. Some of these are considered "mea culpa reasons", meaning that Cook itself may be to blame for the failure, and in these cases, a certain number of these failures won’t count against the job’s retry limit. For example, if Cook pre-empts a task, the task will fail, but this won’t count against the retry limit. However, if a task fails for the same reason more than a certain number of times (which you can specify using this configuration setting), the excess failures WILL start to count against the job’s retry limit.
mea-culpa-failure-limit should be a map. The keys of the map should correspond to names of individual mea-culpa failure reasons (e.g. :preempted-by-rebalancer). Each value refers to a number of task failures for the specified reason that can occur occur before subsequent failures begin to count against the job’s retry-limit.
The value associated with key :default will apply to any mea-culpa failure reasons that aren’t mentioned by name.
To enable infinite failures for a given failure reason, set its value to -1.

Example:

:mea-culpa-failure-limit {:default 5
                          :mesos-master-disconnected 8
                          :preempted-by-rebalancer -1}
:fenzo-max-jobs-considered

This controls the number of jobs (ranked in Cook priority order) Fenzo will be able to see when placing jobs on Mesos Agents. Raising this number gives Fenzo more freedom to apply constraints for the purpose of optimization, but may also make it more likely to schedule jobs Cook wouldn’t consider of the highest priority. Default is 1000.

:fenzo-scaleback

If Fenzo fails to place Cook’s most desirable job, Cook will start to limit the number of jobs Fenzo can see until that most desirable job is matched by Fenzo. This number is the factor by which the number of Jobs Fenzo can see is reduced on each iteration which fails to match the most desirable job. Eventually, if the job is NEVER matched, Cook will reduce the number of Jobs Fenzo can see to 1, meaning that Fenzo will ONLY be able to see the most desirable job. Default is 0.95.

:fenzo-floor-iterations-before-warn

If Cook has been allowing Fenzo to see only 1 job for this number of iterations, warning messages will start to appear in the logs. Default is 10.

:fenzo-floor-iterations-before-reset

If Cook has been allowing Fenzo to see only 1 job for this number of iterations, it measn that the cluster is essentially down. In this case, Cook will log an error message and then reset the number of jobs Fenzo can see to the value of "fenzo-max-jobs-considered" (see above).

:fenzo-fitness-calculator

By default, Cook will have Fenzo attempt to bin-pack using a combination of memory and CPU when choosing which hosts will field which tasks. By choosing a different option in fenzo-fitness-calculator, you can specify that Fenzo should use a different implemention of VMTaskFitnessCalculator. This value can either refer to a static member of a Java class on the classpath (e.g. "com.netflix.fenzo.plugins.BinPackingFitnessCalculators/cpuMemBinPacker", the default), or a namespaced clojure symbol (e.g. "cook.mesos.scheduler/dummy-fitness-calculator")

:task-constraints

This option is a map that allows you to configure limits for tasks, to ensure that impossible-to-schedule tasks and tasks that run forever won’t bog down your cluster. It currently supports 4 parameters to defend the Cook scheduler, which are described in Task Constraints.

:estimated-completion-constraint

This allows you to configure an optional constraint which will not launch jobs on VMs where the job is expected to run longer than the host’s expected lifetime (for instance, public cloud spot VMs.) The configuration parameters are described in Estimated Completion Constraint.

Rebalancer configuration

Optionally, you can include a "rebalancer" stanza. If you do, on startup, Cook will update its Rebalancer configuration to match the values you specify here.

:interval-seconds

How often to rebalance the cluster for fairness between users. Default is 300 (5 minutes).

:safe-dru-threshold

See the Rebalancer documentation

:min-dru-diff

See the Rebalancer documentation

:max-preemption

See the Rebalancer documentation

:dru-scale

This is only used to control the metrics reporting of DRU values. On some clusters, the DRU’s may be so small that when the values are fed to clj-metrics, they are treated as 0, which makes it impossible to glean insights into the DRU’s in play, in order to set rebalancer parameters. If you find that this is true on your cluster, it is likely that the user shares are set to a very high value, perhaps the default of Integer.MAX_VALUE. To obtain useful DRU metrics in this situation, you can either adjust your share settings (recommended), or increase the dru-scale setting to e.g. 10^300.

Optimizer configuration

The optimizer is a component that can make more global decisions about the cluster, job placement and autoscaling. By default, Cook will use a no-op optimizer. To plug in an implementation, add the optional "optimizer" stanza. The two pluggable pieces, host-feed and optimizer, each have the same structure to configure. A :create-fn:: key which is a namespaced symbol on the class path and a :config:: key which is an arbitrary map. See the example below.

:host-feed

The implementation for the host feed to use. The host feed returns a list of maps where each map describes a host type that can be purchased. Each map should include the following keys:

* :count

The number of hosts available of that type. Should be a non-negative integer

* :instance-type

The name of the host type. Should be a string

* :cpus

The number of cpus available for the particular host type. Should be a postive number

* :mem

The amount of memory available for the host type, in MB. Should be a positive number

* :gpu

(Optional) The number of GPUs available for the host type. Should be a positive number

:optimizer

The implementation for the optimizer to use. The optimizer accepts the output of the host feed, the queue, the running tasks and the available offers and outputs a schedule of suggestions.

A schedule is a map from T milliseconds in the future to a map of optimizer recommendations. Each recommendation map can contain multiple keys, currently, only one, :suggested-matches. The value of :suggested-matches is a map from a host type map described above to a list of job uuids that the optimizer recommends matching onto that host type at this point in the future.

:optimizer-interval-seconds

The interval to run the optimizer, in seconds. The default is 30.

Example optimizer config:

+

{
 ; ... snip ...
 :optimizer {:host-feed {:create-fn cook.mesos.optimizer/create-dummy-host-feed
                         :config {}}
             :optimizer {:create-fn cook.mesos.optimizer/create-dummy-optimizer
                         :config {}}
             :optimizer-interval-seconds 30}
 ; ... snip ...
}

Task Constraints

:timeout-hours

This specifies the max time that a task is allowed to run for. Any tasks running for longer than this will be automatically killed.

:timeout-interval-minutes

This specifies how often to check for timed-out tasks. Since checking for timed-out tasks is linear in the number of running tasks, this can take a while. On the other hand, if your timeout is one hour, but you only check every 30 minutes, some tasks could end up running for almost one and a half hours!

:memory-gb

This specifies the max amount of memory that a task can request. You should make sure this is small enough that users can’t accidentally submit tasks that are too big for your slaves.

:cpus

This is just like :memory-gb, but for CPUs. You should make sure this is small enough that users can’t accidentally submit tasks that are too big for your slaves.

:retry-limit

This limits the number of retries a job is allowed to request. Something in the low tens is often more than sufficient.

Estimated Completion Constraint

:expected-runtime-multiplier

This is a configurable constant factor which will multiply the expected runtime on each job to compute when the job will complete. For instance, if estimated-runtime-multiplier is 2.5 and a job has an expected runtime of 60000 ms, it will not be scheduled on a host which will die in 2.5 minutes.

:host-lifetime-mins

This is the expected lifetime of hosts in the cluster. To apply the constraint, each host should have a host-start-time attribute in it’s mesos offer which will be used with the host-lifetime-mins parameter to determine when the host is expected to die.

Cook Executor

The Cook executor is a custom executor written in Python. It is enabled when the :command option (see below) is configured to a non-empty string. When enabled, it replaces the default command executor in order to enable a number of features for both operators and end users. Please see the Cook Executor README for more detailed information about the Cook executor.

An example configuration looks like:

{...
 :executor {:command "./cook-executor"
            :default-progress-regex-string "progress:\\s+([0-9]*\\.?[0-9]+)($|\\s+.*)"
            :environment {"EXECUTOR_DEFAULT_PROGRESS_OUTPUT_NAME" "stdout"}
            :log-level "INFO"
            :max-message-length 512
            :progress-sample-interval-ms 1000
            :uri {:cache true
                  :executable true
                  :extract false
                  :value "file:///path/to/cook-executor"}}
 ...}

The configuration values are defined as follows:

:command

A string containing the command executed on the mesos agent to launch the Cook executor. No default value is provided, when missing the use of the Cook executor is disabled.

:default-progress-regex-string

The string representation of the regex used to identify progress update messages. The regex should have one or two capture groups, the first being a number representing the progress percent. The second, when present, being a message about the progress. The executor will use the :max-message-length value to trim the progress message string before sending it to the scheduler. Defaults to "progress:\\s+(\\.?)($|\\s.)".

:environment

A map that represents the additional environment variables passed on to the executor. The default is an empty map.

:log-level

The log level for the executor process. Defaults to "INFO".

:max-message-length

The maximum length for the unencoded string messages sent from a task via the Mesos executor HTTP API. The default is 512.

:progress-sample-interval-ms

The interval in ms after which to send progress updates. Care should be taken to avoid setting this value too low as it can end up causing a high rate of message transfer between the executor and the scheduler. The default is (1000 * 60 * 5), i.e. 5 minutes.

:uri

A description of the uri used to download the executor executable. The default is an empty map, i.e. no executable to download. The uri structure is defined below:

key

type

description

:cache

boolean

Mesos 0.23 and later only: should the URI be cached in the fetcher cache?

:executable

boolean

Should the URI have the executable bit set after download?

:extract

boolean

Should the URI be extracted (must be a tar.gz, zipfile, or similar).

:value

string

The URI to fetch. Supports everything the Mesos fetcher supports, i.e. http://, https://, ftp://, file://, hdfs://

Progress Configuration

The Cook scheduler throttles the rate at which it publishes progress updates from the Cook executor. This allows us to handle high rate of incoming progress messages in a graceful manner. This also protects the scheduler against potentially bad executors that are sending progress messages at a high rate.

An example configuration looks like:

{...
 :progress {:batch-size 100
            :pending-threshold 4000
            :publish-interval-ms 2500
            :sequence-cache-threshold 1000}
 ...}

The configuration values are defined as follows:

:batch-size

An integer representing the number of facts that are updated in individual datomic instance progress update transactions. The default value is 100.

:pending-threshold

An integer representing the maximum number of instances whose pending progress states will be stored in memory. Additional messages (either in the queue or while building the in-memory state) will be dropped. The default value is 4000.

:publish-interval-ms

An integer representing the number of millisecond intervals at which progress updates will be published to datomic. The default value is 2500.

:sequence-cache-threshold

An integer representing the max number of items in the task sequence cache. This cache is used to track the latest sequence number of progress message processed for a given task. In order to avoid the potential for out-of-order progress updates, this cache should be sized to handle the maximum number of active tasks that are reporting progress. The default value is 1000.

Debugging Facilities

Cook is designed to be easy to debug and monitor. We’ll look at the various monitoring and debugging subconfigs:

:metrics

This map configures where and how to report Cook’s internal scheduling and performance metrics. See Metrics for details.

:nrepl

Cook can start an embedded nREPL server. nREPL allows you to log into the Cook server and inspect and modify the code while it’s running. This should not be enabled on untrusted networks, as anyone who connects via nREPL can bypass all of Cook’s security mechanisms. This is really useful for development, though! See nREPL for details.

:log

This section configures Cook’s logging. See Logging for details.

:unhandled-exceptions

This map configures what Cook’s behavior should be when it encounters an exception that doesn’t already have code implemented to handle it. See Unhandled Exceptions for how to configure.

Metrics

Cook can transmit its internal metrics over a variety of transports, such as JMX, Graphite, and Riemann. Internally, Cook uses Dropwizard Metrics 3, so we can easily add support for any Metrics 3 compatible reporter.

JMX Metrics

To enable JMX metrics, set the :metrics key to {:jmx true}.

Graphite Metrics

To enable Graphite metrics, you’ll need to populate the :graphite map. We support setting a prefix on all metrics, choosing which graphite server to connect to, and whether to use the plain-text or pickled transport format. Here’s an example of enabling Graphite metrics:

:metrics {:graphite {:host "my-graphite-server.example.com"
                     :port 2003
                     :prefix "cook"
                     :pickled? false ; defaults to true
                    }}

Also, keep in mind that you can enable multiple metrics reporters simultaneously, if that’s useful in your environment. For example, you could use JMX and graphite together:

:metrics {:graphite {:host "my-graphite-server.example.com"
                     :port 2003
                     :prefix "cook"}
          :jmx true}

nREPL

The :nrepl key takes a map that supports two options:

:enabled?

Set this to true if you’d like to start the embedded nREPL server.

:port

Set this the to the port number you’d like the nREPL server to bind to. You must choose a port to enable nREPL.

Logging

Cook’s logging is configured under :log. Cook automatically rotates its logs daily, and includes information about package, namespace, thread, and the time for every log message.

:file

You must choose a file location for Cook to write its log. It’s strongly recommended to specify a log file under a folder, e.g. log/cook.log, since Cook will rotate the log files by appending .YYYY-MM-dd to the specified path. The path can be relative (from the directory you launch Cook) or absolute.

:levels

You can also specify log levels to increase or decrease verbosity of various components of Cook and libraries it uses. We’ll look at an example, which sets the default logging level to :info, but sets a few Datomic namespaces to use the :warn level. This also happens to be the recommended logging configuration:

:levels {"datomic.db" :warn
         "datomic.peer" :warn
         "datomic.kv-cluster" :warn
         :default :info}}

As you can see, specific packages and namespaces are specified by strings as the map’s keys; their values specify their log level override.

Unhandled Exceptions

Everyone makes mistakes. We’d like to know when errors happen that we didn’t anticipate. That’s what the :unhandled-exceptions key is for. Let’s look at what options it takes:

:log-level

This lets you choose the level to log unhanded error at. Usually :error is the right choice, although you may want to log these at the :fatal level.

:email

You can also choose to receive emails when an unhandled exception occurs. This key takes a map that it uses as a template for the email. Cook uses postal to send email. For advanced configuration, check out the postal’s documentation. Cook will append details to whatever subject line you provide, and it will fill in the body with the stacktrace, thread, and other useful info. Here’s a simple example of setting up email:

:email {:to ["admin@example.com"]
        :from "cook@example.com"
        :subject "Unhandled exception in cook"}

Production JVM Options

It can be intimidating to choose JVM options to enable Cook to run with high performance—​what GC to use, how much heap, which Datomic options? Here’s a table with some options that should work for a cluster with thousands of machines:

Table 1. Cook JVM Options Recommendations for Large Clusters
Options Reasoning

-XX:UseG1GC

This enables the low-pause collector, which gives better API latency characteristics

-XX:MaxGCPauseMillis=50

This means that the JVM will target to never stop the world for move than 50ms

-Ddatomic.readConcurrency=10

Increase datomic read rate to improve table scans

-Ddatomic.writeConcurrency=10

Balance the writes with the read rate for faster job updates

-Ddatomic.memIndexThreshold=256m

This allows Datomic to index much less often

-Ddatomic.memIndexMax=512m

This allows Datomic to accept writes during slow indexing jobs for longer

-Ddatomic.txTimeoutMsec=60000

Sometimes, we generate big and bad transactions—​this helps us to not die

-Ddatomic.peerConnectionTTLMsec=35000

This helps to deal with slow peers

-Ddatomic.objectCacheMax=2g

This accelerates queries by caching a lot of data in memory

-Xmx12g

Set the heap to use 12GB

-Xms12g

Don’t bother scaling the heap up—​just force it to start at full size

License

© Two Sigma Open Source, LLC