Skip to content

Commit

Permalink
More conceptual documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
tailhook committed Sep 10, 2015
1 parent 96df1b5 commit 92e3912
Show file tree
Hide file tree
Showing 2 changed files with 181 additions and 29 deletions.
208 changes: 180 additions & 28 deletions doc/concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Verwalter is a tool that manages cluster of services.

Briefly verwalter provides the following:

* Cluster-wide management that is scriptable by lua_
* Cluster-wide resource management that is scriptable by lua_
* Limited service discovery

It builds on top of lithos_ (which is isolation, containarization and
Expand Down Expand Up @@ -124,11 +124,12 @@ having two leaders is not a problem when used wisely.
The Missing Parts
-----------------

In current implementation the missing part is delivering files to node, in
particular:
In current implementation the missing part of the puzzle is a means to deliver
files to each box. In particular the following files might need to be
distributed between nodes:

1. Lithos configs ``/etc/lithos/master.yaml``, ``/etc/lithos/sandboxes``
2. Images of systems for lithos
1. Images of containers for lithos
2. Vervalter's configs and configuration templates

We use ansible_ and good old rsync_ for these things for now

Expand Down Expand Up @@ -184,35 +185,178 @@ instance**. The information consists of:

* list of peers in the cluster
* availability of the nodes (i.e. time of last successful ping)
* some minor useful info like round trip time (RTT) between nodes

Verwalter delegates all the work of joining cluster to cantal. As described
above, verwalter operates in one of the two modes: leader and follower. It
starts as follower and waits until it will be reached by leader. Leader in
turn discovers followers through cantal. I.e. it assumes that every cantal that
joins the cluster has verwalter instance.
Verwalter delegates all the work of joining cluster to cantal.

While cantal is joining cluster and verwalter does it's own boostrapping and
possible leader election, the lithos continues to run. I.e. if there was any
configuration for lithos before reboot of the system or before you do any
maintainance of the verwalter/consul, the processes are started and supervised.
Any processes that crash are restarted and so on.
As described above, verwalter operates in one of the two modes: leader and
follower. It starts as follower and waits until it will be reached by leader.
Leader in turn discovers followers through cantal. I.e. it assumes that every
cantal that joins the cluster has verwalter instance.

.. note:: In case you don't want for processes to start on boot, you may
configure system to clean lithos configs on reboot (for example by putting
them on ``tmpfs`` filesystem). This is occassionally useful, but we consider
the default behaviour to start all processes that was previously run more
useful in most cases.
.. note::

While cantal is joining cluster and verwalter does it's own boostrapping
and possible leader election, the lithos continues to run. This means if
there was any configuration for lithos before reboot of the system or
before you do any maintainance of the verwalter/consul, the processes are
started and supervised. Any processes that crash are restarted and so on.

In case you don't want for processes to start on boot, you may configure
system to clean lithos configs on reboot (for example by putting them on
``tmpfs`` filesystem). This is occassionally useful, but we consider the
default behaviour to start all processes that was previously run more
useful in most cases.


Leader's Job
------------

When verwalter follower is not reached by a leader for predefined time (don't
matter whether it's on startup or when it had leader), it starts an election
matter whether it's on startup or after it had leader), it starts an election
process. Election process is not described in detail here, because it's work
in progress. It will be described in detail later on other parts of
in progress. It will be described in detail later in other parts of
documentation.

When verwalter elected as a leader:

1. It connects to every node, and ensures that every follower knows the leader
2. After establishing connections
2. After establishing connections it gathers the configuration of all currently
running processes on every node
3. It connects to local cantal and requests statistics for all the nodes
4. Then it runs scheduling algorithm which produces new configuration for every
node
5. At next step it delivers configuration to respective nodes
6. Repeat at step 3 at regular intervals (~10 sec)

In fact steps 1-3 are done simultaneously. As outlined in
`cantal documentation`_ it gathers and aggregates metrics by itself, easing
the work for verwalter.

Note that at the moment when new leader is elected the previous one is probably
not accesible (or there were two of them so no shared consistent configuration
exists). So it's important to gather all current node configurations to keep
number of reallocations/movements of processes between machines at minimum. It
also allows to have persistent processes (i.e. processes which store data on
local filesystem or in local memory, for example database shards).

Having not only old configuration but also statistics is very important, we can
use it for the following things:

1. Detect failing processes
2. Find out the number of requests that are processed per second
3. Predict trends, i.e. whether traffic is going up or down

All this info is gathered continuously and asyncrhonously. Nodes come and leave
at every occassion. So it's too complex to reason about them in reactive
manner. So from SysOp's point of view the scheduler is a pure function from a
{*set of currently running processes*; *set of metrics*} to the new
configuration. The verwalter itself does all heavy lifting of keeping all nodes
in contact, synchronizing changes, etc.

The input to the function in simplified human-readable form looks like the
following::

box1 django: 3 running, 10 requests per second and growing; 80% CPU usage
box2 flask: 1 running, 7 RPS and declining; django: 2 starting; 20 %CPU

In lua code function looks like this (simplified):

.. code-block:: lua
function scheduler (processes, metrics)
...
return config
end
Furthermore we have a helper utilities to actually keep matching processes
running. So in many simple cases scheduler may just return the number of
processes it wants to run. In simplified form it looks like this:

.. code-block:: lua
function schedule_simple(metrics)
cfg = {
django_workers = metrics.django.rps / DJANGO_WORKER_CAPACITY,
flask_workers = metrics.flask.rps / FLASK_WORKER_CAPACITY,
}
total = cfg.django_workers + cfg.flask_workers
if total > MAX_WORKERS then
-- not enough capacity, but do our best
cfg = distribute_fairly(cfg)
else
-- have some spare capacity for background tasks
cfg.background_workers = MAX_WORKERS - total
end
return cfg
end
make_scheduler(schedule_simple, {
worker_grow_rate: '5 processes per second', -- start processes quickly
worker_decline_rate: '1 process per second', -- but stop at slower rate
})
Of course example is oversimplified, it's only here to get some spirit of what
scheduling might look like.

By using proper lua sandbox we ensure that function is *pure* (have no side
effects), so if you need some external data it must be provided to cantal or
verwalter by implementing their API. In lua script we do our best to ensure
that function is idempotent, so we can log all the data and resulting
configuration for **post mortem debugging**.

Also this allows us to make "shadow" schedulers. I.e. ones that have no real
scheduling abilities, but are evaluated on every occasion. This might be useful
to evaluate new scheduling algorithm before putting one in production.

.. _`cantal documentation`: http://cantal.readthedocs.org/en/latest/concepts.html#aggregated-metrics

Follower's Job
--------------

The follower is much simpler. When leadership is established, it receives
configuration updates from the leader. Configuration may consist of:

1. Application name and number of processes to run
2. Host name to IP address mapping to provide for an application
3. Arbitrary key-value pairs that are needed for configuring application
4. (Parts of) configurations of other nodes

Note the items (1), (4) and partially (3) do provide the **limited form of
service discovery** that was declared at start of this guide. The (2) is there
mostly for legacy applications which does not support service discovery. The
(4) is mostly for proxy servers that need a list of backends, instead of having
backends discover them by host name.


.. note:: We use extremely ignorant description of "legacy" here. Because even
in 2015 most services don't support service discovery out of the box and
most proxies have a list of backends in the config. I mean not just old
services that are still widely used. But also services that are created in
recent years. Which is problem on it's own but not the one verwalter is
aimed to solve. It's just designed to work both with good and old-style
services.

Every configuration update is applied by verwalter locally. In the simplest
form it means:

1. Render textual templates into temporary file(s)
2. Run configuration checker for application
3. Atomically move configuration file or directory to the right place
4. Signal the application to reload configuration

For some applications it might be more complex. For lithos which is the most
common configuration target for verwalter it's just a matter of writing
YAML/JSON config to temporary location and calling ``lithos_switch`` utility.

.. note:: We're still evaluating whether it's good idea to support plugins for
verwalter for complicated configuration scenarios. Or whether the files are
universal transport and you just want to implement daemon on it's own if
you want some out of scope stuff. The common case might be making API calls
instead of reloading configuration like you might need for docker or any
cloud provider. Lua scripting at this stage is also an option being
considered.


Cross Data Center
Expand All @@ -228,24 +372,23 @@ Cross Data Center
The cross data center connection scheme


When crossing data center line things start to be more complicated. In
When crossing data center things start to be more complicated. In
particular verwalter assumes:

1. Links between Data Center are order of magnitude slower than inside
(normal RTT between nodes inside datacenter is 1ms; whereas between DC even
on the same continent 40ms and sometimes may be up to 120-500 ms). In some
cases traffic is more expensive.
cases traffic is expensive.
2. The connection between datacenters is less reliable and when it's down
clients might be serviced by single data center too. It should be possible
to configure partial degradation.
3. There are only few data centers (i.e. it's normal to have 100-1000 nodes,
3. There are few data centers (i.e. it's normal to have 100-1000 nodes,
but almost nobody has more than a dozen of DCs)

So verwalter establishes a leader inside every datacenter. On the
cross-data-center boundary all verwalter leaders treated equally. They form
full mesh of connections. And when one of them experiences peak load it just
requests some resources from other (i.e. its scriptable so requesting resources
may take place on different conditions).
requests some resources from other.

Let's repeat that again: because verwalter is not a database, consistency is
not important here. I.e. if some resources are provided by DC1 for DC2 and for
Expand All @@ -257,6 +400,15 @@ appropriate metrics. So dialog between data centers looks like the following:
:alt: a dialog between DC1 and DC2 where DC1 requests resources from DC2
:width: 800px

All things here are scriptable. So your logic may only move background tasks
across data-centers or use cloud API's to request more virtual machines

.. note:: A quick note to last sentence. You can't access cloud API directly
because of sandboxing. But you may produce a configuration for some
imaginary *cloud provider management daemon* that includes bigger value in
the setting *number of virtual machines to provision*.


.. _lithos: http://github.com/tailhook/lithos
.. _cantal: http://cantal.readthedocs.org
.. _lua: http://lua.org
Expand Down
2 changes: 1 addition & 1 deletion doc/pic/cross-dc-dialog.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 92e3912

Please sign in to comment.