docs/cloudtask.txt

=========
CloudTask
=========

----------------------------------------------------
Automatic parallelization of batch jobs in the cloud
----------------------------------------------------

:Author: Liraz Siri <liraz@turnkeylinux.org>
:Date:   2011-07-31
:Manual section: 8
:Manual group: misc

SYNOPSIS
========

cloudtask [ -opts ] [ command ]

cloudtask [ -opts ] --resume=SESSION_ID

ARGUMENTS
=========

`command` := a shell command to execute. Job inputs are appended as
arguments to this command.

OPTIONS
=======

--force                 
  Don't ask for confirmation

--hub-apikey=APIKEY     
  Hub API KEY (required if launching workers)

--ec2-region=REGION     
  Region for instance launch (default: us-east-1)

  Regions::

      us-east-1 (Virginia, USA)
      us-west-1 (California, USA)
      eu-west-1 (Ireland, Europe)
      ap-southeast-1 (Singapore, Asia)

--ec2-size=SIZE         
  Instance launch size (default: m1.small)

  Sizes::

      t1.micro (1 CPU core, 613M RAM, no tmp storage)
      m1.small (1 CPU core, 1.7G RAM, 160G tmp storage)
      c1.medium (2 CPU cores, 1.7G RAM, 350G tmp storage)

--ec2-type=TYPE         
  Instance launch type <s3|ebs> (default: s3)

--sessions=PATH         
  Path where sessions are stored (default: $HOME/.cloudtask)

--user=USERNAME         
  Username to execute commands as (default: root)

--pre=COMMAND          
  Worker setup command

--post=COMMAND         
  Worker cleanup command

--overlay=PATH      
  Path to worker filesystem overlay

--timeout=SECONDS      
  How many seconds to wait before giving up

--split=NUM        
  Number of workers to execute jobs in parallel

--workers=ADDRESSES      
  List of pre-allocated workers to use

                path/to/file | host-1 ... host-N

--report=HOOK       
  Task reporting hook, examples::

    sh: command || py: file || py: code

    mail: from@foo.com to@bar.com

DESCRIPTION
===========

Batch remote execution with automatic cloud server allocation.

Background
----------

Many batch tasks can be easily broken down into units of work that can
be executed in parallel. On cloud services such as Amazon EC2, running a
single server for 100 hours costs the same as running 100 servers for 1
hour. In other words, for problems that can be parallelized this makes
it advantageous to distribute the batch on as many cloud servers as
required to finish execution of the job in just under an hour. To take
full advantage of these economics an easy to use, automatic system for
launching and destroying server instances reliably on demand and
distributing work amongst them is required. 

CloudTask solves this problem by automating the execution of a batch job
on remote worker servers over SSH. The user can split up the batch
amongst an arbitrary number of servers to speed up execution time.
CloudTask can automatically allocate (and later destroy) EC2 cloud
servers as required, or the user may provide a list of suitably
configured "persistent" servers.

Terms and definitions
---------------------

* Task: a sequence of jobs

* Job: a shell command representing an atomic unit of work

* Task template: a pre-configured task.

* Session: the state of a task run at a particular time. This includes
  the task configuration, the status of jobs that have finished
  executing, and a list of jobs still pending execution.
  
* Split: the number of workers the task jobs of task are split amongst.

* Worker: a server running SSH a job on which we execute jobs. This can
  be a persistent server or a dynamically allocated EC2 cloud server
  instance.

* TurnKey Hub: a web service that cloudtask may use to launch and
  destroy TurnKey servers preconfigured to perform a given task.

Features
--------

* Jobs are just simple shell commands executed remotely: there is no
  special API. Shell commands are well understood, language agnostic and
  easy to test and develop.

* Ad-hoc task configuration via command line options / environment:
  cloudtask can be used directly from the command line, which is useful
  for one-off tasks, or for experimenting/debugging a new routine
  task.

* Pre-configured task templates: the configuration parameters for
  routine tasks can be embedded within a pre-configured task template,
  which is itself executable just like cloudtask, and inherits its
  interface.

  Under the hood a task template is implemented by defining a Python
  class that inherits Task::

        #!/usr/bin/python

        from cloudtask import Task

        class HelloWorld(Task):
            DESCRIPTION = "This is a hello world cloudtask template"
            COMMAND = 'echo hello world'
            SPLIT = 2
            REPORT = 'mail: cloudtask@example.com liraz@example.com'

        HelloWorld.main()

* Transparent execution with real-time logging: cloudtask provides
  real-time logging to make it easy for the user to following the
  progress of a task. For example, the progress of any command executed
  over SSH can be followed by tailing the worker's session log::

    cd ~/.cloudtask/$session_id/workers/
    tail -f 1234
  
* Fault tolerance: cloudtask is designed to reliably survive multiple
  types of failure. For example:
  
  - worker servers are continually monitored for failure so that a job
    executing on a failed server may be rerouted to a working server. A
    task will continue executing so long as a single worker survives. 
     
  - the user can specify a per-job timeout so that jobs that freeze up
    for whatever reason will time out gracefully without jamming upt he
    worker indefinitely.

  - In case of Hub API failure cloudtask will wait a few seconds and try
    again.

* Abort and resume capability: a task can be aborted at any time by
  pressing Ctrl-C, or sending the TERM signal to the main process.
  After all automatically launched server instances are destroyed, the
  state of the session is saved so that it may be resumed later from
  where it left off.

* Reporting hook: when the execution of a session finishes a reporting
  hook may be configured to perform an arbitrary action (e.g., sending
  a notification e-mail, updating a database, etc.). Three types of
  reporting handlers are currently supported:

  1) `mail`: send out an e-mail with the session log to one or more
     recipients.

  2) `sh`: execute a shell command. The current working directory is set
     to the session path and the environment is populated with the
     session context. 

  3) `py`: execute an arbitrary snippet of Python code. The session and
     task configuration are accessible as local variables.

Example usage scenario
----------------------

Alon wants to refresh all TurnKey Linux appliances with the latest
security updates. 

He writes a script which accepts the name of an appliance as an
argument, downloads the latest version from Sourceforge, extracts the
root filesystem, installs the security updates, repackages the root
filesystem into an appliance ISO and uploads a new version of the
appliance back to Sourceforge.

After testing the script on his local Ubuntu workstation, he asks the
Hub to launch a new TurnKey Core instance (88.1.2.3), transfers his
script and installs whatever dependencies are required. Once everything
is tested to work, he creates a new TKLBAM backup with captures the
state of his master worker server.

Alon runs his first cloudtask test::

    echo core | cloudtask --workers=88.1.2.3 refresh-iso-security-updates

Once he confirms that this single test job worked correctly, he's ready
for the big batch job that will run on 10 servers in parallel.

Since this is a routine task Alon expects to repeat regularly, he
creates a pre-configured cloudtask template for it in $HOME/cloudtasks::

    $ mkdir $HOME/cloudtasks
    $ cd $HOME/cloudtasks

    $ cat > refresh-iso << 'EOF'
    from cloudtask import Task

    class RefreshISO(Task):
        DESCRIPTION = "This task refreshes security updates on an ISO"
        COMMAND = 'refresh-iso-security-updates'
        SPLIT = 10
        REPORT = 'mail: cloudtask@example.com alon@example.com liraz@example.com'

        HUB_APIKEY = 'BRDUKK3WDXY3CFQ'

    RefreshISO.main()

    EOF

    $ chmod +x ./refresh-iso

    $ cat $PATH_LIST_APPLIANCES | ./refresh-iso
    About to launch 10 cloud servers to execute the following task:

      Parameter       Value
      ---------       -----

      jobs            40 (appengine .. zimbra)
      command         refresh-iso-security-updates
      hub-apikey      5YGVPEMHJHU5EA
      ec2-region      us-east-1
      ec2-size        m1.small
      ec2-type        s3
      user            root
      workers         -
      overlay         -
      post            -
      pre             -
      timeout         -
      report          mail: cloudtask@turnkeylinux.org liraz@turnkeylinux.org

    Is this really what you want? [yes/no] yes

    session 11 (pid 29709)
    88.178.132.231 (29721): launched new worker
    88.214.141.175 (29722): launched new worker
    88.15.179.7 (29724): launched new worker
    88.229.38.128 (29723): launched new worker

    ...

45 minutes later, Alon receives an e-mail from cloudtask that the job
has finished. In the body is the session log detailing if errors were
detected on any job (e.g., non-zero exitcode), how long the session took
to run, etc. 

Had he wanted to, Alon could have followed the execution of the task
jobs in real-time by tailing the worker log files::

    tail -f ~/.cloudtask/11/workers/29721

CONFIGURATION
=============

Any cloudtask configuration option that can be configured from the
command line may also be configured through a template default, or by
defining an environment variable.

Resolution order for options:
1) command line (highest precedence)
2) task-level default
3) CLOUDTASK_{PARAM_NAME} environment variable (lowest precedence)

For example, if you want to configure the ec2 region worker instances
are launched in, you can configure it as:

1) The --ec2-region command line option::

    $ cloudtask --ec2-region ap-southeast-1

2) By defining EC2_REGION in a task template::

    $ cat > foo.py << 'EOF'

    from cloudtask import Task

    class Foo(Task):
        EC2_REGION = 'ap-southeast-1'

    Foo.main()
    EOF

    $ chmod +x ./foo.py
    $ ./foo.py

3) By setting the CLOUDTASK_EC2_REGION environment variable::
    
    export CLOUDTASK_EC2_REGION=ap-southeast-1

EXPLORING CLOUDTASK
===================

Since launching and destroying cloud servers can take a few minutes, the
easiest way to explore and experiment with cloudtask is to run a local
ssh server::

    # you need root privileges to install SSH
    apt-get install openssh-server
    /etc/init.d/ssh start

Add your user's SSH key to root's authorized keys::

    ssh-copy-id root@localhost
    
Then run test tasks with the --workers=localhost option, like this::

    seq 10 | cloudtask --workers=localhost echo

BEST PRACTICES FOR PRODUCTION USE
=================================

For production use, it is recommended to create pre-configured templates
for routine jobs in a Git repository. Templates may inherit shared
definitions such as the Hub APIKEY or the reporting hook from a common
module::

    $ cat > common.py << 'EOF'
    from cloudtask import Task
    class BaseTask(Task):
        HUB_APIKEY = 'BRDUKK3WDXY3CFQ'
        REPORT = 'mail: cloudtask@example.com alon@example.com liraz@example.com'

        # save sessions in the local directory ratehr than
        # $HOME/.cloudtask. That way we can easily track the session
        # logs in Git too.
        SESSIONS = 'sessions/' 
    EOF

    $ cat > helloworld << 'EOF'
    from common import BaseTask
    class HelloWorld(BaseTask):
        COMMAND = 'echo hello world'

    HelloWorld.main()
    EOF

HOW IT WORKS
============

When the user executes a task, the following steps are performed:

1) A temporary SSH session key is created.

   The initial authentication to workers assumes you have set up an SSH
   agent or equivalent (cloudtask does not support password
   authentication). 
   
   The temporary session key will be added to the worker's authorized
   keys for the duration of the task run, and then removed. We need to
   authorize a temporary session key to ensure access to the workers
   without relying on the SSH agent.

2) Workers are allocated.

   Worker cloud servers are launched automatically by cloudtask to
   satisfy the requested split unless enough pre-allocated workers are
   provided via the --workers option.

   A TKLBAM backup id may be provided to install the required job
   execution dependencies (e.g., scripts, packages, etc.) on top of
   TurnKey Core.

3) Worker setup.

   After workers are allocated they are set up. The temporary session
   key is added to the authorized keys, the overlay is applied to the
   root filesystem (if the user has configured an overlay) and the pre
   command is executed (if the user has configured a pre command).

4) Job execution.

   CloudTask feeds a list of all jobs that make up the task into an
   job queue. Every remote worker has a local supervisor process which
   reads a job command from the queue and executes it over SSH on the
   worker.

   The job may time out before it has completed if a --timeout has been
   configured.

   While the job is executing, the supervising process will periodically
   check that the worker is still alive every 30 seconds if the job
   doesn't generate any console output. If a worker is no longer
   reachable, it is destroyed and the aborted job is put back into the
   job queue for execution by another worker.
   
5) Worker cleanup

   When there are no job commands left in the input Queue to provide a
   worker it is cleaned up by running the post command, removing the
   temporary session key from the authorized keys.

   If cloudtask launched the worker, it will also destroy it at this
   point to halt incremental usage fees.

6) Session reporting

   A reporting hook may be configured that performs an action once the
   session has finished executing. 3 types of reporting hooks are
   supported:

   1) mail: uses /usr/sbin/sendmail to send a simple unencrypted e-mail
      containing the session log in the body.

   2) sh: executes a shell command, with the task configuration embedded
      in the environment and the current working directory set to the
      session path. You can test the execution context like this::

        --report='sh: env && pwd'

   3) py: executes a Python code snippet with the session values set as
      local variables. You can test the execution context like this::

        --report='py: import pprint; pprint.pprint(locals())'
      

SEE ALSO
========

``cloudtask-launch-workers`` (8), ``cloudtask-destroy-workers`` (8)