Open source Flotilla
Go JavaScript CSS
Clone or download
akshaywadia Merge pull request #88 from stitchfix/status_events
Emit Events for Task Status Updates
Latest commit 9214f5a Jun 22, 2018
Permalink
Failed to load latest commit information.
.circleci initial queue manager interface, sqs manager implementation, and tests Jul 14, 2017
clients uses errors package to wrap and display much more informative error m… ( Mar 30, 2018
conf working on status event routing rule Dec 18, 2017
config uses errors package to wrap and display much more informative error m… ( Mar 30, 2018
docs Delete api.md Feb 12, 2018
exceptions refactor errors a bit; start working on translation layer from reques… Aug 8, 2017
execution pull status from main container when available (#81) Apr 10, 2018
flotilla uses errors package to wrap and display much more informative error m… ( Mar 30, 2018
log Adding localEventSink, and emitting events in submit_worker Oct 26, 2017
queue uses errors package to wrap and display much more informative error m… ( Mar 30, 2018
services support multiple valued filters (eg. status=RUNNING&status=PENDING ) (#… Feb 2, 2018
state uses errors package to wrap and display much more informative error m… ( Mar 30, 2018
testutils support multiple valued filters (eg. status=RUNNING&status=PENDING ) (#… Feb 2, 2018
ui Move static assets (.js and .css files) to /build/static during the b… May 28, 2018
vendor uses errors package to wrap and display much more informative error m… ( Mar 30, 2018
worker Merge pull request #88 from stitchfix/status_events Jun 22, 2018
.gitignore Add more UI tests (#62) Feb 5, 2018
Dockerfile working on example Dockerfile and docker-compose.yml Dec 8, 2017
LICENSE Initial commit Jul 5, 2017
README.md minor spell fixes Apr 30, 2018
docker-compose.yml Quick start docs (#76) Mar 4, 2018
main.go uses errors package to wrap and display much more informative error m… ( Mar 30, 2018

README.md

flotilla-os

Circle CI Go Report Card

Introduction

Flotilla is a self-service framework that dramatically simplifies the process of defining and executing containerized jobs. This means you get to focus on the work you're doing rather than how to do it.

Once deployed, Flotilla allows you to:

  • Define containerized jobs by allowing you to specify exactly what command to run, what image to run that command in, and what resources that command needs to run
  • Run any previously defined job and access its logs, status, and exit code
  • View and edit job definitions with a flexible UI
  • Run jobs and view execution history and logs within the UI
  • Use the complete REST API for definitions, jobs, and logs to build your own custom workflows

Philosophy

Flotilla is strongly opinionated about self-service for data science.

The core assumption is that you understand your work the best. Therefore, it is you who should own your work from end-to-end. In other words, you shouldn't need to be a "production engineer" to run your jobs or to access logs in case of problems. Do this with Flotilla.

Quick Start

Minimal Assumptions

Before we can do anything there's some prerequistes that must be met.

  1. Flotilla by default uses AWS. You must have an AWS account and AWS keys available. This quick-start guide uses AWS keys exported into the environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. If you've got credentials configured on your machine you can set these easily by running:
export AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id)
export AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key)

Note: When running on AWS EC2 instances or ECS it's better practice to use an IAM profile for AWS credentials

  1. The AWS credentials must be authorized. The permissions required are described in the following policy document for AWS (you can attach it to a user or a role depending on how you manage users in AWS).
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "flotilla-policy",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:ListQueues",
                "sqs:GetQueueUrl",
                "logs:DescribeLogGroups",
                "sqs:ReceiveMessage",
                "events:PutRule",
                "sqs:SendMessage",
                "sqs:GetQueueAttributes",
                "ecs:DescribeClusters",
                "ecs:DeregisterTaskDefinition",
                "events:ListRuleNamesByTarget",
                "ecs:RunTask",
                "ecs:RegisterTaskDefinition",
                "sqs:CreateQueue",
                "ecs:ListContainerInstances",
                "ecs:DescribeContainerInstances",
                "ecs:ListClusters",
                "ecs:StopTask"
            ],
            "Resource": "*"
        }
    ]
}
  1. Flotilla uses AWS's Elastic Container Service (ECS) as the execution backend. However, Flotilla does not manage ECS clusters. There must be at least one cluster defined in AWS's ECS service available to you and it must have at least one task node. Most typically this is the default cluster and examples will assume this going forward. You can easily set up a cluster by following the instructions here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html

Starting the service locally

You can run the service locally (which will still leverage AWS resources) using the docker-compose tool. From inside the repo run:

docker-compose up -d

You'll notice it builds the code in the repo and starts the flotilla service as well as the default postgres backend.

Verify the service is running by making a GET request with cURL (or navigating to in a web browser) the url http://localhost:3000/api/v1/task. A 200OK response means things are good!

Note: The default configuration under conf and in the docker-compose.yml assume port 3000. You'll have to change it in both places if you don't want to use port 3000 locally.

Using the UI

Flotilla has a simple, easy to use UI. Here's some example images for basic usage.

Define a task with the UI

The UI allows you to quickly create new tasks.

Define Task

Launch a task with UI

You can run tasks you've created with the UI as well. Once you've ran a task the run will transition from Queued to Pending to Running before it finishes and shows Success or Failed (see Task Life Cycle). Once a task is in the Running state the logs should be visible.

  1. Launch

    Run Task

  2. Queued --> Pending

    Queued Task

    Pending Task

  3. View logs

    Running Task

    Finished Task

Basic API Usage

Defining your first task

Before you can run a task you first need to define it. We'll use the example hello world task definition. Here's what that looks like:

hello-world.json

{
  "alias": "hello-flotilla",
  "group_name": "examples",
  "image": "ubuntu:latest",
  "memory": 512,
  "env": [
    {
      "name": "USERNAME",
      "value": "_fill_me_in_"
    }
  ],
  "command": "echo \"hello ${USERNAME}\""
}

It's a simple task that runs in the default ubuntu image, prints your username to the logs, and exits.

Note: While you can use non-public images and images in your own registries with flotilla, credentials for accessing those images must exist on the ECS hosts. This is outside the scope of this doc. See the AWS documentation.

Let's define it:

curl -XPOST localhost:3000/api/v1/task --data @examples/hello-world.json

You'll notice that if you visit the initial url again http://localhost:3000/api/v1/task the newly defined definition will be in the list.

Running your first task

This is the fun part. You'll make a PUT request to the execution endpoint for the task you just defined and specify any environment variables.

curl -XPUT localhost:3000/api/v1/task/alias/hello-flotilla/execute -d '{
  "cluster":"default",
  "env":[
    {"name":"USERNAME","value":"yourusername"}
  ],
  "run_tags":{"owner_id":"youruser"}
}'

Note: run_tags is defined as a way for all runs to have a ownership injected for visibility and is required.

You'll get a response that contains a run_id field. You can check the status of your task at http://localhost:3000/api/v1/history/<run_id>

curl -XGET localhost:3000/api/v1/history/<run_id>

{
  "instance": {
    "dns_name": "<dns-host-of-task-node>",
    "instance_id": "<instance-id-of-task-node>"
  },
  "run_id": "<run_id>",
  "definition_id": "<definition_id>",
  "alias": "hello-flotilla",
  "image": "ubuntu:latest",
  "cluster": "default",
  "status": "PENDING",
  "env": [
    {
      "name": "FLOTILLA_RUN_OWNER_ID",
      "value": "youruser"
    },
    {
      "name": "FLOTILLA_SERVER_MODE",
      "value": "dev"
    },
    {
      "name": "FLOTILLA_RUN_ID",
      "value": "<run_id>"
    },
    {
      "name": "USERNAME",
      "value": "yourusername"
    }
  ]
}

and you can get the logs for your task at http://localhost:3000/api/v1/<run_id>/logs. You will not see any logs until your task is at least in the RUNNING state.

curl -XGET localhost:3000/api/v1/<run_id>/logs

{
  "last_seen":"<last_seen_token_used_for_paging>",
  "log":"+ set -e\n+ echo 'hello yourusername'\nhello yourusername"
}

Definitions and Task Life Cycle

Definitions

Name Definition
task A definition of a task that can be executed to create a run
run An instance of a task

Task Life Cycle

When executed, a task's run goes through several transitions

  1. QUEUED - this is the first phase of a run and means the run is currently queued and waiting to be allocated to a cluster
  2. PENDING - every worker.submit_interval (defined in the config) the submit worker pulls from the queues and submits them for execution. At this point, if the cluster associated with the run has resources, the run gets allocated to the cluster and transitions to the PENDING status. For the default execution engine this stage encapsulates the process of pulling the docker image and starting the container. It can take several minutes depending on whether the image is cached and how large the image is.
  3. RUNNING - Once the run starts on a particular execution host it transitions to this stage. At this point logs should become available.
  4. STOPPED - A run enters this stage when it finishes execution. This can mean it either succeeded or failed depending on the existence of an exit_code and the value of that exit code.
  5. NEEDS_RETRY - on occassion, due to host level characteristics (full disk, too many open files, timeouts pulling image, etc) the run exits with a null exit code without ever being executed. In this case the reason is analyzed to determine if the run is retriable. If it is, the task transitions to this status and is allocated to the appropriate execution queue again, and will repeat the lifecycle.

Normal Lifecycle

QUEUED --> PENDING --> RUNNING --> STOPPED

Retry Lifecycle

... --> PENDING --> STOPPED --> NEEDS_RETRY --> QUEUED --> ...

Deploying

In a production deployment you'll want multiple instances of the flotilla service running and postgres running elsewhere (eg. Amazon RDS). In this case the most salient detail configuration detail is the DATABASE_URL.

Docker based deploy

The simplest way to deploy for very light usage is to avoid a reverse proxy and deploy directly with docker.

  1. Build and tag an image for flotilla using the Dockerfile provided in this repo:

    docker build -t <your repo name>/flotilla:<version tag>
    
  2. Run this image wherever you deploy your services:

    docker run -e DATABASE_URL=<your db url> -e FLOTILLA_MODE=prod -p 3000:3000 ...<other standard docker run args>
    

    Notes:

    • Flotilla uses viper for configuration so you can override any of the default configuration under conf/ using run time environment variables passed to docker run
    • In most realistic deploys you'll likely want to configure a reverse proxy to sit in front of the flotilla container. See the docs here

    See docker run for more details

Configuration In Detail

The variables in conf/config.yml are sensible defaults. Most should be left alone unless you're developing flotilla itself. However, there are a few you may want to change in a production environment.

Variable Name Description
worker.retry_interval Run frequency of the retry worker
worker.submit_interval Poll frequency of the submit worker
worker.status_interval Poll frequency of the status update worker
http.server.read_timeout_seconds Sets read timeout in seconds for the http server
http.server.write_timeout_seconds Sets the write timeout in seconds for the http server
http.server.listen_address The port for the http server to listen on
owner_id_var Which environment variable containing ownership information to inject into the runtime of jobs
enabled_workers This variable is a list of the workers that run. Use this to control what workers run when using a multi-container deployment strategy. Valid list items include (retry, submit, and status)
log.namespace For the default ECS execution engine setup this is the log-group to use
log.retention_days For the default ECS execution engine this is the number of days to retain logs
log.driver.options.* For the default ECS execution engine these map to the awslogs driver options here
queue.namespace For the default ECS execution engine this is the prefix used for SQS to determine which queues to pull job launch messages from
queue.retention_seconds For the default ECS execution engine this configures how long a message will stay in an SQS queue without being consumed
queue.process_time For the default ECS execution engine configures the length of time allowed to process a job launch message
queue.status For the default ECS execution engine this configures which SQS queue to route ECS cluster status updates to
queue.status_rule For the default ECS execution engine this configures the name of the rule for routing ECS cluster status updates

Development

API Documentation

See API

Building

Currently Flotilla is built using go 1.9.3 and uses the govendor to manage dependencies.

govendor sync && go build