Skip to content

Latest commit

 

History

History
142 lines (88 loc) · 4.34 KB

README.md

File metadata and controls

142 lines (88 loc) · 4.34 KB

Drake

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

  • which commands to execute (based on file timestamps)
  • in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

Installation

Drake is a Clojure project, so to build Drake you will need to have leiningen.

Note that Drake has been tested under Linux and Mac OS X. We've not tested it on Windows.

Clone the project:

$ git clone git@github.com:Factual/drake.git
$ cd drake

Build the uberjar:

$ lein uberjar

Run Drake from the jar

Once you've built the uberjar, you can run Drake like this:

$ java -jar drake.jar

You can pass in arguments and options to Drake by putting them at the end of the above command, e.g.:

$ java -jar drake.jar --version

A nicer way to run Drake

We recommend you "install" Drake in your environment so that you can run it by just typing "drake". For example, you could have an executable script called drake, like this on your path:

#!/bin/bash
java -cp $(dirname $0)/drake.jar drake.core $@

Drake documentation refers to running Drake as "drake". If you are instead running the uberjar, just replace "drake" with "java -jar drake.jar" in the examples.

Basic Usage

The wiki is the home for Drake's documentation, but here are simple notes on usage:

To build a specific target (and any out-of-date dependencies, if necessary):

$ drake mytarget

To build a target and everything that depends on it (a.k.a. "down-tree" mode):

$ drake ^mytarget

To build a specific target only, without any dependencies, up or down the tree:

$ drake =mytarget

To force build a target:

$ drake +mytarget

To force build a target and all its downtree dependencies:

$ drake +^mytarget

To force build the entire workflow:

$ drake +...

To exclude targets:

$ drake ... -sometarget -anothertarget

By default, Drake will look for ./workflow.d. The simplest way to run your workflow is to name your workflow file workflow.d, and make sure you're in the same directory. Then, simply:

$ drake

To specify the workflow file explicitly, use -w or --workflow. E.g.:

$ drake -w /myworkflow/my-fav-workflow.d

Use drake --help for the full list of options.

Documentation, etc.

The wiki is the home for Drake's documentation.

A lot of work went into designing and specifying Drake. To prove it, here's the 60 page specification document. It can be downloaded as a PDF and treated like a user manual.

There are annotated workflow examples in the demos directory.

There's a Google Group for Drake

If you like screencasts, check out this Drake walk-through video recorded by Artem, Drake's primary designer:

HDFS Compatibility

Drake provides HDFS support by allowing you to specify inputs and outputs like hdfs://my/big_file.txt.

If you plan to use Drake with HDFS, please see the wiki doc on HDFS Compatibility.

License

Source Copyright © 2012-2013 Factual, Inc.

Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.