Big Data Compute Cluster using Chef, Hadoop and Cassandra in the Amazon AWS/EC2 cloud
Clone or download
Pull request Compare This branch is 1179 commits behind infochimps-labs:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Chef is a powerful tool for maintaining and describing the software and configurations that let a machine provide its services.

cluster_chef is

  • a clean, expressive way to describe how machines and roles are assembled into a working cluster.
  • Our collection of Industrial-strength, cloud-ready recipes for Hadoop, Cassandra, HBase, Elasticsearch and more.
  • a set of conventions and helpers that make provisioning cloud machines easier.


Here’s a basic 3-node hadoop cluster:

    ClusterChef.cluster 'demohadoop' do
      facet 'master' do
        instances           1
        role                "nfs_server"
        role                "hadoop_master"
        role                "hadoop_worker"
        role                "hadoop_initial_bootstrap"

      facet 'worker' do
        instances           2
        role                "nfs_client"
        role                "hadoop_worker"
        server 0 do
	  chef_node_name 'demohadoop_worker_zero'

      cloud :ec2 do
        image_name          "lucid"
        flavor              "c1.medium"
        availability_zones  ['us-east-1d']
        security_group :logmuncher do
          authorize_group "webnode"

This defines a cluster (group of machines that serve some common purpose) with two facets, or unique configurations of machines within the cluster. (For another example, a webserver farm might have a loadbalancer facet, a database facet, and a webnode facet).

In the example above, the master serves out a home directory over NFS, and runs the processes that distribute jobs to hadoop workers. In this small cluster, the master also has workers itself, and a utility role that helps initialize it out of the box.

There are 2 workers; they use the home directory served out by the master, and run the hadoop worker processes.

Lastly, we define what machines to use for this cluster. Instead of having to look up and type in an image ID, we just say we want the Ubuntu ‘Lucid’ distribution on a c1.medium machine. Cluster_chef understands that this means we need the 32-bit image in the us-east-1 region, and makes the cloud instance request accordingly. It also creates a ‘logmunchers’ security group, opening it so all the ‘webnode’ machines can push their server logs onto the HDFS.

The following commands launch each machine, and once ready, ssh in to install chef and converge all its software.

knife cluster launch demohadoop master --bootstrap
knife cluster launch demohadoop worker --bootstrap

You can also now launch the entire cluster at once with the following

knife cluster launch demohadoop --bootstrap

The cluster launch operation is idempotent: nodes that are running won’t be started!


Some general principles of how we use chef.

  • Chef server is never the repository of truth — it only mirrors the truth.
    – a file is tangible and immediate to access
  • Specifically, we want truth to live in the git repo, and be enforced by the chef server. There is no truth but git, and chef is its messenger.
    – this means that everything is versioned, documented and exchangeable.
  • Systems, services and significant modifications cluster should be obvious from the `clusters` file. I don’t want to have to bounce around nine different files to find out which thing installed a redis:server.
    – basically, the existence of anything that opens a port should be obvious when I look at the cluster file.
  • Roles define systems, clusters assemble systems into a machine.
    – For example, a resque worker queue has a redis, a webserver and some config files — your cluster should invoke a whatever_queue role, and the whatever_queue role should include recipes for the component services.
    – the existence of anything that opens a port or runs as a service should be obvious when I look at the roles file.
  • include_recipe considered harmful Do NOT use include_recipe for anything that a) provides a service, b) launches a daemon or c) is interesting in any way. (so: include_recipe java yes; include_recipe iptables no.) You should note the dependency in the metadata.rb. This seems weird, but the breaking behavior is purposeful: it makes you explicitly state all dependencies.
  • It’s nice when machines are in full control of their destiny.
    – initial setup (elastic IP, attaching a drive) is often best enforced externally
    – but machines should be ablt independently assert things like load balancer registration that that might change at any point in the lifetime.
  • It’s even nicer, though, to have full idempotency from the command line: I can at any time push truth from the git repo to the chef server and know that it will take hold.

Getting Started

This assumes you have installed chef, have a working chef server, and have an AWS account. If you can run knife and use the web browser to see your EC2 console, you can start here. If not, see the instructions below.


Install these gems (you may have several installed already):

sudo gem install chef fog highline configliere net-ssh-multi formatador terminal-table gorillib extlib

(and maybe right_aws and broham?)

Add cluster_chef to your cookbooks repo

This assumes the same setup as described in the knife quickstart walkthrough, but to keep things convenient we’ll use these shortcuts to refer

CHEF_REPO_DIR=$HOME/PATH/TO/chef-repo      # has cookbooks/ and site-cookbooks/
DOT_CHEF_DIR=$HOME/PATH/TO/chef-repo/.chef # has your knife.rb, USERNAME.pem etc

I recommend adding cluster chef as a git submodule; you can instead clone it elsewhere.

git submodule add cluster_chef

Knife setup

In your $DOT_CHEF_DIR/knife.rb, modify the cookbook path (to include cluster_chef/cookbooks and cluster_chef/site-cookbooks) and to add settings for cluster_chef_path, cluster_path and keypair_path. Here’s mine:

current_dir = File.dirname(__FILE__)
organization  = 'CHEF_ORGANIZATION'
username      = 'CHEF_USERNAME'

# The full path to your cluster_chef installation
cluster_chef_path File.expand_path("#{current_dir}/../cluster_chef")
# The list of paths holding clusters
cluster_path      [ File.expand_path("#{current_dir}/../clusters") ]
# The directory holding your cloud keypairs
keypair_path      File.expand_path(current_dir)

log_level                :info
log_location             STDOUT
node_name                username
client_key               "#{keypair_path}/#{username}.pem"
validation_client_name   "#{organization}-validator"
validation_key           "#{keypair_path}/#{organization}-validator.pem"
chef_server_url          "{organization}"
cache_type               'BasicFile'
cache_options( :path => "#{ENV['HOME']}/.chef/checksums" )

# The first things have lowest priority (so, site-cookbooks gets to win)
cookbook_path            [

# If you primarily use AWS cloud services:
knife[:ssh_address_attribute] = 'cloud.public_hostname'
knife[:ssh_user] = 'ubuntu'

# Configure bootstrapping
knife[:bootstrap_runs_chef_client] = true
bootstrap_chef_version   "~> 0.10.0"

# AWS access credentials
knife[:aws_access_key_id]      = "XXXXXXXXXXX"
knife[:aws_secret_access_key]  = "XXXXXXXXXXXXX"

Push to chef server

To send all the cookbooks and role to the chef server, visit your cluster_chef directory and run:

mkdir -p $CHEF_REPO_DIR/site-cookbooks
knife cookbook upload --all
for foo in roles/*.rb ; do knife role from file $foo & sleep 1 ; done

You should see all the cookbooks defined in cluster_chef/cookbooks (ant, apt, …) listed among those it uploads.

Cluster chef’s Knife plugins

mkdir -p $DOT_CHEF_DIR/plugins
ln -s $CHEF_REPO_DIR/cluster_chef/lib/cluster_chef/knife           $DOT_CHEF_DIR/plugins/
ln -s $CHEF_REPO_DIR/cluster_chef/lib/cluster_chef/knife/bootstrap $DOT_CHEF_DIR/

If you type knife cluster launch you should see it found the new scripts:

    knife cluster launch CLUSTER_NAME [FACET_NAME] (options)
    knife cluster show CLUSTER_NAME [FACET_NAME] (options)

Your first cluster

Let’s create a cluster called ‘demosimple’. It’s, well, a simple demo cluster.

Create a simple demo cluster

Create a directory for your clusters; copy the demosimple cluster and its associated roles from cluster_chef:

mkdir -p $CHEF_REPO_DIR/clusters
cp cluster_chef/clusters/{defaults,demosimple}.rb ./clusters/
cp cluster_chef/roles/{big_package,nfs_*,ssh,base_role,chef_client}.rb  ./roles/
for foo in roles/*.rb ; do knife role from file $foo ; done

Symlink in the cookbooks you’ll need, and upload them to your chef server:

cd $CHEF_REPO_DIR/cookbooks
ln -s ../cluster_chef/site-cookbooks/{nfs,big_package,cluster_chef,cluster_service_discovery,firewall,motd} .
knife cookbook upload nfs big_package cluster_chef cluster_service_discovery firewall motd

AWS credentials

Make a cloud keypair, a secure key for communication with Amazon AWS cloud. cluster_chef expects a keypair named after its cluster — this is a best practice that helps keep your environments distinct.

  1. Log in to the AWS console and create a new keypair named demosimple. Your browser will download the private key file.
  2. Move the private key file you just downloaded to your .chef dir, and make it private key unsnoopable, or ssh will complain:
mv ~/downloads/demosimple.pem $DOT_CHEF_DIR/demosimple.pem
chmod 600 $DOT_CHEF_DIR/*.pem

Cluster chef knife commands

knife cluster launch

Hooray! You’re ready to launch a cluster:

    knife cluster launch demosimple homebase --bootstrap

It will kick off a node and then bootstrap it. You’ll see it install a whole bunch of things. Yay.

Extended Installation Notes

Set up Knife on your local machine, and a Chef Server in the cloud

If you already have a working chef installation you can skip this section.

To get started with knife and chef, follow the Chef Quickstart, We use the hosted chef service and are very happy, but there are instructions on the wiki to set up a chef server too. Stop when you get to “Bootstrap the Ubuntu system” — cluster chef is going to make that much easier.

Cloud setup

If you can use the normal knife bootstrap commands to launch a machine, you can skip this step.


Right now cluster chef works well with AWS. If you’re interested in modifying it to work with other cloud providers, see here or get in touch.

Tips and Troubleshooting

Help a git deploy recipe has gone limp

Suppose you are using the git resource to deploy a recipe (george for sake of example). If /var/chef/cache/revision_deploys/var/www/george exists then nothing will get deployed, even if /var/www/george/{release_sha} is empty or screwy. If git deploy is acting up in any way, nuke that cache from orbit — it’s the only way to be sure.

$ sudo rm -rf /var/www/george/{release_sha} /var/chef/cache/revision_deploys/var/www/george
Hadoop namenode bootstrap

Once the master runs to completion with all daemons started, remove the hadoop_initial_bootstrap recipe from its run_list. (Note that you may have to edit the runlist on the machine itself depending on how you bootstrapped the node).

Problems starting NFS server on ubuntu maverick

For problems starting NFS server on ubuntu maverick systems, read, understand and then run /tmp/ — See this thread for more

Setup for Knife versions before 0.10

On older versions of chef that don’t have a plugin mechanism for new commands, we have to do surgery on the knife itself… we’ll just symlink the new commands into chef’s lib/chef/knife directory, and symlink the bootstrap templates into chef’s lib/chef/knife/bootstrap directory. Set the path to your cluster_chef directory and run the following:

    sudo ln -s $CLUSTER_CHEF_DIR/lib/cluster_chef/knife/*.rb            $(dirname `gem which chef`)/chef/knife/
    sudo ln -s $CLUSTER_CHEF_DIR/lib/cluster_chef/knife/bootstrap/*     $(dirname `gem which chef`)/chef/knife/bootstrap/