Infrastructure for managing information aggregation and distribution on a large cluster of servers
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


    Clustreduce is designed to aid information discovery and management over a large number of servers.  I've spent years writing paragraphs of shell to answer questions that should have been a single command away, and should have been far more reusable than they actually were once I was done.  This is my proposal for a simple way to answer the following questions:

* Given this specific list of servers, what are they?  What configuration attributes do they share?  Are they all mysql boxes?  Are they all web hosts?  How do they differ from a larger pool?  Are these all the four-core web hosts?
* What's the distribution of load average over the web pool?  min - mean - max?  Can you show me a graph of that?  What about load vs current connections?  Average response time vs load average?  What's the load distribution for this pool broken out by number of cpu cores?  cpu speed?  Which of these attributes I know about show the highest divergence in response time between hosts with different values?
* When I run this command on all hosts, which hosts had which output?
* What's the difference in the contents of this file across all hosts in this pool?  Which hosts have which version?
* Restart apache on all web hosts.  all hosts of this pool.  all web hosts in this pool with a load above 10.  All at once ASAP.  Five at a time.  One at a time, with a five-second delay between each.
* What's the list of hosts currently failing this nagios check?
* Copy this file to all hosts in this class.
* For the following hosts, show me everything important.
* For the following hosts, show me everything you can possibly find out.
* Run this command (size of a specific file, for example) on all hosts in this pool and give me a sum of the output.  Okay, now show me the distribution.  Now give me the sum broken out by version of this library.

There are probably 3 things that we want to work with here:
* Metadata about hosts; a description of your cluster.
    * Search: given a specification of properties, give me a list of hosts.
        (mysql-version=5.1 & ( pool=analytics | pool=marketing ))
    * Classify: given a host, give me all the cluster metadata.
* Commands to run on hosts
    * Filtered and cumulative; each command lists a filter for the type of hosts it can run on, and asking for a command will (by default) run all commands of that name where the host matches the filter.
        stats.load-average makes sense on all hosts, but stats.sql-slave does not.  If I ask for just 'stats', I want to see everything relevant to the host I'm asking about.
* Operations to run on the data we get back.
    * filter, transform, join, group, graph

I haven't yet done much thinking about the details of the last, so I will neglect it for now.

Configured by a set of conf.d-style directories; consider the following example command configs:

    filter = type=host
    type = ssh
    format = json
    command = foo_uptime
    # output like: { "1-min": 0.04, "5-min": 0.08, "15-min": 0.12 }

    filter = class=mysql
    type = sql
    command = "show slave status"
    keep = slave_sql_running, seconds_behind_master

    filter = class=web
    level = -2 # not normally run
    type = cmd
    format = return-code
    command = nmap -sG - -p 80 $target | grep -F 80/open

    filter = proxy-software=nginx
    type = ssh
    format = plain
    ssh.user = root
    command = /etc/init.d/nginx reload

    filter = type=host
    args = 1
    type = cmd
    format = plain
    cmd = ssh $target $arg1

    filter = type=host
    args = 2
    type = cmd
    format = return-code
    command = rsync -Pav $arg1 $target:$arg2