Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
A plugin for flume that allows you to use Cassandra as a sink.
tag: v0.7.0.b1

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.



The flume-cassandra-plugin allows you to use Cassandra as a Flume sink.

Getting Started

1. Copy the flume-cassandra-plugin directory into flume_dir/plugins/.  There
should also be a helloworld directory there.

2. cd into flume-cassandra-plugin

3. Build by running 'ant'.  A cassandra_plugin.jar file should be created.

4. Modify flume-site.xml (you may start out by copying
flume-site.xml.template and removing the body of the file) to include:

        <description>Comma separated list of plugin classes</description>

5. cd into the top-level flume directory (above plugins).

6. Set FLUME_CLASSPATH for all terminals which will run Flume master or node:

    export FLUME_CLASSPATH=`pwd`/plugins/flume-cassandra-plugin/cassandra_plugin.jar:`pwd`/plugins/flume-cassandra-plugin/lib/jug-asl-2.0.0.jar

You may want to just put this in your ~/.bashrc file.  If you do, make sure to start a new terminal or run:

    source ~/.bashrc

in any terminals you will use.


This plugin primarily targets log storage right now.

The Cassandra sink requires four arguments for its constructor:

1. A keyspace (String).  For example, 'Keyspace1'.
2. A column family name (String) for storing data in.
3. A column family name (String) for storing indexes in.
4. A list Cassandra server hostname:port combinations (Strings)

Cassandra must already be configured so that the keyspace and both of the
column families must already exist.  The index column family should use
a TimeUUIDType comparator.  For example, in cassandra.yaml you would have:

    - name: FlumeIndexes
      compare_with: TimeUUIDType
      comment: 'Stores the v1 uuids for log events'

The data storage column family can use BytesType.

When the Cassandra sink receives an event, it does the following:

1. In the index column family:
    a. Creates a column where the name is a type 1 UUID (timestamp based) and the
       value is empty.
    b. Inserts it into row "YYYYMMDDHH" (the current date and hour) in the
       given column family.
2. In the data column family:
    a. Creates a column where the name is 'data' and the value is the
       flume event body.
    b. Inserts it into a row with a key that is the same uuid from step 1.

This allows you to easily fetch all logs for a slice of time. Simply use
something like get_slice() on the index column family to get the uuids you
want for a particular slice of time, and then multiget the data column
family using those uuids as the keys.
Something went wrong with that request. Please try again.