Fetching latest commit…
Cannot retrieve the latest commit at this time
|Failed to load latest commit information.|
flume-cassandra-plugin ====================== The flume-cassandra-plugin allows you to use Cassandra as a Flume sink. Getting Started --------------- 1. Copy the flume-cassandra-plugin directory into flume_dir/plugins/. There should also be a helloworld directory there. 2. cd into flume-cassandra-plugin 3. Build by running 'ant'. A cassandra_plugin.jar file should be created. 4. Modify flume-site.xml (you may start out by copying flume-site.xml.template and removing the body of the file) to include: <configuration> <property> <name>flume.plugin.classes</name> <value>org.apache.cassandra.plugins.SimpleCassandraSink</value> <description>Comma separated list of plugin classes</description> </property> </configuration> 5. cd into the top-level flume directory (above plugins). 6. Set FLUME_CLASSPATH for all terminals which will run Flume master or node: export FLUME_CLASSPATH=`pwd`/plugins/flume-cassandra-plugin/cassandra_plugin.jar:`pwd`/plugins/flume-cassandra-plugin/lib/jug-asl-2.0.0.jar You may want to just put this in your ~/.bashrc file. If you do, make sure to start a new terminal or run: source ~/.bashrc in any terminals you will use. Usage ----- This plugin primarily targets log storage right now. The Cassandra sink requires four arguments for its constructor: 1. A keyspace (String). For example, 'Keyspace1'. 2. A column family name (String) for storing data in. 3. A column family name (String) for storing indexes in. 4. A list Cassandra server hostname:port combinations (Strings) Cassandra must already be configured so that the keyspace and both of the column families must already exist. The index column family should use a TimeUUIDType comparator. For example, in cassandra.yaml you would have: - name: FlumeIndexes compare_with: TimeUUIDType comment: 'Stores the v1 uuids for log events' The data storage column family can use BytesType. When the Cassandra sink receives an event, it does the following: 1. In the index column family: a. Creates a column where the name is a type 1 UUID (timestamp based) and the value is empty. b. Inserts it into row "YYYYMMDDHH" (the current date and hour) in the given column family. 2. In the data column family: a. Creates a column where the name is 'data' and the value is the flume event body. b. Inserts it into a row with a key that is the same uuid from step 1. This allows you to easily fetch all logs for a slice of time. Simply use something like get_slice() on the index column family to get the uuids you want for a particular slice of time, and then multiget the data column family using those uuids as the keys.