Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Spring XD Batch Hashtag Count Sample

This sample will take an input file with Twitter JSON data and count the occurrences of hashtags.


In order for the sample to run you will need to have installed:

Furthermore you must have your Twitter API credentials ready:

  • Consumer Key

  • Consumer Secret

  • Access Token

  • Access Token Secret

If you are using a Hadoop distribution that uses a different configuration than the default one from Apache Hadoop, then you need to provide additional configuration settings to be used by any MapReduce tasks submitted to the cluster. See this page for details.

Module configuration

The src/main/resources/config/spring-module.xml file defines the location of the file to process and the output directory to use. All relevant properties are defined in the util:property element:

<util:properties id="myProperties" >
	<prop key="tweets.input.path">/xd/tweets/</prop>
	<prop key="tweets.output.path">/xd/hashtagcount/out/</prop>

Building with Maven

Build the sample simply by executing:

$ mvn clean package

The project pom declares spring-xd-module-parent as its parent. This adds the dependencies needed to compile and test the module and also configures the Spring Boot Maven Plugin to package the module as an uber-jar, packaging any dependencies that are not already provided by the Spring XD container. In this case there are no additional dependencies so the artifact is built as a common jar. See the Modules section in the Spring XD Reference for more details on module packaging.

Building with Gradle

$./gradlew clean bootRepackage

The project’s build.gradle applies the spring-xd-module plugin, providing analagous build and packaging support for gradle. This plugin also applies the Spring Boot Gradle Plugin as well as the propdeps plugin.

Running the Sample

Now your sample is ready to be executed. The simplest way to run Spring XD is using the singlenode server.

xd/bin>$ ./xd-singlenode

Now start the Spring XD Shell in a separate window:

shell/bin>$ ./xd-shell

Upload the module

In the Spring XD shell:

xd:>module upload --type job --name hashtagCountExample --file [path-to]/batch-hashtag-count-1.0.0.BUILD-SNAPSHOT.jar

Collect Twitter Data

In order to setup the Twitter stream, you must either provide your Twitter API credentials via the shell:

xd:> stream create --name tweets --definition "twitterstream \
--consumerKey='your_credentials' \
--consumerSecret='your_credentials' \
--accessToken='your_credentials' \
--accessTokenSecret='your_credentials' | hdfs --rollover=2M" --deploy

or alternatively you can provide the credentials in config/modules/modules.yml

   consumerKey: <your consumer key>
   consumerSecret: <your consumer secret>
   accessToken: <your access token>
   accessTokenSecret: <your token secret>

That way you don’t have to provide your credentials every time you create a stream:

xd:> stream create --name tweets --definition "twitterstream | hdfs --rollover=2M" --deploy

Create the Batch Job

You will now create a new Batch Job Stream using the Spring XD Shell:

xd:>job create --name hashtagCountJob --definition "hashtagCountExample" --deploy

You should see a message:

Successfully created and deployed job 'hashtagCountJob'

Launch the job using:

xd:>job launch hashtagCountJob

You should see a message:

Successfully submitted launch request for job 'hashtagCountJob'

Verify the result

First specify the Hadoop NameNode for the Spring XD Shell:

xd:>hadoop config fs --namenode hdfs://localhost:8020

We will now take a look at the root of the HDFS filesystem:

xd:>hadoop fs ls /xd

You should see output like the following:

Found 2 items
drwxr-xr-x   - hillert supergroup          0 2013-08-12 11:01 /xd/hashtagcount
drwxrwxrwx   - hillert supergroup          0 2013-08-12 11:00 /xd/tweets

As we declared the property tweets.output.path in hashtagcount.xml to be /xd/hashtagcount/output/, let’s have a look at that directory:

xd:>hadoop fs ls /xd/hashtagcount/output
Found 2 items
-rw-r--r--   3 hillert supergroup          0 2013-08-10 00:07 /xd/hashtagcount/output/_SUCCESS
-rw-r--r--   3 hillert supergroup      31752 2013-08-10 00:07 /xd/hashtagcount/output/part-r-00000

Finally, executing:

xd:>hadoop fs cat /xd/hashtagcount/output/part-r-00000

should yield a long list of hashtags, indicating the number of occurrences within the provided input of Twitter data.