Spring XD Batch Word-count Sample
This is the Spring Batch word-count sample for Hadoop adapted for Spring XD. This sample will take an input file and counts the occurrences of each word within that document.
In order for the sample to run you will need to have installed:
Note: If you are using a Hadoop distribution that uses a different configuration than the default one from Apache Hadoop, then you need to provide additional configuration settings to be used by any MapReduce tasks submitted to the cluster. See this page for details.
You can build the sample simply by executing:
$ mvn clean package
The project pom declares
spring-xd-module-parent as its parent. This adds the dependencies needed to compile and test the module and also configures the Spring Boot Maven Plugin to package the module as an über-jar, packaging any dependencies that are not already provided by the Spring XD container. See the Modules section in the Spring XD Reference for more details on module packaging.
As a result, you will see the following jar being created:
Running the Sample
The wordcount sample is ready to be executed. The simplest way to run Spring XD is using the
Now start the Spring XD Shell in a separate window:
Upload the module
In the Spring XD shell:
xd:>module upload --type job --name wordcount --file [path-to]/batch-wordcount-1.0.0.BUILD-SNAPSHOT.jar
You will now create a new Batch Job Stream using the Spring XD Shell:
xd:>job create --name wordCountJob --definition "wordcount"
The UI located on the machine where
xd-singlenode is running, will show you the jobs that can be deployed. The UI is located at:
Alternatively, you can deploy the job using the shell command:
xd:>job deploy --name wordCountJob
We will now create a stream that polls a local directory for files. By default the name of the directory is named after the name of the stream, so in this case the directory will be
/tmp/xd/input/wordCountFiles. If the directory does not exist, it will be created. You can override the default directory using the
xd:>stream create --name wordCountFiles --definition "file --mode=ref > queue:job:wordCountJob" --deploy
If you now drop text files into the
/tmp/xd/input/wordCountFiles/ directory, the file will be picked up, copied to HDFS and its words counted. You can move the supplied
nietzsche-chapter-1.txt file to the input directory using the shell by executing:
xd:>! cp /path/to/spring-xd-samples/batch-wordcount/data/nietzsche-chapter-1.txt /tmp/xd/input/wordCountFiles
Note: Anything under
/xd/countdirectory on hdfs will be removed each time the job executes.
Verify the result
First specify the Hadoop NameNode for the Spring XD Shell:
xd:>hadoop config fs --namenode hdfs://localhost:8020
We will now take a look at the root of the HDFS filesystem:
xd:>hadoop fs ls /xd/
You should see output like the following:
Found 1 items drwxr-xr-x - hillert supergroup 0 2013-08-12 11:01 /xd/count
As we declared the property
wordcount.output.path in wordcount.xml to be
/xd/count/out/, let's have a look at the respective directory:
xd:>hadoop fs ls /xd/count/out Found 2 items -rw-r--r-- 3 hillert supergroup 0 2013-08-10 00:07 /xd/count/out/_SUCCESS -rw-r--r-- 3 hillert supergroup 31752 2013-08-10 00:07 /xd/count/out/part-r-00000
xd:>hadoop fs cat /xd/count/out/part-r-00000
should yield a long list of words, indicating the number of occurrences within the provided input text.