docs/src/reference/docbook/samples/spring-batch-wordcount.xml

<?xml version="1.0" encoding="UTF-8"?>
<chapter version="5.0" xml:id="batch-wordcount"
         xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:xi="http://www.w3.org/2001/XInclude"
         xmlns:ns5="http://www.w3.org/1998/Math/MathML"
         xmlns:ns4="http://www.w3.org/1999/xhtml"
         xmlns:ns3="http://www.w3.org/2000/svg"
         xmlns:ns="http://docbook.org/ns/docbook">
  <title>Wordcount sample using Spring Batch</title>

  <para>Please read the <link linkend="sample-prereq">sample
  prerequistes</link> before following these instructions.</para>

  <section id="spring:batch-wordcount">
    <title>Introduction</title>

    <para>This sample demonstrates how to execute the wordcount example in the
    context of a Spring Batch application. It serves as a starting point that
    will be expanded upon in other samples. The sample code is located in the
    distribution directory
    <literal>&lt;spring-hadoop-install-dir&gt;/samples/batch-wordcount</literal></para>

    <para>The sample uses the Spring Hadoop namespace to define a Spring Batch
    Tasklet that runs a Hadoop job. A <link
    xlink:href="http://static.springsource.org/spring-batch/reference/html/domain.html#domainJob">Spring
    Batch Job</link> (not to be confused with a Hadoop job) combines multiple
    steps together to create a flow of execution, a small workflow. Each step
    in the flow does some processing which can be as complex or as simple as
    your require. The configuration of the flow from one step to another can
    very simple, a linear sequence of steps, or complex using conditional and
    programmatic branching as well as sequential and parallel step execution.
    A <link
    xlink:href="http://static.springsource.org/spring-batch/apidocs/org/springframework/batch/core/step/tasklet/Tasklet.html">Spring
    Batch Tasklet</link> is the abstraction that represents the processing for
    a Step and is an extensible part of Spring Batch. You can write your own
    implementation of a Tasklet to perform arbitrary processing, but often you
    configure existing Tasklets provided by Spring Batch and Spring
    Hadoop.</para>

    <para>Spring Batch provides Tasklets for reading, writing, and processing
    data from flat files, databases, messaging systems and executing system
    (shell) commands. Spring Hadoop provides Tasklets for running Hadoop
    MapReduce, Streaming, Hive, and Pig jobs as well as executing script files
    that have build in support for ease of use interaction with HDFS.</para>
  </section>

  <section>
    <title>Basic Spring Hadoop configuration</title>

    <para>The part of the configuration file that defines the Spring Batch Job
    flow is shown below and can be found in the file
    <literal>wordcount-context.xml</literal>. The elements
    <literal>&lt;batch:job/&gt;, &lt;batch:step/&gt;,
    &lt;batch:tasklet&gt;</literal> come from the XML Schema for Spring
    Batch</para>

    <programlisting language="xml">&lt;batch:job id="job1"&gt;
    &lt;batch:step id="import" next="wordcount"&gt;
      &lt;batch:tasklet ref="script-tasklet"/&gt;
    &lt;/batch:step&gt;
			
    &lt;batch:step id="wordcount"&gt;
       &lt;batch:tasklet ref="wordcount-tasklet" /&gt;
    &lt;/batch:step&gt;
&lt;/batch:job&gt;</programlisting>

    <para>This configuration defines a Spring Batch Job named "job1" that
    contains two steps executed sequentially. The first one prepares HDFS with
    sample data and the second runs the Hadoop wordcount mapreduce job. The
    tasklet's reference to "script-tasklet" and "wordcount-tasklet"
    definitions that will be shown a little later.</para>

    <para>The <link
    xlink:href="http://www.springsource.com/developer/sts">Spring Source
    Toolsuite</link> is a free Eclipse-powered development environment which
    provides a nice visualization and authoring help for Spring Batch
    workflows as shown below.<mediaobject>
        <imageobject role="html">
          <imagedata fileref="images/batch-wordcount-ide.jpg" format="JPG"/>
        </imageobject>
      </mediaobject></para>

    <para>The <link linkend="scripting-tasklet">script tasklet</link> shown
    below uses Groovy to remove any data that is in the input or output
    directories and puts the file "nietzsche-chapter-1.txt" into HDFS.</para>

    <programlisting language="xml">&lt;script-tasklet id="script-tasklet"&gt;
  &lt;script language="groovy"&gt;
    inputPath = "${wordcount.input.path:/user/gutenberg/input/word/}"
    outputPath = "${wordcount.output.path:/user/gutenberg/output/word/}"
    if (fsh.test(inputPath)) {
      fsh.rmr(inputPath)
    }
    if (fsh.test(outputPath)) {
      fsh.rmr(outputPath)
    }
    inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
    fsh.put(inputFile, inputPath)
  &lt;/script&gt;
&lt;/script-tasklet&gt;</programlisting>

    <para>The script makes use of the predefined variable fsh, which is the
    embedded <link linkend="scripting-fssh">Filesystem Shell</link> that
    Spring Hadoop provides. It also uses Spring's <link
    xlink:href="http://static.springsource.org/spring/docs/3.1.x/spring-framework-reference/html/beans.html#beans-factory-placeholderconfigurer">Property
    Placeholder</link> functionality so that the input and out paths can be
    configured external to the application, for example using property files.
    The syntax for variables recognized by Spring's Property Placeholder is
    <literal>${key:defaultValue}</literal>, so in this case
    <literal>/user/gutenberg/input/word</literal> and
    <literal>/user/gutenberg/output/word</literal> are the default input and
    output paths. Note you can also use <link
    xlink:href="http://static.springsource.org/spring/docs/3.1.x/spring-framework-reference/html/expressions.html">Spring's
    Expression Language</link> to configure values in the script or in <link
    xlink:href="http://static.springsource.org/spring/docs/3.1.x/spring-framework-reference/html/expressions.html#expressions-beandef">other
    XML definitions</link>.</para>

    <para>The configuration of the tasklet to execute the Hadoop MapReduce
    jobs is shown below.</para>

    <programlisting language="xml">&lt;hdp:tasklet id="hadoop-tasklet" job-ref="mr-job"/&gt;

&lt;job id="wordcount-job"
  input-path="${wordcount.input.path:/user/gutenberg/input/word/}"
  output-path="${wordcount.output.path:/user/gutenberg/output/word/}" 
  mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper"
  reducer="org.apache.hadoop.examples.WordCount.IntSumReducer" /&gt;</programlisting>

    <para>The <literal>&lt;hdp:tasklet&gt;</literal> and
    <literal>&lt;hdp:job/&gt;</literal> elements are from Spring Haddop XML
    Schema. The hadoop tasklet is the bridge between the Spring Batch world
    and the Hadoop world. The hadoop tasklet in turn refers to your map reduce
    job. Various common properties of a map reduce job can be set, such as the
    mapper and reducer. The section <link linkend="hadoop-job">Creating a
    Hadoop Job</link> describes the additional elements you can configure in
    the XML.</para>

    <note>
      <para>If you look at the JavaDocs for the org.apache.hadoop.examples
      package (<link
      xlink:href="org.apache.hadoop.examples.PiEstimator.PiMapper">here</link>),
      you can see the Mapper and Reducer class names for many examples you may
      have previously used from the hadoop command line runner.</para>
    </note>

    <para>The <literal>configuration-ref</literal> element in the job
    definition refers to common Hadoop configuration information that can be
    shared across many jobs. It is defined in the file
    <literal>hadoop-context.xml</literal> and is shown below</para>

    <programlisting language="xml">&lt;!--  default id is 'hadoop-configuration' --&gt;
&lt;hdp:configuration register-url-handler="false"&gt;
  fs.default.name=${hd.fs}
&lt;/hdp:configuration&gt;</programlisting>

    <para>As mentioned before, as this is a configuration file processed by
    the Spring container, it supports variable substitution through the use of
    <literal>${var} </literal>style variables. In this case the location for
    HDFS is parameterized and no default value is provided. The property file
    <literal>hadoop.properties</literal> contains the definition of the hd.fs
    variable, change the value if you want to refer to a different name node
    location.</para>

    <programlisting>hd.fs=hdfs://localhost:9000</programlisting>

    <para>The entire application is put together in the configuration file
    <literal>launch-context.xml</literal>, shown below.</para>

    <programlisting language="xml">&lt;!-- where to read externalized configuration values --&gt;
&lt;context:property-placeholder location="classpath:batch.properties,classpath:hadoop.properties"
  ignore-resource-not-found="true" ignore-unresolvable="true" /&gt;

&lt;!-- simple base configuration for batch components, e.g. JobRepository --&gt;
&lt;import resource="classpath:/META-INF/spring/batch-common.xml" /&gt;
&lt;!-- shared hadoop configuration --&gt;
&lt;import resource="classpath:/META-INF/spring/hadoop-context.xml" /&gt;
&lt;!-- word count workflow --&gt;
&lt;import resource="classpath:/META-INF/spring/wordcount-context.xml" /&gt;</programlisting>
  </section>

  <section>
    <title>Build and run the sample application</title>

    <para>In the directory
    <literal>&lt;spring-hadoop-install-dir&gt;/samples/batch-wordcount</literal>
    build and run the sample</para>

    <programlisting>$ ../../gradlew</programlisting>

    <para>If this is the first time you are using gradlew, it will download
    the Gradle build tool and all necessary dependencies to run the
    sample.</para>

    <note>
      <para>If you run into some issues, drop us a line in the <link
      xlink:href="http://forum.springsource.org/forumdisplay.php?27-Data">Spring
      Forums</link>.</para>
    </note>
  </section>

  <section>
    <title>Run the sample application as a standlone Java application</title>

    <para>You can use Gradle to export all required dependencies into a
    directory and create a shell script to run the application. To do this
    execute the command</para>

    <programlisting>$ ../../gradlew installApp
</programlisting>

    <para>This places the shell scripts and dependencies under the
    <literal>build/install/batch-wordcount</literal> directory. You can zip up that
    directory and share the application with others.</para>

    <para>The main Java class used is part of Spring Batch. The class is
    <classname>org.springframework.batch.core.launch.support.CommandLineJobRunner</classname>.
    This main app requires you to specify at least a Spring configuration file
    and a job instance name. You can read the <literal>CommandLineJobRunner</literal> <link
    xlink:href="http://static.springsource.org/spring-batch/apidocs/org/springframework/batch/core/launch/support/CommandLineJobRunner.html">JavaDocs</link>
    for more information as well as <link
    xlink:href="http://static.springsource.org/spring-batch/reference/html/configureJob.html#runningJobsFromCommandLine">this
    section in the reference docs</link> for Spring Batch to find out more of
    what command line options it supports.</para>

    <programlisting>$ ./build/install/wordcount/bin/wordcount classpath:/launch-context.xml job1
</programlisting>

    <para>You can then verify the output from work count is present and cat it
    to the screen</para>

    <programlisting>$ hadoop dfs -ls /user/gutenberg/output/word
Warning: $HADOOP_HOME is deprecated.

Found 2 items
-rw-r--r--   3 mpollack supergroup          0 2012-01-31 19:29 /user/gutenberg/output/word/_SUCCESS
-rw-r--r--   3 mpollack supergroup     918472 2012-01-31 19:29 /user/gutenberg/output/word/part-r-00000

$ hadoop dfs -cat /user/gutenberg/output/word/part-r-00000 | more
Warning: $HADOOP_HOME is deprecated.

"'Spells 1
"'army'  1
"(1)     1
"(Lo)cra"1
"13      4
"1490    1
"1498,"  1
"35"     1
"40,"    1
"A       9
"AS-IS". 1
"AWAY    1
"A_      1
"Abide   1
"About   1
</programlisting>
  </section>
</chapter>