This repository has been archived by the owner on Apr 5, 2022. It is now read-only.
/
spring-batch-wordcount.xml
259 lines (217 loc) · 12 KB
/
spring-batch-wordcount.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
<?xml version="1.0" encoding="UTF-8"?>
<chapter version="5.0" xml:id="batch-wordcount"
xmlns="http://docbook.org/ns/docbook"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:ns5="http://www.w3.org/1998/Math/MathML"
xmlns:ns4="http://www.w3.org/1999/xhtml"
xmlns:ns3="http://www.w3.org/2000/svg"
xmlns:ns="http://docbook.org/ns/docbook">
<title>Wordcount sample using Spring Batch</title>
<para>Please read the <link linkend="sample-prereq">sample
prerequistes</link> before following these instructions.</para>
<section id="spring:batch-wordcount">
<title>Introduction</title>
<para>This sample demonstrates how to execute the wordcount example in the
context of a Spring Batch application. It serves as a starting point that
will be expanded upon in other samples. The sample code is located in the
distribution directory
<literal><spring-hadoop-install-dir>/samples/batch-wordcount</literal></para>
<para>The sample uses the Spring Hadoop namespace to define a Spring Batch
Tasklet that runs a Hadoop job. A <link
xlink:href="http://static.springsource.org/spring-batch/reference/html/domain.html#domainJob">Spring
Batch Job</link> (not to be confused with a Hadoop job) combines multiple
steps together to create a flow of execution, a small workflow. Each step
in the flow does some processing which can be as complex or as simple as
your require. The configuration of the flow from one step to another can
very simple, a linear sequence of steps, or complex using conditional and
programmatic branching as well as sequential and parallel step execution.
A <link
xlink:href="http://static.springsource.org/spring-batch/apidocs/org/springframework/batch/core/step/tasklet/Tasklet.html">Spring
Batch Tasklet</link> is the abstraction that represents the processing for
a Step and is an extensible part of Spring Batch. You can write your own
implementation of a Tasklet to perform arbitrary processing, but often you
configure existing Tasklets provided by Spring Batch and Spring
Hadoop.</para>
<para>Spring Batch provides Tasklets for reading, writing, and processing
data from flat files, databases, messaging systems and executing system
(shell) commands. Spring Hadoop provides Tasklets for running Hadoop
MapReduce, Streaming, Hive, and Pig jobs as well as executing script files
that have build in support for ease of use interaction with HDFS.</para>
</section>
<section>
<title>Basic Spring Hadoop configuration</title>
<para>The part of the configuration file that defines the Spring Batch Job
flow is shown below and can be found in the file
<literal>wordcount-context.xml</literal>. The elements
<literal><batch:job/>, <batch:step/>,
<batch:tasklet></literal> come from the XML Schema for Spring
Batch</para>
<programlisting language="xml"><batch:job id="job1">
<batch:step id="import" next="wordcount">
<batch:tasklet ref="script-tasklet"/>
</batch:step>
<batch:step id="wordcount">
<batch:tasklet ref="wordcount-tasklet" />
</batch:step>
</batch:job></programlisting>
<para>This configuration defines a Spring Batch Job named "job1" that
contains two steps executed sequentially. The first one prepares HDFS with
sample data and the second runs the Hadoop wordcount mapreduce job. The
tasklet's reference to "script-tasklet" and "wordcount-tasklet"
definitions that will be shown a little later.</para>
<para>The <link
xlink:href="http://www.springsource.com/developer/sts">Spring Source
Toolsuite</link> is a free Eclipse-powered development environment which
provides a nice visualization and authoring help for Spring Batch
workflows as shown below.<mediaobject>
<imageobject role="html">
<imagedata fileref="images/batch-wordcount-ide.jpg" format="JPG"/>
</imageobject>
</mediaobject></para>
<para>The <link linkend="scripting-tasklet">script tasklet</link> shown
below uses Groovy to remove any data that is in the input or output
directories and puts the file "nietzsche-chapter-1.txt" into HDFS.</para>
<programlisting language="xml"><script-tasklet id="script-tasklet">
<script language="groovy">
inputPath = "${wordcount.input.path:/user/gutenberg/input/word/}"
outputPath = "${wordcount.output.path:/user/gutenberg/output/word/}"
if (fsh.test(inputPath)) {
fsh.rmr(inputPath)
}
if (fsh.test(outputPath)) {
fsh.rmr(outputPath)
}
inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
fsh.put(inputFile, inputPath)
</script>
</script-tasklet></programlisting>
<para>The script makes use of the predefined variable fsh, which is the
embedded <link linkend="scripting-fssh">Filesystem Shell</link> that
Spring Hadoop provides. It also uses Spring's <link
xlink:href="http://static.springsource.org/spring/docs/3.1.x/spring-framework-reference/html/beans.html#beans-factory-placeholderconfigurer">Property
Placeholder</link> functionality so that the input and out paths can be
configured external to the application, for example using property files.
The syntax for variables recognized by Spring's Property Placeholder is
<literal>${key:defaultValue}</literal>, so in this case
<literal>/user/gutenberg/input/word</literal> and
<literal>/user/gutenberg/output/word</literal> are the default input and
output paths. Note you can also use <link
xlink:href="http://static.springsource.org/spring/docs/3.1.x/spring-framework-reference/html/expressions.html">Spring's
Expression Language</link> to configure values in the script or in <link
xlink:href="http://static.springsource.org/spring/docs/3.1.x/spring-framework-reference/html/expressions.html#expressions-beandef">other
XML definitions</link>.</para>
<para>The configuration of the tasklet to execute the Hadoop MapReduce
jobs is shown below.</para>
<programlisting language="xml"><hdp:tasklet id="hadoop-tasklet" job-ref="mr-job"/>
<job id="wordcount-job"
input-path="${wordcount.input.path:/user/gutenberg/input/word/}"
output-path="${wordcount.output.path:/user/gutenberg/output/word/}"
mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper"
reducer="org.apache.hadoop.examples.WordCount.IntSumReducer" /></programlisting>
<para>The <literal><hdp:tasklet></literal> and
<literal><hdp:job/></literal> elements are from Spring Haddop XML
Schema. The hadoop tasklet is the bridge between the Spring Batch world
and the Hadoop world. The hadoop tasklet in turn refers to your map reduce
job. Various common properties of a map reduce job can be set, such as the
mapper and reducer. The section <link linkend="hadoop-job">Creating a
Hadoop Job</link> describes the additional elements you can configure in
the XML.</para>
<note>
<para>If you look at the JavaDocs for the org.apache.hadoop.examples
package (<link
xlink:href="org.apache.hadoop.examples.PiEstimator.PiMapper">here</link>),
you can see the Mapper and Reducer class names for many examples you may
have previously used from the hadoop command line runner.</para>
</note>
<para>The <literal>configuration-ref</literal> element in the job
definition refers to common Hadoop configuration information that can be
shared across many jobs. It is defined in the file
<literal>hadoop-context.xml</literal> and is shown below</para>
<programlisting language="xml"><!-- default id is 'hadoop-configuration' -->
<hdp:configuration register-url-handler="false">
fs.default.name=${hd.fs}
</hdp:configuration></programlisting>
<para>As mentioned before, as this is a configuration file processed by
the Spring container, it supports variable substitution through the use of
<literal>${var} </literal>style variables. In this case the location for
HDFS is parameterized and no default value is provided. The property file
<literal>hadoop.properties</literal> contains the definition of the hd.fs
variable, change the value if you want to refer to a different name node
location.</para>
<programlisting>hd.fs=hdfs://localhost:9000</programlisting>
<para>The entire application is put together in the configuration file
<literal>launch-context.xml</literal>, shown below.</para>
<programlisting language="xml"><!-- where to read externalized configuration values -->
<context:property-placeholder location="classpath:batch.properties,classpath:hadoop.properties"
ignore-resource-not-found="true" ignore-unresolvable="true" />
<!-- simple base configuration for batch components, e.g. JobRepository -->
<import resource="classpath:/META-INF/spring/batch-common.xml" />
<!-- shared hadoop configuration -->
<import resource="classpath:/META-INF/spring/hadoop-context.xml" />
<!-- word count workflow -->
<import resource="classpath:/META-INF/spring/wordcount-context.xml" /></programlisting>
</section>
<section>
<title>Build and run the sample application</title>
<para>In the directory
<literal><spring-hadoop-install-dir>/samples/batch-wordcount</literal>
build and run the sample</para>
<programlisting>$ ../../gradlew</programlisting>
<para>If this is the first time you are using gradlew, it will download
the Gradle build tool and all necessary dependencies to run the
sample.</para>
<note>
<para>If you run into some issues, drop us a line in the <link
xlink:href="http://forum.springsource.org/forumdisplay.php?27-Data">Spring
Forums</link>.</para>
</note>
</section>
<section>
<title>Run the sample application as a standlone Java application</title>
<para>You can use Gradle to export all required dependencies into a
directory and create a shell script to run the application. To do this
execute the command</para>
<programlisting>$ ../../gradlew installApp
</programlisting>
<para>This places the shell scripts and dependencies under the
<literal>build/install/batch-wordcount</literal> directory. You can zip up that
directory and share the application with others.</para>
<para>The main Java class used is part of Spring Batch. The class is
<classname>org.springframework.batch.core.launch.support.CommandLineJobRunner</classname>.
This main app requires you to specify at least a Spring configuration file
and a job instance name. You can read the <literal>CommandLineJobRunner</literal> <link
xlink:href="http://static.springsource.org/spring-batch/apidocs/org/springframework/batch/core/launch/support/CommandLineJobRunner.html">JavaDocs</link>
for more information as well as <link
xlink:href="http://static.springsource.org/spring-batch/reference/html/configureJob.html#runningJobsFromCommandLine">this
section in the reference docs</link> for Spring Batch to find out more of
what command line options it supports.</para>
<programlisting>$ ./build/install/wordcount/bin/wordcount classpath:/launch-context.xml job1
</programlisting>
<para>You can then verify the output from work count is present and cat it
to the screen</para>
<programlisting>$ hadoop dfs -ls /user/gutenberg/output/word
Warning: $HADOOP_HOME is deprecated.
Found 2 items
-rw-r--r-- 3 mpollack supergroup 0 2012-01-31 19:29 /user/gutenberg/output/word/_SUCCESS
-rw-r--r-- 3 mpollack supergroup 918472 2012-01-31 19:29 /user/gutenberg/output/word/part-r-00000
$ hadoop dfs -cat /user/gutenberg/output/word/part-r-00000 | more
Warning: $HADOOP_HOME is deprecated.
"'Spells 1
"'army' 1
"(1) 1
"(Lo)cra"1
"13 4
"1490 1
"1498," 1
"35" 1
"40," 1
"A 9
"AS-IS". 1
"AWAY 1
"A_ 1
"Abide 1
"About 1
</programlisting>
</section>
</chapter>