Restartable Batch YARN Application :: Learn how to build a restartable Spring Batch YARN application
Java Shell
Permalink
Failed to load latest commit information.
complete
db
initial
test
.gitignore
.travis.yml
LICENSE.code.txt
LICENSE.writing.txt
README.adoc

README.adoc

tags projects
hadoop
yarn
boot
batch
spring-hadoop

This guide walks you through the process of executing a Spring Batch job and its partitioned steps on Hadoop YARN. We also add in a way of simulating an error in a partitioned step to show the restart of a job.

What you’ll build

You’ll build a simple Hadoop YARN application with Spring Hadoop and Spring Boot. This application contains a job with two master steps where the actual execution is done on YARN as partitioned steps. We also simulate an error of step execution order to demonstrate a job restart so that step execution is continued from the failed steps.

What you’ll need

Note
Testing this sample application doesn’t need existing or running Hadoop instance.

Set up the project

First you set up a basic build script. You can use any build system you like when building apps with Spring, but the code you need to work with Gradle is included here. If you’re not familiar with it, refer to Building Java Projects with Gradle.

Create a Gradle build file

Below is the initial Gradle build file. If you are using Spring Tool Suite (STS), you can import the guide directly.

build.gradle

link:complete/build.gradle[]

In the above gradle build file we simply create three different jars — client, appmaster and container, each having classes for its specific role. These jars are then repackaged by Spring Boot’s gradle plugin to create executable jars.

You also need a settings.gradle file to define the sub-projects.

settings.gradle

link:complete/settings.gradle[]

Create a Remote Batch Step

First we create an HdfsTasklet class.

gs-yarn-batch-restart-container/src/main/java/hello/container/HdfsTasklet.java

link:complete/gs-yarn-batch-restart-container/src/main/java/hello/container/HdfsTasklet.java[]
  • We @AutoWired Hadoop’s Configuration to be able to use FsShell.

  • We simply check if a file exists in HDFS and throw RuntimeException if the file doesn’t exist. This is done to simulate error in a Tasklet. If the file does exist we simply return FINISHED from a Tasklet.

  • To figure out which file name to use, we access stepname which looks something like remoteStep1:partition0 and remove any illegal file name characters.

Next we create a ContainerApplication class.

gs-yarn-batch-restart-container/src/main/java/hello/container/ContainerApplication.java

link:complete/gs-yarn-batch-restart-container/src/main/java/hello/container/ContainerApplication.java[]
  • We simply create two steps named remoteStep1 and remoteStep2 and attach HdfsTasklet to those steps.

Next we create an application.yml file for the container app.

gs-yarn-batch-restart-container/src/main/resources/application.yml

link:complete/gs-yarn-batch-restart-container/src/main/resources/application.yml[]
  • We disable batch functionality in Spring Boot core to use YARN specific features.

  • We add Hadoop configuration for HDFS. This can be customized for accessing a real cluster.

  • We enable batch processing on YARN by using spring.yarn.batch.enable property.

Create a Batch Job

Now we create a AppmasterApplication class.

gs-yarn-batch-restart-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java

link:complete/gs-yarn-batch-restart-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java[]
  • We simply create two master steps named master1 and master2. Then we configure those steps to be partitioned on YARN and we set the grid size to two.

Next we create a application.yml file for appmaster.

gs-yarn-batch-restart-appmaster/src/main/resources/application.yml

link:complete/gs-yarn-batch-restart-appmaster/src/main/resources/application.yml[]
  • Again, we disable batch functionality in Boot core to use a YARN specific features.

  • We add Hadoop configuration for HDFS. This can be customized for accessing a real cluster.

  • We enable batch processing on YARN by using spring.yarn.batch.enable property.

  • We define a job named job to run automatically.

  • We enable the job named job and allow it to do a next operation with an indication that job execution should not fail if job parameters cannot be incremented.

  • We enable job restart and with an indication that job should not fail if job cannot be restarted.

Create a Yarn Client

Now we create a ClientApplication class.

gs-yarn-batch-restart-client/src/main/java/hello/client/ClientApplication.java

link:complete/gs-yarn-batch-restart-client/src/main/java/hello/client/ClientApplication.java[]
  • @EnableAutoConfiguration tells Spring Boot to start adding beans based on classpath setting, other beans, and various property settings.

  • Specific auto-configuration for Spring YARN components takes place in the same way it does in a regular Spring Boot app.

The main() method uses Spring Boot’s SpringApplication.run() method to launch an application. From there, we simply request a bean of type YarnClient and execute its submitApplication() method. What happens next depends on application configuration, which we go through later in this guide. Did you notice that there wasn’t a single line of XML?

Next we create an application.yml file for the client.

gs-yarn-batch-restart-client/src/main/resources/application.yml

link:complete/gs-yarn-batch-restart-client/src/main/resources/application.yml[]
  • Here we simply defined all the files of the application that are needed for submission.

Run the Application

Now that you’ve successfully compiled and packaged your application, it’s time for the fun part of executing it on a Hadoop YARN.

Because we need to persist the Spring Batch job status a database is needed. We’ve bundled an instance of HSQL which is easy to start using in-memory mode. In a separate terminal window run the following:

$ cd db/
$ unzip hsqldb-2.3.1.zip
$ cd hsqldb-2.3.1/hsqldb/data/
$ java -cp ../lib/hsqldb.jar org.hsqldb.server.Server --database.0 mem:testdb --dbname.0 testdb --silent false --trace true

Note: If you build this from scratch you can download the HSQL zip file from http://sourceforge.net/projects/hsqldb/files/hsqldb/hsqldb_2_3/.

Back to setting up for actually running the application. First create two empty files /tmp/remoteStep1partition0 and /tmp/remoteStep1partition1 in HDFS:

$ hdfs dfs -touchz /tmp/remoteStep1partition0
$ hdfs dfs -touchz /tmp/remoteStep1partition1

Then run the application:

$ cd gs-yarn-batch-restart-dist
$ java -jar target/gs-yarn-batch-restart-dist/gs-yarn-batch-restart-client-0.1.0.jar

If you check the application status from a YARN resource manager, you’ll see that application FAILED because partitioned steps of the second phase failed. Now create files /tmp/remoteStep2partition0 and /tmp/remoteStep2partition1:

$ hdfs dfs -touchz /tmp/remoteStep2partition0
$ hdfs dfs -touchz /tmp/remoteStep2partition1

Runt the application again:

$ java -jar target/dist/gs-yarn-batch-restart-client-0.1.0.jar

You should now be able to see that application finished successfully and the only the failed partitioned steps were executed.

Test the Application

Below is a class which can be used to execute this application as a JUnit test without running Hadoop cluster.

gs-yarn-batch-restart-dist/src/test/java/hello/AppIT.java

link:complete/gs-yarn-batch-restart-dist/src/test/java/hello/AppIT.java[]

Running the JUnit test doesn’t require existing database instance because as seen from a above example, a HSQL instance is created within a test itself.

Summary

Congratulations! You’ve just developed a Spring YARN application!