Batch YARN Application :: Learn how to build a Spring Batch YARN application
Java Shell

README.adoc

tags projects
hadoop
yarn
boot
batch
spring-hadoop

This guide walks you through the process of executing a Spring Batch job on Hadoop YARN.

What you’ll build

You’ll build a simple Hadoop YARN application with Spring Hadoop, Spring Batch and Spring Boot.

What you’ll need

Note
Testing this sample application doesn’t need existing or running Hadoop instance.

Set up the project

First you set up a basic build script. You can use any build system you like when building apps with Spring, but the code you need to work with Gradle is included here. If you’re not familiar with it, refer to Building Java Projects with Gradle.

Create a Gradle build file

Below is the initial Gradle build file. If you are using Spring Tool Suite (STS), you can import the guide directly.

build.gradle

link:complete/build.gradle[]

In above gradle build file we simply create three different jars, each having classes for its specific role. These jars are then repackaged by Spring Boot’s gradle plugin to create an executable jars.

You also need a settings.gradle file to define the sub-projects.

settings.gradle

link:complete/settings.gradle[]

Spring Batch Intro

Many batch processing problems can be solved with single threaded, single process jobs, so it is always a good idea to properly check if that meets your needs before thinking about more complex implementations. When you are ready to start implementing a job with some parallel processing, Spring Batch offers a range of options. At a high level there are two modes of parallel processing: single process, multi-threaded; and multi-process.

Spring Hadoop contains a support for running Spring Batch jobs using YARN on a on a Hadoop cluster. For better parallel processing Spring Batch partitioned steps can be executed on YARN as remote steps.

The starting point for running a Spring Batch Job on YARN is always the Application Master whether the job is just a simple job with or without partitioning. In the case when partitioning is not used, the whole job would run within the Application Master and no additional YARN containers would be launched. This may seem a bit odd, to run something on YARN without using containers, but one should remember that the Application Master is also just a resource allocated from the Hadoop cluster running in a YARN container.

In order to run Spring Batch jobs on a Hadoop cluster, few constraints exists:

  • Job Context - Application Master is the main entry point of running the job.

  • Job Repository - Application Master needs to have access to a repository which is located either in-memory or in a relational database. These are the two types natively supported by Spring Batch.

  • Remote Steps - Due to nature how Spring Batch partitioning works, remote steps need access to the job repository.

Let’s take a quick look at how Spring Batch partitioning is handled. The concept of running a partitioned job involves three things, Remote Steps, Partition Handler and a Partitioner. If we do a little bit of oversimplification a remote step is like any other step from a user point of view. Spring Batch itself does not contain implementations for any proprietary grid or remoting fabrics. Spring Batch does however provide a useful implementation of PartitionHandler that executes Steps locally in separate threads of execution, using the TaskExecutor strategy from Spring. Spring Hadoop provides implementation to execute Steps remotely on a Hadoop cluster using YARN.

Note
For more background information about the Spring Batch Partitioning, read the Spring Batch reference documentation.

Create a Remote Batch Step

Here you create a PrintTasklet class.

gs-yarn-batch-processing-container/src/main/java/hello/container/PrintTasklet.java

link:complete/gs-yarn-batch-processing-container/src/main/java/hello/container/PrintTasklet.java[]

A Tasklet in Spring Batch is one of the easiest concepts to use when something needs to be executed in a step as part of a job. This simple tasklet is serving the purpose of demonstrating how simple it is to execute a real `Partitioned Step`s on Hadoop YARN without introducing more complex job processing.

In the above PrintTasklet we simply write a log message when the Tasklet itself is executed.

Now it’s time to create a ContainerApplication class.

gs-yarn-batch-processing-container/src/main/java/hello/container/ContainerApplication.java

link:complete/gs-yarn-batch-processing-container/src/main/java/hello/container/ContainerApplication.java[]
  • The @EnableYarnRemoteBatchProcessing annotation tells Spring to enable Batch processing for YARN Containers. Batch @EnableBatchProcessing is then automatically included which means all builders for JavaConfig are available.

  • We @AutoWired the step builder which is then used to create steps as beans.

  • We define PrintTasklet as a @Bean

  • We create a step as a @Bean and instruct it to execute a tasklet.

Next we create an application.yml file for the container.

gs-yarn-batch-processing-container/src/main/resources/application.yml

link:complete/gs-yarn-batch-processing-container/src/main/resources/application.yml[]
  • We disable batch functionality in Spring Boot core to use YARN specific features.

  • We add Hadoop configuration for HDFS. This can be customized for accessing a real cluster.

  • We enable batch processing on YARN by using spring.yarn.batch.enabled property.

Create a Batch Job

Here we create a AppmasterApplication class.

gs-yarn-batch-processing-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java

link:complete/gs-yarn-batch-processing-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java[]
  • @EnableYarnBatchProcessing tells Spring to enable Batch processing for appmaster.

  • We @AutoWired builders for steps and jobs.

Next we create an application.yml file for the appmaster.

gs-yarn-batch-processing-appmaster/src/main/resources/application.yml

link:complete/gs-yarn-batch-processing-appmaster/src/main/resources/application.yml[]
  • We disable batch functionality in Boot core to use a YARN specific featurs.

  • We add Hadoop configuration for HDFS. This can be customized for accessing a real cluster.

  • We enable batch processing on YARN by using spring.yarn.batch.enabled property.

  • We define a job named job to run automatically.

  • We enable a job named job and allow it to do a next operation.

Create a Yarn Client

Here we create a ClientApplication class.

gs-yarn-batch-processing-client/src/main/java/hello/client/ClientApplication.java

link:complete/gs-yarn-batch-processing-client/src/main/java/hello/client/ClientApplication.java[]

The ClientApplication is similar to what we’ve used in other guides and its only purpose is to submit a YARN application.

Here you create an application.yml file for the client.

gs-yarn-batch-processing-client/src/main/resources/application.yml

link:complete/gs-yarn-batch-processing-client/src/main/resources/application.yml[]
  • Here we simply defined all the files of the application that are needed for submission.

Run the Application

Now that we’ve successfully compiled and packaged our application, it’s time to do the fun part and execute it on Hadoop YARN.

Simply run the executable client jar.

$ cd gs-yarn-batch-processing-dist
$ java -jar target/gs-yarn-batch-processing-dist/gs-yarn-batch-processing-client-0.1.0.jar

If the application executes successfully we should be able to see two steps executed on a YARN.

Create a JUnit Test Class

Below is a class which can be used to execute this application as a JUnit test without running Hadoop cluster.

gs-yarn-batch-processing-dist/src/test/java/hello/AppIT.java

link:complete/gs-yarn-batch-processing-dist/src/test/java/hello/AppIT.java[]

Summary

Congratulations! You’ve just developed a Spring YARN application executing a Spring Batch job!