layout | title | permalink |
---|---|---|
default |
Apache Apex Runner |
/documentation/runners/apex/ |
The Apex Runner executes Apache Beam pipelines using Apache Apex as an underlying engine. The runner has broad support for the [Beam model and supports streaming and batch pipelines]({{ site.baseurl }}/documentation/runners/capability-matrix/).
Apache Apex is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture, Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.).
You may set up your own Hadoop cluster, and setup Apache Apex on top of it or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the distribution information on the Apache Apex website.
Download some data for processing and put it on HDFS
curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt
hdfs dfs -mkdir -p /tmp/input/
hdfs dfs -put /tmp/kinglear.txt /tmp/input/
The output directory should not exist on HDFS. Delete it if it exists.
hdfs dfs -rm -r -f /tmp/output/
Run the wordcount example
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/ --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner
This will launch an Apex application.
The sample program which is processing small amount of data would finish quickly. You can check contents on /tmp/output/ on HDFS
hdfs dfs -ls /tmp/output/
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins:
- YARN : Using YARN web UI generally running on 8088 on the node running resource manager
- Apex cli: Using apex cli to get running application information