A set of example Streaming and Batch jobs implementation with Apache Beam
Dataflow brings life to Datalakes
- Monorepo(apps, libs) project to showcase workspace setup with multiple apps and shared libraries
- Polyglot - Support multiple languages (java, kotlin)
- Support making
FatJar
for submitting jobs form CI Environment - Cloud Native (Run Local, Run on Cloud, Deploy as Template for GCD)
- Multiple Runtime (Flink, Spark, Google Cloud Dataflow, Hazelcast Jet )
see PLAYBOOK
Run WordCount kotlin example:
gradle :apps:wordcount:run --args="--runner=DirectRunner --inputFile=./src/test/resources/data/input.txt --output=./build/output.txt"
WordCount pipeline will run on local and produce the output file in apps/wordcount/build
directory.
WordCount pipeline can run on Google Cloud Dataflow if you have a project setup in your local.
PROJECT_ID=<my-project-id>
GCS_BUCKET=<my-project-gcs-bucket>
export GOOGLE_APPLICATION_CREDENTIALS=<full-path-to-your-json>
gradle :apps:wordcount:run --args="--runner=DataflowRunner --project=$PROJECT_ID --gcpTempLocation=gs://$GCS_BUCKET/dataflow/wordcount/temp/ --stagingLocation=gs://$GCS_BUCKET/dataflow/wordcount/staging/ --inputFile=gs://$GCS_BUCKET/dataflow/wordcount/input/shakespeare.txt --output=gs://$GCS_BUCKET/dataflow/wordcount/output/output.txt"
The inputFile
option is defined by default in WordCount options, so that it will run with the input file and produce output files in gs://your-cloud-storage-bucket