Building Big Data Pipelines with Apache Beam

This is the code repository for Building Big Data Pipelines with Apache Beam, published by Packt.

Use a single programming model for both batch and stream data processing

What is this book about?

This book describes both batch processing and real-time processing pipelines. You’ll learn how to implement basic and advanced big data use cases with ease and develop a deep understanding of the Apache Beam model. In addition to this, you’ll discover how the portability layer works and the building blocks of an Apache Beam runner.

This book covers the following exciting features:

Understand the core concepts and architecture of Apache Beam
Implement stateless and stateful data processing pipelines
Use state and timers for processing real-time event processing
Structure your code for reusability
Use streaming SQL to process real-time data for increasing productivity and data accessibility
Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK
Implement Apache Beam I/O connectors using the Splittable DoFn API

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders.

The code will look like the following:

ClassLoader loader = FirstPipeline.class. 
getClassLoader(); 
String file = loader.getResource("lorem.txt").getFile();
List<String> lines = Files.readAllLines( Paths.get(file), StandardCharsets.UTF_8);

Following is what you need for this book: This book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

With the following software and hardware list you can run all code files present in the book (Chapter 1-7).

Software and Hardware List

Chapter	Software required	OS required
1-7	Java 11, Python 3	Windows, Mac OS X, and Linux (Any)
1-7	Bash	Windows, Mac OS X, and Linux (Any)
1-7	Docker	Windows, Mac OS X, and Linux (Any)

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Related products

Data Engineering with Apache Spark, Delta Lake, and Lakehouse [Packt] [Amazon]
Data Engineering with Python [Packt] [Amazon]

Get to Know the Author

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
bin		bin
chapter1		chapter1
chapter2		chapter2
chapter3		chapter3
chapter4		chapter4
chapter5		chapter5
chapter6		chapter6
chapter7		chapter7
docker		docker
env		env
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
copy-jars.sh		copy-jars.sh
license-header-spotless.txt		license-header-spotless.txt
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Building Big Data Pipelines with Apache Beam

What is this book about?

Instructions and Navigations

Software and Hardware List

Related products

Get to Know the Author

About

Licenses found

Releases

Packages

Languages

License

Licenses found

tomstepp/big-data-beam

Folders and files

Latest commit

History

Repository files navigation

Building Big Data Pipelines with Apache Beam

What is this book about?

Instructions and Navigations

Software and Hardware List

Related products

Get to Know the Author

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages