Skip to content

tomstepp/big-data-beam

 
 

Repository files navigation

Building Big Data Pipelines with Apache Beam

Building Big Data Pipelines with Apache Beam

This is the code repository for Building Big Data Pipelines with Apache Beam, published by Packt.

Use a single programming model for both batch and stream data processing

What is this book about?

This book describes both batch processing and real-time processing pipelines. You’ll learn how to implement basic and advanced big data use cases with ease and develop a deep understanding of the Apache Beam model. In addition to this, you’ll discover how the portability layer works and the building blocks of an Apache Beam runner.

This book covers the following exciting features:

  • Understand the core concepts and architecture of Apache Beam
  • Implement stateless and stateful data processing pipelines
  • Use state and timers for processing real-time event processing
  • Structure your code for reusability
  • Use streaming SQL to process real-time data for increasing productivity and data accessibility
  • Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK
  • Implement Apache Beam I/O connectors using the Splittable DoFn API

If you feel this book is for you, get your copy today!

https://www.packtpub.com/

Instructions and Navigations

All of the code is organized into folders.

The code will look like the following:

ClassLoader loader = FirstPipeline.class. 
getClassLoader(); 
String file = loader.getResource("lorem.txt").getFile();
List<String> lines = Files.readAllLines( Paths.get(file), StandardCharsets.UTF_8);

Following is what you need for this book: This book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

With the following software and hardware list you can run all code files present in the book (Chapter 1-7).

Software and Hardware List

Chapter Software required OS required
1-7 Java 11, Python 3 Windows, Mac OS X, and Linux (Any)
1-7 Bash Windows, Mac OS X, and Linux (Any)
1-7 Docker Windows, Mac OS X, and Linux (Any)

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Related products

Get to Know the Author

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.

About

Building Big Data Pipelines with Apache Beam

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
license-header-spotless.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Java 88.6%
  • Python 11.2%
  • Other 0.2%