A list of articles that are essential to understand stream processing.
- Designing Data Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martine Kleppmann This is a book we've been waiting for 10 years. A definitive guide for entering the field of distirbuted systems and stream data processing. This book covers fundamental concepts, techniques, and challenges in keep processing large volumes of data continously.
- Japanese translation of the book 「データ指向アプリケーションデザイン」 is also available
- Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing A book by the authors of Streaming 101, Spark Streaming. This book introduces how we can unify batch processing and stream processing within a single system and covers the basic ideas of Stream SQL.
- One SQL to Rule Them All. Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, Kenneth Knowles SIGMOD 2019 A proposal to extend the current SQL semantics to support both batch and stream processing by adding time-varying relations (TVR), event time, and materialization control.
- DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Y. Yu, et al. OSDI08 The origin of declarative data processing operators (e.g., map, filter, groupBy, etc.) in modern programming languages.
- OpenMessaging:Common Use Cases Illustrations of typical stream processing patterns from OpenMessaging.
- The world beyond batch: Streaming 101. An article written by the author of Google Dataflow. Streaming is actually a superset of batch processing for unbounded data. This article explains what is unbounded data and how to manage late coming data. You can also learn what streaming systems can do and can't do, and the differences of event times and processing times, and varieties of time-window based processing.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD2018. A good summary of challenges around continous stream processing and unifying APIs for batch and stream processing.
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale. NSDI 2013 An approacy for applying micro-batch style stream processing in Spark. This model has been redesigned in Spark 2.0 as Structured Streaming.
- Continuous Applications: Evolving Streaming in Apache Spark 2.0
- Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017 An approach for reducing the overhead of the coordination between stream processing tasks.
- Dataflow/Beam & Spark: A Programming Model Comparison
- ReactiveX. Stream processing patterns for functional programming.
- Akka Stream Stream processing DSL for Akka
- A Practical Guide to Selecting A Stream Processing Technology. A good explanation of stream processing
- Trill: A High-Performance Incremental Query Processor for Diverse Analytics. B. Chandramouli, et al. VLDB 2014
- Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores Utilizing scalable object strage on the cloud (e.g., S3), Delta Lake provides a single table format with versioning (time-travel) and transaction support.
- Iceberg
- Big Metadata: When Metadata is Big Data. (VLDB 2021) Columnar table catalog with partition statistics used in Google's BigQuery.
- Apache Hudi Apach Hudi provides a file layout for placing both streaming and batch data processing with a transaction support. You can merge fragmented partition data or use it as is for faster real-time data processing. Previously, it was called Uber Hoodie
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Akidau et al., (Google) VLDB 2015 The original paper of Google Cloud Dataflow, which describes how we can cope with the delay of data arrival (late-coming data) and periodical data processing in a unified API for batch and stream processing. You can also find a summary of this paper at the morning paper
- Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow. VLDB 2021. Describes basic definitions of watermarks and shows challenges and trade-offs in managing watermarks.
- Watermarking in stream processing | Course in Spark Structured Streaming 3.0 | Lesson 7 A good tutorial explaining the notions of stream processing and watermark management.
- Towards a Learning Optimizer for Shared Clouds (VLDB 2019). Estimate cardinality models from the previous job executions in order to optimize the overall workloads. This work uses the multi-layer perceptron (MLP) neural network for learning models from query exeuction features (e.g., job name, input cardinality, average row length, input dataset names, etc.)
- CrocodileDB: Efficient Database Execution through Intelligent Deferment (CIDR 2020) This paper introduces Intermittent Query Processing (IQP) approach for utilizing the knowledge about new data, query semantics, and users' expectation together to reduce the overall processing cost. It uses Deep Q-Materialization (DQM) to make a tradeoff under a certain resource constraint (e.g., memory, CPUs, storage) to decide how much data will be cached, pre-computed, pre-loaded, etc.
- Peregrine: Workload Optimization for Cloud Query Engines (SOCC 2019) Analyzing the workload of historical queries and optimize recurrring queries, similar queries, and coordinating queries by extracing common subexpressions that can be materialized. To support various query engines including Spark, Microsoft has creaetd a common intermediate representation (IR) of workloads.
- Olston, C. et al. 2011. Nova: continuous Pig/Hadoop workflows. (Jun. 2011)
- Naiad: A Timely Dataflow system (SOSP13 best paper) Differential data processing developed in Microsoft. Niad Project Page The proposed approach is now implemented in Rust Timely Dataflow
- Apache Flink: Spinning Fast Iterative Data Flows. PVLDB 2012
- DBSP: Automatic Incremental View Maintenance for Rich Query Languages (VLDB 2023 Best Paper) By introducing Integration (I) and Differentiation (D) operators, DBSP allows the composition of complex incremental SQL queries, including joins, aggregations, recursive loops, etc.
- Keep Your Distributed Data Warehouse Consistent at a Minimal Cost (PACMMOD 2023) Proposes a method for maintaining consistency in large data warehouses while minimizing updates. It solves a dynamic programming problem to balance computational cost and data freshness. Implemented at the YouTube Data Warehouse, the method has cut update requests by 25% by removing non-trivial duplicates, significantly saving computing resources.
- What’s the Difference? Incremental Processing with Change Queries in Snowflake (ACM Management of Data 2023) Snowflake introduces CHANGE queries and STREAM table objects to subscribe changes in the table.
- dbt: Incremental Models dbt, a tool for compiling a sequence of queries from SQL templates, supports a simple incremental processing with conditional switch of SQL queries.
- DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views (VLDB 2012). An efficient way for finding the delta of delta queries (higher-order delta) for computing materialized views as we receive a new update record. A demo source code written in OCaml and Scala can be found in https://github.com/dbtoaster
- How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates M. Nicolik et al. (SIGMOD 2016) Extending DBToaster for batch and distributed incremental view maintenance setting.
- Generalized Scale Independence Through Incremental Precomputation. M. Armbrust, et al. SIGMOD 2013 An approach for guaranteeing the response time of queries by classifying query types and preparing materialized views if necessary.
- Comet: batched stream processing for data intensive distributed computing (SoCC 10) A basic style of incremental processing
- Continuous queries over append-only databases. SIGMOD 1992
- Selecting Subexpressions to Materialize at Datacenter Scale. PVLDB 2018 Microsoftr SCOPE - Automatically finding common sub-expressions among queries and materializing their results for reducing the overhead of recurrent queries.
- Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing (VLDB 2020) A cost-based optimizer for choosing the right incremental processing methods. A demo source code extending Apache Calcite is available.
- Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. VLDB 2021 Control the timing of eager materialization of queries based on the user's requirements (Favor freshness or performance)
- Amazon Redshift: Automatic Query Rewriting with Materialized Views
- Big Query: Smart tuning BigQuery automatically rewrites queries to use materialized views whenever possible. Automatic rewriting improves query performance and cost, and does not change query results.
- OpenIVM: a SQL-to-SQL Compiler for Incremental Computations (SIGMOD-Companion 2024) Implements incremental view maintenance (IVM) at the SQL level, implemented as a DuckDB extension (source code)
- Incremental Refresh for Materialized Views (Databricks)
- Fluentd A unified logging layer from various data sources.
- Kafka is often used for providing replayable streaming data sources.
- Apache Pulsar A distributed pub-sub messaging system originally created at Yahoo!
- OpenMessaging Cloud-oriented, simple, flexible, vendor-neutral and language-independent standards for messaging
- Uber Hoodie Hybrid storage: Avro for streaming import, Parquet for analysis. This project has been moved to Apache Hudi
- MQTT A machine-to-machine (M2M)/"Internet of Things" connectivity protocol.
- Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter. SIGMOD2018
- Questioning the Lambda Architecture A commonly used architecture for managing recent data and archived data. However combining two types of systems for batch and streaming is still painful because analysts need to understand both systems (e.g, Hadoop + Storm, Spark + Spark Streaming, Kafka + other data store)
Real-time stream processing usually means ultra-low latency applications to satisfy SLAs for returning results in a few seconds.
- The Stratosphere Platform for Big Data Analytics. Stratospher is a former name of Apache Flink.
- The 8 Requirements of Real-Time Stream Processing. M. Stonebraker, et al. SIGMOD Record 2005. A summary is also available in the morning paper
- MacroBase: Prioritizing Attention in Fast Data. P. Bailis, et al. SIGMOD 2017. A data analytics engine that prioritizes end-user attention in high-volume fast data streams.
- A prototype implementation on GitHub
- Foundations of Streaming SQL by Tyler Akidau. Good illustrations for understanding how regular table-based SQL and streaming SQL are different.
- Microsoft Azure Stream Analytics
- Norikra Schema-less Stream Processing with SQL
- Esper
- KSQL A Streaming SQL Engine for Apache Kafka
- OpenMessaging
- Apache Beam A unified model for defining both batch and streaming data-parallel processing pipelines.
- spotify/scio: A Scala API for Google Cloud Dataflow
- twitter/heron
- Microsoft Naiad
- Norikra
- KSQL
- Apache Apex Unified stream and batch processing engine.
- List of projects related to stream-processing: https://github.com/manuzhang/awesome-streaming