<img src="./img/uva_seal.png">  

## Streaming Systems

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: November 6, 2024
---  


### SOURCES

Streaming Systems by Tyler Akidau, Slava Chernyak, Reuven Lax


### OBJECTIVES
- Understand the definition of a streaming system, and how it is different from a batch system
- Differentiate tables from streams
- Understand why persistent state is essential in streaming systems
- Understand the tradeoffs between different methods of persisting data
- Differentiate between perfect watermarks and heuristic watermarks
- Understand how different accumulation modes work
- Present a short example of a windowed calculation using `Apache Beam`

### CONCEPTS

- Batch processing
- Streaming system
- Streams and Tables
- Persistent State
- Event Time vs Processing Time
- Windowing
- Sessions
- Triggers
- Watermarks
- Accumulation
- Exactly-Once Processing

---


### Batch Processing

Before learning about streaming systems, let's briefly define *batch processing*.  

Batch processing takes a finite dataset and processes it fully and all at once.  

An alternative would be to separately process each record one at a time.

The introduction of datasets designed to be infinite in cardinality pose challenges to this mode.

### Streaming System

A *streaming system* is a data processing engine designed with infinite datasets in mind. Examples will include social data on the web, such as Twitter and Facebook.  These websites can be treated as accumulating continuous data streams.   

Streaming systems include *microbatch* implementations such as *Spark Streaming* discussed later.  

In microbatch, a batch processing engine is repeatedly called at a fairly high frequency, such as every 500 microseconds.  
Each microbatch consists of a chunk of data, formed from a window processing time.  

Once data is regarded as infinite, it adds complexity such as:  
- what to persist (not all data can be stored)
- when to aggregate and report results

---

### Streams and Tables

**Tables** are data *at rest,* reflecting the data at a point in time.  Think of tables in a relational database.  

**Streams** are data *in motion*.  They capture the evolution of data over time.  

Aggregating a stream of updates over time produces a table.  

Observing changes to a table over time produces a stream.

**MapReduce Viewed as Streams and Tables**

The stages look like this:  

1. Consume a table of input data, such as sentences from a corpus
2. Preprocess the input data into key/value form.  Preprocessing transforms the data into a stream.
3. Consume the key/value pairs, outputting modified key/value pairs in form
(word, 1). These are streams.
4. Shuffle the data, sending key/value pairs w same key to same worker.
5. Reduce by key, which is an aggregation
6. Persist the data, producing a table

The process starts with a table and ends with a table, using streams in between to transform the data.

**Transformations**

Transformations tell us what the pipeline is computing.  There are two kinds:

- *Nongrouping* transformations take a stream as input and produce a new stream as output.  Examples include `filter`, `map`.  


- *Grouping* transformation take a stream as input and perform aggregation, which transforms data into a table.  An example is `ReduceByKey`.

**To repeat, grouping is what produces tables.**

---

### Persistent State

The purpose of persisting tables is to capture data that otherwise would vanish.
There are a few reasons we need to do this:

- Durability upon Interruption

Streaming systems are supposed to run forever, but this isn't realistic in practice.  Interruptions happen for many reasons, such as machine failure, planned maintenance, code changes, and bad commands.

Persistent state helps the system to recover from an interruption.

For expensive, complex pipelines, can use *checkpointing* to periodically save results.  In event of failure, the system only loses data since the last checkpoint.  Persistence must be strategic, since it must be assumed that data through the system cannot be reloaded.

- Correctness and Efficiency

By persisting the necessary intermediate quantities, the system the can recover from an interruption.  It is important to save just what is necessary, since extra data just takes up space.

**Raw Grouping versus Incremental Combining**

*Raw grouping* appends each new element.  This can require massive storage costs.

*Incremental combining* will incrementally compute and checkpoint an intermediate result.

*Running Average*  
As an example, if we are interested in the average of a vector of values, it will be more efficient to store a running sum and count (incremental combining) than to store all of the values (raw grouping).

*Histogram*  
A histogram will be a more complex accumulator than a running average, but it provides a better description of the distribution than the mean.  To persist a histogram, store a running count for each bucket range.

*Parallelization*  
Parallelization can be used to optimize aggregation.  Specifically, subgroups of data can be distributed across multiple machines, with each machine computing a partial aggregate.

---

### Conversion Attribution

Conversion attribution is used in the advertising technology (AdTech) space to provide concrete feedback on the efficacy of advertisements.  It is a common use case for streaming systems.  In the ideal situation, a desired outcome (consumer purchases a product) can be precisely traced back to an advertisement.  In practice, this task is often challenging for reasons such as a complex path between the advertisement (a.k.a. the *impression*) and the outcome, and *attribution fraud*.  

The figure below depicts several user paths, with one conversion path leading from an impression to the outcome, or goal.  In this path, there are two additional steps between the impression and goal, illustrating that the task is tricky.  Other paths include: 

1) impression leading to a sequence of site visits, none ending with the goal state  
2) user going directly to the goal, without ever getting the impression

<img src="./img/conv_attrib.png">  

**Conversion Attribution with Apache Beam**

The authors of the *Streaming Systems* text have developed `Apache Beam` for batch and streaming data processing jobs.
The system is designed to run on any execution engine.  For details, see: https://beam.apache.org  

One of the use cases explored in the text, for which source code is included, is conversion attribution.
The source code is lengthy and written in Java.  Without the textbook, it might be a challenging read, but I will provide the link to the repo:
https://github.com/takidau/streamingbook/blob/master/src/main/java/net/streamingbook/StateAndTimers.java

---

### Correctness

For a streaming system to be at parity with batch processing, it needs *correctness*.  In other words, batch systems process finite datasets, and it is straightforward to recompute results if needed.  With infinite datasets, care must to taken to recover needed data for correct results.  This is solved by  checkpointing persistent state over time.

### Event Time versus Processing Time

For any data processing system, there are two times of interest:

1. *event time*, which is the time at which events actually occurred

2. *processing time*, which is the time at which events are observed by the system

Ideally these times would always coincide (black dashed line in **Processing Time vs Event Time figure below**), but in practice we observe a red curve.  For example, a user might use an app offline in airplane mode.  When the app switches back online, it uploads user statistics to the system for processing.  This causes the processing time to occur after the event time, known as *processing-time lag.*  *Event-time skew* measures the time between when the event actually occurred and when it was processed.

The degree of event-time skew can be affected by several factors including:

- shared resource limitations like network congestion, shared CPU in a nondedicated environment  
- software causes such as distributed system logic
- variance in throughput

Not all use cases care about event times, but there are important cases that do, such as billing and anomaly detection.  

**Processing Time vs Event Time**

<img src="./img/event_time_vs_proc_time.png">

### Batching Data and Windowing

To deal with infinite datasets, processing systems use *windowing*, which chops the dataset into finite pieces along temporal boundaries.  Many systems (including Spark Streaming) define these temporal boundaries using processing time, as this is easier than defining by event time.

Forming windows by processing time can lead to incorrect conclusions in the context of their event times.  From the lens of event times, they might arrive for processing out of order.  Further, there may not be certainty when all of the events from a given time window have all arrived for processing (there could be late data).  Streaming systems like `Apache Beam` aim at providing event time correctness.

In the case of processing finite (bounded) data with a batch engine, unstructured data arrives and is run through the engine once, producing structured data.

**Bounded Data Processing w Batch Engine**

<img src='./img/bounded_data.png'>

When processing infinite (unbounded) data, a popular approach is to window the data into fixed-size windows and process each window as a bounded data source.  Each filename can contain the time window (process_x_1000_1010).

A limitation with this approach is that delayed data will be included in the incorrect bucket, due to processing-time lag. To account for the lag, the system could attempt to:

- delay processing until all events for that window have been collected  

However, the time delay may not be practical, and it may never be known when all events from a window have been collected

- reprocess the batch for a window when data arrives late.  

This may be computationally expensive/prohibitive

**Processing Windows w Batch Engine**

<img src="./img/unbounded_fixed_windows.png">

*Sessions* are a more sophisticated windowing strategy.  A session begins when a user interacts with the system, and ends when the user hasn't interacted with the system for some predefined period of time, or *session timeout*.

When a batch engine is used to process sessions, the sessions oftentimes get split into multiple windows, destroying structure.  This can be mitigated by increasing batch size (at the cost of increased latency), or repairing split sessions with logic (at the cost of increased complexity).


**Processing Sessions w Batch Engine**

<img src="./img/session.png">

In conclusion, both the fixed window approach and session approach have shortcomings in terms of correctness.  This is because:

- data may arrive for processing out of order with respect to their event times

- the lateness of the data will vary, meaning the system cannot say that X% of the events will arrive within Y time units

---

### Triggers

A *trigger* is a mechanism for declaring when the output for a window should be materialized.  For example, when sufficient events have been collected for a window, a trigger will signal that a summary statistic should be computed and reported.

There are generally two useful types of triggers:

- *Repeated update triggers*  
These periodically generate updated window panes as more data arrives.  
These triggers are the most common type of trigger in streaming systems.  


- *Completeness triggers*  
These materialize a window pane after the input for that window is believed to be sufficient

---

### Watermarks

When event times and processing times coincide, things are straightforward.  However, when event data can arrive late, the problem becomes tricky.  How do we know when to close a time window?  This motivates the concept of *watermarks*.

Watermarks are temporal notions of input completeness in the event-time domain.  It makes the statement that all data with event times `<T` have been observed.  

Sometimes watermarks are exact, but other times the best we can do is design approximate (heuristic) watermarks.

Watermarks work together with triggers: when the watermark reaches a given level, the trigger fires, producing results. That is, output for one or more windows is materialized.

Oftentimes, it makes sense to provide a bound on how late an event can arrive.  This allows the system to close the window after the fixed period, dropping any points exceeding the bound.

**A Watermark Example**

The watermark illustration below shows an example of a perfect watermark and a heuristic watermark.  In each case, the x-axis shows the event time domain sliced into fixed window panes of 2 minutes. The y-axis shows the processing time domain.  The circles denote data points, with their values inside.  The curves are the watermarks.

Once a window is closed (with time denoted on the y-axis), results can be reported.  The numbers in yellow denote the values reported from each window.  For the perfect watermark, a value of 14 is reported for the 12:00-12:02 event time window a bit before 12:09.  This contrasts with the value of 5 reported for that same window using the heuristic watermark.  This is due to the red data point with value 9 arriving very late; the heuristic watermark missed that point, and it was ignored.  One advantage, however, is the earlier reporting time at 12:06. Thus, **the heuristic watermark faces a tradeoff between correctness and latency**.

In real systems, perfect watermarks are often impractical.  The more that is known about the source, the better the heuristic that can be designed, resulting in fewer late data points.

The most important distinction between perfect watermarks and heuristic watermarks is this: **perfect watermarks account for all data, while heuristic watermarks admit late data.**

**Illustration of Watermarks**

<img src="./img/watermarks.jpg">

**Apache Beam Code Example**

The code below shows how a windowed computation is implemented in `Apache Beam`.
A `PCollection` is a (possibly massive) dataset.  We parse key/value data where the key is named `Group` and the value is named `Score`; the values are integers.  The data are windowed into fixed, two-minute windows.  The summation is done by key.  After understanding the concepts (windows, reduce-by-key aggregation), the code is reasonably clear and appealing.

**Windowed summation code**
```
PCollection<KV<Group, Score>> totals = input
  .apply(Window.into(FixedWindows.of(TWO_MINUTES)))
  .apply(Sum.integersPerKey());
```

---

### Accumulation

When multiple panes are produced for a single window over time (say a statistic is refreshed with more data), we need an *accumulation* method.  Examples include:

- *discarding*: each time a window pane is materialized, any stored (earlier) state is discarded.
- *accumulating*: all stored states are retained, and the current state is the accumulation of all previous states.
- *accumulating and retracting*: similar to accumulating mode, this returns the accumulation, and also retractions for the previous pane(s).

**Accumulation Example**

Consider the example below where two panes are produced for a window. The second pane is an update which occurs at a later processing time.  

For `discarding` mode, pane 2 contains the updated value.  The correct final value is found by summing over the panes (= 3 + 9).  

For `accumulating` mode, the final value has done the summing, and there is no need to sum over the panes. In fact, summing over the panes would produce 15, which is incorrect.  

In `accumulating & retracting` mode, once again the final value has done the summing.  However, a sum over the panes computes the correct answer of 12, since it would sum [3, 12, -3].

|    |      Discarding      |  Accumulating | Accumulating & Retracting |  
|----------|:-------------:|:-------------:|:-------------:|  
| Pane 1      |  3   | 3 | 3 |   
| Pane 2      |    9 | 12 | 12, -3 |  
| Final value | 9 | 12 | 12|  
| Sum over panes | 12 | 15 | 12|  

---

### Exactly-Once Processing

Streaming systems reference *exactly-once processing*, meaning that every record is processed exactly once.  

Historically, no guarantees were made about record processing; they were on a *best efforts* basis.  Records might be duplicated, producing incorrect aggregations.  If a machine crashed, aggregations might be lost, and records might need to be processed more than once.

`Apache Beam` ensures that records are not erroneously dropped or duplicated.
The user can configure how long the system should wait for late data.  Any records arriving later than this deadline are dropped.  This feature leads to potential inaccuracy, but assures that all records showing up on time for processing are accurately processed exactly once.  Late records are explicitly dropped.