## Windows - What are they good for?

When we work on a _batch_ (e.g. a `.csv` file) we are dealing with a _bounded_ input. There are a finite number of records 
in our `PCollection`. Working with a bounded input comes naturally to most people, so much so that there are a number of 
concepts that we probably take for granted when doing so. For example, imagine you are processing a `.csv` file containing
the scores that players recorded on an online game. The `.csv` file is published daily (the number of records in the file
vary from one day to the next). Say you wanted to determine the average score for players aged 16-18? You could perform 
a `GroupByKey` operation on the `PCollection` records where the key `16 >= age <= 18` and calculate the `mean`. In this 
scenario, we usally assume that the `GroupByKey` operation will be performed on _all_ of the matching records in the 
`.csv` file.  

Now consider an alternative scenario where the data arrives as a stream via a Kafka Topic. In this scenario our input is 
_unbounded_; there are an infinite (or potentially infinite) number of records in our `PCollection`. We can't  simply 
declare a `GroupByKey` operation on the stream where `16 >= age <= 18`, because these types of records could keep 
arriving forever (apparently, a lot of teenagers are playing video games these days). Instead, we also need to declare a
_boundary_ on the data we want processed. Windows! Windows let us explicitly define boundaries on our input data. 

## Types of Windows

### Fixed Time Windows (aka Tumbling Windows)

Given a timestamped `PCollection` we declare a window to capture all of the elements whose timestamps lie within the 
specified time range. For example, we might declare a fixed window with a duration of 30 seconds on our stream data. 
Then, any elements with a timestamp in the range `[00:00:00, 00:00:30)` (i.e. up to but not including `00:00:30`) would 
get averaged as part of Window 0. Likewise, any elements with a timestamp in the range `[00:00:30, 00:01:00)` would get 
averaged as part of Window 1, and so on.  

<table class="image">
<caption align="bottom" style="text-align: center">https://beam.apache.org/documentation/programming-guide</caption>
<tr><td><img src="https://beam.apache.org/images/fixed-time-windows.png"></td></tr>
</table>
 
### Sliding Time Windows

Sliding windows can overlap. For example, we could declare a sliding window _duration_ of 60 seconds, and declare that a 
new window should start every 30 seconds (called the _period_). In this case, elements will belong to more than 1 
window. This type of windowing can be used to create rolling averages.

<table class="image">
<caption align="bottom" style="text-align: center">https://beam.apache.org/documentation/programming-guide</caption>
<tr><td><img src="https://beam.apache.org/images/sliding-time-windows.png"></td></tr>
</table>

### Session Windows

A session window creates a boundary around a series of consecutive events separated by a duration of time (i.e. a
_gap_). For example, imagine we are collecting user input data (e.g. keyboard strokes, joystick movement, touch input) 
for the players in our online game. We might expect to see bursts of data for each player, followed by gaps with no 
activity (time for a soda, time for homework). When data arrives after the specified gap duration, a new window is 
created. Note that session windows are applied on a per-key basis.   

<table class="image">
<caption align="bottom" style="text-align: center">
    https://beam.apache.org/documentation/programming-guide
</caption>
<tr><td><img src="https://beam.apache.org/images/session-windows.png"></td></tr>
</table>

### Global Windows

This is the default window if your pipeline doesn't explicitly create one of the aforementioned windows. When we 
considered our batch data example, we relied on a global window. Because our datasource was a `.csv` file, the data was 
bounded so we could safely perform aggregation operations (e.g. `GroupByKey`, `Combine`) operations. Actually, you _can_ 
use a global window on streaming data, under a couple of circumstances:

- You aren't performing any aggregation operations in your pipeline. For example, if your pipeline is performing simple 
transformations on individual `PCollection` elements as they arrive on the stream. 
- You provide a non-default `Trigger` for the global window. Triggers are the mechanism used by Beam to determine when 
to emit the results of a window. So for example, you might use a custom trigger that says to emit the results of your 
global window every time 50 elements arrive.

## References

https://beam.apache.org/documentation/programming-guide