##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->



# Windows

Windows subdivide `PCollections` based on element’s timestamp and/or a given logic. When aggregating unbounded data, you must subdivide this data with windows; this allows for the aggregation to use these bounded data to finish the operation.

Apache Beam has three predefined windows: `FixedWindows`, `SlidingWindows` and `SessionWindows`. Here we talk about the first two. There is another type of window called `GlobalWindow`, which all elements fall into.

You can create our own window type, but we don't cover this here. You can find this information in the [Apache Beam Documentation](https://beam.apache.org/documentation/programming-guide/#setting-your-pcollections-windowing-function).

Let's first import the needed packages.

In [None]:
import logging
import time
from datetime import datetime

import apache_beam as beam
from apache_beam import Create, FlatMap, Map, ParDo, Filter, Flatten, Partition
from apache_beam import Keys, Values, GroupByKey, CoGroupByKey, CombineGlobally, CombinePerKey
from apache_beam import pvalue, window, WindowInto
from apache_beam.transforms.combiners import Top, Mean, Count, MeanCombineFn

from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

**`FixedWindows`** creates a window of a given duration and when it closes, another window is create right after it. They are good for measuring events within a given time.

![Fixed Windows](./images/fixed-time-windows.png)

**`SlidingWindows`** creates a window of a given duration and, after given period, it creates another overlapping window of the same duration. This means that an element can fall into more than one window. They are useful when measuring trends.

![Sliding Windows](./images/sliding-time-windows.png)

See Apache Beam programming guide for more [details](https://beam.apache.org/documentation/programming-guide/#windowing).


The `InteractiveRunner` has another feature that allow us to see the window info in the output

In [None]:
p = beam.Pipeline(InteractiveRunner())

def key_value_fn(element):
    key = element["user"]
    value = element["product"]
    return (key, value)

elements = [
    {"user": "john", "product": "Laptop", "time": 1581870000}, #16:20 UTC 
    {"user": "rebecca", "product": "Videogame", "time": 1581870180}, #16:23 UTC
    {"user": "john", "product": "Movie", "time": 1581870420}, #16:27 UTC
    {"user": "rebecca", "product": "Snacks", "time": 1581871200}, #16:40 UTC
    {"user": "rebecca", "product": "Controller", "time": 1581870900}, #16:35 UTC
]

create = (p | "Create" >> Create(elements)
            | 'With timestamps' >> Map(lambda x: window.TimestampedValue(x, x["time"]))
            | "Add keys" >> Map(key_value_fn))

fixed = (create | "FixedWindow" >> WindowInto(window.FixedWindows(600))  # 10 min windows
                | "GBK Fixed" >> GroupByKey())

sliding = (create | "Window" >> WindowInto(window.SlidingWindows(600, period=300))  # 10 min windows, 5 min period
                  | "GBK Sliding" >> GroupByKey())

ib.show_graph(p)
ib.show(fixed, sliding, include_window_info=True)

The next cell will tell you the times in your time zone

In [None]:
timestamps = [1581870000, 1581870180, 1581870420, 1581871200, 1581870900]

for timestamp in timestamps:
    local_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(timestamp))
    print(local_time)

****
**Take a good look at the output**. Note, for example, that Rebecca buying a controller is in two `SlidingWindows` (window from 16:30 to 16:40 and 16:35 to 16:45, UTC time), but only once in `FixedWindows`. 


In step `With timestamps`, we see how we can modify the element metadata adding/changing the timestamp. Some operations (as `PubSubIO`) have it built-in, and the timestamp is the sent-timestamp.

## Exercise

We are generating elements of players, their score and a timestamp in Python dictionaries. We want to know the total score per player of every hour. We also need the average score of all players in the last hour, but we want it every 20 minutes.

### Important note

You need to use windows for this exercise, so you must add something to global combiners (i.e., `Count.Globally()`). The explanation of why you need to add that is below the solution of the exercise.

<details><summary>Spoiler</summary>
<p>

Instead of using `Mean/Count.Globally()` by itself, we need to add:
 
 ```
    Mean/Count.Globally().without_defaults()
 ```
<br>
    
Note that `without_defaults()` for `Mean` was added recently in [Pull Request](https://github.com/apache/beam/pull/11943), and for the other built-in combiners (as `Count` or `Top`) in [Pull Request](https://github.com/apache/beam/pull/12074). Another approach for SDKs without the Pull Requests is:
    
 ```
   CombineGlobally(CountCombineFn()).without_defaults()
 ```
</p>    
</details>
<br>


Since we are going to test the pipeline, be sure to name the final pipelines `total` and `avg`. 

In [None]:
from apache_beam.testing.util import assert_that
from apache_beam.testing.util import matches_all, equal_to
from utils.solutions import solutions

In [None]:
p = beam.Pipeline(InteractiveRunner())
    
scores = [
    {"player":"Marina", "score":1000, "timestamp": "2020-04-30 16:10"},
    {"player":"Cristina", "score":2000, "timestamp": "2020-04-30 15:00"},
    {"player":"Cristina", "score":2000, "timestamp": "2020-04-30 15:45"},
    {"player":"Marina", "score":3000, "timestamp": "2020-04-30 16:30"},
    {"player":"Juan", "score":2000, "timestamp": "2020-04-30 15:15"},
    {"player":"Cristina", "score":2000, "timestamp": "2020-04-30 16:50"},
    {"player":"Juan", "score":1000, "timestamp": "2020-04-30 16:59"},     
]

def date2unix(string):
    unix = int(time.mktime(datetime.strptime(string, "%Y-%m-%d %H:%M").timetuple()))
    return unix

# TODO: Finish the pipeline 
create = (p | "Create" >> Create(scores)
            | "Add timestamps" >> Map(lambda x: window.TimestampedValue(x, date2unix(x["timestamp"])))
         )


ib.show_graph(p)
ib.show(total, avg, include_window_info=True)

# For testing the solution - Don't modify
assert_that(avg, equal_to(solutions[4]["avg"]), label="Average")
assert_that(total, equal_to(solutions[4]["total"]), label="total")

### Hints

**Prepare elements for total per player**
<details><summary>Hint</summary>
<p>

Our input elements are dictionaries, but you need key-value pairs to process those. A `Map` function does this.
</p>
</details>


<details><summary>Code</summary>
<p>

```
def toKV(element):
    return (element["player"], element["score"])

total = create | "To KV" >> Map(toKV) 
```
</p>
</details>

**Prepare elements for average**
<details><summary>Hint</summary>
<p>

Because the timestamp has already been used for the element timestamp, you only need to work with the scores. You can process the scores data with a `Map` function.
</p>
</details>


<details><summary>Code</summary>
<p>

```
 avg =  create | "Get Score" >> Map(lambda x: x["score"])
```
</p>
</details>

**Group elements for total per player**
<details><summary>Hint</summary>
<p>

For the total per player you need to group every hour. A `FixedWindow` is the way to go.   
</p>
</details>

<details><summary>Code</summary>
<p>

```
   | "FixedWindow" >> WindowInto(window.FixedWindows(60 * 60))
```
</p>
</details>

**Group elements for average**
<details><summary>Hint</summary>
<p>

In this case you want to group every hour, but get the value every 20 min, this means that there could be overlap in some elements. When in this situation, you need to use `SlidingWindows`.  
</p>
</details>


<details><summary>Code</summary>
<p>

```
    | "SlidingWindow" >> WindowInto(window.SlidingWindows(60 * 60, period=60 * 20))
```

</p>
</details>

**Process elements for total per player**
<details><summary>Hint</summary>
<p>

You want to know the total score per player, so you are going to need a `PerKey` combiner, in this case `CombinePerKey` with a fn that sums the values in lists.
</p>
</details>


<details><summary>Code</summary>
<p>

```
   | "Total Per Key" >> CombinePerKey(sum))
```
</p>
</details>

**Process elements for average**
<details><summary>Hint</summary>
<p>

This one is a bit tricker due to the note mentioned before. You are going to need a global combiner, since there are no keys (we only have scores as elements).
</p>
</details>

<details><summary>Code</summary>
<p>

```
    | Mean.Globally().without_defaults())      
```
</p>
</details>

**Full code**
<details><summary>Code</summary>
<p>

```
p = beam.Pipeline(InteractiveRunner())

scores = [
    {"player":"Marina", "score":1000, "timestamp": "2020-04-30 16:10"},
    {"player":"Cristina", "score":2000, "timestamp": "2020-04-30 15:00"},
    {"player":"Cristina", "score":2000, "timestamp": "2020-04-30 15:45"},
    {"player":"Marina", "score":3000, "timestamp": "2020-04-30 16:30"},
    {"player":"Juan", "score":2000, "timestamp": "2020-04-30 15:15"},
    {"player":"Cristina", "score":2000, "timestamp": "2020-04-30 16:50"},
    {"player":"Juan", "score":1000, "timestamp": "2020-04-30 16:59"},      
]

def date2unix(string):
    unix = int(time.mktime(datetime.strptime(string, "%Y-%m-%d %H:%M").timetuple()))
    return unix

def toKV(element):
    return (element["player"], element["score"])

create = (p | "Create" >> Create(scores)
            | "Add timestamps" >> Map(lambda x: window.TimestampedValue(x, date2unix(x["timestamp"]))))

total = (create | "To KV" >> Map(toKV) 
                | "FixedWindow" >> WindowInto(window.FixedWindows(60 * 60))
                | "Total Per Key" >> CombinePerKey(sum))

avg = (create | "Get Score" >> Map(lambda x: x["score"])
              | "SlidingWindow" >> WindowInto(window.SlidingWindows(60 * 60, period=60 * 20))
              | Mean.Globally().without_defaults())

ib.show_graph(p)
ib.show([total, avg], include_window_info=True)

# For testing the solution - Don't modify
assert_that(avg, equal_to(solutions[4]["avg"]), label="Average")
assert_that(total, equal_to(solutions[4]["total"]), label="total")
```    

</p>
</details>

## Explanation of `without_defaults`

[Apache Beam documentation](https://beam.apache.org/documentation/programming-guide/#core-beam-transforms).

When using a combiner, the default behavior of Beam is to return a `PCollection` containing one element (this depends on the combiner, but for example, the sum fn returns a 0). If you use Windows (apart from the `GlobalWindow`) the behavior is different. Using `without_defaults` makes the output empty if the Window doesn't have elements.