##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->

# Aggregations

Previous notebooks covered element-wise operations, but, in order to aggregate data, we need operations that happen `PCollection`-wise.

First, import the necessary resources.

In [None]:
import logging

import apache_beam as beam
from apache_beam import Create, Map, ParDo, Flatten, Keys
from apache_beam import Values, GroupByKey, CoGroupByKey, CombineGlobally, CombinePerKey
from apache_beam import pvalue, window, WindowInto
from apache_beam.transforms.util import WithKeys
from apache_beam.transforms.combiners import Top, Mean, Count

from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

**`GroupByKey`** takes a `PCollection` of key-value pairs and outputs each key with all values associated with that key.

In [None]:
p = beam.Pipeline(InteractiveRunner())

elements = [
    {"country": "China", "population": 1389, "continent": "Asia"},
    {"country": "India", "population": 1311, "continent": "Asia"},
    {"country": "USA", "population": 331, "continent": "America"},
    {"country": "Australia", "population": 25, "continent": "Oceania"},
    {"country": "Brazil", "population": 212, "continent": "America"},
]

gbk = (p | "Create" >> Create(elements)
         | "Add Keys" >> WithKeys(lambda x: x["continent"])
         | GroupByKey())

ib.show(gbk)

</br>

Note that the output is the key and an iterable containing the values.

Some of the basic combiner functions are already built-in:

- **`Count`** takes a `PCollection` and outputs the amount of elements.  
- **`Top`** outputs the *n* largest/smallest of a `PCollection` given a comparison.  
- **`Mean`** outputs the arithmetic mean of a `PCollection`.

Combiners can aggregate using the whole `PCollection` or by key using methods:

- **`.Globally`** applies the combiner to the whole `PCollection`.
- **`.PerKey`** applies the combiner for each key-value in the `Pcollection`.

In [None]:
p = beam.Pipeline(InteractiveRunner())

def key_value_fn(element):
    return (element['continent'], element['population'])

elements = [
    {"country": "China", "population": 1389, "continent": "Asia"},
    {"country": "India", "population": 1311, "continent": "Asia"},
    {"country": "Japan", "population": 126, "continent": "Asia"},        
    {"country": "USA", "population": 331, "continent": "America"},
    {"country": "Ireland", "population": 5, "continent": "Europe"},
    {"country": "Indonesia", "population": 273, "continent": "Asia"},
    {"country": "Brazil", "population": 212, "continent": "America"},
    {"country": "Egypt", "population": 102, "continent": "Africa"},
    {"country": "Spain", "population": 47, "continent": "Europe"},
    {"country": "Ghana", "population": 31, "continent": "Africa"},
    {"country": "Australia", "population": 25, "continent": "Oceania"},
]

create = (p | "Create" >> Create(elements)
            | "Map Keys" >> Map(key_value_fn))

element_count_total = create | "Total Count" >> Count.Globally()

element_count_grouped = create | "Count Per Key" >> Count.PerKey()

top_grouped = create | "Top" >> Top.PerKey(n=2) # We get the top 2

mean_grouped = create | "Mean" >> Mean.PerKey()


ib.show_graph(p)
ib.show(element_count_total, element_count_grouped, top_grouped, mean_grouped)

</br>

**`CoGroupByKey`** Aggregates all input elements by their key and allows downstream processing to consume all values associated with the key. While `GroupByKey` performs this operation over a single input collection and thus a single type of input values, `CoGroupByKey` operates over multiple input collections. As a result, the result for each key is a tuple of the values associated with that key in each input collection.

In [None]:
p = beam.Pipeline(InteractiveRunner())

jobs = [
    ("John", "Data Scientist"),
    ("Rebecca", "Full Stack Engineer"),
    ("John", "Data Engineer"),
    ("Alice", "CEO"),
    ("Charles", "Web Designer"),
]

hobbies = [
    ("John", "Baseball"),
    ("Rebecca", "Football"),
    ("John", "Piano"),
    ("Alice", "Photoshop"),
    ("Charles", "Coding"),
    ("Rebecca", "Acting"),
    ("Rebecca", "Reading")
]

jobs_create = p | "Create Jobs" >> Create(jobs)
hobbies_create = p | "Create Hobbies" >> Create(hobbies)

cogbk = (jobs_create, hobbies_create) | CoGroupByKey()

ib.show_graph(p)
ib.show(cogbk)

</br> 

This operation could be thought of as a `Flatten`+`GroupByKey`.

Sometimes you need to add your own logic to aggregate data, either in a global way or per key. For this, you can build your own combiners.  

**`CombineGlobally`** takes a `PCollection` and outputs the aggregated value of the given function.


In [None]:
p = beam.Pipeline(InteractiveRunner())

elements = ["Lorem ipsum dolor sit amet. Consectetur adipiscing elit",
            "Sed eu velit nec sem vulputate loborti",
            "In lobortis augue vitae sagittis molestie. Mauris volutpat tortor non purus elementum",
            "Ut blandit massa et risus sollicitudin auctor"]

combine = (p | "Create" >> Create(elements)
             | "Join" >> CombineGlobally(lambda x: ". ".join(x)))

ib.show(combine)

</br> 

Note that the order may change. Combiners are normally commutative (i.e., *a + b = b + a*) and associative (i.e., *a + (b + c)= (a + b) + c*) operations.

The combiner can also be done per key:


**`CombinePerKey`** takes a `PCollection` and outputs the aggregated value of the given function per key.

In [None]:
p = beam.Pipeline(InteractiveRunner())

elements = [
            ("Latin", "Lorem ipsum dolor sit amet. Consectetur adipiscing elit. Sed eu velit nec sem vulputate loborti"),
            ("Latin", "In lobortis augue vitae sagittis molestie. Mauris volutpat tortor non purus elementum"),
            ("English", "From fairest creatures we desire increase"),
            ("English", "That thereby beauty's rose might never die"),
            ("English", "But as the riper should by time decease"),
            ("Spanish", "En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho"),
            ("Spanish", "tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua"),
]

combine_key = (p | "Create" >> Create(elements)
                 | "Join By Language" >> CombinePerKey(lambda x: ". ".join(x)))

ib.show_graph(p)
ib.show(combine_key)

## Exercise

The pipeline creates key-value pairs of buyers and items. From these key-value pairs the pipeline extracts three things: the items each person bought, how many times each item was bought, and how many total items were bought.

**Example**

From values `(Bob, TV)`, `(Alice, TV)` and `(Bob, Speakers)` the output is that the TV was bought two times and the Speakers one time, there were a total of three items bought, and Bob bought a TV and a speaker and Alice just a TV.

Since we are going to test if the pipeline is right, be sure to name the final pipelines `buyers`, `total_per_items` and `total_sum`. 

In [None]:
from apache_beam.testing.util import assert_that
from apache_beam.testing.util import matches_all, equal_to
from utils.solutions import solutions

In [None]:
p = beam.Pipeline(InteractiveRunner())

kvs = [("Bob", "TV"),
       ("Alice", "TV"),
       ("Pedro", "Speaker"),
       ("Bob", "Speaker"),
       ("Bob", "HDMI"),
       ("Alice", "Controller")]

# TODO: Finish the pipeline 
create = p | "Create" >> Create(kvs)



ib.show_graph(p)
ib.show(buyers, total_per_items, total_sum)

# For testing the solution - Don't modify
assert_that(total_per_items, equal_to(solutions[3]["total_per_items"]), label="Total per item")
assert_that(total_sum, equal_to(solutions[3]["total_sum"]), label="Total sum")
assert_that(buyers | beam.MapTuple(lambda k, v: (k, sorted(v))), equal_to(solutions[3]["buyers"]), label="Buyers")

### Hints

**Get items per buyer**
<details><summary>Hint</summary>
<p>

You need to take the `PCollection` and output the grouped values per key, this needs a `GroupByKey`.
</p>
</details>

<details><summary>Code</summary>
<p>

```
buyers = create | "GBK Buyer" >> GroupByKey()
```

</p>
</details>

**Count times each item was bought**
<details><summary>Hint</summary>
<p>

Since the input is the same `create`, just branch it out. You need to aggregate the elements by key and, in this case, count them, hence you need `Count.PerKey`. But, since the input key is the buyer rather than the item, swap them before (there's a built-in operation but a `Map` suffices, also `MapTuple`).
</p>
</details>


<details><summary>Code</summary>
<p>
    
```
total_per_items = (create | "Invert keys" >> Map(lambda x: (x[1], x[0]))
                          | "Count per key" >> Count.PerKey())    
```
</p>
</details>

**Count total sells**
<details><summary>Hint</summary>
<p>

There is more than one way to do this, you can take the input from `Create`, but that means each element is aggregated three times (`GroupByKey`, `Count.PerKey`, and this aggregation). A more efficient way is to sum the values that the `Count.PerKey` output (since it's already aggregated), but just with the values of the key-value pairs. Since there's no need to aggregate considering the key (there are no keys now), you can use `Combine.Globally`. 
</p>
</details>


<details><summary>Code</summary>
<p>

```
total_sum = (total_per_items | Values()
                             | CombineGlobally(sum))    
```
</p>
</details>

**Full code**
<details><summary>Code</summary>
<p>

```
p = beam.Pipeline(InteractiveRunner())

kvs = [("Bob", "TV"),
       ("Alice", "TV"),
       ("Pedro", "Speaker"),
       ("Bob", "Speaker"),
       ("Bob", "HDMI"),
       ("Alice", "Controller")]


# TODO: Finish the pipeline 
create = p | "Create" >> Create(kvs)

buyers = create | "GBK Buyer" >> GroupByKey()

total_per_items = (create | "Invert keys" >> Map(lambda x: (x[1], x[0]))
                          | "Count per key" >> Count.PerKey())

total_sum = (total_per_items | Values()
                             | CombineGlobally(sum))

ib.show_graph(p)
ib.show(buyers, total_per_items, total_sum)

# For testing the solution - Don't modify
assert_that(total_per_items, equal_to(solutions[3]["total_per_items"]), label="Total per item")
assert_that(total_sum, equal_to(solutions[3]["total_sum"]), label="Total sum")
assert_that(buyers | beam.MapTuple(lambda k, v: (k, sorted(v))), equal_to(solutions[3]["buyers"]), label="Buyers")
```
</p>
</details>
