# GroupBy

Takes a collection of elements and produces a collection grouped,
by properties of those elements.

Unlike `GroupByKey`, the key is dynamically created from the elements themselves.

## Setup

To run a code cell, you can click the **Run cell** button at the top left of the cell,
or select it and press **`Shift+Enter`**.
Try modifying a code cell and re-running it to see what happens.

First, let's install the `apache-beam` module.

In [None]:
!pip install --quiet -U apache-beam

## Grouping Examples

In the following example, we create a pipeline with a `PCollection` of fruits.

We use `GroupBy` to group all fruits by the first letter of their name.

In [4]:
import apache_beam as beam

with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(['strawberry', 'raspberry', 'blueberry', 'blackberry', 'banana'])
      | beam.GroupBy(lambda s: s[0])
      | beam.Map(print)
  )

('s', ['strawberry'])
('r', ['raspberry'])
('b', ['blueberry', 'blackberry', 'banana'])


We can group by a composite key consisting of multiple properties if desired. The resulting key is a named tuple with the two requested attributes, and the values are grouped accordingly.

In [5]:
import apache_beam as beam

with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(['strawberry', 'raspberry', 'blueberry', 'blackberry', 'banana'])
      | beam.GroupBy(letter=lambda s: s[0], is_berry=lambda s: 'berry' in s)
      | beam.Map(print)
  )

(Key(letter='s', is_berry=True), ['strawberry'])
(Key(letter='r', is_berry=True), ['raspberry'])
(Key(letter='b', is_berry=True), ['blueberry', 'blackberry'])
(Key(letter='b', is_berry=False), ['banana'])


In the case that the property one wishes to group by is an attribute, a string
may be passed to `GroupBy` in the place of a callable expression. For example,
suppose I have the following data

In [6]:
GROCERY_LIST = [
    beam.Row(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.50),
    beam.Row(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.50),
    beam.Row(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.00),
    beam.Row(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.00),
    beam.Row(recipe='muffin', fruit='banana', quantity=3, unit_price=1.00),
]

We can then do

In [20]:
import pprint

with beam.Pipeline() as p:
  grouped = (
      p 
      | beam.Create(GROCERY_LIST)  
      | beam.GroupBy('recipe')
      | beam.Map(pprint.pprint)
  )

('pie',
 [BeamSchema_a1e96ac9_005e_4869_92b9_4b8f4424b798(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.5),
  BeamSchema_a1e96ac9_005e_4869_92b9_4b8f4424b798(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.5),
  BeamSchema_a1e96ac9_005e_4869_92b9_4b8f4424b798(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.0),
  BeamSchema_a1e96ac9_005e_4869_92b9_4b8f4424b798(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.0)])
('muffin',
 [BeamSchema_a1e96ac9_005e_4869_92b9_4b8f4424b798(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.0),
  BeamSchema_a1e96ac9_005e_4869_92b9_4b8f4424b798(recipe='muffin', fruit='banana', quantity=3, unit_price=1.0)])


It is possible to mix and match attributes and expressions, for example

In [21]:
import pprint

with beam.Pipeline() as p:
  grouped = (
      p 
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy('recipe', is_berry=lambda x: 'berry' in x.fruit)
      | beam.Map(pprint.pprint)
  )

(Key(recipe='pie', is_berry=True),
 [BeamSchema_95dffbb2_72c5_4487_9218_cee271a8ec85(recipe='pie', fruit='strawberry', quantity=3, unit_price=1.5),
  BeamSchema_95dffbb2_72c5_4487_9218_cee271a8ec85(recipe='pie', fruit='raspberry', quantity=1, unit_price=3.5),
  BeamSchema_95dffbb2_72c5_4487_9218_cee271a8ec85(recipe='pie', fruit='blackberry', quantity=1, unit_price=4.0),
  BeamSchema_95dffbb2_72c5_4487_9218_cee271a8ec85(recipe='pie', fruit='blueberry', quantity=1, unit_price=2.0)])
(Key(recipe='muffin', is_berry=True),
 [BeamSchema_95dffbb2_72c5_4487_9218_cee271a8ec85(recipe='muffin', fruit='blueberry', quantity=2, unit_price=2.0)])
(Key(recipe='muffin', is_berry=False),
 [BeamSchema_95dffbb2_72c5_4487_9218_cee271a8ec85(recipe='muffin', fruit='banana', quantity=3, unit_price=1.0)])


## Aggregation

Grouping is often used in conjunction with aggregation, and the
`aggregate_field` method of the `GroupBy` transform can be used to accomplish
this easily.
This method takes three parameters: the field (or expression) which to
aggregate, the `CombineFn` (or associative `callable`) with which to aggregate
by, and finally a field name in which to store the result.
For example, suppose one wanted to compute the amount of each fruit to buy.
One could write

In [12]:
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy('fruit')
          .aggregate_field('quantity', sum, 'total_quantity')
      | beam.Map(print)
  )

Result(fruit='strawberry', total_quantity=3)
Result(fruit='raspberry', total_quantity=1)
Result(fruit='blackberry', total_quantity=1)
Result(fruit='blueberry', total_quantity=3)
Result(fruit='banana', total_quantity=3)


Similar to the parameters in `GroupBy`, one can also aggregate multiple fields
and by expressions.

In [13]:
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy('recipe')
          .aggregate_field('quantity', sum, 'total_quantity')
          .aggregate_field(lambda x: x.quantity * x.unit_price, sum, 'price')
      | beam.Map(print)
  )

Result(recipe='pie', total_quantity=6, price=14.0)
Result(recipe='muffin', total_quantity=5, price=7.0)


One can, of course, aggregate the same field multiple times as well.
This example also illustrates a global grouping, as the grouping key is empty.

In [16]:
with beam.Pipeline() as p:
  grouped = (
      p
      | beam.Create(GROCERY_LIST)
      | beam.GroupBy()
          .aggregate_field('unit_price', min, 'min_price')
          .aggregate_field('unit_price', beam.transforms.combiners.MeanCombineFn(), 'mean_price')
          .aggregate_field('unit_price', max, 'max_price')
      | beam.Map(print)
  )

Result(min_price=1.0, mean_price=2.3333333333333335, max_price=4.0)


## Related transforms

* [CombinePerKey](/documentation/transforms/python/aggregation/combineperkey) for combining with a single CombineFn.
* [GroupByKey](/documentation/transforms/python/aggregation/groupbykey) for grouping with a known key.
* [CoGroupByKey](/documentation/transforms/python/aggregation/cogroupbykey) for multiple input collections.