In [None]:
#import findspark
import pyspark

#findspark.init()

spark = pyspark.sql.SparkSession.builder \
        .master('local') \
        .appName('Aggregates with PySpark') \
        .getOrCreate()

sc = spark.sparkContext

# Aggregates

In this notebook we will be using data from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online%20Retail).  
We'll actually use [this version](https://www.kaggle.com/carrie1/ecommerce-data) of the dataset for convenience reasons (csv format).

In [None]:
from pathlib import Path
INPUT_PATH = Path(*['..'] * 3, 'input', 'Online-Retail-Dataset')
filename = 'data.csv'
filepath = INPUT_PATH / filename
filepath

In [None]:
sales = spark.read.format("csv").options(inferSchema=True, header=True).load("dbfs:/FileStore/shared_uploads/thibaudchevrier@gmail.com/data.csv")
sales.printSchema()
print("Shape:", (sales.count(), len(sales.columns)))
sales.show(5)

Before we get started, let's import PySpark's SQL functions.

In [None]:
from pyspark.sql import functions as F

We're **ready to roll!**

What's the type of `sales.groupBy('CustomerID')`?

In [None]:
type(sales.groupBy('CustomerID'))

A `GroupedData`, here's the link to the [documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData).

We'll go through what we can do with this.

Compute the average of all numeric columns with `.avg()`

In [None]:
sales.groupBy('CustomerID').avg().show(5)

`.mean()` is an alias to `.avg()`

In [None]:
sales.groupBy('CustomerID').mean().show()

`.sum()`

In [None]:
sales.groupBy('customerID').sum().show(5)

We can select a specific column

In [None]:
sales.groupBy('customerID').mean('Quantity').show(5)

or more

In [None]:
sales.groupBy('customerID').mean('Quantity', 'UnitPrice').show(5)

Won't work with a list, a tuple or a generator.  
These need to be ***unpacked**.

In [None]:
# This will fail without unpacking
sales.groupBy('customerID').sum(*('Quantity', 'UnitPrice')).show(5)

`count()` is a bit different, it doesn't apply to any column, it count the number of rows in the DataFrame.  
This is different from pandas, where `.count()` count the number of non-null values.

In [None]:
sales.groupBy('customerID').count().show(5)

Other aggregation functions include: `.max()`, `.min()` and `.pivot()`.

And a last one: `.agg()`. This one is a bit meta, it will compute the aggregate of a function given as parameter.  
From the [Documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData.agg)

> The available aggregate functions are `avg`, `max`, `min`, `sum`, `count`.

1. If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.

In [None]:
agg_dict = {'Quantity': 'mean', 'UnitPrice': 'sum'}
sales.groupBy('customerId').agg(agg_dict).show(5)

Can use `'*'` if calling `.count()`. But not with others.

In [None]:
sales.groupBy('customerId').agg({'*': 'count'}).show(5)

2. Alternatively, exprs can also be a "list" of aggregate Column expressions.  
This requires ***unpacking**

In [None]:
agg_exprs = [F.mean('Quantity'), F.sum('UnitPrice')]
sales.groupBy('customerId').agg(*agg_exprs).show()

Now, you can alias, and that's because `.alias()` also returns a _Column Expression_.

In [None]:
type(F.mean('Quantity').alias('meanQuantity'))

In [None]:
agg_exprs = [F.mean('Quantity'), F.sum('UnitPrice')]
sales.groupBy('customerId').agg(
    F.mean('Quantity').alias('meanQuantity'),
    F.sum('UnitPrice').alias('totalPrice')
).show(5)