# Aggregations

Spark allows us to create several different grouping types for aggregation. This notebook will discuss some of these grouping and aggregation techniques.

Let's first load some retail data. We have over 300 csv files each representing daily transactions in a retail store:

In [1]:
data = "gs://is843-demo/notebooks/jupyter/data/"

## Data import

In [2]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load(data + "retail-data/by-day/*.csv")

                                                                                

Here’s a sample of the data:

In [3]:
df.show(5, False)

df.printSchema()

+---------+---------+-------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                    |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-------------------------------+--------+-------------------+---------+----------+--------------+
|580538   |23084    |RABBIT NIGHT LIGHT             |48      |2011-12-05 08:38:00|1.79     |14075.0   |United Kingdom|
|580538   |23077    |DOUGHNUT LIP GLOSS             |20      |2011-12-05 08:38:00|1.25     |14075.0   |United Kingdom|
|580538   |22906    |12 MESSAGE CARDS WITH ENVELOPES|24      |2011-12-05 08:38:00|1.65     |14075.0   |United Kingdom|
|580538   |21914    |BLUE HARMONICA IN BOX          |24      |2011-12-05 08:38:00|1.25     |14075.0   |United Kingdom|
|580538   |22467    |GUMBALL COAT RACK              |6       |2011-12-05 08:38:00|2.55     |14075.0   |United Kingdom|
+---------+---------+---------------------------

`InvoiceDate` is being recognized as string. Let's replace it with a date format. We won't need the timestamp, so it's ok to lose it.

We will also cast `CustomerID` as string and Quantity as long:

In [4]:
from pyspark.sql.functions import col, to_date

df = df.withColumn("InvoiceDate", to_date(col("InvoiceDate"), "yyyy-MM-dd HH:mm:ss"))

df = df.withColumn("CustomerID", col("CustomerID").cast("string"))
df = df.withColumn("Quantity", col("Quantity").cast("long"))

df.createOrReplaceTempView("dfTable")

df.show(5)
df.printSchema()

+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48| 2011-12-05|     1.79|   14075.0|United Kingdom|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20| 2011-12-05|     1.25|   14075.0|United Kingdom|
|   580538|    22906|12 MESSAGE CARDS ...|      24| 2011-12-05|     1.65|   14075.0|United Kingdom|
|   580538|    21914|BLUE HARMONICA IN...|      24| 2011-12-05|     1.25|   14075.0|United Kingdom|
|   580538|    22467|   GUMBALL COAT RACK|       6| 2011-12-05|     2.55|   14075.0|United Kingdom|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
only showing top 5 rows

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nulla

## Caching

Caching allows the DataFrame to be loaded and persist in the memory. If we don't use this option, every time we execute an action our DataFrame gets loaded from our Cloud Storage, which is not ideal and will add to our execution time:

**Note:** Caching is a lazy transformation. It will happen the first time you execute an action against the DataFrame, not when you cache that DataFrame.

In [5]:
df.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: bigint, InvoiceDate: date, UnitPrice: double, CustomerID: string, Country: string]

Basic aggregations apply to an entire DataFrame. The simplest example is the `count` method:

In [6]:
df.count()

                                                                                

541909

It returns the total number of rows in this DataFrame. `count` when used in this context is actually an action, so it returns the output immediately. We will also encounter cases that `count` will be acting as a lazy transformation.

Now that we have performed an action on `df` it should be cached into the memory. Go ahead an execute the previous command again to see the performance gain:

In [7]:
df.count()

541909

Once done with the DataFrame we can free up the memory by removing the cache. This can be done by `unpersist`:
```python
df.unpersist()
```

## Aggregation Functions

All aggregations are available as functions, in addition to the special cases that can appear on DataFrames or via `.stat`. You can find most aggregation functions in the `pyspark.sql.functions` package.

**Note:** There are some gaps between the available SQL functions and the functions that we can import in Scala and Python. This changes every release, so it’s impossible to include a definitive list. This section covers the most common functions.

### `count`
The first function worth going over is `count`, except in this example it will perform as a transformation instead of an action. In this case, we can do one of two things: specify a specific column to count, or all the columns by using `count(*)` or `count(1)` to represent that we want to count every row as the literal one, as shown in this example:

In [8]:
from pyspark.sql.functions import count

df.select(count("StockCode")).show()

+----------------+
|count(StockCode)|
+----------------+
|          541909|
+----------------+



In SQL
```sql
SELECT COUNT(*) FROM dfTable
```

**Warning**

when performing a `count(*)`, Spark will count null values (including rows containing all nulls). However, when counting an individual column, Spark will not count the null values. For instance if we repeate this for "CustomerID" column we will get a different value due to the null values:

In [9]:
df.select(count("CustomerID")).show()

+-----------------+
|count(CustomerID)|
+-----------------+
|           406829|
+-----------------+



### `countDistinct`

Sometimes, the total number is not relevant; rather, it’s the number of unique groups that you want. To get this number, you can use the `countDistinct` function. This is a bit more relevant for individual columns:

In [10]:
from pyspark.sql.functions import countDistinct

df.select(countDistinct("StockCode")).show()



+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+



                                                                                

In SQL
```sql
SELECT COUNT(DISTINCT *) FROM DFTABLE
```

`approx_count_distinct`: This function can be used to estimate the count distinct. It will give us a lot of performace gain.

### `min` and `max`

To extract the minimum and maximum values from a DataFrame, use the min and max functions:

In [11]:
from pyspark.sql.functions import min, max

df.select(min("Quantity"), max("Quantity")).show()

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+



### `sum`

Another simple task is to add all the values in a row using the sum function:

In [12]:
from pyspark.sql.functions import sum

df.select(sum("Quantity")).show()

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+



### `avg`

`avg` or `mean` functions:

In [13]:
from pyspark.sql.functions import sum, count, avg, expr

df.select(
    count("Quantity").alias("total_transactions"),
    sum("Quantity").alias("total_purchases"),
    avg("Quantity").alias("avg_purchases"),
    expr("mean(Quantity)").alias("mean_purchases"))\
  .selectExpr(
    "total_purchases/total_transactions",
    "avg_purchases",
    "mean_purchases").show()

+--------------------------------------+----------------+----------------+
|(total_purchases / total_transactions)|   avg_purchases|  mean_purchases|
+--------------------------------------+----------------+----------------+
|                      9.55224954743324|9.55224954743324|9.55224954743324|
+--------------------------------------+----------------+----------------+



### Variance and Standard Deviation

Calculating the mean naturally brings up questions about the variance and standard deviation. These are both measures of the spread of the data around the mean. The variance is the average of the squared differences from the mean, and the standard deviation is the square root of the variance. You can calculate these in Spark by using their respective functions:

In [14]:
from pyspark.sql.functions import variance, stddev

df.select(variance("Quantity"), stddev("Quantity")).show()

+------------------+---------------------+
|var_samp(Quantity)|stddev_samp(Quantity)|
+------------------+---------------------+
| 47559.39140929847|   218.08115785023352|
+------------------+---------------------+



Something to note is that Spark has both the formula for the sample standard deviation as well as the formula for the population standard deviation. These are fundamentally different statistical formulae, and we need to differentiate between them. By default, Spark performs the formula for the sample standard deviation or variance if you use the `variance` or `stddev` functions.

You can also specify these explicitly or refer to the population standard deviation or variance:

In [15]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), var_samp("Quantity"),
  stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+-----------------+------------------+--------------------+---------------------+
|var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+-----------------+------------------+--------------------+---------------------+
|47559.30364660878| 47559.39140929847|  218.08095663447733|   218.08115785023352|
+-----------------+------------------+--------------------+---------------------+



In SQL
```sql
SELECT var_pop(Quantity), var_samp(Quantity),
  stddev_pop(Quantity), stddev_samp(Quantity)
FROM dfTable
```

### Covariance and Correlation

We discussed single column aggregations, but some functions compare the interactions of the values in two difference columns together. Two of these functions are `covar_samp` and `corr`, for covariance and correlation, respectively. Correlation measures the Pearson correlation coefficient, which is scaled between –1 and +1. The covariance is scaled according to the inputs in the data. There also exists a covariance function for the population, covar_pop.

In [16]:
from pyspark.sql.functions import corr, covar_samp

df.select(corr("UnitPrice", "Quantity"), covar_samp("UnitPrice", "Quantity")).show()

+-------------------------+-------------------------------+
|corr(UnitPrice, Quantity)|covar_samp(UnitPrice, Quantity)|
+-------------------------+-------------------------------+
|     -0.00123492454487...|            -26.058761257936645|
+-------------------------+-------------------------------+



## Grouping

Thus far, we have performed only DataFrame-level aggregations. A more common task is to perform calculations based on groups in the data. This is typically done on categorical data for which we group our data on one column and perform some calculations on the other columns that end up in that group.

The best way to explain this is to begin performing some groupings. The first will be a count, just as we did before. We will group by each unique invoice number and get the count of items on that invoice. Note that this returns another DataFrame and is lazily performed.

We do this grouping in two phases. First we specify the column(s) on which we would like to group, and then we specify the aggregation(s). The first step returns a `RelationalGroupedDataset`, and the second step returns a `DataFrame`.

As mentioned, we can specify any number of columns on which we want to group:

In [17]:
df.groupBy("InvoiceNo", "CustomerId")

<pyspark.sql.group.GroupedData at 0x7fd1bcf26c20>

In [18]:
df.groupBy("InvoiceNo", "CustomerId").count().show(5)



+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   577728|   17811.0|   46|
|   580060|   14287.0|    5|
|   580066|   14309.0|   18|
|   577170|   13456.0|    7|
|   580193|   16899.0|   25|
+---------+----------+-----+
only showing top 5 rows



                                                                                

In SQL
```sql
SELECT InvoiceNo, CustomerId, count(*) FROM dfTable GROUP BY InvoiceNo, CustomerId
```

## Grouping with Expressions

As we saw earlier, counting is a bit of a special case because it exists as a method. For this, usually we prefer to use the `count` function. Rather than passing that function as an expression into a `select` statement, we specify it as within `agg`. This makes it possible for you to pass-in arbitrary expressions that just need to have some aggregation specified. You can even do things like `alias` a column after transforming it for later use in your data flow:

In [19]:
from pyspark.sql.functions import count, expr

df.groupBy("InvoiceNo")\
  .agg(count("Quantity").alias("quan"),
       expr("count(Quantity) as quan2")
      ).show(5)



+---------+----+-----+
|InvoiceNo|quan|quan2|
+---------+----+-----+
|   573409|   1|    1|
|   572458|  26|   26|
|  C577362|   1|    1|
|   568711|   4|    4|
|   538041|   1|    1|
+---------+----+-----+
only showing top 5 rows



                                                                                

In [20]:
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)")).show(5)



+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   574966|               6.0|   3.640054944640259|
|   575091|11.552631578947368|   5.008925551458656|
|   578057| 4.607142857142857|   8.755974636597271|
|   537252|              31.0|                 0.0|
|   578459|              28.0|                26.0|
+---------+------------------+--------------------+
only showing top 5 rows



                                                                                

## Window Functions

Window functions operate on a set of rows and return a single value for each row from the underlying query. The term window describes the set of rows on which the function operates. A window function uses values from the rows in a window to calculate the returned values.

The following SQL query adds a new colummn (`overall_dataset_Q`) that includes the overall quantity for the entire dataset. This value will be the same for all the rows:

In [21]:
spark.sql("""
SELECT CustomerId, InvoiceDate, Quantity,

  sum(Quantity) OVER () as overall_dataset_Q
  
FROM dfTable WHERE CustomerId IS NOT NULL 
ORDER BY CustomerId, InvoiceDate DESC
""").show(5)

23/03/21 21:22:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 53:>                                                         (0 + 1) / 1]

+----------+-----------+--------+-----------------+
|CustomerId|InvoiceDate|Quantity|overall_dataset_Q|
+----------+-----------+--------+-----------------+
|   12346.0| 2011-01-18|  -74215|          4906888|
|   12346.0| 2011-01-18|   74215|          4906888|
|   12347.0| 2011-12-07|      12|          4906888|
|   12347.0| 2011-12-07|      24|          4906888|
|   12347.0| 2011-12-07|      24|          4906888|
+----------+-----------+--------+-----------------+
only showing top 5 rows



                                                                                

`OVER()` clause indicates that we want to use a window over the `sum()` expression. 

The `OVER()` clause has the following capabilities:

* Defines window partitions to form groups of rows. (`PARTITION BY` clause)
* Orders rows within a partition. (`ORDER BY` clause)

In the above query since we haven't defined a `PARTITION` within `OVER()` the aggregating function applies to the entire dataset.

We can do a lot more useful calculations with the window functions. For instance, we can add a total quantity per customer and a total quantity per customer by date, specifying the appropriate `PARTITION`s:

In [22]:
spark.sql("""
SELECT CustomerId, InvoiceDate, Quantity,

  sum(Quantity) OVER () as overall_dataset_Q,
  
  sum(Quantity) OVER (PARTITION BY CustomerId) as total_Q,
  
  sum(Quantity) OVER (PARTITION BY CustomerId, InvoiceDate) as total_Q_by_date
  
FROM dfTable WHERE CustomerId IS NOT NULL 
ORDER BY CustomerId, InvoiceDate DESC
""").show()

23/03/21 21:22:41 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:41 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:41 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:41 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 56:>                                                         (0 + 1) / 1]

+----------+-----------+--------+-----------------+-------+---------------+
|CustomerId|InvoiceDate|Quantity|overall_dataset_Q|total_Q|total_Q_by_date|
+----------+-----------+--------+-----------------+-------+---------------+
|   12346.0| 2011-01-18|   74215|          4906888|      0|              0|
|   12346.0| 2011-01-18|  -74215|          4906888|      0|              0|
|   12347.0| 2011-12-07|      20|          4906888|   2458|            192|
|   12347.0| 2011-12-07|      12|          4906888|   2458|            192|
|   12347.0| 2011-12-07|       6|          4906888|   2458|            192|
|   12347.0| 2011-12-07|      16|          4906888|   2458|            192|
|   12347.0| 2011-12-07|      20|          4906888|   2458|            192|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|
|   12347.0|

                                                                                

Using `ORDER BY` clause we can sort the rows within the `PARTITION`s and then use the `rank()` function to give them a ranking:

In [23]:
spark.sql("""
SELECT CustomerId, InvoiceDate, Quantity,

  sum(Quantity) OVER () as overall_dataset_Q,
  
  sum(Quantity) OVER (PARTITION BY CustomerId) as total_Q,
  
  sum(Quantity) OVER (PARTITION BY CustomerId, InvoiceDate) as total_Q_by_date,
  
  RANK() OVER (PARTITION BY CustomerId, InvoiceDate 
               ORDER BY Quantity DESC) as rank
  
FROM dfTable WHERE CustomerId IS NOT NULL 
ORDER BY CustomerId, InvoiceDate DESC, rank
""").show()

23/03/21 21:22:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 59:>                                                         (0 + 1) / 1]

+----------+-----------+--------+-----------------+-------+---------------+----+
|CustomerId|InvoiceDate|Quantity|overall_dataset_Q|total_Q|total_Q_by_date|rank|
+----------+-----------+--------+-----------------+-------+---------------+----+
|   12346.0| 2011-01-18|   74215|          4906888|      0|              0|   1|
|   12346.0| 2011-01-18|  -74215|          4906888|      0|              0|   2|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|   1|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|   1|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|   1|
|   12347.0| 2011-12-07|      24|          4906888|   2458|            192|   1|
|   12347.0| 2011-12-07|      20|          4906888|   2458|            192|   5|
|   12347.0| 2011-12-07|      20|          4906888|   2458|            192|   5|
|   12347.0| 2011-12-07|      16|          4906888|   2458|            192|   7|
|   12347.0| 2011-12-07|    

                                                                                

Please notice the behavior of `rank()` when it comes across tied values.

`dense_rank()` returns the rank of rows within a window partition, without any gaps.

`row_number()` returns a sequential number starting at 1 within a window partition. Please note that the ties will be numbered at random. This could have downstream consequences!

In the example below we have added both `dense_rank()` and `row_number()` to the previous query:

In [24]:
spark.sql("""
SELECT CustomerId, InvoiceDate, Quantity,

  sum(Quantity) OVER () as overall_dataset_Q,
  
  sum(Quantity) OVER (PARTITION BY CustomerId) as total_Q,
  
  max(Quantity) OVER (PARTITION BY CustomerId) as max_Q,

  sum(Quantity) OVER (PARTITION BY CustomerId, InvoiceDate) as total_Q_by_date,
    
  RANK() OVER (PARTITION BY CustomerId, InvoiceDate 
               ORDER BY Quantity DESC) as rank,
  
  DENSE_RANK() OVER (PARTITION BY CustomerId, InvoiceDate 
                     ORDER BY Quantity DESC) as d_rank,
  
  ROW_NUMBER() OVER (PARTITION BY CustomerId, InvoiceDate 
                     ORDER BY Quantity DESC) as row_number
  
FROM dfTable WHERE CustomerId IS NOT NULL 
ORDER BY CustomerId, InvoiceDate DESC, row_number
""").show()

23/03/21 21:22:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:46 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 62:>                                                         (0 + 1) / 1]

+----------+-----------+--------+-----------------+-------+-----+---------------+----+------+----------+
|CustomerId|InvoiceDate|Quantity|overall_dataset_Q|total_Q|max_Q|total_Q_by_date|rank|d_rank|row_number|
+----------+-----------+--------+-----------------+-------+-----+---------------+----+------+----------+
|   12346.0| 2011-01-18|   74215|          4906888|      0|74215|              0|   1|     1|         1|
|   12346.0| 2011-01-18|  -74215|          4906888|      0|74215|              0|   2|     2|         2|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         1|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         2|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         3|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         4|
|   12347.0| 2011-12-07|      20|          4906888|   2

                                                                                

A few points regarding ranking functions:

* `ROW_NUMBER()`, `RANK()`, and other ranking functions must always be windowed and therefore cannot appear without a corresponding OVER clause.

* Give consideration to how ties should be handled with ranking functions. If you need contiguous ranking, you should use `DENSE_RANK()` instead.

* The `ORDER BY` predicate is mandatory for this class of functions because it influences how the results will be sequenced or ranked.

### PySpark Syntax

Now that we are familiar with the SQL syntax of window functions let's have a look at its PySpark equivalent.

The first step to a window function is to create a window specification. 

Note that the `partition by` is unrelated to the partitioning scheme concept that we have covered thus far. It’s just a similar concept that describes how we will be breaking up our group. The ordering determines the ordering within a given partition. 

In the following example, we will reproduce our last SQL query:

In [25]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc

windowSpec0 = Window.partitionBy()
windowSpec1 = Window.partitionBy("CustomerId")
windowSpec2 = Window.partitionBy("CustomerId", "InvoiceDate")
windowSpec3 = Window.partitionBy("CustomerId", "InvoiceDate").orderBy(desc("Quantity"))

Now we want to use an aggregation function to learn more about each specific customer. An example might be establishing the maximum purchase quantity on each day. We indicate the window specification that defines to which frames of data this function will apply:

In [26]:
from pyspark.sql.functions import sum, max, rank, dense_rank, row_number

overall_dataset_Q = sum(col("Quantity")).over(windowSpec0)
total_Q = sum(col("Quantity")).over(windowSpec1)
max_Q = max(col("Quantity")).over(windowSpec1)
total_Q_by_date = sum(col("Quantity")).over(windowSpec2)
rank = rank().over(windowSpec3)
d_rank = dense_rank().over(windowSpec3)
row_number = row_number().over(windowSpec3)

This returns columns that we can use in `select` statements. Now we can perform a select to view the calculated window values:

In [27]:
from pyspark.sql.functions import col

df.where("CustomerId IS NOT NULL")\
  .select(
    col("CustomerId"),
    col("InvoiceDate"),
    col("Quantity"),
    overall_dataset_Q.alias("overall_dataset_Q"),
    total_Q.alias("total_Q"),
    max_Q.alias("max_Q"),
    total_Q_by_date.alias("total_Q_by_date"),
    rank.alias("rank"),
    d_rank.alias("d_rank"),
    row_number.alias("row_number"))\
  .orderBy("CustomerId", desc("InvoiceDate"), "row_number").show()

23/03/21 21:22:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:48 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/03/21 21:22:48 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 65:>                                                         (0 + 1) / 1]

+----------+-----------+--------+-----------------+-------+-----+---------------+----+------+----------+
|CustomerId|InvoiceDate|Quantity|overall_dataset_Q|total_Q|max_Q|total_Q_by_date|rank|d_rank|row_number|
+----------+-----------+--------+-----------------+-------+-----+---------------+----+------+----------+
|   12346.0| 2011-01-18|   74215|          4906888|      0|74215|              0|   1|     1|         1|
|   12346.0| 2011-01-18|  -74215|          4906888|      0|74215|              0|   2|     2|         2|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         1|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         2|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         3|
|   12347.0| 2011-12-07|      24|          4906888|   2458|  240|            192|   1|     1|         4|
|   12347.0| 2011-12-07|      20|          4906888|   2

                                                                                

## Rollups

A rollup is a multidimensional aggregation that performs a variety of group-by style calculations for us.

Let’s create a rollup that looks across time (with our new `InvoiceDate` column) and space (with the `Country` column) and creates a new DataFrame that includes 
- the grand total over all dates 
- the grand total for each date in the DataFrame 
- and the subtotal for each country on each date in the DataFrame

In [28]:
dfNoNull = df.na.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")

In [29]:
rolledUpDF = dfNoNull.rollup("InvoiceDate", "Country")\
  .agg(sum("Quantity"))\
  .orderBy("InvoiceDate")

rolledUpDF.show(10)



+-----------+--------------+-------------+
|InvoiceDate|       Country|sum(Quantity)|
+-----------+--------------+-------------+
|       null|          null|      4906888|
| 2010-12-01|        France|          449|
| 2010-12-01|          EIRE|          243|
| 2010-12-01|       Germany|          117|
| 2010-12-01|        Norway|         1852|
| 2010-12-01|     Australia|          107|
| 2010-12-01|   Netherlands|           97|
| 2010-12-01|          null|        24032|
| 2010-12-01|United Kingdom|        21167|
| 2010-12-02|          null|        20855|
+-----------+--------------+-------------+
only showing top 10 rows



                                                                                

Now where you see the null values is where you’ll find the grand totals:

In [30]:
rolledUpDF.where("Country IS NULL").where("InvoiceDate IS NOT NULL").show(5)

+-----------+-------+-------------+
|InvoiceDate|Country|sum(Quantity)|
+-----------+-------+-------------+
| 2010-12-01|   null|        24032|
| 2010-12-02|   null|        20855|
| 2010-12-03|   null|        11548|
| 2010-12-05|   null|        16394|
| 2010-12-06|   null|        16095|
+-----------+-------+-------------+
only showing top 5 rows



A null in both rollup columns specifies the grand total across both of those columns:

In [31]:
rolledUpDF.where("InvoiceDate IS NULL").show()

+-----------+-------+-------------+
|InvoiceDate|Country|sum(Quantity)|
+-----------+-------+-------------+
|       null|   null|      4906888|
+-----------+-------+-------------+



## Cube

A `cube` takes the `rollup` to a level deeper. Rather than treating elements hierarchically, a cube does the same thing across all dimensions. This means that it won’t just go by date over the entire time period, but also the country. To pose this as a question again, can you make a table that includes the following?

* The total across all dates and countries
* The total for each date across all countries
* The total for each country on each date
* The total for each country across all dates

The method call is quite similar, but instead of calling `rollup`, we call `cube`:

In [32]:
from pyspark.sql.functions import sum

cubbedDf = dfNoNull.cube("InvoiceDate", "Country")\
  .agg(sum("Quantity"))\
  .orderBy("InvoiceDate")

cubbedDf.show(5)



+-----------+--------------------+-------------+
|InvoiceDate|             Country|sum(Quantity)|
+-----------+--------------------+-------------+
|       null|United Arab Emirates|          982|
|       null|               Japan|        25218|
|       null|             Germany|       117448|
|       null|           Lithuania|          652|
|       null|                 USA|         1034|
+-----------+--------------------+-------------+
only showing top 5 rows



                                                                                

In [33]:
cubbedDf.where("Country IS NULL").show(5)

+-----------+-------+-------------+
|InvoiceDate|Country|sum(Quantity)|
+-----------+-------+-------------+
|       null|   null|      4906888|
| 2010-12-01|   null|        24032|
| 2010-12-02|   null|        20855|
| 2010-12-03|   null|        11548|
| 2010-12-05|   null|        16394|
+-----------+-------+-------------+
only showing top 5 rows



                                                                                

It’s a great way to create a quick summary table that others can use later on.

## Pivot

Pivots make it possible for you to convert a row into a column. For example, in our current data we have a `Country` column. With a `pivot`, we can aggregate according to some function for each of those given countries and display them in an easy-to-query way:

In [34]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- InvoiceDate: date (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: string (nullable = true)
 |-- Country: string (nullable = true)



In [35]:
pivoted = df.groupBy("InvoiceDate").pivot("Country").sum()

This DataFrame will now have a column for every combination of country, numeric variable, and a column specifying the date. For example, for USA we have the following columns: USA_sum(Quantity), and USA_sum(UnitPrice). This represents one for each numeric column in our dataset (because we just performed an aggregation over all of them).

Here’s an example query and result from this data:

In [36]:
pivoted.show(5)

23/03/21 21:22:56 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-----------+-----------------------+------------------------+---------------------+----------------------+---------------------+----------------------+---------------------+----------------------+--------------------+---------------------+--------------------+---------------------+-----------------------------+------------------------------+--------------------+---------------------+----------------------------+-----------------------------+---------------------+----------------------+------------------+-------------------+--------------------------------+---------------------------------+---------------------+----------------------+--------------------+---------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+------------------------+---------------------+----------------------+--------------------+---------------------+-------------------+--------------------+-------------------+--------------------+--------

                                                                                

In [37]:
pivoted.where("InvoiceDate > '2011-12-05'").show(5)

+-----------+-----------------------+------------------------+---------------------+----------------------+---------------------+----------------------+---------------------+----------------------+--------------------+---------------------+--------------------+---------------------+-----------------------------+------------------------------+--------------------+---------------------+----------------------------+-----------------------------+---------------------+----------------------+------------------+-------------------+--------------------------------+---------------------------------+---------------------+----------------------+--------------------+---------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+------------------------+---------------------+----------------------+--------------------+---------------------+-------------------+--------------------+-------------------+--------------------+--------

In [38]:
pivoted.where("InvoiceDate > '2011-12-01'").select("InvoiceDate" ,"United Kingdom_sum(Quantity)").show()

+-----------+----------------------------+
|InvoiceDate|United Kingdom_sum(Quantity)|
+-----------+----------------------------+
| 2011-12-06|                       27191|
| 2011-12-05|                       42414|
| 2011-12-09|                        9534|
| 2011-12-02|                       24457|
| 2011-12-08|                       32576|
| 2011-12-07|                       27611|
| 2011-12-04|                       10816|
+-----------+----------------------------+

