# Use aggregate functions

**Data Source**

* English Wikipedia pageviews by second
* Size on Disk: ~255 MB
* Type: Parquet files

**Technical Accomplishments:**
* Introduce the various aggregate functions.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Initialize Spark Session
spark = (SparkSession.builder
         .appName("Create DataFrame from Dummy Data")
         .getOrCreate())

In [3]:
spark

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Data Source

This data uses the **Pageviews By Seconds** data set.

In [6]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

partitions = 8

# Make sure wide operations don't repartition to 200
spark.conf.set("spark.sql.shuffle.partitions", str(partitions))

In [5]:
# The directory containing our parquet files.
parquetFile = "../dataset/pageviews_by_second.parquet/"

In [7]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
initialDF = (spark.read
  .option("inferSchema", "true") # The default, but not costly w/Parquet
  .parquet(parquetFile)          # Read the data in
  .repartition(partitions)       # From 7 >>> 8 partitions
  .cache()                       # Cache the expensive operation
)
# materialize the cache
initialDF.count()

# rename the timestamp column and cast to a timestamp data type
pageviewsDF = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)

# cache the transformations on our new DataFrame by marking the DataFrame as cached and then materialize the result
pageviewsDF.cache().count()

7200000

In [8]:
pageviewsDF.show(10)

+-------------------+-------+--------+
|         capturedAt|   site|requests|
+-------------------+-------+--------+
|2015-03-22 22:41:36| mobile|    1667|
|2015-03-16 08:51:07|desktop|    2223|
|2015-03-17 07:13:16|desktop|    2189|
|2015-03-16 05:54:59| mobile|    1097|
|2015-03-17 14:21:38| mobile|    1342|
|2015-03-23 17:43:40|desktop|    3202|
|2015-03-18 00:44:12| mobile|    1524|
|2015-03-21 05:10:01| mobile|    1156|
|2015-03-19 12:11:03| mobile|    1096|
|2015-03-23 11:43:22| mobile|     994|
+-------------------+-------+--------+
only showing top 10 rows



## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) groupBy()

Aggregating data is one of the more common tasks when working with big data.
* How many customers are over 65?
* What is the ratio of men to women?
* Group all emails by their sender.

The function `groupBy()` is one tool that we can use for this purpose.

If you look at the API docs, `groupBy(..)` is described like this:
> Groups the Dataset using the specified columns, so that we can run aggregation on them.

This function is a **wide** transformation - it will produce a shuffle and conclude a stage boundary.

Unlike all of the other transformations we've seen so far, this transformation does not return a `DataFrame`.
* In Scala it returns `RelationalGroupedDataset`
* In Python it returns `GroupedData`

This is because the call `groupBy(..)` is only 1/2 of the transformation.

To see the other half, we need to take a look at it's return type, `RelationalGroupedDataset`.

### RelationalGroupedDataset

If we take a look at the API docs for `RelationalGroupedDataset`, we can see that it supports the following aggregations:

| Method | Description |
|--------|-------------|
| `avg(..)` | Compute the mean value for each numeric columns for each group. |
| `count(..)` | Count the number of rows for each group. |
| `sum(..)` | Compute the sum for each numeric columns for each group. |
| `min(..)` | Compute the min value for each numeric column for each group. |
| `max(..)` | Compute the max value for each numeric columns for each group. |
| `mean(..)` | Compute the average value for each numeric columns for each group. |
| `agg(..)` | Compute aggregates by specifying a series of aggregate columns. |


Together, `groupBy(..)` and `RelationalGroupedDataset` (or `GroupedData` in Python) give us what we need to answer some basic questions.

For Example, how many more requests did the desktop site receive than the mobile site receive?

For this all we need to do is group all records by **site** and then sum all the requests.

In [9]:
(pageviewsDF
  .groupBy( col("site") )
  .sum()
  .show()
)

+-------+-------------+
|   site|sum(requests)|
+-------+-------------+
| mobile|   4605797962|
|desktop|   8737180972|
+-------+-------------+



In [13]:
pageviewsDF.groupBy(col("site"))

GroupedData[grouping expressions: [site], value: [capturedAt: timestamp, site: string ... 1 more field], type: GroupBy]

Notice above that we didn't actually specify which column we were summing....

In this case you will actually receive a total for all numerical values.

There is a performance catch to that - if I have 2, 5, 10? columns, then they will all be summed and I may only need one.

I can first reduce my columns to those that I wanted or I can simply specify which column(s) to sum up.

In [10]:
(pageviewsDF
  .groupBy( col("site") )
  .sum("requests")
  .show()
)

+-------+-------------+
|   site|sum(requests)|
+-------+-------------+
| mobile|   4605797962|
|desktop|   8737180972|
+-------+-------------+



And because I don't like the resulting column name, **sum(requests)** I can easily rename it...

In [11]:
(pageviewsDF
  .groupBy( col("site") )
  .sum("requests")
  .withColumnRenamed("sum(requests)", "totalRequests")
  .show()
)

+-------+-------------+
|   site|totalRequests|
+-------+-------------+
| mobile|   4605797962|
|desktop|   8737180972|
+-------+-------------+



How about the total number of requests per site? mobile vs desktop?

In [12]:
(pageviewsDF
  .groupBy( col("site") )
  .count()
  .show()
)

+-------+-------+
|   site|  count|
+-------+-------+
| mobile|3600000|
|desktop|3600000|
+-------+-------+



## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) sum(), count(), avg(), min(), max()

The `groupBy(..)` operation is not our only option for aggregating.

The `...sql.functions` package actually defines a large number of aggregate functions
* `org.apache.spark.sql.functions` in the case of Scala & Java
* `pyspark.sql.functions` in the case of Python


Let's take a look at this in the Scala API docs (only because the documentation is a little easier to read).

Let's take a look at our last two examples... 

We saw the count of records and the sum of records.

Let's take do this a slightly different way...

This time with the `...sql.functions` operations.

And just for fun, let's throw in the average, minimum and maximum

In [14]:
(pageviewsDF
  .filter("site = 'mobile'")
  .select( sum( col("requests")), count(col("requests")), avg(col("requests")), min(col("requests")), max(col("requests")) )
  .show()
)
          
(pageviewsDF
  .filter("site = 'desktop'")
  .select( sum( col("requests")), count(col("requests")), avg(col("requests")), min(col("requests")), max(col("requests")) )
  .show()
)

+-------------+---------------+------------------+-------------+-------------+
|sum(requests)|count(requests)|     avg(requests)|min(requests)|max(requests)|
+-------------+---------------+------------------+-------------+-------------+
|   4605797962|        3600000|1279.3883227777778|          645|         3292|
+-------------+---------------+------------------+-------------+-------------+

+-------------+---------------+------------------+-------------+-------------+
|sum(requests)|count(requests)|     avg(requests)|min(requests)|max(requests)|
+-------------+---------------+------------------+-------------+-------------+
|   8737180972|        3600000|2426.9947144444445|         1312|         5695|
+-------------+---------------+------------------+-------------+-------------+



And let's just address one more pet-peeve...

Was that 3.6M records or 360K records?

In [15]:
(pageviewsDF
  .filter("site = 'mobile'")
  .select( 
    format_number(sum(col("requests")), 0).alias("sum"), 
    format_number(count(col("requests")), 0).alias("count"), 
    format_number(avg(col("requests")), 2).alias("avg"), 
    format_number(min(col("requests")), 0).alias("min"), 
    format_number(max(col("requests")), 0).alias("max") 
  )
  .show()
)

(pageviewsDF
  .filter("site = 'desktop'")
  .select( 
    format_number(sum(col("requests")), 0), 
    format_number(count(col("requests")), 0), 
    format_number(avg(col("requests")), 2), 
    format_number(min(col("requests")), 0), 
    format_number(max(col("requests")), 0) 
  )
  .show()
)

+-------------+---------+--------+---+-----+
|          sum|    count|     avg|min|  max|
+-------------+---------+--------+---+-----+
|4,605,797,962|3,600,000|1,279.39|645|3,292|
+-------------+---------+--------+---+-----+

+-------------------------------+---------------------------------+-------------------------------+-------------------------------+-------------------------------+
|format_number(sum(requests), 0)|format_number(count(requests), 0)|format_number(avg(requests), 2)|format_number(min(requests), 0)|format_number(max(requests), 0)|
+-------------------------------+---------------------------------+-------------------------------+-------------------------------+-------------------------------+
|                  8,737,180,972|                        3,600,000|                       2,426.99|                          1,312|                          5,695|
+-------------------------------+---------------------------------+-------------------------------+-------------------

In [16]:
(pageviewsDF.groupBy("site")
 .agg(
    format_number(sum(col("requests")), 0).alias("sum"),
    format_number(count(col("requests")), 0).alias("count"), 
    format_number(avg(col("requests")), 2).alias("avg"), 
    format_number(min(col("requests")), 0).alias("min"), 
    format_number(max(col("requests")), 0).alias("max") 
 ).show()
 )

+-------+-------------+---------+--------+-----+-----+
|   site|          sum|    count|     avg|  min|  max|
+-------+-------------+---------+--------+-----+-----+
| mobile|4,605,797,962|3,600,000|1,279.39|  645|3,292|
|desktop|8,737,180,972|3,600,000|2,426.99|1,312|5,695|
+-------+-------------+---------+--------+-----+-----+



In [17]:
spark.stop()