<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dataset" data-toc-modified-id="Dataset-1">Dataset</a></span></li><li><span><a href="#Summarizing-data-with-over" data-toc-modified-id="Summarizing-data-with-over-2">Summarizing data with <code>over</code></a></span><ul class="toc-item"><li><span><a href="#Q:-When-was-the-lowest-temperature-recorded-each-year?" data-toc-modified-id="Q:-When-was-the-lowest-temperature-recorded-each-year?-2.1">Q: <strong>When</strong> was the lowest temperature recorded each year?</a></span></li><li><span><a href="#Using-a-window-function" data-toc-modified-id="Using-a-window-function-2.2">Using a window function</a></span></li></ul></li><li><span><a href="#Ranking-functions" data-toc-modified-id="Ranking-functions-3">Ranking functions</a></span><ul class="toc-item"><li><span><a href="#rank-&amp;-dense_rank" data-toc-modified-id="rank-&amp;-dense_rank-3.1"><code>rank</code> &amp; <code>dense_rank</code></a></span></li><li><span><a href="#percent_rank" data-toc-modified-id="percent_rank-3.2"><code>percent_rank</code></a></span></li><li><span><a href="#ntile()" data-toc-modified-id="ntile()-3.3"><code>ntile()</code></a></span></li><li><span><a href="#row_number()" data-toc-modified-id="row_number()-3.4"><code>row_number()</code></a></span></li></ul></li><li><span><a href="#Analytic-functions:-looking-back-and-ahead" data-toc-modified-id="Analytic-functions:-looking-back-and-ahead-4">Analytic functions: looking back and ahead</a></span><ul class="toc-item"><li><span><a href="#lag-and-lead" data-toc-modified-id="lag-and-lead-4.1"><code>lag</code> and <code>lead</code></a></span></li><li><span><a href="#cume_dist()" data-toc-modified-id="cume_dist()-4.2"><code>cume_dist()</code></a></span></li></ul></li></ul></div>

# Window functions

## Dataset

We will use the National Oceanic and Atmospheric Administration (NOAA) Global Surface Summary of the Day (GSOD) dataset.

In [6]:
# Setup

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


conf = SparkConf()
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Read from local parquet
gsod = spark.read.parquet("data/gsod_noaa/gsod*.parquet")

## Summarizing data with `over`

### Q: __When__ was the lowest temperature recorded each year?

In [13]:
# Using vanilla groupBy, we can get the lowest temperature but not when.

coldest_temp = gsod.groupby("year").agg(F.min("temp").alias("temp"))
coldest_temp.orderBy("temp").show()


# Using left-semi self-join to get the "when"
# Self joins are generally an anti-pattern because it is SLOW.

coldest_when = gsod.join(coldest_temp, how="left_semi", on=["year", "temp"]) \
                   .select("stn", "year", "mo", "da", "temp")
coldest_when.orderBy("year", "mo", "da").show()

+----+------+
|year|  temp|
+----+------+
|2019|-114.7|
|2017|-114.7|
|2012|-113.5|
|2018|-113.5|
|2016|-111.7|
|2013|-110.7|
|2010|-110.7|
|2014|-110.5|
|2015|-110.2|
|2011|-106.8|
|2020|-105.0|
+----+------+

+------+----+---+---+------+
|   stn|year| mo| da|  temp|
+------+----+---+---+------+
|896060|2010| 06| 03|-110.7|
|896060|2011| 05| 19|-106.8|
|896060|2012| 06| 11|-113.5|
|895770|2013| 07| 31|-110.7|
|896060|2014| 08| 20|-110.5|
|895360|2015| 07| 12|-110.2|
|896060|2015| 08| 21|-110.2|
|896060|2015| 08| 27|-110.2|
|896060|2016| 07| 11|-111.7|
|896250|2017| 06| 20|-114.7|
|896060|2018| 08| 27|-113.5|
|895770|2019| 06| 15|-114.7|
|896060|2020| 08| 11|-105.0|
|896250|2020| 08| 13|-105.0|
+------+----+---+---+------+



In [8]:
# Using a window function instead

from pyspark.sql.window import Window

# To partition according to the values of one or more columns, 
# we pass the column name (or a Column object) to the partitionBy() method.
each_year = Window.partitionBy("year")

# Window is a builder class, just like SparkSession.builder
print(each_year)

<pyspark.sql.window.WindowSpec object at 0x139ed3250>


### Using a window function

- `each_year` runs the aggregate function `F.min("temp")` over each year, rather than the entire data frame.
- `F.min("temp")` applies the minimum temperature for that year to all rows.  This is then filtered to rows with `temp` that matches the aggregate `min_temp`.

In [9]:
# Use the each_year builder class

gsod.withColumn("min_temp", F.min("temp").over(each_year)).where(
    "temp = min_temp"
).select("year", "mo", "da", "stn", "temp").orderBy(
    "year", "mo", "da"
).show()

+----+---+---+------+------+
|year| mo| da|   stn|  temp|
+----+---+---+------+------+
|2010| 06| 03|896060|-110.7|
|2011| 05| 19|896060|-106.8|
|2012| 06| 11|896060|-113.5|
|2013| 07| 31|895770|-110.7|
|2014| 08| 20|896060|-110.5|
|2015| 07| 12|895360|-110.2|
|2015| 08| 21|896060|-110.2|
|2015| 08| 27|896060|-110.2|
|2016| 07| 11|896060|-111.7|
|2017| 06| 20|896250|-114.7|
|2018| 08| 27|896060|-113.5|
|2019| 06| 15|895770|-114.7|
|2020| 08| 11|896060|-105.0|
|2020| 08| 13|896250|-105.0|
+----+---+---+------+------+



Bonus:
- `partitionBy()` can be used on more than one column
- You can also directly use a window function inside a `select`:

In [10]:
# Using window function inside a select
gsod.select(
    "year",
    "mo",
    "da",
    "stn",
    "temp",
    F.min("temp").over(each_year).alias("min_temp"),
).where("temp = min_temp").drop("min_temp").orderBy(
    "year", "mo", "da"
).show()

+----+---+---+------+------+
|year| mo| da|   stn|  temp|
+----+---+---+------+------+
|2010| 06| 03|896060|-110.7|
|2011| 05| 19|896060|-106.8|
|2012| 06| 11|896060|-113.5|
|2013| 07| 31|895770|-110.7|
|2014| 08| 20|896060|-110.5|
|2015| 07| 12|895360|-110.2|
|2015| 08| 21|896060|-110.2|
|2015| 08| 27|896060|-110.2|
|2016| 07| 11|896060|-111.7|
|2017| 06| 20|896250|-114.7|
|2018| 08| 27|896060|-113.5|
|2019| 06| 15|895770|-114.7|
|2020| 08| 11|896060|-105.0|
|2020| 08| 13|896250|-105.0|
+----+---+---+------+------+



## Ranking functions

- Rank functions rank records based on the value of a field.
- Functions: `rank()`, `dense_rank()`, `percent_rank()`, `ntile()` and `row_number()`

In [16]:
# Load lightweight dataset
gsod_light = spark.read.parquet("data/Window/gsod_light.parquet")

In [17]:
# Inspect
gsod_light.printSchema()
gsod_light.show()

root
 |-- stn: string (nullable = true)
 |-- year: string (nullable = true)
 |-- mo: string (nullable = true)
 |-- da: string (nullable = true)
 |-- temp: double (nullable = true)
 |-- count_temp: long (nullable = true)

+------+----+---+---+----+----------+
|   stn|year| mo| da|temp|count_temp|
+------+----+---+---+----+----------+
|994979|2017| 12| 11|21.3|        21|
|998012|2017| 03| 02|31.4|        24|
|719200|2017| 10| 09|60.5|        11|
|917350|2018| 04| 21|82.6|         9|
|076470|2018| 06| 07|65.0|        24|
|996470|2018| 03| 12|55.6|        12|
|041680|2019| 02| 19|16.1|        15|
|949110|2019| 11| 23|54.9|        14|
|998252|2019| 04| 18|44.7|        11|
|998166|2019| 03| 20|34.8|        12|
+------+----+---+---+----+----------+



### `rank` & `dense_rank`
- `rank` gives Olympic ranking (non-consecutive, when you have multiple records that tie for a rank, the next one will be offset by the number of ties)
- `dense_rank` ranks consecutively.  Ties share the same rank, but there won’t be any gap between the ranks.  Useful when you just want a cardinal position over a window.

In [17]:
# Inspect
gsod_light.printSchema()
gsod_light.show()

root
 |-- stn: string (nullable = true)
 |-- year: string (nullable = true)
 |-- mo: string (nullable = true)
 |-- da: string (nullable = true)
 |-- temp: double (nullable = true)
 |-- count_temp: long (nullable = true)

+------+----+---+---+----+----------+
|   stn|year| mo| da|temp|count_temp|
+------+----+---+---+----+----------+
|994979|2017| 12| 11|21.3|        21|
|998012|2017| 03| 02|31.4|        24|
|719200|2017| 10| 09|60.5|        11|
|917350|2018| 04| 21|82.6|         9|
|076470|2018| 06| 07|65.0|        24|
|996470|2018| 03| 12|55.6|        12|
|041680|2019| 02| 19|16.1|        15|
|949110|2019| 11| 23|54.9|        14|
|998252|2019| 04| 18|44.7|        11|
|998166|2019| 03| 20|34.8|        12|
+------+----+---+---+----+----------+



In [31]:
# Create new window, partitioning by year and ordering by number of temperature readings
temp_per_year_asc = Window.partitionBy("year").orderBy("count_temp")
temp_per_month_asc = Window.partitionBy("mo").orderBy("count_temp")


# Using rank() with window, we get the rank accordintg the value of count_temp column
print("Using rank()")
gsod_light.withColumn("rank_tpm", F.rank().over(temp_per_month_asc)).show()


# Using dense_rank() instead to get consecutive ranking by month
print("Using dense_rank()")
gsod_light.withColumn("rank_tpm", F.dense_rank().over(temp_per_month_asc)).show()

Using rank()
+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|949110|2019| 11| 23|54.9|        14|       1|
|996470|2018| 03| 12|55.6|        12|       1|
|998166|2019| 03| 20|34.8|        12|       1|
|998012|2017| 03| 02|31.4|        24|       3|
|041680|2019| 02| 19|16.1|        15|       1|
|076470|2018| 06| 07|65.0|        24|       1|
|719200|2017| 10| 09|60.5|        11|       1|
|994979|2017| 12| 11|21.3|        21|       1|
|917350|2018| 04| 21|82.6|         9|       1|
|998252|2019| 04| 18|44.7|        11|       2|
+------+----+---+---+----+----------+--------+

Using dense_rank()
+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|949110|2019| 11| 23|54.9|        14|       1|
|996470|2018| 03| 12|55.6|        12|       1|
|998166|2019| 03| 20|34.8|        12|       1|
|998012|2017| 03| 02|31.4| 

### `percent_rank`

For every window `percent_rank()` computes percentage rank (0-1) based on ordered value.

formula = # records with lower value than the current / # of records in the window - 1

In [34]:
temp_each_year = each_year.orderBy("temp")


gsod_light.withColumn("rank_tpm", F.percent_rank().over(temp_each_year)).show()

+------+----+---+---+----+----------+------------------+
|   stn|year| mo| da|temp|count_temp|          rank_tpm|
+------+----+---+---+----+----------+------------------+
|041680|2019| 02| 19|16.1|        15|               0.0|
|998166|2019| 03| 20|34.8|        12|0.3333333333333333|
|998252|2019| 04| 18|44.7|        11|0.6666666666666666|
|949110|2019| 11| 23|54.9|        14|               1.0|
|994979|2017| 12| 11|21.3|        21|               0.0|
|998012|2017| 03| 02|31.4|        24|               0.5|
|719200|2017| 10| 09|60.5|        11|               1.0|
|996470|2018| 03| 12|55.6|        12|               0.0|
|076470|2018| 06| 07|65.0|        24|               0.5|
|917350|2018| 04| 21|82.6|         9|               1.0|
+------+----+---+---+----+----------+------------------+



### `ntile()`

Gives n-tile for a given param.

![](notes/img/ntile.png)

In [35]:
gsod_light.withColumn("rank_tpm", F.ntile(2).over(temp_each_year)).show()

+------+----+---+---+----+----------+--------+
|   stn|year| mo| da|temp|count_temp|rank_tpm|
+------+----+---+---+----+----------+--------+
|041680|2019| 02| 19|16.1|        15|       1|
|998166|2019| 03| 20|34.8|        12|       1|
|998252|2019| 04| 18|44.7|        11|       2|
|949110|2019| 11| 23|54.9|        14|       2|
|994979|2017| 12| 11|21.3|        21|       1|
|998012|2017| 03| 02|31.4|        24|       1|
|719200|2017| 10| 09|60.5|        11|       2|
|996470|2018| 03| 12|55.6|        12|       1|
|076470|2018| 06| 07|65.0|        24|       1|
|917350|2018| 04| 21|82.6|         9|       2|
+------+----+---+---+----+----------+--------+



### `row_number()`

Given an ordered window, it will give a increasing rank regardless of ties.

In [39]:
gsod_light.withColumn("row_number", F.row_number().over(temp_each_year)).show()

+------+----+---+---+----+----------+----------+
|   stn|year| mo| da|temp|count_temp|row_number|
+------+----+---+---+----+----------+----------+
|041680|2019| 02| 19|16.1|        15|         1|
|998166|2019| 03| 20|34.8|        12|         2|
|998252|2019| 04| 18|44.7|        11|         3|
|949110|2019| 11| 23|54.9|        14|         4|
|994979|2017| 12| 11|21.3|        21|         1|
|998012|2017| 03| 02|31.4|        24|         2|
|719200|2017| 10| 09|60.5|        11|         3|
|996470|2018| 03| 12|55.6|        12|         1|
|076470|2018| 06| 07|65.0|        24|         2|
|917350|2018| 04| 21|82.6|         9|         3|
+------+----+---+---+----+----------+----------+



In [38]:
# Creating a window with a descending ordered column

temp_per_month_desc = Window.partitionBy("mo").orderBy(F.col("count_temp").desc())

gsod_light.withColumn("row_number", F.row_number().over(temp_per_month_desc)).show()

+------+----+---+---+----+----------+----------+
|   stn|year| mo| da|temp|count_temp|row_number|
+------+----+---+---+----+----------+----------+
|949110|2019| 11| 23|54.9|        14|         1|
|998012|2017| 03| 02|31.4|        24|         1|
|996470|2018| 03| 12|55.6|        12|         2|
|998166|2019| 03| 20|34.8|        12|         3|
|041680|2019| 02| 19|16.1|        15|         1|
|076470|2018| 06| 07|65.0|        24|         1|
|719200|2017| 10| 09|60.5|        11|         1|
|994979|2017| 12| 11|21.3|        21|         1|
|998252|2019| 04| 18|44.7|        11|         1|
|917350|2018| 04| 21|82.6|         9|         2|
+------+----+---+---+----+----------+----------+



## Analytic functions: looking back and ahead


### `lag` and `lead`

> The two most important functions of the analytics functions family are called `lag(col, n=1, default=None)` and `lead(col, n=1, default=None)`, which will give you the value of the col column of the n-th record before and after the record you’re over, respectively.

In [50]:
# Get temp of previous two records using lag()

print("Temp of previous two records over each year")
gsod_light.withColumn(
    "previous_temp", F.lag("temp").over(temp_each_year)
).withColumn(
    "previous_temp_2", F.lag("temp", 2).over(temp_each_year)
).show()


print("Temp delta of previous record over each year")
gsod_light.withColumn(
    "previous_temp_delta", F.round(F.col("temp") - F.lag("temp").over(temp_each_year), 2)
).select(["year", "mo", "temp", "previous_temp_delta"]).show()

Temp of previous two records over each year
+------+----+---+---+----+----------+-------------+---------------+
|   stn|year| mo| da|temp|count_temp|previous_temp|previous_temp_2|
+------+----+---+---+----+----------+-------------+---------------+
|041680|2019| 02| 19|16.1|        15|         null|           null|
|998166|2019| 03| 20|34.8|        12|         16.1|           null|
|998252|2019| 04| 18|44.7|        11|         34.8|           16.1|
|949110|2019| 11| 23|54.9|        14|         44.7|           34.8|
|994979|2017| 12| 11|21.3|        21|         null|           null|
|998012|2017| 03| 02|31.4|        24|         21.3|           null|
|719200|2017| 10| 09|60.5|        11|         31.4|           21.3|
|996470|2018| 03| 12|55.6|        12|         null|           null|
|076470|2018| 06| 07|65.0|        24|         55.6|           null|
|917350|2018| 04| 21|82.6|         9|         65.0|           55.6|
+------+----+---+---+----+----------+-------------+---------------+

Tem

### `cume_dist()`

- Provides cumulative distribution rather than ranking.  Useful for EDA of cume-distro of variables.
- Does not rank, but provides the cumulative density function `F(x)` for the records in the data frame.

In [52]:
print("Percent rank vs. Cumulative distribution of temperature over each year")
gsod_light.withColumn(
    "percen_rank" , F.percent_rank().over(temp_each_year)
).withColumn("cume_dist", F.cume_dist().over(temp_each_year)).show()

Percent rank vs. Cumulative distribution of temperature over each year
+------+----+---+---+----+----------+------------------+------------------+
|   stn|year| mo| da|temp|count_temp|       percen_rank|         cume_dist|
+------+----+---+---+----+----------+------------------+------------------+
|041680|2019| 02| 19|16.1|        15|               0.0|              0.25|
|998166|2019| 03| 20|34.8|        12|0.3333333333333333|               0.5|
|998252|2019| 04| 18|44.7|        11|0.6666666666666666|              0.75|
|949110|2019| 11| 23|54.9|        14|               1.0|               1.0|
|994979|2017| 12| 11|21.3|        21|               0.0|0.3333333333333333|
|998012|2017| 03| 02|31.4|        24|               0.5|0.6666666666666666|
|719200|2017| 10| 09|60.5|        11|               1.0|               1.0|
|996470|2018| 03| 12|55.6|        12|               0.0|0.3333333333333333|
|076470|2018| 06| 07|65.0|        24|               0.5|0.6666666666666666|
|917350|2018| 04|