# Finding latest unique row per primary key

During merge, there is a need to compress events, since a single id might get any combination of the following events: `INSERT + UPDATE * n_times + DELETE`. 

We can make some assumptions:
* DELETE should always outrule the other two. 
* UPDATE must always be more recent than any UPDATE.
* Competing UPDATEs can often be ranked by dms_timestamp or by modified field.

The base logic follows the Delta docs example: https://docs.delta.io/latest/delta-update.html#-merge-in-cdc

In [1]:
import pandas as pd
import pyspark.sql.functions as F
from datetime import datetime, timedelta
from pyspark.sql import SparkSession

In [2]:
spark = (SparkSession.builder
         .appName("LatestUniqueFinder")
         .config('spark.sql.session.timeZone', 'UTC')
         .getOrCreate())

## Generate DataSet

In [11]:
# Time Data
offset_0 = datetime(2021, 4, 1, 12, 34, 0)
offset_1 = offset_0 + timedelta(seconds=1)

# Column Names
columns=["id", "Op", "dms_timestamp", "value_a", "value_b", "value_c"]

# Rows
rows = [
    [1, "I", offset_0, 1, 1, 9],
    [1, "U", offset_0, 2, 1, 8],
    [1, "U", offset_1, 3, 1, 8],
    [1, "U", offset_1, 4, 2, 8],
    [1, "U", offset_1, 4, 1, 8],
    # [1, "D", offset_1, 7],
]

# Create
df = spark.createDataFrame(pd.DataFrame(rows, columns=columns))

# Display for Sanity Check
pd.DataFrame(rows, columns=columns)

Unnamed: 0,id,Op,dms_timestamp,value_a,value_b,value_c
0,1,I,2021-04-01 12:34:00,1,1,9
1,1,U,2021-04-01 12:34:00,2,1,8
2,1,U,2021-04-01 12:34:01,3,1,8
3,1,U,2021-04-01 12:34:01,4,2,8
4,1,U,2021-04-01 12:34:01,4,1,8


## Max using GroupBy

In [14]:
# Add op_numeral
df_mod = (
    df
    .withColumn("op_numeral", F.when(F.col("Op") == "I", 1)
                               .when(F.col("Op") == "U", 2)
                               .when(F.col("Op") == "D", 3).cast("int"))
)

# These two, as well as Op, are not available in the target Delta Table. Mark as to-be-dropped.
cols_to_drop = ["op_numeral", "dms_temp"]

latest_uniques = (
    df_mod
        .selectExpr("id", "struct(dms_timestamp as dms_temp, op_numeral, *) as others")
        .groupBy("id")
        .agg(F.max("others").alias("latest"))
        .select("latest.*")
        .drop(*cols_to_drop)
)


In [15]:
latest_uniques.show()

+---+---+-------------------+-------+-------+-------+
| id| Op|      dms_timestamp|value_a|value_b|value_c|
+---+---+-------------------+-------+-------+-------+
|  1|  U|2021-04-01 12:34:01|      4|      2|      8|
+---+---+-------------------+-------+-------+-------+

