# Window function
Window is one of the grouping operations, it is also a bit trickier to salt if the groups are skewed. There is a way to fix skew efficiently, though. We will look into it in this notebook.

In this example we are using events table. This query assumes that for each row we need data of the next row - assuming all rows are sorted by event date. To use that we do a window function - window of `[row, row + 1]` and we calculate `LEAD(event_date)`.

## The goal
The current implementation needs tweaking. You are welcome to experiment and try any of your ideas. However, if you need an inspiration, try going in this plan:
1. Bucket the input data into N buckets. You can use `F.unix_timestamp` to parse the `event_date` column.
2. Run the windowing function. You will (at least - should) see gaps within the windows on the edges of the bucket.
3. Identify the problematic rows and address them separately. You will have N times less data than in the full dataset.

## Mending the holes
1. For each window pick rows from the border (edge cases, literally) - the first one and the last one (per bucket).
2. Recompute `LEAD(event_date)` for each of those cases, without bucketing. This way event at the end of each bucket will `see` first event of other bucket, thus will find the correct date instead of `NULL`.
3. With last row (per bucket) correctly computed, purge first rows - now they have incorrect lead time. It does not matter, the bucketed dataset already has them correct.
4. Merge the main dataset with fixes.

In [0]:
import pyspark.sql.functions as F
from pyspark.sql import Window

# We are using events dataset, 1B of events associated with 100k users, skewed.
INPUT_PATH = "/Volumes/tantusdata_playground/default/bde-2023/input/events/events.parquet"

# Volume path - make sure you have created a volume for yourself.
VOLUME_PATH = "/Volumes/tantusdata_playground/default/test-user-001"

# Pick any path you like within your volume.
WRITE_PATH = f"{VOLUME_PATH}/04-window/events.parquet"

In [0]:
events = spark.read.parquet(INPUT_PATH)

window = Window.partitionBy("userID").orderBy("eventTimestamp")
windowed_events = events.withColumn("nextEvent", F.lead("eventTimestamp", 1).over(window))

windowed_events.write.parquet(WRITE_PATH)