# Window Functions in PySpark
Window functions allow you to perform calculations across a set of rows related to the current row within a specified partition. Unlike `groupBy`, window functions do not reduce the number of rows but calculate a value for each row.

In [21]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 23, Finished, Available, Finished)

## Creating a Spark Session

In [22]:
spark = SparkSession.builder.appName('WindowFunctionsExample').getOrCreate()

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 24, Finished, Available, Finished)

## Creating a Sample DataFrame
Let's create a sample dataset to demonstrate window functions.

In [23]:
data = [
    ('A', 'X', 1, '2023-01-01'),
    ('A', 'X', 2, '2023-01-02'),
    ('A', 'Y', 3, '2023-01-01'),
    ('A', 'Y', 3, '2023-01-02'),
    ('B', 'X', 5, '2023-01-01'),
    ('B', 'X', 4, '2023-01-02'),
]

columns = ['category', 'sub_category', 'value', 'timestamp']
df = spark.createDataFrame(data, columns)
df.show()

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 25, Finished, Available, Finished)

+--------+------------+-----+----------+
|category|sub_category|value| timestamp|
+--------+------------+-----+----------+
|       A|           X|    1|2023-01-01|
|       A|           X|    2|2023-01-02|
|       A|           Y|    3|2023-01-01|
|       A|           Y|    3|2023-01-02|
|       B|           X|    5|2023-01-01|
|       B|           X|    4|2023-01-02|
+--------+------------+-----+----------+



## Defining a Window Specification
A window specification determines how the rows are partitioned and ordered.

In [24]:
window_spec = Window.partitionBy('category', 'sub_category')\
                     .orderBy(F.col('timestamp'), F.col('value'))

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 26, Finished, Available, Finished)

## Applying Window Functions
### 1. Row Number
Assigns a unique integer to each row within the partition.

In [25]:
df = df.withColumn('row_number', F.row_number().over(window_spec))

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 27, Finished, Available, Finished)

### 2. Rank
Assigns the same rank to rows with the same values in the order criteria, but the next rank has a gap.

In [26]:
df = df.withColumn('rank', F.rank().over(window_spec))

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 28, Finished, Available, Finished)

### 3. Dense Rank
Similar to rank(), but does not leave gaps in ranking.

In [27]:
df = df.withColumn('dense_rank', F.dense_rank().over(window_spec))

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 29, Finished, Available, Finished)

### 4. Lead and Lag Functions
- `lead()` returns the value of the next row within the window.
- `lag()` returns the value of the previous row.

In [28]:
df = df.withColumn('next_value', F.lead('value').over(window_spec))
df = df.withColumn('previous_value', F.lag('value').over(window_spec))

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 30, Finished, Available, Finished)

### 5. Aggregation Functions
Window functions can also be used to compute aggregated values over a specified window.

In [29]:
df = df.withColumn('avg_value', F.avg('value').over(window_spec))

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 31, Finished, Available, Finished)

## Displaying Final Results

In [30]:
df.show()

StatementMeta(, 6b95a3c2-fece-4147-a79a-03c7c1608ce5, 32, Finished, Available, Finished)

+--------+------------+-----+----------+----------+----+----------+----------+--------------+---------+
|category|sub_category|value| timestamp|row_number|rank|dense_rank|next_value|previous_value|avg_value|
+--------+------------+-----+----------+----------+----+----------+----------+--------------+---------+
|       A|           X|    1|2023-01-01|         1|   1|         1|         2|          NULL|      1.0|
|       A|           X|    2|2023-01-02|         2|   2|         2|      NULL|             1|      1.5|
|       A|           Y|    3|2023-01-01|         1|   1|         1|         3|          NULL|      3.0|
|       A|           Y|    3|2023-01-02|         2|   2|         2|      NULL|             3|      3.0|
|       B|           X|    5|2023-01-01|         1|   1|         1|         4|          NULL|      5.0|
|       B|           X|    4|2023-01-02|         2|   2|         2|      NULL|             5|      4.5|
+--------+------------+-----+----------+----------+----+--------