# Windows Function in PySpark

Window functions allow you to perform calculations across a set of rows related to the current row within a specified partition. Unlike groupBy functions, window functions do not reduce the number of rows in the result; instead, they calculate a value for each row based on the specified window.




In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *  # Import the function
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.window import Window
from google.colab import drive


## Window Specification
A window specification defines how the rows will be grouped (partitioned) and ordered within each group.

**Example:**
```
windowSpec = Window.partitionBy("category").orderBy("timestamp")
```

```
windowSpec = Window.partitionBy("category", "sub_category").orderBy(col("timeStamp"), col("score"))

```

## List of Window Functon

* Row Number : row_number()
```
df = df.withColumn("row_number", F.row_number().over(window_spec))
```

* Rank: rank()
Assigns the same rank to rows with the same values in the order criteria. The next rank has a gap.
```
df = df.withColumn("rank", rank().over(window_spec))
```

* Dense Rank : dense_rank()
Similar to rank(), but does not leave gaps in the ranking.
```
df.withColumn("dense_rank",dense_rank().over(window_spec))
```

* Lead and Lag Functions : lead(), lag()
  * **lead()**:  Returns the value of the next row within the window.
  * **lag()**: Returns the value of the previous row.

```
df = df.withColumn("next_value", F.lead("value").over(window_spec))

df = df.withColumn("previous_value", F.lag("value").over(window_spec))
```


* Aggregation Functions
Window functions can also be used to compute aggregated values over a specified window.
  * Sum: F.sum("column_name").over(window_spec)
  * Min: F.min("column_name").over(window_spec)
  * Max: F.max("column_name").over(window_spec)

```
df = df.withColumn("avg_value", F.avg("value").over(window_spec))
```






In [5]:
# Create Dtafreame with data
data = [
    ("A", "X", 1, "2023-01-01"),
    ("A", "X", 2, "2023-01-02"),
    ("A", "Y", 3, "2023-01-01"),
    ("A", "Y", 3, "2023-01-02"),
    ("B", "X", 5, "2023-01-01"),
    ("B", "X", 4, "2023-01-02"),
    ]
columns = ["category", "sub_category", "value", "timestamp"]
df = spark.createDataFrame(data, columns)
df.show()


+--------+------------+-----+----------+
|category|sub_category|value| timestamp|
+--------+------------+-----+----------+
|       A|           X|    1|2023-01-01|
|       A|           X|    2|2023-01-02|
|       A|           Y|    3|2023-01-01|
|       A|           Y|    3|2023-01-02|
|       B|           X|    5|2023-01-01|
|       B|           X|    4|2023-01-02|
+--------+------------+-----+----------+



In [11]:
# Define the window specification
window_spec = Window.partitionBy("category", "sub_category").orderBy(col("timestamp"), col("value"))

#Apply window function
df= df.withColumn("row_number", row_number().over(window_spec))\
      .withColumn("rank", rank().over(window_spec))\
      .withColumn("dense_rank", dense_rank().over(window_spec))\
      .withColumn("next_value", lead("value").over(window_spec))\
      .withColumn("previous_value", lag("value").over(window_spec))\
      .withColumn("avg_value", avg("value").over(window_spec))

df.show()


+--------+------------+-----+----------+----------+----+----------+----------+--------------+---------+
|category|sub_category|value| timestamp|row_number|rank|dense_rank|next_value|previous_value|avg_value|
+--------+------------+-----+----------+----------+----+----------+----------+--------------+---------+
|       A|           X|    1|2023-01-01|         1|   1|         1|         2|          NULL|      1.0|
|       A|           X|    2|2023-01-02|         2|   2|         2|      NULL|             1|      1.5|
|       A|           Y|    3|2023-01-01|         1|   1|         1|         3|          NULL|      3.0|
|       A|           Y|    3|2023-01-02|         2|   2|         2|      NULL|             3|      3.0|
|       B|           X|    5|2023-01-01|         1|   1|         1|         4|          NULL|      5.0|
|       B|           X|    4|2023-01-02|         2|   2|         2|      NULL|             5|      4.5|
+--------+------------+-----+----------+----------+----+--------

## Windows Function in PySpark Part 2

In [13]:
#Sample Data
data = [
    ("Alice", 100),
    ("Bob", 200),
    ("Charlie", 200),
    ("David", 300),
    ("Eve", 400),
    ("Frank", 500),
    ("Grace", 500),
    ("Hank", 600),
    ("Ivy", 700),
    ("Jack", 800)
]
columns = ["Name", "Score"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)
df.show()



+-------+-----+
|   Name|Score|
+-------+-----+
|  Alice|  100|
|    Bob|  200|
|Charlie|  200|
|  David|  300|
|    Eve|  400|
|  Frank|  500|
|  Grace|  500|
|   Hank|  600|
|    Ivy|  700|
|   Jack|  800|
+-------+-----+



### Define a window specification

In [14]:
window_spec = Window.orderBy("Score")

### Using rank() to calculate rank

In [15]:
df1 = df.withColumn("Rank", rank().over(window_spec))
df1.show()

+-------+-----+----+
|   Name|Score|Rank|
+-------+-----+----+
|  Alice|  100|   1|
|    Bob|  200|   2|
|Charlie|  200|   2|
|  David|  300|   4|
|    Eve|  400|   5|
|  Frank|  500|   6|
|  Grace|  500|   6|
|   Hank|  600|   8|
|    Ivy|  700|   9|
|   Jack|  800|  10|
+-------+-----+----+



### Using dense_rank() to calculate dense rank

In [16]:
df2 = df.withColumn("DenseRank", dense_rank().over(window_spec))
df2.show()

+-------+-----+---------+
|   Name|Score|DenseRank|
+-------+-----+---------+
|  Alice|  100|        1|
|    Bob|  200|        2|
|Charlie|  200|        2|
|  David|  300|        3|
|    Eve|  400|        4|
|  Frank|  500|        5|
|  Grace|  500|        5|
|   Hank|  600|        6|
|    Ivy|  700|        7|
|   Jack|  800|        8|
+-------+-----+---------+



### Using row_number() to calculate row number

In [19]:
df3 = df.withColumn("Row_Number", row_number().over(window_spec))
df3.show()

+-------+-----+----------+
|   Name|Score|Row_Number|
+-------+-----+----------+
|  Alice|  100|         1|
|    Bob|  200|         2|
|Charlie|  200|         3|
|  David|  300|         4|
|    Eve|  400|         5|
|  Frank|  500|         6|
|  Grace|  500|         7|
|   Hank|  600|         8|
|    Ivy|  700|         9|
|   Jack|  800|        10|
+-------+-----+----------+



### Using lead() to calculate the difference with the next row

In [21]:
df4 = df.withColumn("ScoreDifferenceWithNext", lead("Score").over(window_spec) - df["Score"])
df4.show()

+-------+-----+-----------------------+
|   Name|Score|ScoreDifferenceWithNext|
+-------+-----+-----------------------+
|  Alice|  100|                    100|
|    Bob|  200|                      0|
|Charlie|  200|                    100|
|  David|  300|                    100|
|    Eve|  400|                    100|
|  Frank|  500|                      0|
|  Grace|  500|                    100|
|   Hank|  600|                    100|
|    Ivy|  700|                    100|
|   Jack|  800|                   NULL|
+-------+-----+-----------------------+



### Using lag() to calculate the difference with the previous row

In [26]:
df5 = df.withColumn("ScoreDifferenceWithNext", df["Score"] - lag("Score").over(window_spec))
df5.show()

+-------+-----+-----------------------+
|   Name|Score|ScoreDifferenceWithNext|
+-------+-----+-----------------------+
|  Alice|  100|                   NULL|
|    Bob|  200|                    100|
|Charlie|  200|                      0|
|  David|  300|                    100|
|    Eve|  400|                    100|
|  Frank|  500|                    100|
|  Grace|  500|                      0|
|   Hank|  600|                    100|
|    Ivy|  700|                    100|
|   Jack|  800|                    100|
+-------+-----+-----------------------+

