# Windows Function in PySpark

Window functions allow you to perform calculations across a set of rows related to the current row within a specified partition. Unlike groupBy functions, window functions do not reduce the number of rows in the result; instead, they calculate a value for each row based on the specified window.




In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *  # Import the function
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.window import Window
from google.colab import drive


## Window Specification
A window specification defines how the rows will be grouped (partitioned) and ordered within each group.

**Example:**
```
windowSpec = Window.partitionBy("category").orderBy("timestamp")
```

```
windowSpec = Window.partitionBy("category", "sub_category").orderBy(col("timeStamp"), col("score"))

```

## List of Window Functon

* Row Number : row_number()
```
df = df.withColumn("row_number", F.row_number().over(window_spec))
```

* Rank: rank()
Assigns the same rank to rows with the same values in the order criteria. The next rank has a gap.
```
df = df.withColumn("rank", rank().over(window_spec))
```

* Dense Rank : dense_rank()
Similar to rank(), but does not leave gaps in the ranking.
```
df.withColumn("dense_rank",dense_rank().over(window_spec))
```

* Lead and Lag Functions : lead(), lag()
  * **lead()**:  Returns the value of the next row within the window.
  * **lag()**: Returns the value of the previous row.

```
df = df.withColumn("next_value", F.lead("value").over(window_spec))

df = df.withColumn("previous_value", F.lag("value").over(window_spec))
```


* Aggregation Functions
Window functions can also be used to compute aggregated values over a specified window.
  * Sum: F.sum("column_name").over(window_spec)
  * Min: F.min("column_name").over(window_spec)
  * Max: F.max("column_name").over(window_spec)

```
df = df.withColumn("avg_value", F.avg("value").over(window_spec))
```






In [2]:
# Create Dtafreame with data
data = [
    ("A", "X", 1, "2023-01-01"),
    ("A", "X", 2, "2023-01-02"),
    ("A", "Y", 3, "2023-01-01"),
    ("A", "Y", 3, "2023-01-02"),
    ("B", "X", 5, "2023-01-01"),
    ("B", "X", 4, "2023-01-02"),
    ]
columns = ["category", "sub_category", "value", "timestamp"]
df = spark.createDataFrame(data, columns)
df.show()


+--------+------------+-----+----------+
|category|sub_category|value| timestamp|
+--------+------------+-----+----------+
|       A|           X|    1|2023-01-01|
|       A|           X|    2|2023-01-02|
|       A|           Y|    3|2023-01-01|
|       A|           Y|    3|2023-01-02|
|       B|           X|    5|2023-01-01|
|       B|           X|    4|2023-01-02|
+--------+------------+-----+----------+



In [3]:
# Define the window specification
window_spec = Window.partitionBy("category", "sub_category").orderBy(col("timestamp"), col("value"))

#Apply window function
df= df.withColumn("row_number", row_number().over(window_spec))\
      .withColumn("rank", rank().over(window_spec))\
      .withColumn("dense_rank", dense_rank().over(window_spec))\
      .withColumn("next_value", lead("value").over(window_spec))\
      .withColumn("previous_value", lag("value").over(window_spec))\
      .withColumn("avg_value", avg("value").over(window_spec))

df.show()


+--------+------------+-----+----------+----------+----+----------+----------+--------------+---------+
|category|sub_category|value| timestamp|row_number|rank|dense_rank|next_value|previous_value|avg_value|
+--------+------------+-----+----------+----------+----+----------+----------+--------------+---------+
|       A|           X|    1|2023-01-01|         1|   1|         1|         2|          NULL|      1.0|
|       A|           X|    2|2023-01-02|         2|   2|         2|      NULL|             1|      1.5|
|       A|           Y|    3|2023-01-01|         1|   1|         1|         3|          NULL|      3.0|
|       A|           Y|    3|2023-01-02|         2|   2|         2|      NULL|             3|      3.0|
|       B|           X|    5|2023-01-01|         1|   1|         1|         4|          NULL|      5.0|
|       B|           X|    4|2023-01-02|         2|   2|         2|      NULL|             5|      4.5|
+--------+------------+-----+----------+----------+----+--------

## Windows Function in PySpark Part 2

In [4]:
#Sample Data
data = [
    ("Alice", 100),
    ("Bob", 200),
    ("Charlie", 200),
    ("David", 300),
    ("Eve", 400),
    ("Frank", 500),
    ("Grace", 500),
    ("Hank", 600),
    ("Ivy", 700),
    ("Jack", 800)
]
columns = ["Name", "Score"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)
df.show()



+-------+-----+
|   Name|Score|
+-------+-----+
|  Alice|  100|
|    Bob|  200|
|Charlie|  200|
|  David|  300|
|    Eve|  400|
|  Frank|  500|
|  Grace|  500|
|   Hank|  600|
|    Ivy|  700|
|   Jack|  800|
+-------+-----+



### Define a window specification

In [5]:
window_spec = Window.orderBy("Score")

### Using rank() to calculate rank

In [6]:
df1 = df.withColumn("Rank", rank().over(window_spec))
df1.show()

+-------+-----+----+
|   Name|Score|Rank|
+-------+-----+----+
|  Alice|  100|   1|
|    Bob|  200|   2|
|Charlie|  200|   2|
|  David|  300|   4|
|    Eve|  400|   5|
|  Frank|  500|   6|
|  Grace|  500|   6|
|   Hank|  600|   8|
|    Ivy|  700|   9|
|   Jack|  800|  10|
+-------+-----+----+



### Using dense_rank() to calculate dense rank

In [7]:
df2 = df.withColumn("DenseRank", dense_rank().over(window_spec))
df2.show()

+-------+-----+---------+
|   Name|Score|DenseRank|
+-------+-----+---------+
|  Alice|  100|        1|
|    Bob|  200|        2|
|Charlie|  200|        2|
|  David|  300|        3|
|    Eve|  400|        4|
|  Frank|  500|        5|
|  Grace|  500|        5|
|   Hank|  600|        6|
|    Ivy|  700|        7|
|   Jack|  800|        8|
+-------+-----+---------+



### Using row_number() to calculate row number

In [8]:
df3 = df.withColumn("Row_Number", row_number().over(window_spec))
df3.show()

+-------+-----+----------+
|   Name|Score|Row_Number|
+-------+-----+----------+
|  Alice|  100|         1|
|    Bob|  200|         2|
|Charlie|  200|         3|
|  David|  300|         4|
|    Eve|  400|         5|
|  Frank|  500|         6|
|  Grace|  500|         7|
|   Hank|  600|         8|
|    Ivy|  700|         9|
|   Jack|  800|        10|
+-------+-----+----------+



### Using lead() to calculate the difference with the next row

In [9]:
df4 = df.withColumn("ScoreDifferenceWithNext", lead("Score").over(window_spec) - df["Score"])
df4.show()

+-------+-----+-----------------------+
|   Name|Score|ScoreDifferenceWithNext|
+-------+-----+-----------------------+
|  Alice|  100|                    100|
|    Bob|  200|                      0|
|Charlie|  200|                    100|
|  David|  300|                    100|
|    Eve|  400|                    100|
|  Frank|  500|                      0|
|  Grace|  500|                    100|
|   Hank|  600|                    100|
|    Ivy|  700|                    100|
|   Jack|  800|                   NULL|
+-------+-----+-----------------------+



### Using lag() to calculate the difference with the previous row

In [10]:
df5 = df.withColumn("ScoreDifferenceWithNext", df["Score"] - lag("Score").over(window_spec))
df5.show()

+-------+-----+-----------------------+
|   Name|Score|ScoreDifferenceWithNext|
+-------+-----+-----------------------+
|  Alice|  100|                   NULL|
|    Bob|  200|                    100|
|Charlie|  200|                      0|
|  David|  300|                    100|
|    Eve|  400|                    100|
|  Frank|  500|                    100|
|  Grace|  500|                      0|
|   Hank|  600|                    100|
|    Ivy|  700|                    100|
|   Jack|  800|                    100|
+-------+-----+-----------------------+



## Part 3

In [21]:
data6 = [
    ("Alice", "Math", 90, 1),
    ("Alice", "Science", 85, 1),
    ("Alice", "History", 78, 1),
    ("Bob", "Math", 80, 1),
    ("Bob", "Science", 81, 1),
    ("Bob", "History", 77, 1),
    ("Charlie", "Math", 75, 1),
    ("Charlie", "Science", 82, 1),
    ("Charlie", "History", 79, 1),
    ("Alice", "Physics", 86, 2),
    ("Alice", "Chemistry", 92, 2),
    ("Alice", "Biology", 80, 2),
    ("Bob", "Physics", 94, 2),
    ("Bob", "Chemistry", 91, 2),
    ("Bob", "Biology", 96, 2),
    ("Charlie", "Physics", 89, 2),
    ("Charlie", "Chemistry", 88, 2),
    ("Charlie", "Biology", 85, 2),
    ("Alice", "Computer Science", 95, 3),
    ("Alice", "Electronics", 91, 3),
    ("Alice", "Geography", 97, 3),
    ("Bob", "Computer Science", 88, 3),
    ("Bob", "Electronics", 66, 3),
    ("Bob", "Geography", 92, 3),
    ("Charlie", "Computer Science", 92, 3),
    ("Charlie", "Electronics", 97, 3),
    ("Charlie", "Geography", 99, 3)
  ]
columns = ["First Name", "Subject", "Marks", "Semester"]
#Craete DataFrame
df7 = spark.createDataFrame(data6, columns)
df7.show(100)


+----------+----------------+-----+--------+
|First Name|         Subject|Marks|Semester|
+----------+----------------+-----+--------+
|     Alice|            Math|   90|       1|
|     Alice|         Science|   85|       1|
|     Alice|         History|   78|       1|
|       Bob|            Math|   80|       1|
|       Bob|         Science|   81|       1|
|       Bob|         History|   77|       1|
|   Charlie|            Math|   75|       1|
|   Charlie|         Science|   82|       1|
|   Charlie|         History|   79|       1|
|     Alice|         Physics|   86|       2|
|     Alice|       Chemistry|   92|       2|
|     Alice|         Biology|   80|       2|
|       Bob|         Physics|   94|       2|
|       Bob|       Chemistry|   91|       2|
|       Bob|         Biology|   96|       2|
|   Charlie|         Physics|   89|       2|
|   Charlie|       Chemistry|   88|       2|
|   Charlie|         Biology|   85|       2|
|     Alice|Computer Science|   95|       3|
|     Alic

### # 1. Which student scored max marks in each semester considering all subjects

In [22]:
window_spec_max_marks = Window.partitionBy("Semester").orderBy(desc("Marks"))
max_mark_df = df7.withColumn("Rank", rank().over(window_spec_max_marks))
max_mark_df.show(100)
top_scorer = max_mark_df.filter(col("Rank") == 1)
print('Top Scorer Each Semester')
top_scorer.show()

+----------+----------------+-----+--------+----+
|First Name|         Subject|Marks|Semester|Rank|
+----------+----------------+-----+--------+----+
|     Alice|            Math|   90|       1|   1|
|     Alice|         Science|   85|       1|   2|
|   Charlie|         Science|   82|       1|   3|
|       Bob|         Science|   81|       1|   4|
|       Bob|            Math|   80|       1|   5|
|   Charlie|         History|   79|       1|   6|
|     Alice|         History|   78|       1|   7|
|       Bob|         History|   77|       1|   8|
|   Charlie|            Math|   75|       1|   9|
|       Bob|         Biology|   96|       2|   1|
|       Bob|         Physics|   94|       2|   2|
|     Alice|       Chemistry|   92|       2|   3|
|       Bob|       Chemistry|   91|       2|   4|
|   Charlie|         Physics|   89|       2|   5|
|   Charlie|       Chemistry|   88|       2|   6|
|     Alice|         Physics|   86|       2|   7|
|   Charlie|         Biology|   85|       2|   8|


### 2. Percentage of each student considering all subjects

In [20]:
window_spec_total_marks = Window.partitionBy("First Name", "Semester")

df8 = df7.withColumn("Total_Marks", sum("Marks").over(window_spec_total_marks))
df8 = df8.withColumn("Percentage", (col("Total_Marks")/(3*100)).cast("decimal(5,2)")*100)

df9 = df8.groupBy("First Name", "Semester").agg(max("Total_Marks"), max("Percentage"))
df9.show()

+----------+--------+----------------+---------------+
|First Name|Semester|max(Total_Marks)|max(Percentage)|
+----------+--------+----------------+---------------+
|     Alice|       1|             253|          84.00|
|     Alice|       2|             258|          86.00|
|     Alice|       3|             283|          94.00|
|       Bob|       1|             238|          79.00|
|       Bob|       2|             281|          94.00|
|       Bob|       3|             246|          82.00|
|   Charlie|       1|             236|          79.00|
|   Charlie|       2|             262|          87.00|
|   Charlie|       3|             288|          96.00|
+----------+--------+----------------+---------------+



### 3. Who is the top rank holder in each semester considering all subjects

In [31]:
window_spec_rank = Window.partitionBy("Semester").orderBy(desc("Percentage"))
rank_df = df8.withColumn("Rank", dense_rank().over(window_spec_rank))

top_rank_holder = rank_df.filter(col("Rank") == 1).select("First Name", "Semester", "Percentage", "Rank" ).distinct()
print('Top Rank Holder Each Semester')
top_rank_holder.show()

Top Rank Holder Each Semester
+----------+--------+----------+----+
|First Name|Semester|Percentage|Rank|
+----------+--------+----------+----+
|     Alice|       1|     84.00|   1|
|       Bob|       2|     94.00|   1|
|   Charlie|       3|     96.00|   1|
+----------+--------+----------+----+



### # 4. Who scored max marks in each subject in each semester

In [37]:
window_spec_max_subject_marks = Window.partitionBy("Semester", "Subject").orderBy(desc("Marks"))
max_subject_mark_df = df8.withColumn("Rank", rank().over(window_spec_max_subject_marks))


max_subject_scorer = max_subject_mark_df.filter(col("Rank") == 1)
print("Max Subject Scorer")
max_subject_scorer.show()

Max Subject Scorer
+----------+----------------+-----+--------+-----------+----------+----+
|First Name|         Subject|Marks|Semester|Total_Marks|Percentage|Rank|
+----------+----------------+-----+--------+-----------+----------+----+
|   Charlie|         History|   79|       1|        236|     79.00|   1|
|     Alice|            Math|   90|       1|        253|     84.00|   1|
|     Alice|         Science|   85|       1|        253|     84.00|   1|
|       Bob|         Biology|   96|       2|        281|     94.00|   1|
|     Alice|       Chemistry|   92|       2|        258|     86.00|   1|
|       Bob|         Physics|   94|       2|        281|     94.00|   1|
|     Alice|Computer Science|   95|       3|        283|     94.00|   1|
|   Charlie|     Electronics|   97|       3|        288|     96.00|   1|
|   Charlie|       Geography|   99|       3|        288|     96.00|   1|
+----------+----------------+-----+--------+-----------+----------+----+

