# Window Functions

* Window functions operate on a set of rows and return a single aggregated value for each row. The term Window describes the set of rows in the database on which the function will operate.

* We define the Window (set of rows on which functions operates) using an OVER() clause

### Types of Window functions

* Aggregate Window Functions SUM(), MAX(), MIN(), AVG(). COUNT()
* Ranking Window Functions RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()
* Value Window Functions LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()

#### OVER
* Specifies the window clauses for aggregate functions.

#### PARTITION BY partition_list
* 




In [11]:
from pyspark.sql import SparkSession

In [14]:
spark = SparkSession.builder.appName("Python Spark SQL Window Example").getOrCreate()

In [15]:
spark.version

'3.0.1'

### Create Spark Sample Dataframe

In [25]:
data = [("Lisa", "Sales", 10000, 35),("Evan", "Sales", 32000, 38),
        ("Fred", "Engineering", 21000, 28),
        ("Alex", "Sales", 30000, 33),
        ("Tom", "Engineering", 23000, 33),
        ("Jane", "Marketing", 29000, 28),
        ("Jeff", "Marketing", 35000, 38),
        ("Paul", "Engineering", 29000, 23),
        ("Chloe", "Engineering", 23000, 25)]

df = spark.createDataFrame(data, "name STRING, dept STRING, salary INT, age INT")
df.printSchema()
df.show(10)
df.createOrReplaceTempView("employees")

root
 |-- name: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- age: integer (nullable = true)

+-----+-----------+------+---+
| name|       dept|salary|age|
+-----+-----------+------+---+
| Lisa|      Sales| 10000| 35|
| Evan|      Sales| 32000| 38|
| Fred|Engineering| 21000| 28|
| Alex|      Sales| 30000| 33|
|  Tom|Engineering| 23000| 33|
| Jane|  Marketing| 29000| 28|
| Jeff|  Marketing| 35000| 38|
| Paul|Engineering| 29000| 23|
|Chloe|Engineering| 23000| 25|
+-----+-----------+------+---+



In [32]:
df2_rank = spark.sql("SELECT name, dept,salary, RANK() OVER (PARTITION BY dept ORDER BY salary desc) AS rank FROM employees")
df2_rank.show(10)

+-----+-----------+------+----+
| name|       dept|salary|rank|
+-----+-----------+------+----+
| Evan|      Sales| 32000|   1|
| Alex|      Sales| 30000|   2|
| Lisa|      Sales| 10000|   3|
| Paul|Engineering| 29000|   1|
|  Tom|Engineering| 23000|   2|
|Chloe|Engineering| 23000|   2|
| Fred|Engineering| 21000|   4|
| Jeff|  Marketing| 35000|   1|
| Jane|  Marketing| 29000|   2|
+-----+-----------+------+----+



In [35]:
df3_dense_rank = spark.sql("SELECT name, dept,salary, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS dense_rank FROM employees")
df3_dense_rank.show(10)

+-----+-----------+------+----------+
| name|       dept|salary|dense_rank|
+-----+-----------+------+----------+
| Evan|      Sales| 32000|         1|
| Alex|      Sales| 30000|         2|
| Lisa|      Sales| 10000|         3|
| Paul|Engineering| 29000|         1|
|  Tom|Engineering| 23000|         2|
|Chloe|Engineering| 23000|         2|
| Fred|Engineering| 21000|         3|
| Jeff|  Marketing| 35000|         1|
| Jane|  Marketing| 29000|         2|
+-----+-----------+------+----------+

