## Window Functions
<img src="https://th.bing.com/th/id/OIP.LZowY8jSj7ARbxCYNfyU9wHaC6?w=328&h=137&c=7&r=0&o=5&dpr=1.5&pid=1.7">


- For understanding we are going to use the dataset.
- Read the dataset from the DBFS file system.
- create a temprory table for it.
- select top five records to check the data

In [0]:
sales_data =  spark.read.csv("dbfs:/FileStore/sales.csv",header=True, inferSchema=True)
sales_data.createOrReplaceTempView('sales_data')
spark.sql("select * from sales_data limit 3").show()

+---------+---------+--------------------+--------+---------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|    InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+---------------+---------+----------+--------------+
|   536365|     null|WHITE HANGING HEA...|       6|01-12-2010 8.26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|01-12-2010 8.26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|01-12-2010 8.26|     2.75|     17850|United Kingdom|
+---------+---------+--------------------+--------+---------------+---------+----------+--------------+



### Syntax of Window Functions

<img src="https://learnsql.com/blog/mysql-window-functions/1.png">

`Window Function Name:` It is the window function that we want to use like `RANK(),ROW_NUMBER()` etc.

`OVER:` The `OVER` clause defines window partitions to form the groups of rows specifies the orders of rows in a partition. The `OVER` clause consists of three clauses: partition, order, and frame clauses.

`PARTITIONBY:` Divides the rows in partitions on which the window function operates.

`GROUPBY`: The order clause specifies the orders of rows in a partition on which the window function operates

**We have some window funcitons**
```sql
CUME_DIST()
NTILE()
PERCENT_RANK()
DENSE_RANK()
RANK()
ROW_NUMBER()
```
**Let's Explore One By One From Button to Up**

#### ROW_NUMBER():
- Assign a unique value for each row.

**Q. Assign Unique row number for each row of same InvoiceNo?**
- You can see we have Row_Number Column that is showing the unique row number according to partition.

In [0]:
spark.sql("""
select *, 
ROW_NUMBER() OVER(PARTITION BY InvoiceNo ORDER BY InvoiceNo ASC) as Row_Number
from sales_data
""").show()

+---------+---------+--------------------+--------+----------------+---------+----------+--------------+----------+
|InvoiceNo|StockCode|         Description|Quantity|     InvoiceDate|UnitPrice|CustomerID|       Country|Row_Number|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+----------+
|   536366|    22633|HAND WARMER UNION...|       6| 01-12-2010 8.28|     1.85|     17850|United Kingdom|         1|
|   536366|    22632|HAND WARMER RED P...|       6| 01-12-2010 8.28|     1.85|     17850|United Kingdom|         2|
|   536374|    21258|VICTORIAN SEWING ...|      32| 01-12-2010 9.09|    10.95|     15100|United Kingdom|         1|
|   536386|    84880|WHITE WIRE EGG HO...|      36| 01-12-2010 9.57|     4.95|     16029|United Kingdom|         1|
|   536386|   85099C|JUMBO  BAG BAROQU...|     100| 01-12-2010 9.57|     1.65|     16029|United Kingdom|         2|
|   536386|   85099B|JUMBO BAG RED RET...|     100| 01-12-2010 9.57|    

**Q. Find the first two products of each invoice?**

In [0]:
spark.sql("""
select * from
(select s.*, 
ROW_NUMBER() OVER(PARTITION BY s.InvoiceNo ORDER BY s.InvoiceNo ASC) as Row_Number
from sales_data s)
where Row_Number < 3
""").show()

+---------+---------+--------------------+--------+----------------+---------+----------+--------------+----------+
|InvoiceNo|StockCode|         Description|Quantity|     InvoiceDate|UnitPrice|CustomerID|       Country|Row_Number|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+----------+
|   536366|    22633|HAND WARMER UNION...|       6| 01-12-2010 8.28|     1.85|     17850|United Kingdom|         1|
|   536366|    22632|HAND WARMER RED P...|       6| 01-12-2010 8.28|     1.85|     17850|United Kingdom|         2|
|   536374|    21258|VICTORIAN SEWING ...|      32| 01-12-2010 9.09|    10.95|     15100|United Kingdom|         1|
|   536386|    84880|WHITE WIRE EGG HO...|      36| 01-12-2010 9.57|     4.95|     16029|United Kingdom|         1|
|   536386|   85099C|JUMBO  BAG BAROQU...|     100| 01-12-2010 9.57|     1.65|     16029|United Kingdom|         2|
|   536387|    79321|       CHILLI LIGHTS|     192| 01-12-2010 9.58|    

#### RANK():
- The RANK() function is operated on the rows of each partition and re-initialized when crossing each partition boundary.
- It will generate Rank with gaps.
- You can see the CustomerRank with gap.

In [0]:
spark.sql("""
(
select *, 
RANK() OVER(ORDER BY CustomerID) as CustomerRank
from sales_data where CustomerID is not null
)
""").show()

+---------+---------+--------------------+--------+----------------+---------+----------+--------------+------------+
|InvoiceNo|StockCode|         Description|Quantity|     InvoiceDate|UnitPrice|CustomerID|       Country|CustomerRank|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+------------+
|   541431|    23166|MEDIUM CERAMIC TO...|   74215|18-01-2011 10.01|     1.04|     12346|United Kingdom|           1|
|  C541433|    23166|MEDIUM CERAMIC TO...|  -74215|18-01-2011 10.17|     1.04|     12346|United Kingdom|           1|
|   537626|   84997C|BLUE 3 PIECE POLK...|       6|07-12-2010 14.57|     3.75|     12347|       Iceland|           3|
|   537626|    22212|FOUR HOOK  WHITE ...|       6|07-12-2010 14.57|      2.1|     12347|       Iceland|           3|
|   537626|    20782|CAMOUFLAGE EAR MU...|       6|07-12-2010 14.57|     5.49|     12347|       Iceland|           3|
|   537626|   85167B|BLACK GRAND BAROQ...|      30|07-12

#### DENSE_RANK():
- The DENSE_RANK() is a window function that assigns ranks to rows in partitions with no gaps in the ranking values.
- If two or more rows in each partition have the same values, they receive the same rank. The next row has the rank increased by one.
- It will generate Rank without gaps.
- You can see the CustomerDenseRank with no gap.

In [0]:
spark.sql("""
(
select *, 
RANK() OVER(ORDER BY CustomerID) as CustomerRank,
DENSE_RANK() OVER(ORDER BY CustomerID) as CustomerDenseRank
from sales_data where CustomerID is not null
)
""").show()

+---------+---------+--------------------+--------+----------------+---------+----------+--------------+------------+-----------------+
|InvoiceNo|StockCode|         Description|Quantity|     InvoiceDate|UnitPrice|CustomerID|       Country|CustomerRank|CustomerDenseRank|
+---------+---------+--------------------+--------+----------------+---------+----------+--------------+------------+-----------------+
|   541431|    23166|MEDIUM CERAMIC TO...|   74215|18-01-2011 10.01|     1.04|     12346|United Kingdom|           1|                1|
|  C541433|    23166|MEDIUM CERAMIC TO...|  -74215|18-01-2011 10.17|     1.04|     12346|United Kingdom|           1|                1|
|   537626|   84997C|BLUE 3 PIECE POLK...|       6|07-12-2010 14.57|     3.75|     12347|       Iceland|           3|                2|
|   537626|    22212|FOUR HOOK  WHITE ...|       6|07-12-2010 14.57|      2.1|     12347|       Iceland|           3|                2|
|   537626|    20782|CAMOUFLAGE EAR MU...|      

#Thank You