As Window Functions retornam um único valor para cada grupo de linhas. O PySpark oferece suporte a 3 tipos de Window Functions:
- Ranking functions
- Analytic functions
- Aggregate functions

Documentação:

**BIBLIOTECAS**

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.window import Window

**DATAFRAME**

In [0]:
dados = [
         ("Anderson", "Vendas", "SP", 1500.00, 34, 1000.00),
         ("Kennedy", "Vendas", "CE", 1200.00, 56, 2000.00),
         ("Bruno", "Vendas", "SP", 1100.00, 30, 2300.00),
         ("Maria", "Financas", "CE", 3600.00, 24, 2300.00),
         ("Eduardo", "Financas", "CE", 4500.00, 40, 2400.00),
         ("Mendes", "Financas", "RS", 8000.00, 36, 1900.00),
         ("Kethlyn", "Financas", "RS", 1200.00, 53, 1500.00),
         ("Thiago", "Marketing", "GO", 1100.00, 25, 1800.00),
         ("Carla", "Marketing", "GO", 2600.00, 50, 2100.00)
]

schema = ["Nome", "Departamento", "Estado", "Salario", "Idade", "Bonus"]

df = spark.createDataFrame(data=dados, schema=schema)
df.printSchema()
display(df)

Nome,Departamento,Estado,Salario,Idade,Bonus
Anderson,Vendas,SP,1500.0,34,1000.0
Kennedy,Vendas,CE,1200.0,56,2000.0
Bruno,Vendas,SP,1100.0,30,2300.0
Maria,Financas,CE,3600.0,24,2300.0
Eduardo,Financas,CE,4500.0,40,2400.0
Mendes,Financas,RS,8000.0,36,1900.0
Kethlyn,Financas,RS,1200.0,53,1500.0
Thiago,Marketing,GO,1100.0,25,1800.0
Carla,Marketing,GO,2600.0,50,2100.0


**row_number Window Function**

Retorna o número da linha de acordo com a coluna que foi particionada

In [0]:
w0 = Window.partitionBy(F.col("Departamento")).orderBy("Salario")
df.withColumn("row_number", F.row_number().over(w0)).display()

Nome,Departamento,Estado,Salario,Idade,Bonus,row_number
Kethlyn,Financas,RS,1200.0,53,1500.0,1
Maria,Financas,CE,3600.0,24,2300.0,2
Eduardo,Financas,CE,4500.0,40,2400.0,3
Mendes,Financas,RS,8000.0,36,1900.0,4
Thiago,Marketing,GO,1100.0,25,1800.0,1
Carla,Marketing,GO,2600.0,50,2100.0,2
Bruno,Vendas,SP,1100.0,30,2300.0,1
Kennedy,Vendas,CE,1200.0,56,2000.0,2
Anderson,Vendas,SP,1500.0,34,1000.0,3


**rank Window Function**

In [0]:
df.withColumn("rank", F.rank().over(w0)).display()

Nome,Departamento,Estado,Salario,Idade,Bonus,rank
Kethlyn,Financas,RS,1200.0,53,1500.0,1
Maria,Financas,CE,3600.0,24,2300.0,2
Eduardo,Financas,CE,4500.0,40,2400.0,3
Mendes,Financas,RS,8000.0,36,1900.0,4
Thiago,Marketing,GO,1100.0,25,1800.0,1
Carla,Marketing,GO,2600.0,50,2100.0,2
Bruno,Vendas,SP,1100.0,30,2300.0,1
Kennedy,Vendas,CE,1200.0,56,2000.0,2
Anderson,Vendas,SP,1500.0,34,1000.0,3


**dense_rank Window Function**

In [0]:
df.withColumn("dense_rank", F.dense_rank().over(w0)).display()

Nome,Departamento,Estado,Salario,Idade,Bonus,dense_rank
Kethlyn,Financas,RS,1200.0,53,1500.0,1
Maria,Financas,CE,3600.0,24,2300.0,2
Eduardo,Financas,CE,4500.0,40,2400.0,3
Mendes,Financas,RS,8000.0,36,1900.0,4
Thiago,Marketing,GO,1100.0,25,1800.0,1
Carla,Marketing,GO,2600.0,50,2100.0,2
Bruno,Vendas,SP,1100.0,30,2300.0,1
Kennedy,Vendas,CE,1200.0,56,2000.0,2
Anderson,Vendas,SP,1500.0,34,1000.0,3


**percent_rank Window Function**

In [0]:
df.withColumn("percent_rank", F.percent_rank().over(w0)).display()

Nome,Departamento,Estado,Salario,Idade,Bonus,percent_rank
Kethlyn,Financas,RS,1200.0,53,1500.0,0.0
Maria,Financas,CE,3600.0,24,2300.0,0.3333333333333333
Eduardo,Financas,CE,4500.0,40,2400.0,0.6666666666666666
Mendes,Financas,RS,8000.0,36,1900.0,1.0
Thiago,Marketing,GO,1100.0,25,1800.0,0.0
Carla,Marketing,GO,2600.0,50,2100.0,1.0
Bruno,Vendas,SP,1100.0,30,2300.0,0.0
Kennedy,Vendas,CE,1200.0,56,2000.0,0.5
Anderson,Vendas,SP,1500.0,34,1000.0,1.0


**lag window() Window Function**

A função mostra o valor da coluna seleciona com "lag", para o exemplo, ele olha dois salários para trás e disponibiliza na coluna lag

In [0]:
df.withColumn("lag", F.lag("Salario", 2).over(w0)).display()

Nome,Departamento,Estado,Salario,Idade,Bonus,lag
Kethlyn,Financas,RS,1200.0,53,1500.0,
Maria,Financas,CE,3600.0,24,2300.0,
Eduardo,Financas,CE,4500.0,40,2400.0,1200.0
Mendes,Financas,RS,8000.0,36,1900.0,3600.0
Thiago,Marketing,GO,1100.0,25,1800.0,
Carla,Marketing,GO,2600.0,50,2100.0,
Bruno,Vendas,SP,1100.0,30,2300.0,
Kennedy,Vendas,CE,1200.0,56,2000.0,
Anderson,Vendas,SP,1500.0,34,1000.0,1100.0


**lead Window Function**

In [0]:
df.withColumn("lead", F.lead("Salario", 1).over(w0)).display()

Nome,Departamento,Estado,Salario,Idade,Bonus,lead
Kethlyn,Financas,RS,1200.0,53,1500.0,3600.0
Maria,Financas,CE,3600.0,24,2300.0,4500.0
Eduardo,Financas,CE,4500.0,40,2400.0,8000.0
Mendes,Financas,RS,8000.0,36,1900.0,
Thiago,Marketing,GO,1100.0,25,1800.0,2600.0
Carla,Marketing,GO,2600.0,50,2100.0,
Bruno,Vendas,SP,1100.0,30,2300.0,1200.0
Kennedy,Vendas,CE,1200.0,56,2000.0,1500.0
Anderson,Vendas,SP,1500.0,34,1000.0,


**Window Aggregate Function**

In [0]:
(df.withColumn("row", F.row_number().over(w0))
   .withColumn("avg", F.avg(F.col("salario")).over(w0))
   .withColumn("sum", F.sum(F.col("salario")).over(w0))
   .withColumn("min", F.min(F.col("salario")).over(w0))
   .withColumn("max", F.max(F.col("salario")).over(w0))
   .select("row","departamento", "salario", "avg", "sum", "min", "max").display()
)

row,departamento,salario,avg,sum,min,max
1,Financas,1200.0,1200.0,1200.0,1200.0,1200.0
2,Financas,3600.0,2400.0,4800.0,1200.0,3600.0
3,Financas,4500.0,3100.0,9300.0,1200.0,4500.0
4,Financas,8000.0,4325.0,17300.0,1200.0,8000.0
1,Marketing,1100.0,1100.0,1100.0,1100.0,1100.0
2,Marketing,2600.0,1850.0,3700.0,1100.0,2600.0
1,Vendas,1100.0,1100.0,1100.0,1100.0,1100.0
2,Vendas,1200.0,1150.0,2300.0,1100.0,1200.0
3,Vendas,1500.0,1266.6666666666667,3800.0,1100.0,1500.0
