In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").getOrCreate()

Window Functions or Windowed Aggregates
This functionality in PySpark allows you to perform certain operations on groups of records known as “within the window.” It calculates the results for each row within the window. A classic example of using window is the various aggregations for a user during different sessions

PySpark supports three types of window functions:
• Aggregations
• Ranking
• Analytics

In [3]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col,row_number

In [5]:
df=spark.read.options(delimiter=',', inferSchema='True', header='True').csv("data/Invistico_Airline.csv")

In [6]:
win = Window.orderBy(df['Flight Distance'].desc())

In [7]:
df = df.withColumn('rank', row_number().over(win).alias('rank'))

In [8]:
df.show(5)

+------------+------+-----------------+---+---------------+--------+---------------+------------+---------------------------------+--------------+-------------+---------------------+----------------------+--------------+----------------------+----------------+----------------+----------------+---------------+-----------+---------------+--------------------------+------------------------+----+
|satisfaction|Gender|    Customer Type|Age| Type of Travel|   Class|Flight Distance|Seat comfort|Departure/Arrival time convenient|Food and drink|Gate location|Inflight wifi service|Inflight entertainment|Online support|Ease of Online booking|On-board service|Leg room service|Baggage handling|Checkin service|Cleanliness|Online boarding|Departure Delay in Minutes|Arrival Delay in Minutes|rank|
+------------+------+-----------------+---+---------------+--------+---------------+------------+---------------------------------+--------------+-------------+---------------------+----------------------+---

One common requirement is to find the top-three values from a category. In this case, window can be used to get the results.

In [10]:
win_1 = Window.partitionBy("Class").orderBy(df['Flight Distance'].desc())

In [11]:
df=df.withColumn('rank', row_number().over(win_1).alias('rank'))

Now that we have a new column rank that consists of the rank or each Class, we can filter the top-three ranks for each Class

In [12]:
df.groupBy('rank').count().orderBy('rank').show()

+----+-----+
|rank|count|
+----+-----+
|   1|    3|
|   2|    3|
|   3|    3|
|   4|    3|
|   5|    3|
|   6|    3|
|   7|    3|
|   8|    3|
|   9|    3|
|  10|    3|
|  11|    3|
|  12|    3|
|  13|    3|
|  14|    3|
|  15|    3|
|  16|    3|
|  17|    3|
|  18|    3|
|  19|    3|
|  20|    3|
+----+-----+
only showing top 20 rows



In [13]:
df.filter(col('rank') < 4).show()

+------------+------+-----------------+---+---------------+--------+---------------+------------+---------------------------------+--------------+-------------+---------------------+----------------------+--------------+----------------------+----------------+----------------+----------------+---------------+-----------+---------------+--------------------------+------------------------+----+
|satisfaction|Gender|    Customer Type|Age| Type of Travel|   Class|Flight Distance|Seat comfort|Departure/Arrival time convenient|Food and drink|Gate location|Inflight wifi service|Inflight entertainment|Online support|Ease of Online booking|On-board service|Leg room service|Baggage handling|Checkin service|Cleanliness|Online boarding|Departure Delay in Minutes|Arrival Delay in Minutes|rank|
+------------+------+-----------------+---+---------------+--------+---------------+------------+---------------------------------+--------------+-------------+---------------------+----------------------+---