## This notebook is part of the Apache Spark training delivered by CERN IT


Run this notebook from Jupyter with Python kernel
- When running on CERN SWAN, do not attach the notebook to a Spark cluster, but rather run it locally on the SWAN container (which is the default)
- If running this outside CERN SWAN, please make sure to have PySpark installed: `pip install pyspark`


In order to run this notebook as slides:
 - on SWAN click on the button "Enter/Exit RISE slideshow" in the ribbon
 - on other environments please make sure to have the RISE extension installed `pip install RISE`

### SPARK DataFrame Hands-On Lab
Contact: Luca.Canali@cern.ch

### Objective: Perform Basic DataFrame Operations
1. Creating DataFrames
2. Select columns
3. Add, rename and drop columns
4. Filtering rows
5. Aggregations

## Getting started: create the SparkSession

In [1]:
!pip install pyspark



In [2]:
# !pip install pyspark

from pyspark.sql import SparkSession

spark = (SparkSession.builder
          .master("local[*]") \
          .appName("DataFrame HandsOn 1") \
          .config("spark.ui.showConsoleProgress","false") \
          .getOrCreate()
        )

spark

The master `local[*]` means that the executors are in the same node that is running the driver. The `*` tells Spark to start as many executors as there are logical cores available

### Hands-On 1 - Construct a DataFrame from csv file
This demostrates how to read a csv file and construct a DataFrame.  
We will use the online retail dataset from Kaggle, credits: https://www.kaggle.com/datasets/vijayuv/onlineretail


#### First, let's inspect the csv content

In [3]:
#modify below code to use the downloaded dataset - not running because of self extract

!gzip -cd ../data/online-retail-dataset.csv.gz 2>&1| head -n3

gzip: ../data/online-retail-dataset.csv.gz: No such file or directory


In [5]:
online_retail_schema="InvoiceNo int, StockCode string, Description string, Quantity int,\
InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string"

In [6]:
df = (spark.read
        .option("header", "true")
        .option("timestampFormat", "M/d/yyyy H:m")
        .csv("OnlineRetail.csv",
             schema=online_retail_schema)
     )

#### Inspect the data

In [7]:
df.show(2, False)

+---------+---------+----------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                       |Quantity|InvoiceDate        |UnitPrice|CustomerId|Country       |
+---------+---------+----------------------------------+--------+-------------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER|6       |2010-12-01 08:26:00|2.55     |17850     |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN               |6       |2010-12-01 08:26:00|3.39     |17850     |United Kingdom|
+---------+---------+----------------------------------+--------+-------------------+---------+----------+--------------+
only showing top 2 rows



#### Show columns

In [8]:
df.printSchema()

root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: float (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Country: string (nullable = true)



### Hands-On 2 - Spark Transformations - select, add, rename and drop columns

Select dataframe columns

In [9]:
# select single column

df.select("Country").show(2)

+--------------+
|       Country|
+--------------+
|United Kingdom|
|United Kingdom|
+--------------+
only showing top 2 rows



Select multiple columns


In [10]:
df.select("StockCode","Description","UnitPrice").show(n=2, truncate=False)

+---------+----------------------------------+---------+
|StockCode|Description                       |UnitPrice|
+---------+----------------------------------+---------+
|85123A   |WHITE HANGING HEART T-LIGHT HOLDER|2.55     |
|71053    |WHITE METAL LANTERN               |3.39     |
+---------+----------------------------------+---------+
only showing top 2 rows



In [11]:
df.columns

['InvoiceNo',
 'StockCode',
 'Description',
 'Quantity',
 'InvoiceDate',
 'UnitPrice',
 'CustomerId',
 'Country']

In [12]:
# select first 5 columns
df.select(df.columns[0:5]).show(2)

+---------+---------+--------------------+--------+-------------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|
+---------+---------+--------------------+--------+-------------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|
+---------+---------+--------------------+--------+-------------------+
only showing top 2 rows



In [13]:
# selects all the original columns and adds a new column that specifies high value item
(df.selectExpr(
   "*", # all original columns
   "(UnitPrice > 100) as HighValueItem")
   .show(2)
)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|HighValueItem|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|     17850|United Kingdom|        false|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|        false|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-------------+
only showing top 2 rows



In [14]:
# selects all the original columns and adds a new column that specifies high value item
(df.selectExpr(
  "sum(Quantity) as TotalQuantity",
  "cast(sum(UnitPrice) as int) as InventoryValue")
  .show()
)

+-------------+--------------+
|TotalQuantity|InventoryValue|
+-------------+--------------+
|      3283820|       1637281|
+-------------+--------------+



#### Adding, renaming and dropping columns

In [15]:
# add a new column called InvoiceValue
from pyspark.sql.functions import expr
df_1 = (df
        .withColumn("InvoiceValue", expr("UnitPrice * Quantity"))
        .select("InvoiceNo","Description","UnitPrice","Quantity","InvoiceValue")
       )
df_1.show(2, False)

# rename InvoiceValue to LineTotal
df_2 = df_1.withColumnRenamed("InvoiceValue","LineTotal")
df_2.show(2, False)

# drop a column
df_2.drop("LineTotal").show(2, False)

+---------+----------------------------------+---------+--------+------------+
|InvoiceNo|Description                       |UnitPrice|Quantity|InvoiceValue|
+---------+----------------------------------+---------+--------+------------+
|536365   |WHITE HANGING HEART T-LIGHT HOLDER|2.55     |6       |15.299999   |
|536365   |WHITE METAL LANTERN               |3.39     |6       |20.34       |
+---------+----------------------------------+---------+--------+------------+
only showing top 2 rows

+---------+----------------------------------+---------+--------+---------+
|InvoiceNo|Description                       |UnitPrice|Quantity|LineTotal|
+---------+----------------------------------+---------+--------+---------+
|536365   |WHITE HANGING HEART T-LIGHT HOLDER|2.55     |6       |15.299999|
|536365   |WHITE METAL LANTERN               |3.39     |6       |20.34    |
+---------+----------------------------------+---------+--------+---------+
only showing top 2 rows

+---------+---------

### Hands-On 3 - Spark Transformations - filter, sort and cast

In [16]:
from pyspark.sql.functions import col

# select invoice lines with quantity > 50 and unitprice > 20
df.where(col("Quantity") > 20).where(col("UnitPrice") > 50).show(2)
df.filter(df.Quantity > 20).filter(df.UnitPrice > 50).show(2)
df.filter("Quantity > 20 and UnitPrice > 50").show(2)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   556444|    22502|PICNIC BASKET WIC...|      60|2011-06-10 15:28:00|    649.5|     15098|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   556444|    22502|PICNIC BASKET WIC...|      60|2011-06-10 15:28:00|    649.5|     15098|United Kingdom|
+---------+---------+------

In [17]:
# select invoice lines with quantity > 100 or unitprice > 20
df.where((col("Quantity") > 100) | (col("UnitPrice") > 20)).show(2)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536378|    21212|PACK OF 72 RETROS...|     120|2010-12-01 09:37:00|     0.42|     14688|United Kingdom|
|     NULL|        D|            Discount|      -1|2010-12-01 09:41:00|     27.5|     14527|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 2 rows



In [18]:
from pyspark.sql.functions import desc, asc

# sort in the default order: ascending
df.orderBy(expr("UnitPrice")).show(2)

df.orderBy(col("Quantity").desc(), col("UnitPrice").asc()).show(10)

+---------+---------+---------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|    Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+---------------+--------+-------------------+---------+----------+--------------+
|     NULL|        B|Adjust bad debt|       1|2011-08-12 14:51:00|-11062.06|      NULL|United Kingdom|
|     NULL|        B|Adjust bad debt|       1|2011-08-12 14:52:00|-11062.06|      NULL|United Kingdom|
+---------+---------+---------------+--------+-------------------+---------+----------+--------------+
only showing top 2 rows

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   541431|    23166|MEDIUM CERAM

### Hands-On 4 - Spark Transformations - aggregations
full list of built int functions - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions

In [19]:
%%time
# Count distinct customers
from pyspark.sql.functions import countDistinct
df.select(countDistinct("CustomerID")).show()

+--------------------------+
|count(DISTINCT CustomerID)|
+--------------------------+
|                      3444|
+--------------------------+

CPU times: user 21.3 ms, sys: 3.14 ms, total: 24.4 ms
Wall time: 2.79 s


In [20]:
%%time
# approx. distinct stock items
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("CustomerID", 0.1)).show()

+---------------------------------+
|approx_count_distinct(CustomerID)|
+---------------------------------+
|                             3286|
+---------------------------------+

CPU times: user 10.4 ms, sys: 1.42 ms, total: 11.8 ms
Wall time: 1.62 s


In [21]:
# average, maximum and minimum purchase quantity
from pyspark.sql.functions import avg, max, min
( df.select(
    avg("Quantity").alias("avg_purchases"),
    max("Quantity").alias("max_purchases"),
    min("Quantity").alias("min_purchases"))
   .show()
)

+-----------------+-------------+-------------+
|    avg_purchases|max_purchases|min_purchases|
+-----------------+-------------+-------------+
|9.688956816277395|        74215|       -74215|
+-----------------+-------------+-------------+



### Hands-On 5 - Spark Transformations - grouping and windows

In [22]:
# count of items on the invoice
df.groupBy("InvoiceNo", "CustomerId").count().show(5)

# grouping with expressions
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
  .show(5)

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536573|     17025|    4|
|   537228|     17677|    1|
|   537419|     13495|   14|
|   538093|     12682|   33|
|   538648|     17937|    5|
+---------+----------+-----+
only showing top 5 rows

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536532| 25.36986301369863|  16.850272831671976|
|   537632|               1.0|                 0.0|
|   538708| 10.61111111111111|   7.150282736359209|
|   538877|14.258278145695364|   27.56989037543246|
|   538993| 9.333333333333334|   2.748737083745107|
+---------+------------------+--------------------+
only showing top 5 rows



### Read the csv file into DataFrame

`%%time` is an iPython magic https://ipython.readthedocs.io/en/stable/interactive/magics.html


It's possible to read files without specifying the schema. Some file formats (Parquet is one of them) include the schema, which means that Spark can start reading the file. For format without schema (csv, json...) Spark can infer the schema. Let's see what's the difference in terms of time and of results:

In [23]:
online_retail_schema="InvoiceNo int, StockCode string, Description string, Quantity int,\
InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string"

In [25]:
%%time
df = spark.read \
        .option("header", "true") \
        .option("timestampFormat", "M/d/yyyy H:m")\
        .csv("OnlineRetail.csv",
             schema=online_retail_schema)

CPU times: user 3.05 ms, sys: 1.03 ms, total: 4.08 ms
Wall time: 34.9 ms


In [26]:
%%time
df_infer = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .csv("OnlineRetail.csv")

CPU times: user 26.9 ms, sys: 3.89 ms, total: 30.8 ms
Wall time: 4.81 s


## Exercises

Reminder: documentation at
https://spark.apache.org/docs/latest/api/python/index.html

If you didn't run the previous cells, run the following one:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local[*]") \
        .appName("DataFrame HandsOn 1") \
        .config("spark.ui.showConsoleProgress","false") \
        .getOrCreate()

online_retail_schema="InvoiceNo int, StockCode string, Description string, Quantity int,\
InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string"

df = spark.read \
        .option("header", "true") \
        .option("timestampFormat", "M/d/yyyy H:m")\
        .csv("../data/online-retail-dataset.csv.gz",
             schema=online_retail_schema)

Task: Show 5 lines of the "description" column

In [27]:
df.select("Description").show(5)

+--------------------+
|         Description|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
|CREAM CUPID HEART...|
|KNITTED UNION FLA...|
|RED WOOLLY HOTTIE...|
+--------------------+
only showing top 5 rows



Task: Count the number of distinct invoices in the dataframe

In [28]:
df.select("InvoiceNo").distinct().count()

22062

Task: Find out in which month most invoices have been issued

In [33]:
from pyspark.sql.functions import col,countDistinct, to_timestamp, month

# invoice column timepstamp format conversion
df = df.withColumn("InvoiceDate", to_timestamp(col("InvoiceDate"), "M/d/yyyy H:m"))
# getting month details
df = df.withColumn("Month", month(col("InvoiceDate")))

# checking new column
df.printSchema()

# validating rows to see the data
df.select("InvoiceDate", "Month").show(5)

# getting distinct invoices per month
invoice_counts = (df.groupBy("Month").agg(countDistinct("InvoiceNo").alias("InvoiceNoCount")).orderBy(col("InvoiceNoCount").desc()))

# final result
invoice_counts.show()


root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: float (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Month: integer (nullable = true)

+-------------------+-----+
|        InvoiceDate|Month|
+-------------------+-----+
|2010-12-01 08:26:00|   12|
|2010-12-01 08:26:00|   12|
|2010-12-01 08:26:00|   12|
|2010-12-01 08:26:00|   12|
|2010-12-01 08:26:00|   12|
+-------------------+-----+
only showing top 5 rows

+-----+--------------+
|Month|InvoiceNoCount|
+-----+--------------+
|   11|          3021|
|   12|          2568|
|   10|          2275|
|    9|          1994|
|    5|          1848|
|    6|          1683|
|    3|          1665|
|    7|          1657|
|    4|          1504|
|    8|          1456|
|    1|          1216|
|    2|          1174|

Task: Filter the lines where the Quantity is more than 30

In [34]:
df.filter(col("Quantity") > 30).show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|Month|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----+
|   536367|    84879|ASSORTED COLOUR B...|      32|2010-12-01 08:34:00|     1.69|     13047|United Kingdom|   12|
|   536370|    10002|INFLATABLE POLITI...|      48|2010-12-01 08:45:00|     0.85|     12583|        France|   12|
|   536370|    22492|MINI PAINT SET VI...|      36|2010-12-01 08:45:00|     0.65|     12583|        France|   12|
|   536371|    22086|PAPER CHAIN KIT 5...|      80|2010-12-01 09:00:00|     2.55|     13748|United Kingdom|   12|
|   536374|    21258|VICTORIAN SEWING ...|      32|2010-12-01 09:09:00|    10.95|     15100|United Kingdom|   12|
+---------+---------+--------------------+--------+-------------------+---------+-------

Task: Show the four most sold items (by quantity)

Bonus question: why do these two operations return different results? Hint: look at the documentation

In [35]:
print(df.select("InvoiceNo").distinct().count())
from pyspark.sql.functions import countDistinct
df.select(countDistinct("InvoiceNo")).show()

22062
+-------------------------+
|count(DISTINCT InvoiceNo)|
+-------------------------+
|                    22061|
+-------------------------+



As per the documentation, distinct().count() include null values however, countDistinct ignores the null values.

In [40]:
from pyspark.sql.functions import col
print("Null Show")
df.filter(col("InvoiceNo").isNull()).show(5)
print("Total Count")
print(df.select("InvoiceNo").count())
print("Null value count")
df.filter("InvoiceNo IS NULL").count()

Null Show
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerId|       Country|Month|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----+
|     NULL|        D|            Discount|      -1|2010-12-01 09:41:00|     27.5|     14527|United Kingdom|   12|
|     NULL|   35004C|SET OF 3 COLOURED...|      -1|2010-12-01 09:49:00|     4.65|     15311|United Kingdom|   12|
|     NULL|    22556|PLASTERS IN TIN C...|     -12|2010-12-01 10:24:00|     1.65|     17548|United Kingdom|   12|
|     NULL|    21984|PACK OF 12 PINK P...|     -24|2010-12-01 10:24:00|     0.29|     17548|United Kingdom|   12|
|     NULL|    21983|PACK OF 12 BLUE P...|     -24|2010-12-01 10:24:00|     0.29|     17548|United Kingdom|   12|
+---------+---------+--------------------+--------+-------------------+-------

9291