- Because caching itself is lazy, the data is cached in memory only on the first time you run an action on the dataframe.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

spark = SparkSession.builder.appName("chapter-19-perf").getOrCreate()

In [2]:
import os
SPARK_BOOK_DATA_PATH = os.environ['SPARK_BOOK_DATA_PATH']

file_path = SPARK_BOOK_DATA_PATH + "/data/flight-data/csv/2015-summary.csv"

In [3]:
DF1 = (spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .load(file_path))

In [4]:
DF1.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [5]:
DF1.storageLevel

StorageLevel(False, False, False, False, 1)

DataFrame's default storage-level is `StorageLevel(False, False, False, False, 1)`

In [6]:
%%time
result = DF1.groupBy("DEST_COUNTRY_NAME").count().collect() 

CPU times: user 4.87 ms, sys: 2.25 ms, total: 7.12 ms
Wall time: 1.57 s


In [7]:
DF1.cache()

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]

In [8]:
DF1.storageLevel

StorageLevel(True, True, False, True, 1)

After `cache()`, storage-level is `StorageLevel(True, True, False, True, 1)`

In [9]:
DF1.count()

256

In [10]:
DF1.is_cached

True

In [12]:
%%time
result = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()    # sys time = 0.86 ms with cache

CPU times: user 5.86 ms, sys: 442 µs, total: 6.3 ms
Wall time: 563 ms


After `DF1` is cached, the same `groupBy` took 0.7 ms (vs 5.3 ms - uncached)

In [13]:
%%time
result = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()    # sys time = 0.86 ms with cache

CPU times: user 0 ns, sys: 5.55 ms, total: 5.55 ms
Wall time: 513 ms


In [14]:
DF1.unpersist()

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]

In [15]:
DF1.is_cached

False

In [16]:
%%time
result = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()    # sys time = 4.6 ms after unpersist()

CPU times: user 1.7 ms, sys: 4.45 ms, total: 6.15 ms
Wall time: 493 ms


#### persist(storageLevel)

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html

If no storage level is specified defaults to `MEMORY_AND_DISK`.

storageLevel parameters: 
- useDisk
- useMemory
- useOffHeap
- deserialized
- replication (1 or 2)


|Storage Level	| Equivalent	| Description|
|---------------|---------------|------------|
|`MEMORY_AND_DISK` (default)	|StorageLevel(True, True, False, False, 1)	| Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.|
|`MEMORY_ONLY`	| StorageLevel(False, True, False, False, 1)	|Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.|
|`DISK_ONLY`	|StorageLevel(True, False, False, False, 1)	|Store the RDD partitions only on disk.|
|`MEMORY_AND_DISK_2`	|StorageLevel(True, True, False, False, 2)	|Same as the levels above, but replicate each partition on two cluster nodes.|
|`MEMORY_ONLY_2`	|StorageLevel(False, True, False, False, 2)	|Same as the levels above, but replicate each partition on two cluster nodes.|
|`OFF_HEAP`	|StorageLevel(True, True, True, False, 1)	|Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.|

In [16]:
from pyspark import StorageLevel

In [24]:
StorageLevel.MEMORY_AND_DISK

StorageLevel(True, True, False, False, 1)

In [27]:
StorageLevel.MEMORY_AND_DISK_2

StorageLevel(True, True, False, False, 2)

In [25]:
StorageLevel.MEMORY_ONLY

StorageLevel(False, True, False, False, 1)

In [28]:
StorageLevel.MEMORY_ONLY_2

StorageLevel(False, True, False, False, 2)

In [26]:
StorageLevel.DISK_ONLY

StorageLevel(True, False, False, False, 1)

In [29]:
StorageLevel.DISK_ONLY_2

StorageLevel(True, False, False, False, 2)

In [30]:
StorageLevel.OFF_HEAP

StorageLevel(True, True, True, False, 1)

### [Apache Spark Optimization Toolkit](https://towardsdatascience.com/apache-spark-optimization-toolkit-17cf3e491992)

In [17]:
file_path = SPARK_BOOK_DATA_PATH + "/data/retail-data/by-day/2010-12-01.csv"
# Original loading code that does *not* cache DataFrame
df1 = spark.read.format("csv")\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .load(file_path)

In [18]:
(
df1.withColumn("partition_id", F.spark_partition_id())
  .groupBy("partition_id")
  .count().show()
)

+------------+-----+
|partition_id|count|
+------------+-----+
|           0| 3108|
+------------+-----+



In [17]:
spark.stop()