# Caching

**Technical Accomplishments:**
* Understaning for How caching works?
* Explore the various caching mechanisims
* Discuss tips for the best use of the cache

## A Fresh Start
For this section, first of all there is need to clear the existing cache.

There are several ways to accomplish this:
  * Remove each cache one-by-one which is fairly problematic
  * Restart the cluster - takes a fair while to come back online
  * Just blow the entire cache away - this will affect each and every user on the cluster!!

In [None]:
#!!! DO NOT RUN THIS ON A SHARED CLUSTER !!!

# spark.catalog.clearCache()

#!!! It will Delete the cache of your system and Your's Co-Worker's !!!

This will ensure that any caches produced by other exercises will be removed.

Next, open the **Spark UI** and go to the **Storage** tab - it should be empty.

## The Data Source

This data uses the **Pageviews By Seconds** data set.

The parquet files are located in the HDFS at **data/pageviews_by_second.parquet**.

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

fileName = "/home/Downloads/data/pageviews_by_second.tsv"

pageviewsDF = (spark.read
  .option("header", "true")
  .option("sep", "\t")
  .schema(schema)
  .csv(fileName)
)

The 255 MB pageviews data is currently in HDFS, which means each time you scan through it, your Spark cluster has to read the 255 MB of data remotely over the network.

Once again, use the `count()` action to scan the entire 255 MB file from disk and count how many total rows there are in dataset:

In [None]:
total = pageviewsDF.count()

print("Record Count: {0:,}".format( total ))

Record Count: 7,200,000


The pageviews DataFrame contains 7.2 million rows.

Do Make a note of how long the previous operation takes.

Re-run it several times so as trying to establish an average.

Now Let's try a slightly more complicated operation, such as sorting, which induces an "expensive" shuffle.

In [None]:
(pageviewsDF
  .orderBy("requests")
  .count()
)

7200000

Again, do make a note of how long the operation takes.

Rerun it several times to get an average time.

Each and Every time we re-run these operations, it goes all the way back to the original data store.

This requires pulling all the data across the network for every execution.

In most of the cases, this network IO is the most expensive part of a job.

## cache()

We can avoid all of this overhead by caching the data on the executors.

Just go ahead and run the following command.

Don't forget to make a note of how long it takes to execute.

In [None]:
pageviewsDF.cache()

DataFrame[timestamp: string, site: string, requests: int]

The `cache(..)` operation doesn't perform any rocket science but it only mark a `DataFrame` as cacheable.

And while it does return an instance of `DataFrame` it is not technically a transformation or action

In order to actually cache the data, Spark has to process over each and every single record.

As Spark processes every record, the cache will be materialized.

A very common method for materializing the cache is to execute a `count()`.

**BUT BEFORE YOU DO** Check the **Spark UI** to make sure it's still empty even after calling `cache()`.

In [None]:
pageviewsDF.count()

7200000

The last `count()` will take a little longer than normal.

It has to perform the cache and do the work of materializing the cache.

Now the `pageviewsDF` is cached **AND** the cache has been materialized.

Before we rerun our queries, check the **Spark UI** and the **Storage** tab.

Now, run the two queries and compare their execution time to the ones above.

In [None]:
pageviewsDF.count()

7200000

In [None]:
(pageviewsDF
  .orderBy("requests")
  .count()
)

7200000

Was it Faster?

All of our data is being stored in cache on the executors.

We are no longer making network calls. Our plain `count()` should be sub-second. Our `orderBy()` & `count()` should be around 3-4 seconds.

## Spark UI - Storage

Now that the pageviews `DataFrame` is cached in memory let's review the **Spark UI** in more detail.

In the **RDDs** table, you should see only one record - multiple if you re-run the `cache()` operation.

Let's review the **Spark UI**'s **Storage** in detail:

* RDD Name
* Storage Level
* Cached Partitions
* Fraction Cached
* Size in Memory
* Size on Disk

Now, dig deeper into the storage details.

Click on the link provided in the **RDD Name** column to open the **RDD Storage Info**.

Review the **RDD Storage Info**:

* Size in Memory
* Size on Disk
* Executors

If you recall:

* We should have 8 partitions.
* With 255MB of data divided into 8 partitions.
* The first seven partitions should be 32MB each.
* The last partition will be significantly smaller than the others.

**Question:** Why is the **Size in Memory** nowhere near 32MB?

**Question:** What is the difference between **Size in Memory** and **Size on Disk**?

## persist()

`cache()` is just an alias for the `persist()`

Let's take a look at the API docs for:

* `Dataset.persist()` - Scala
* `DataFrame.persist()` - Python

`persist()` allows one to specify an additional parameter i.e. storage level, indicating how the data is cached:

* DISK_ONLY
* DISK_ONLY_2
* MEMORY_AND_DISK
* MEMORY_AND_DISK_2
* MEMORY_AND_DISK_SER
* MEMORY_AND_DISK_SER_2
* MEMORY_ONLY
* MEMORY_ONLY_2
* MEMORY_ONLY_SER
* MEMORY_ONLY_SER_2
* OFF_HEAP

** *Note:* ** *The default storage level for:*
* *RDDs are **MEMORY_ONLY**.*
* *DataFrames are **MEMORY_AND_DISK**.* 
* *Streaming is **MEMORY_AND_DISK_2**.*

Before we can use the various storage levels, it's necessary to import the enumerations...

In [None]:
from pyspark import StorageLevel

**Question:** How do we purge data from the cache?

`unpersist()` or `uncache()`?

Want to Try it?

In [None]:
# pageviewsDF.uncache()
# pageviewsDF.unpersist()

Real quick, go check the **Storage** tab in the **Spark UI** and confirm that the cache has been removed.

**Question:** What will happen if you take 75% of the cache and then I come along and try to use 50% (of the total) only

* with **MEMORY_ONLY**?
* with **MEMORY_AND_DISK**?
* with **DISK_ONLY**?

## RDD Name

If you haven't noticed yet, the **RDD Name** on the **Storage** tab in the **Spark UI** is a big ugly name.

It's a bit hacky, but there is a workaround for assigning a name.
0. Create your `DataFrame`.
0. From that `DataFrame`, create a temporary view with any name.
0. Specifically, cache the table via the `SparkSession` and its `Catalog`.
0. Materialize the cache.

In [None]:
pageviewsDF.unpersist()

pageviewsDF.createOrReplaceTempView("Pageviews_DF_Python")
spark.catalog.cacheTable("Pageviews_DF_Python")

pageviewsDF.count()

7200000

And now to clean up after ourselves

In [None]:
pageviewsDF.unpersist()

DataFrame[timestamp: string, site: string, requests: int]

## In this lesson you learned about:
 - Analyzing the performance of caching RDDs w.r.t DataFrames
 - Comparing and contrasting the various storage level options

In [None]:
df = spark.read.csv("/home/Downloads/data/people-with-header-10m.txt", header="true", sep=":")

In [None]:
df.cache() # cache is a lazy operation

DataFrame[id: string, firstName: string, middleName: string, lastName: string, gender: string, birthDate: string, ssn: string, salary: string]

Show causes an action so cache is materialized..

In [None]:
df.count() # cache for whole dataset

10000000

Don't ignore data types. How big is the file compared to in-memory?

In [None]:
df.unpersist()

DataFrame[id: string, firstName: string, middleName: string, lastName: string, gender: string, birthDate: string, ssn: string, salary: string]

It's bigger in memory than on disk! Why? Due to Java string object storage.

<img src="https://files.training.databricks.com/images/tuning/java-string.png" alt="Java String Memory allocation"/><br/>


- A regular 4 byte string would end up taking 48 bytes. 
- The diagram shows how the 40 bytes are allocated and we also need to round up byte usage to be divisible of 8 due to JVM padding. 
- This is a very bloated representation knowing that of these 48 bytes, we're actually after only 4. 

Let's try with `inferSchema` instead.

In [None]:
df2 = spark.read.csv("/home/Downloads/data/people-with-header-10m.txt", header="true", inferSchema="true", sep=":")

df2.cache().count()

10000000

Only takes up ~230MB vs ~330MB...

In [None]:
df2.unpersist() # we have to unpersist here otherwise the next cell won't re-cache the same dataset

DataFrame[id: int, firstName: string, middleName: string, lastName: string, gender: string, birthDate: timestamp, ssn: string, salary: int]

Let's use MEMORY_AND_DISK.

In [None]:
from pyspark import StorageLevel

df3 = spark.read.csv("/home/Downloads/data/people-with-header-10m.txt", header="true", inferSchema="true", sep=":")

df3.persist(StorageLevel.MEMORY_AND_DISK).count()

10000000

Now only ~336MB, almost half of what we started with on storage! Let's compare that to an RDD.

In [None]:
myRDD = df3.rdd
myRDD.setName("myRDD").cache().count()

10000000

Wow! The RDD took up significantly less space. Let's unpersist both of them and see how we can cache DataFrames with cleaner names.

In [None]:
df3.unpersist()
myRDD.unpersist()

df.createOrReplaceTempView("df")

In [None]:
spark.sql("CACHE TABLE df")

DataFrame[]

## End of Exercise