d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Caching
1. Clear cache
1. Cache DataFrame
1. Remove Cache
1. Cache table for RDD name
1. Spark UI - Storage

##### Methods
- DataFrame (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">Scala</a>): `union` `cache`, `unpersist`
- Catalog (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=catalog#pyspark.sql.Catalog" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/catalog/Catalog.html" target="_blank">Scala</a>): `cacheTable`

In [0]:
%run ./Includes/Classroom-Setup

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Clear all cache
You can clear cache on your cluster by restarting your cluster or using the method below.

In [0]:
# DO NOT RUN ON SHARED CLUSTER - CLEARS YOUR CACHE AND YOUR COWORKER'S
spark.catalog.clearCache()

Let's use the BedBricks events dataset

In [0]:
eventsJsonPath = "/mnt/training/ecommerce/events/events-1m.json"

df = (spark.read
  .option("inferSchema", True)
  .json(eventsJsonPath))

In [0]:
df.orderBy("event_timestamp").count()

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Cache DataFrame

#### `cache()`
Persist this Dataset with the default storage level

##### Alias for `persist`

In [0]:
df.cache()

:NOTE: A call to `cache()` does not immediately materialize the data in cache.

An action using the DataFrame must be executed for Spark to actually cache the data.

Check Spark UI Storage tab before and after materializing the cache below.

In [0]:
df.count()

Observe change in execution time below.

In [0]:
df.orderBy("event_timestamp").count()

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Remove cache
:NOTE: As a best practice, you should always evict your DataFrames from cache when you no longer need them.

#### `unpersist()`
Removes cache for a DataFrame

In [0]:
df.unpersist()

###![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Cache table for RDD name
Cache a table to assign a nicer name to the cached RDD for the Storage UI.

In [0]:
df.createOrReplaceTempView("Pageviews_DF_Python")
spark.catalog.cacheTable("Pageviews_DF_Python")

df.count()

In [0]:
df.unpersist()

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
