### Disk Cache

[Optimize performance with caching on Databricks](https://docs.databricks.com/aws/en/optimizations/disk-cache)

Databricks uses **disk caching** to **accelerate data reads by creating copies of remote Parquet data files in nodes' local storage** using a fast intermediate data format. 
- The data is cached automatically whenever a file has to be fetched from a remote location. 
- Successive reads of the same data are then performed locally, which results in significantly improved reading speed. 
- The cache works for all Parquet data files (including Delta Lake tables).

**Disk cache consistency**

The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.

In [0]:
file_path = "dbfs:/FileStore/Baby_Names.csv"

In [0]:
schema = "year INT, name STRING, county STRING, sex STRING, count INT"
df = spark.read.option("header", "true").schema(schema).csv(file_path)
df.write.saveAsTable("baby_names")

In [0]:
%%time
spark.sql("SELECT * FROM baby_names").display()

In [0]:
spark.conf.get("spark.databricks.io.cache.enabled")

In [0]:
spark.conf.set("spark.databricks.io.cache.enabled", "true")

In [0]:
%%time
spark.sql("SELECT * FROM baby_names").display()

In [0]:
%%time
spark.sql("SELECT * FROM baby_names").display()

In [0]:
%sql
-- cache command preloads the data (of the queyy) on to the disk
CACHE SELECT * FROM baby_names;

#### Spark Persistance vs. Disk Cache

- **Spark persistence** is a manual process to store a DataFrame or RDD in memory or on disk, while **Databricks disk cache** is an automatic, background process that caches remote Parquet files on local SSDs for faster reads. 

- Spark's methods (like `.cache()` or `.persist()`) require explicit code, whereas disk cache is enabled via configuration and is ideal for large Parquet tables, while Spark's manual methods are better for reusing intermediate DataFrame computations. 

| Feature   | Spark Persistence   | Databricks Disk Cache   |
|------------|------------|------------|
| **Trigger**   | Manual (e.g. df.persist())   | Automatic, on first read of a file   |
| **Data storage**  | JVM memory or disk, depending on storage level  | Local SSDs on worker nodes  |
| **Data type**    | Any DataFrame or RDD     | Remote Parquet files    |
| **Use case**     | Reusing intermediate computations from DataFrame/RDDs | Accelerating reads of frequently accessed Parquet tables     |
| **Eviction**     | LRU or manual unpersist()     | LRU or when the underlying file changes     |
| **Control**     | Fine-grained control over storage level and cache location     | Primarily enabled/disabled via configuration    |