### üîπ Why Use Cache in PySpark?

When you perform transformations in Spark, they are lazy ‚Äî meaning Spark doesn‚Äôt compute anything until an action (like .show(), .count(), .collect()) is called.
If the same DataFrame is reused multiple times, Spark recomputes it every time ‚Äî which can be expensive.

üëâ Caching tells Spark to keep the results of a DataFrame in memory (or disk) after the first computation, so subsequent actions are much faster.

### üîπ Methods for Caching in PySpark
Method	Storage Level	Description
df.cache()	MEMORY_ONLY	Default caching ‚Äî stores DataFrame in memory only.
df.persist()	Custom	Lets you specify the storage level (e.g., MEMORY_AND_DISK).
unpersist()	‚Äî	Removes the DataFrame from cache.

In [0]:
#üîπ Example: Using Cache in PySpark

#Here‚Äôs a sample you can run in Databricks or any PySpark environment.

from pyspark.sql.functions import col, sum as _sum

# ‚úÖ Create a sample DataFrame
data = [
    (1, "Electronics", 1000),
    (2, "Electronics", 1500),
    (3, "Furniture", 800),
    (4, "Clothing", 400),
    (5, "Clothing", 600)
]

columns = ["order_id", "category", "amount"]

df = spark.createDataFrame(data, columns)

In [0]:
# ‚úÖ Perform a transformation
sales_by_category = df.groupBy("category").agg(_sum("amount").alias("total_sales"))

# ‚úÖ Cache the transformed DataFrame
sales_by_category.cache()

# ‚ö° First action triggers computation and caches the result
print("Initial Action:")
sales_by_category.display()

Initial Action:


category,total_sales
Electronics,2500
Clothing,1000
Furniture,800


In [0]:
# ‚ö° Second action reuses cache ‚Äî much faster
print("Reusing Cached DataFrame:")
sales_by_category.count()

# ‚úÖ Remove from cache if not needed anymore
sales_by_category.unpersist()

Reusing Cached DataFrame:
Out[4]: DataFrame[category: string, total_sales: bigint]

In [0]:
#üîπ Verify Cache Storage Level
#You can check the storage level like this:
sales_by_category.storageLevel
#Output (for cache()):
StorageLevel(True, False, False, False, 1)

Out[8]: StorageLevel(True, False, False, False, 1)

In [0]:
#üîπ Using persist() for More Control
#You can persist with a different storage level, for example:
from pyspark import StorageLevel
# Store in both memory and disk
sales_by_category.persist(StorageLevel.MEMORY_AND_DISK)
#This is safer when the dataset is too large to fit in memory.

Out[7]: DataFrame[category: string, total_sales: bigint]

### üîπ When to Use Cache / Persist

‚úÖ Use cache() or persist() when:

You reuse the same DataFrame multiple times in a job.

You perform iterative algorithms (like ML model training or graph computations).

You materialize intermediate results that are expensive to recompute.

‚ùå Avoid caching when:

The DataFrame is used only once.

The DataFrame is too large to fit in memory.

### üîπ Bonus: Check Cached Tables in Spark UI

When running in Databricks or Spark UI:

Go to the Storage tab.

You‚Äôll see all cached DataFrames, their memory size, and storage level.