# **Databricks spark Cache vs Persist**


cache and persist are two methods in Apache Spark that are used to persist (or cache) a DataFrame or RDD in memory or on disk for reusing it across multiple stages of a Spark application. Both methods are used to optimize the performance of iterative algorithms or multiple computations on the same dataset

*   Cache
    *   It is a more convenient and concise method for caching DataFrames. It is often used for quick caching without worrying about storage level details.
    *   It is used to store the data in memory across the working node.
    *   we only need to call it on a DataFrame or RDD without specifying storage level details. For **example: df.cache()**
    *   
*   Persist
    *    It offers more flexible as you can choose the storage level explicitly. This flexibility is valuable when you have specific requirements regarding memory usage, serialization, or disk storage.
    *    It give choice of storing
    *    Mechanism which gives an option to store the data either in-memory or in disc acrosss node
    *  The persist method requires you to explicitly provide a storage level, making the syntax slightly more verbose. For example: **df.persist(storageLevel="MEMORY_ONLY").**  

Both cache and persist can be followed by an unpersist to remove the DataFrame or RDD from the cache. The unpersist method removes the data from memory or disk, depending on the storage level used.

**Example -**\
df1= df.select()\
df2=df1.filter()\
df3=df2.withColumn()\
df4=df2.withColumn()

*  without cache or persist
    *  df2 is to be calculated again and again which is time and resource consuming.
*  with cache 
    *  we execute the transformation and save the result in the memoery or disk and fetch it when we needed it.
    *  Because of this e.g. df3 needed df2 so it will chech the dataframe in memeory and if found it will fecth the result from that. In this the resources and time to calculate the tranformation is saved as now we don't need to re compute df2 it again.
    *  but to increase the efficinecy of computation or to decrese exeution time we should't add all the intermediate result to the memory, because these is fixed computation memeory and thus there can be shortage of memory for the other computation, thus we should take care which data we are caching in memory.

### **Storage level provided by spark in Persist**
*   **MEMORY_ONLY**
    *   same as cache
    *   In this data is stored in deserialized form.
*   **MEMORY_ONLY_SER**
    *   persisting the RDD in a serialized (binary) form helps to reduce the size of the RDD, thus making space for more RDD to be persisted in the cache memoey, it is space efficient but not time efficient.
*   **MEMORY_AND_DISK**
    *   Stores partitions on disk which do not fit in memeory.
*   **MEMORY_AND _DISK_SER**
    *   Same as "Meomory and disk" but in serialized form
*   **DISK_ONLY**
    *   persists data only in disk, which would required network input and output operation thus time consuming. Still better performance than re-computation each time through DAG.
*   same as above  but create replication in other nodes. so if any failure of node, still data is accessible through another node.
    *   **DISK_ONLY_2**
    *   **MEMORY_AND_DISK_2**
    *   **MEMORY_ONLY_2**
    *  **MEMORY_AND_DISL_SER_2**
    *  **MEMORY_ONLY_SER_2**
   

**Reason why serialized data takes less memory than de serialized data**

When you serialize data, you convert the in-memory representation of objects or data structures into a format that can be easily stored or transmitted, such as a byte stream or a text format like JSON or XML. The serialization process typically removes unnecessary information and organizes the data in a more compact form. This reduction in size is one of the reasons why serialized data can take less space than the original in-memory representation.

*   For example, a complex object with various attributes and relationships might be serialized into a simple JSON or binary format where only essential information is included.
*   Some serialization formats or libraries include compression techniques to further reduce the size of the serialized data
*   During serialization, certain metadata or non-essential information may be stripped away
*   Serialization may omit default values for attributes since these can be assumed when deserializing the data. This can result in smaller serialized representations.
*   Depending on the serialization method or format used, there might be a degree of lossy compression involved, especially in image or audio data serialization.

The degree of space reduction during serialization can vary depending on the specific serialization format, the data being serialized, and the serialization library or tool used.

**Note -**
*   while serialized data may be smaller for storage or transmission, there is typically some overhead involved in deserialization to reconstruct the original in-memory representation.
*   To read the data we need to deserialized the data.
  

In [None]:
import findspark
findspark.init()

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark=SparkSession.builder.config("spark.driver.host","localhost").getOrCreate()

### **Without cache the dataframe**

In [None]:
df2 = spark.read.format("csv").option("header","true").option("sep",",").load("data.csv")
result2 = df2.filter(df2["collision_id"] > 4456624)
result2 = df2.filter((df2["collision_id"] < 4456624) & (df2["collision_id"]>4456355))

result2.show()

In [None]:
result3 = df2.filter((df2["collision_id"] %2==0)).count()
print(result3)


### **with caching the df the execution time reduces**

In [None]:

# Assume df is your DataFrame
df = spark.read.format("csv").option("header","true").option("sep",",").load("data.csv")

# Cache the DataFrame
df.cache()

# Perform operations on the DataFrame
result = df.filter(df["collision_id"] > 4456624)

# The DataFrame is now cached in memory, and subsequent actions will be faster
# ...
result = df.filter((df["collision_id"] < 4456624) & (df["collision_id"]>4456355))

result.show()




In [None]:
result = df.filter((df["collision_id"] %2==0)).count()
print(result)

**By doing operatoin of datafarme which is cahched (df) and which is not (df3) ypu can compare the execution time difference.**

### **Persist**

In [None]:
from pyspark.storagelevel import StorageLevel

# Assume df is your DataFrame
df = spark.read.format('csv').options(inferschema=True,header=True,sep=',').load('data.csv')

# Cache the DataFrame with different storage levels

# 1. MEMORY_ONLY: Store as deserialized Java objects in the JVM. This is the default storage level.
df.persist(StorageLevel.MEMORY_ONLY)

# 2. MEMORY_ONLY_SER: Store as serialized Java objects (binary format) in the JVM.
df.persist(StorageLevel.MEMORY_ONLY_SER)

# 3. MEMORY_AND_DISK: Store as deserialized Java objects in the JVM, and spill to disk if necessary.
df.persist(StorageLevel.MEMORY_AND_DISK)

# 4. MEMORY_AND_DISK_SER: Store as serialized Java objects in the JVM, and spill to disk if necessary.
df.persist(StorageLevel.MEMORY_AND_DISK_SER)

# 5. DISK_ONLY: Store as deserialized Java objects on disk.
df.persist(StorageLevel.DISK_ONLY)

# 6. DISK_ONLY_SER: Store as serialized Java objects (binary format) on disk.
df.persist(StorageLevel.DISK_ONLY_SER)

# Perform operations on the DataFrame, and Spark will use the cached data
result = df.filter(df["column"] > 5).collect()

# Unpersist the DataFrame when it's no longer needed
df.unpersist()
