### Caching and Persisting DataFrames in PySpark


##### When working with large datasets in PySpark, repeatedly computing transformations on a DataFrame can be slow because each transformation triggers a full recomputation. To improve performance, PySpark provides caching and persisting, which store intermediate results in memory or disk for reuse.

##### using Cache()

In [0]:
data = [(1, "Alice", 23), (2, "Bob", 34), (3, "Charlie", 29)]
columns = ["id", "name", "age"]

In [0]:
df = spark.createDataFrame(data, columns)


In [0]:
df.cache()


Out[4]: DataFrame[id: bigint, name: string, age: bigint]

In [0]:
df.count()

Out[5]: 3

In [0]:
df.display()

id,name,age
1,Alice,23
2,Bob,34
3,Charlie,29


##### using Persist(storageLevel)

##### persist() allows more control over storage by specifying a storage level.



In [0]:
data =[
    (1,'sonu vishwakarma', 'data engineer'),
    (2,'mohit jain' , 'python developer'),
    (3,'hitesh','web developer')
]

schema =['id','name','Role']

df = spark.createDataFrame(data, schema)

In [0]:
df.display()

id,name,Role
1,sonu vishwakarma,data engineer
2,mohit jain,python developer
3,hitesh,web developer


In [0]:
from pyspark import StorageLevel


In [0]:
df.persist(StorageLevel.MEMORY_AND_DISK)


Out[11]: DataFrame[id: bigint, name: string, Role: string]

In [0]:
df.count() 


Out[12]: 3

In [0]:
df.show()


+---+----------------+----------------+
| id|            name|            Role|
+---+----------------+----------------+
|  1|sonu vishwakarma|   data engineer|
|  2|      mohit jain|python developer|
|  3|          hitesh|   web developer|
+---+----------------+----------------+



###  When to Use cache() or persist()?


### Use cache() when:

##### You need to reuse the DataFrame multiple times.

##### The DataFrame fits in memory.

### Use persist(storageLevel) when:

##### The DataFrame is too large to fit in memory.

##### You want to store it on disk or use a different storage strategy.