# RDD Persistence and Reuse
RDDs are created and exist predominantly in memory on Executors. By default, RDDs are tran-
sient objects that exist only while they are required. After they transform into new RDDs and
arenâ€™t needed for any other operations, they are removed permanently. This may be problematic
if an RDD is required for more than one action because it must be reevaluated in its entirety each
time. An option to address this is to cache or persist the RDD by using the persist() method.
Below are the examples with and without persist()

In [1]:
#Importing SparkContext
from pyspark.context import SparkContext


sc = SparkContext("local")
sc

In [2]:
#MULTIPLE OPERATIONS WITHOUT PERSIST METHOD
rangeRDD = sc.range(0,100000,1,2)
#Filtering only even out of 1 lac entries
even  = rangeRDD.filter(lambda x : x%2)
#Counting
count = even.count()
print(f"There are {count} elements in the collection")
#Action to collect the answer
list_of_elements = even.collect()
print(f"The first five elements include: {list_of_elements[0:5]}")

There are 50000 elements in the collection
The first five elements include: [1, 3, 5, 7, 9]


In [3]:
#WITH PERSISTENCE
numbers  = sc.range(0,100000,1,2)
evenNum = numbers.filter(lambda x:x%2)
evenNum.persist() # instructs Spark to persist evens RDD when the next action requires it
#Counting the total
counter = evenNum.count()
print(f"There are {counter} elements in the collection")

listElem = even.collect() # does NOT have to recompute the evens RDD
print(f"First Five elements in the collection are: {listElem[0:5]}")

There are 50000 elements in the collection
First Five elements in the collection are: [1, 3, 5, 7, 9]
