<a href="https://colab.research.google.com/github/sandeepgundeboina/LearningSpark/blob/main/SparkCachePersist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('SparkCachePersist').getOrCreate()
spark



    Adding Colors to dataframe in pandas

In [2]:
import pandas as pd
def color_df(val):
    color='red' if val<0 else 'green'
    return 'color:%s' %color
df=pd.DataFrame(dict(a=[1,-20,3],b=[-1,2,-34],c=[100,20,-3]))
df.style.map(color_df)

Unnamed: 0,a,b,c
0,1,-1,100
1,-20,2,20
2,3,-34,-3


In [3]:
from pyspark.sql.types import *
data=[('Ram',2339943,'24-03-2022','04-03-1998',40000,'A'),
      ('Raju',244335,'04-06-2021','22-08-1994',20000,'A'),
      ('Charan',254335,'29-03-2024','11-07-1996',25000,'A'),
      ('Devdas',12343,'21-01-2021','04-05-1993',55000,'R'),
      ('Rezza',124335,'15-08-2024','03-12-1994',65000,'A')
      ]
schema=['Name','Emp_id','Joining_date','DOB','Salary','Status']
schem1=StructType([
    StructField('Name',StringType()),
    StructField('Emp_id',IntegerType()),
    StructField('Joining_date',DateType()),
    StructField('DOB',DateType()),
    StructField('Salary',IntegerType()),
    StructField('Status',StringType())
])
df=spark.createDataFrame(data=data,schema=schema)
df.show()

+------+-------+------------+----------+------+------+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|
+------+-------+------------+----------+------+------+
|   Ram|2339943|  24-03-2022|04-03-1998| 40000|     A|
|  Raju| 244335|  04-06-2021|22-08-1994| 20000|     A|
|Charan| 254335|  29-03-2024|11-07-1996| 25000|     A|
|Devdas|  12343|  21-01-2021|04-05-1993| 55000|     R|
| Rezza| 124335|  15-08-2024|03-12-1994| 65000|     A|
+------+-------+------------+----------+------+------+



    Transformations

In [4]:
from pyspark.sql.functions import *
df=df.withColumn('DOB',to_date(df.DOB,'dd-MM-yyyy'))
df=df.withColumn('Joining_date',to_date(df.Joining_date,'dd-MM-yyyy'))
df1=df.filter(df.Status=='A')
df2=df1.withColumn('Bonus',df1.Salary*0.2)
# Use the year() function to extract the year from the DOB column
df3=df2.withColumn('Age',year(current_date())-year(df2.DOB))
df4=df2.withColumn('YOE',year(current_date())-year(df2.Joining_date))

    Actions

In [5]:
df3.show()

+------+-------+------------+----------+------+------+-------+---+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|  Bonus|Age|
+------+-------+------------+----------+------+------+-------+---+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A| 8000.0| 27|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A| 4000.0| 31|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A| 5000.0| 29|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|13000.0| 31|
+------+-------+------------+----------+------+------+-------+---+



In [6]:
df4.show()

+------+-------+------------+----------+------+------+-------+---+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|  Bonus|YOE|
+------+-------+------------+----------+------+------+-------+---+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A| 8000.0|  3|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A| 4000.0|  4|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A| 5000.0|  1|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|13000.0|  1|
+------+-------+------------+----------+------+------+-------+---+



In [7]:
df2.show()

+------+-------+------------+----------+------+------+-------+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|  Bonus|
+------+-------+------------+----------+------+------+-------+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A| 8000.0|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A| 4000.0|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A| 5000.0|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|13000.0|
+------+-------+------------+----------+------+------+-------+





*   As Spark uses Lazy transformations, it first creates a logic to perform the transformation and stores it, and runs the process only after an action is called upon.
*   So, when we try to do many transformations based on one like above, df2 is the main transformation, that would be required for all other transformations df3,df4. so in this case any action called up on df2, it has to run all previous transformations and would result in more consumption of computing power, in such cases Persisit and Cache are used.
*   In genral, cache is performed in many DB's for optimisation.
* In case of Spark, Cache stores data in memory node.



In [8]:
df2.cache()

DataFrame[Name: string, Emp_id: bigint, Joining_date: date, DOB: date, Salary: bigint, Status: string, Bonus: double]

In [9]:
df3.show()

+------+-------+------------+----------+------+------+-------+---+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|  Bonus|Age|
+------+-------+------------+----------+------+------+-------+---+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A| 8000.0| 27|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A| 4000.0| 31|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A| 5000.0| 29|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|13000.0| 31|
+------+-------+------------+----------+------+------+-------+---+



In [10]:
df4.show()

+------+-------+------------+----------+------+------+-------+---+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|  Bonus|YOE|
+------+-------+------------+----------+------+------+-------+---+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A| 8000.0|  3|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A| 4000.0|  4|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A| 5000.0|  1|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|13000.0|  1|
+------+-------+------------+----------+------+------+-------+---+



In [11]:
df5=df2.withColumn('Bonus',df2.Salary*0.1)
df5.show()

+------+-------+------------+----------+------+------+------+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status| Bonus|
+------+-------+------------+----------+------+------+------+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A|4000.0|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A|2000.0|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A|2500.0|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|6500.0|
+------+-------+------------+----------+------+------+------+



* After df2 is cached, we could see significant increase in the process of output when action is called.

* Persist is another progammaticaly saving data on Memory or disk or on Both.
* there are different type of storing using different storage level
* Memory_only, Memory_and_disk, memory_and_disk_deser,Disk_only (read https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html for more)


In [12]:

from pyspark import StorageLevel
df2.persist(StorageLevel.MEMORY_ONLY)
df3.show()


+------+-------+------------+----------+------+------+-------+---+
|  Name| Emp_id|Joining_date|       DOB|Salary|Status|  Bonus|Age|
+------+-------+------------+----------+------+------+-------+---+
|   Ram|2339943|  2022-03-24|1998-03-04| 40000|     A| 8000.0| 27|
|  Raju| 244335|  2021-06-04|1994-08-22| 20000|     A| 4000.0| 31|
|Charan| 254335|  2024-03-29|1996-07-11| 25000|     A| 5000.0| 29|
| Rezza| 124335|  2024-08-15|1994-12-03| 65000|     A|13000.0| 31|
+------+-------+------------+----------+------+------+-------+---+

