In [2]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()

# Practicing caching: part 1

A dataframe df1 is loaded from a csv file. Several processing steps are performed on it. As df1 is to be used more than once, it is a candidate for caching.

A second dataframe df2 is created by performing additional compute-intensive steps on df1. It is also a candidate for caching.

In [3]:
df1 = spark.read.csv("dataset/sherlock.txt")
print(df1.is_cached)

# Cache df1
df1.cache()

# Prove df1 is cached
print(df1.is_cached)

False
True


# Practicing caching: the SQL

Previously, we examined two DataFrames: df1 and df2 (which is created from df1). We tried caching df1, but not df2. In this exercise, we'll examine the effects of caching df2, but not df1.

In [6]:
import pyspark

df2 = spark.read.csv("dataset/trainsched.txt")
print(df2.is_cached)


# Persist df2 using memory and disk storage level 
df2.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)

print(df2.is_cached)



False
True


# Practicing caching: putting it all together

What was the best approach to caching df1 and df2 and why?

Your results will vary; but here is one (random) result for each of the two approaches:

First answer (cache df1):
```
df1_1st : 2.4s
df1_2nd : 0.1s
df2_1st : 0.3s
df2_2nd : 0.2s
Overall elapsed : 3.9
```
Second answer (cache df2):
```
df1_1st : 2.3s
df1_2nd : 1.1s
df2_1st : 1.7s
df2_2nd : 0.1s
Overall elapsed : 6.4
```

- Cache df1, because it improves the time of the 2nd, 3rd, and 4th action.

# Caching and uncaching tables

In the lesson we learned that tables can be cached. Whereas a dataframe is cached using a cache or persist operation, a table is cached using a cacheTable operation.

In [7]:
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
# List the tables
print("Tables:\n", spark.catalog.listTables())

# Cache table1 and Confirm that it is cached
spark.catalog.cacheTable('table1')
print("table1 is cached: ", spark.catalog.isCached('table1'))

# Uncache table1 and confirm that it is uncached
spark.catalog.uncacheTable('table1')
print("table1 is cached: ", spark.catalog.isCached('table1'))

Tables:
 [Table(name='table1', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True), Table(name='table2', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]
table1 is cached:  True
table1 is cached:  False


# Spark UI storage tab

A folder sherlock_parts exists on disk containing twelve text files.
```
ls sherlock_parts
sherlock_part0.txt   sherlock_part2.txt   sherlock_part7.txt
sherlock_part1.txt   sherlock_part3.txt   sherlock_part8.txt
sherlock_part10.txt  sherlock_part4.txt   sherlock_part9.txt
sherlock_part11.txt  sherlock_part5.txt
sherlock_part12.txt  sherlock_part6.txt
```
When loaded, this creates a dataframe having seven partitions.
```
partitioned_df = spark.read.text('sherlock_parts')
partitioned_df.rdd.getNumPartitions()
7
```
A table is created, and the table is cached:
```
partitioned_df.createOrReplaceTempView('text')
spark.catalog.cacheTable('text')
```
What will appear in the Spark UI Storage tab once the cache operation is triggered by an action?
<center><img src="images/03.01.png"  style="width: 400px, height: 300px;"/></center>

<center><img src="images/03.02.png"  style="width: 400px, height: 300px;"/></center>


# Inspecting cache in the Spark UI

A dataframe partitioned_df is available. It is used to register a temporary table called text. text is then cached using spark.catalog.cacheTable('text'). If you were running Spark locally, then the Spark UI would be available at http://localhost:4040/storage/. For the purpose of this exercise, examine the following image. It shows what the Spark UI would display once the cache for text is loaded:

<center><img src="images/03.03.png"  style="width: 400px, height: 300px;"/></center>


This shows that a table called text having seven partitions is cached in memory. Which of the following would immediately cause the above to appear in Spark UI?

1. Performing a transform on the underlying dataframe, for example df = partitioned_df.distinct().
2. Counting the underlying dataframe, for example: partitioned_df.count()
3. Querying the table using, say: spark.sql("select count(*) from text")
4. Querying and showing the result, say: spark.sql("select count(*) from text").show()

- (2) and (4)

# Practice logging

You will now practice these logging operations.

In [None]:
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format='%(levelname)s - %(message)s')

# Log columns of text_df as debug message
logging.debug("text_df columns: %s", df1.columns)

# Log whether table1 is cached as info message
logging.info("table1 is cached: %s", spark.catalog.isCached(tableName="table1"))

# Log first row of text_df as warning message
logging.warning("The first row of text_df:\n %s", df1.first())

# Log selected columns of text_df as error message
logging.error("Selected columns: %s", df1.select("_c0"))

DEBUG - text_df columns: ['_c0']
DEBUG - Command to send: c
o73
isCached
stable1
e

DEBUG - Answer received: !ybfalse
INFO - table1 is cached: False
DEBUG - Command to send: c
o39
limit
i1
e

DEBUG - Answer received: !yro110
DEBUG - Command to send: c
o13
setCallSite
sfirst at C:\\Users\\88016\\AppData\\Local\\Temp/ipykernel_12448/2156017374.py:12
e

DEBUG - Answer received: !yv
DEBUG - Command to send: c
o110
collectToPython
e

DEBUG - Answer received: !yto111
DEBUG - Command to send: c
o13
setCallSite
n
e

DEBUG - Answer received: !yv
DEBUG - Command to send: a
e
o111
e

DEBUG - Answer received: !yi3
DEBUG - Command to send: a
g
o111
i0
e

DEBUG - Answer received: !yi57774
DEBUG - Command to send: a
e
o111
e

DEBUG - Answer received: !yi3
DEBUG - Command to send: a
g
o111
i1
e

DEBUG - Answer received: !ys0961058a4144d1c421e124e522ecfb9047eb4b6ce8c18ecd3c484e7565325ad2
 Row(_c0='The Project Gutenberg EBook of The Adventures of Sherlock Holmes')
DEBUG - Command to send: r
u
functions


DEBUG - Command to send: m
d
o111
e

DEBUG - Answer received: !yv
DEBUG - Command to send: m
d
o113
e

DEBUG - Answer received: !yv


# Practice logging 2

In the lesson we learned that Spark operations that trigger an action must be logged with care to avoid stealth loss of compute resources. You will now practice identifying logging statements that trigger an action on a dataframe or table.

In [13]:
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format='%(levelname)s - %(message)s')

In [15]:
text_df = df1
# Uncomment the 5 statements that do NOT trigger text_df
logging.debug("text_df columns: %s", text_df.columns)
logging.info("table1 is cached: %s", spark.catalog.isCached(tableName="table1"))
# logging.warning("The first row of text_df: %s", text_df.first())
logging.error("Selected columns: %s", text_df.select("_c0"))
logging.info("Tables: %s", spark.sql("show tables").collect())
logging.debug("First row: %s", spark.sql("SELECT * FROM table1 limit 1"))
# logging.debug("Count: %s", spark.sql("SELECT COUNT(*) AS count FROM table1").collect())

DEBUG - text_df columns: ['_c0']
DEBUG - Command to send: c
o73
isCached
stable1
e

DEBUG - Answer received: !ybfalse
INFO - table1 is cached: False
DEBUG - Command to send: r
u
functions
rj
e

DEBUG - Answer received: !ycorg.apache.spark.sql.functions
DEBUG - Command to send: r
m
org.apache.spark.sql.functions
col
e

DEBUG - Answer received: !ym
DEBUG - Command to send: c
z:org.apache.spark.sql.functions
col
s_c0
e

DEBUG - Answer received: !yro131
DEBUG - Command to send: r
u
PythonUtils
rj
e

DEBUG - Answer received: !ycorg.apache.spark.api.python.PythonUtils
DEBUG - Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e

DEBUG - Answer received: !ym
DEBUG - Command to send: i
java.util.ArrayList
e

DEBUG - Answer received: !ylo132
DEBUG - Command to send: c
o132
add
ro131
e

DEBUG - Answer received: !ybtrue
DEBUG - Command to send: c
z:org.apache.spark.api.python.PythonUtils
toSeq
ro132
e

DEBUG - Answer received: !yro133
DEBUG - Command to send: c
o39
select
ro133
e


DEBUG - Command to send: m
d
o132
e

DEBUG - Answer received: !yv
DEBUG - Command to send: m
d
o138
e

DEBUG - Answer received: !yv
DEBUG - Command to send: m
d
o141
e

DEBUG - Answer received: !yv
DEBUG - Command to send: m
d
o142
e

DEBUG - Answer received: !yv


# Practice query plans

A dataframe text_df is available. This dataframe is registered as a table called table1

In [17]:
logging.disable(logging.DEBUG)
# Run explain on text_df
text_df.explain()


== Physical Plan ==
FileScan csv [_c0#17] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/c:/Datacamp/Python/Introduction to Spark SQL in Python/dataset/s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string>




In [18]:

# Run explain on "SELECT COUNT(*) AS count FROM table1" 
spark.sql("SELECT COUNT(*) AS count FROM table1").explain()



== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=147]
      +- HashAggregate(keys=[], functions=[partial_count(1)])
         +- FileScan csv [] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/c:/Datacamp/Python/Introduction to Spark SQL in Python/dataset/s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>




In [20]:
# Run explain on "SELECT COUNT(DISTINCT word) AS words FROM table1"
spark.sql("SELECT COUNT(DISTINCT _c0) AS words FROM table1").explain()


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct _c0#17)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=170]
      +- HashAggregate(keys=[], functions=[partial_count(distinct _c0#17)])
         +- HashAggregate(keys=[_c0#17], functions=[])
            +- Exchange hashpartitioning(_c0#17, 200), ENSURE_REQUIREMENTS, [plan_id=166]
               +- HashAggregate(keys=[_c0#17], functions=[])
                  +- FileScan csv [_c0#17] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/c:/Datacamp/Python/Introduction to Spark SQL in Python/dataset/s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string>




# Practice reading query plans 2

Three dataframes are available: part2_df, part3_df, and part4_df. The questions posed in this exercise can be answered by inspecting the explain() output of each dataframe.

In [21]:
df1.explain()

== Physical Plan ==
FileScan csv [_c0#17] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/c:/Datacamp/Python/Introduction to Spark SQL in Python/dataset/s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string>


