# Chapter 8: Tunning and Debugging Spark

In this Notebook, we will explore the main ways to tune the Spark configuration as well as how to debug Spark jobs.

## Configuring Spark with SparkConf

We can configure some characteristics (App name, master ip address, maximum number of cores in the executors, ...) or our SparkSession when creting it throuh the SparkConf.

In [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf

In [2]:
val conf = new SparkConf()
conf.set("spark.app.name", "Spark App")
conf.set("spark.master", "local[*]")

conf = org.apache.spark.SparkConf@72b68a71


org.apache.spark.SparkConf@72b68a71

In [4]:
val spark = SparkSession.builder.config(conf=conf).getOrCreate()
val sc = spark.sparkContext

spark = org.apache.spark.sql.SparkSession@747045db
sc = org.apache.spark.SparkContext@3a69d421


## Components of Execution: Jobs, Tasks, and Stages

We can check in our Notebook the Execution plan of a Spark job through the command `toDebugString`. Let's see an example:

In [5]:
val lines = sc.textFile("../data/README.md")
val words = lines.filter(_.size > 0).flatMap(_.split(" "))
val counts = words.map(x => (x, 1)).reduceByKey(_ + _)

lines = ../data/README.md MapPartitionsRDD[1] at textFile at <console>:32
words = MapPartitionsRDD[3] at flatMap at <console>:33
counts = ShuffledRDD[5] at reduceByKey at <console>:34


ShuffledRDD[5] at reduceByKey at <console>:34

In [8]:
counts.toDebugString

(2) ShuffledRDD[5] at reduceByKey at <console>:34 []
 +-(2) MapPartitionsRDD[4] at map at <console>:34 []
    |  MapPartitionsRDD[3] at flatMap at <console>:33 []
    |  MapPartitionsRDD[2] at filter at <console>:33 []
    |  ../data/README.md MapPartitionsRDD[1] at textFile at <console>:32 []
    |  ../data/README.md HadoopRDD[0] at textFile at <console>:32 []

## Level of Parallelism

Among other aspects, we can control the number of partitions of one RDD. Sometimes, specially when the size of the RDD is changed, it may be convinient to reduce or increase its numbers of partitions. Let's see an example.

In [9]:
words.count()

527

In [11]:
words.getNumPartitions

2

In [12]:
val wordsFiltered = words.filter(_.size > 8)

wordsFiltered = MapPartitionsRDD[6] at filter at <console>:35


MapPartitionsRDD[6] at filter at <console>:35

In [13]:
wordsFiltered.count()

86

In [15]:
wordsFiltered.getNumPartitions

2

In [16]:
val wordsFilteredRepar = wordsFiltered.coalesce(1)

wordsFilteredRepar = CoalescedRDD[7] at coalesce at <console>:37


CoalescedRDD[7] at coalesce at <console>:37

In [17]:
wordsFilteredRepar.getNumPartitions

1