# Chapter 8: Tunning and Debugging Spark

In this Notebook, we will explore the main ways to tune the Spark configuration as well as how to debug Spark jobs.

## Configuring Spark with SparkConf

We can configure some characteristics (App name, master ip address, maximum number of cores in the executors, ...) or our SparkSession when creting it throuh the SparkConf.

In [6]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

In [7]:
conf = SparkConf()
conf.set("spark.app.name", "Spark App")
conf.set("spark.master", "local[*]")

<pyspark.conf.SparkConf at 0x7f1eb0084080>

In [8]:
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext

## Components of Execution: Jobs, Tasks, and Stages

We can check in our Notebook the Execution plan of a Spark job through the command `toDebugString`. Let's see an example:

In [22]:
lines = sc.textFile("../data/README.md")
words = lines.filter(lambda x: len(x) > 0).flatMap(lambda line: line.split(" "))
counts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

In [28]:
counts.toDebugString().decode("utf-8").split("\n")

['(2) PythonRDD[27] at RDD at PythonRDD.scala:49 []',
 ' |  MapPartitionsRDD[26] at mapPartitions at PythonRDD.scala:129 []',
 ' |  ShuffledRDD[25] at partitionBy at NativeMethodAccessorImpl.java:0 []',
 ' +-(2) PairwiseRDD[24] at reduceByKey at <ipython-input-22-4aa73400c3be>:3 []',
 '    |  PythonRDD[23] at reduceByKey at <ipython-input-22-4aa73400c3be>:3 []',
 '    |  ../data/README.md MapPartitionsRDD[22] at textFile at NativeMethodAccessorImpl.java:0 []',
 '    |  ../data/README.md HadoopRDD[21] at textFile at NativeMethodAccessorImpl.java:0 []']

## Level of Parallelism

Among other aspects, we can control the number of partitions of one RDD. Sometimes, specially when the size of the RDD is changed, it may be convinient to reduce or increase its numbers of partitions. Let's see an example.

In [29]:
words.count()

527

In [31]:
words.getNumPartitions()

2

In [37]:
words_filtered = words.filter(lambda x: len(x) > 8)

In [38]:
words_filtered.count()

86

In [39]:
words_filtered.getNumPartitions()

2

In [40]:
words_filtered_repar = words_filtered.coalesce(1)

In [41]:
words_filtered_repar.getNumPartitions()

1