Cluster Topology Matters!

In this session we're going to talk about the anatomy of a Spark job. We're going to look at how clusters are typically organized that Spark runs on. And this is actually important. It's going to come back to the programming model once again. You can't just pretend like you have sequential collections that are on one machine. You actually have to think about how your program might be spread out along the cluster.

Spark Jobs anatomy

Spark jobs are organized in Master - Worker topology. Usually there is 1 Master, many Workers. The node that acts as the Master is called as the Driver Program and holds the SparkContext, thus this is the node that we i.e. our program interacts with. The Workers nodes are called as Executors and these execute the Jobs.

How do the Master and the Workers communicate? They do this via a Cluster Manager (e.g. YARN, Mesos) which manages/allocates resources, scheduling, etc.

spark_anatomy

Thus, the Spark application is a set of processes running on a cluster.

The Driver-Program:

coordinates all the processes.
holds the process where the main() method of our program runs.
holds the process that creates SparkContext, creates RDDs, and stages up or sends off transformations and actions.

The executors:

run the task which represent the application
return computed results to the driver
provide in-memory storage for cached RDDs.

How Spark Jobs are executed

The driver-program runs the Spark application, which creates a SparkContext upon start-up.
The SparkContext connects to a cluster-manager which allocates resources.
Spark acquires executors on nodes in the cluster.
Driver-program sends your application code to the executors.
Finally, SparkContext sends tasks for the executors to run.

Example 1: `println`

Assume we have an RDD populated with Person objects.

case class Person(name: String, age: Int)

val people: RDD[Person] = ...
people.foreach(println)

What happens?

Nothing happens on the driver. This is because foreach is an action with Unit return type. Hence it is eagerly executed on the executors, not the driver. Thus any calls to println are visible only on stdout of worker nodes and not the master node.

Example 2: `take`

case class Person(name: String, age: Int)

val people: RDD[Person] = ...
val first10 = people.take(10)

What happens? Where will the Array[Person] representing first10 end up?

It ends of on the driver program. In general, executing an action involves communication between worker nodes and the node running the driver program (since its an action, the workers perform the action, and send result to the driver where it is aggregated.).

Moral of the story: To make effective user of RDDs, you have to understand little bit about how Spark works. As due to the lazy/eager properties, it is not obvious upon first glance that on what part of the cluster a line of code might run on.

Week 1

Week 2

Week 3

Week 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Topology Matters!

Spark Jobs anatomy

How Spark Jobs are executed

Example 1: `println`

Example 2: `take`

Clone this wiki locally

Cluster Topology Matters!

Spark Jobs anatomy

How Spark Jobs are executed

Example 1: println

Example 2: take

Clone this wiki locally

Example 1: `println`

Example 2: `take`