 <img src="uva_seal.png"> 

## Running Spark on a Cluster

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---  

### SOURCES 

1. Learning Spark 1st ed., Chapter 7: Running on a Cluster
2. [Cluster overview](https://spark.apache.org/docs/latest/cluster-overview.html)

### OBJECTIVES
- Learn how to run distributed Spark
- Learn about some of the common deployment environments


### CONCEPTS AND FUNCTIONS
- Cluster manager (Hadoop YARN, Apache Mesos, Standalone)
- Driver and worker/executor
- Spark application
- Jobs, stages, and tasks
- Directed acyclic graph (DAG)
- Amazon Web Services tools for running Spark: EC2, EMR

---  

### Spark Architecture

One benefit of Spark is the ability to scale computation by adding more machines and running in cluster mode

The *workers (aka executors)* receive code and data chunks and do the processing, sending results back to driver.

The *driver* is in charge of coordinating the workers

Driver + Workers = Spark application

### Driver

`main()` method of program runs on driver

Converts program into tasks

Converts into logical *directed acyclic graph* (DAG) of operations

Coordinates scheduling of tasks on executors (like a manager)

### Executors

Run the individual tasks

Launch at start of application and run for lifetime of app

Provide in-memory (RAM) storage for RDDs

Using RAM speeds up computation versus slower disk

### Cluster Manager

External service where the Spark application runs.  

Spark is packaged with the Standalone cluster manager.

Manages the resources between Spark applications.  
Can manage queues if there is more demand than resources for executors.

---

#### Cluster Overview. Source: Apache Spark.

The driver program distributes data to workers.  
On each worker is an executor.  
It is possible to cache data in RAM for speedup (avoiding recompute).  
The Cluster Manager is responsible for managing components of a job.

 <img src="cluster_manager.png"> 

#### Implementing a Job

For implementing the work, the *Job* is divided into *Stages*, which are further divided into *Tasks*.  
Smallest unit of work is the Task.  
Executors run the Tasks.

**Example:**  
Consider this line of code which reads a text file into an RDD and collects the data to the driver.

<img src="code_read_textfile_into_rdd.png"> 

Spark comes with a built-in Web UI.  There are several tabs such as `Jobs` and `Stages` which provide details about the running application.  
Useful information such as resources used at each stage of the computation is available here.

When running jobs locally (*local mode*), you should be able to view the UI at this URL:  
http://localhost:4040/jobs/


From the UI, here are details on the Stages:

<img src="stages_read_textfile_into_rdd.png"> 

From the UI, here are details on the Executors:

<img src="executor_info.png"> 

 ### Launching a Program

We generally run code from notebooks in this course.

For running at command line, `spark-submit` is called to launch a Spark app

**Run in local mode using single core**

`$ bin\spark-submit --master local python_scripts\textAnalysis1.py`

**Run in local mode using 4 cores**

`$ bin\spark-submit --master local[4] python_scripts\textAnalysis1.py`

**Run in local mode using all cores**

`$ bin\spark-submit --master local[*] python_scripts\textAnalysis1.py`

**Run on Spark Standalone cluster at default port**

`$ bin\spark-submit --master spark://host:7077 python_scripts\textAnalysis1.py`

**Run on Spark Standalone cluster at default port, specifying memory to allocate**

`$ bin\spark-submit --master spark://host:7077 –-executor_memory 10g 	python_scripts\textAnalysis1.py`

**Generic Form to run Spark App**

`$ bin\spark-submit [options] <app jar | python file> [app options]`

### Packaging Code and Dependencies  

**Python**  
PySpark uses Python on worker machines, so can use `pip` for managing packages    
Can also submit libraries using the `--py-files` argument to `spark-submit`  

### Hadoop YARN

**Y**et **A**nother **R**esource **N**egotiator 

`YARN` is a cluster manager introduced in `Hadoop 2.0`  
It does the following:
- allocates system resources to various applications running in a `Hadoop` cluster.  
- schedules tasks to be executed on different cluster nodes  

### Amazon EC2 (elastic cloud compute)

One of many services from Amazon Web Services (AWS) is EC2

Spark has built-in script to launch clusters on EC2: `spark-ec2`

Will need Amazon Web Services (AWS) account  
Export the *access key ID* and *secret access key*    
By default, launching the cluster produces one master and one worker  

### AWS and the Free Tier

AWS offers over 200 services in storage, compute, machine learning, and many other areas of tech including AI/ML.  

There is an AWS Free Tier where some of the services are completely free.

We will use this Free Tier for the course. Visit here to sign up:

https://aws.amazon.com/free/?all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc


**Amazon Elastic MapReduce (EMR)**

`Amazon EMR` provides a managed `Hadoop` framework to process vast amounts of data using AWS for parallel, distributed, elastic execution of data processes and tasks.  
`EMR` leverages `S3`, which is their elastic, highly reliable cloud storage product (covered later in the course). 
  
Here is a very short overview (1 min) of EMR:  
https://www.youtube.com/watch?v=AM8WZb2Xj2g


### Summary

In this notebook, you learned about Spark's architecture, and many options for running a Spark cluster.  

The terminology of worker, executor, and driver will come up throughout the course.

---