---
<span style="color:#000; font-family: 'Arial'; font-size: 2em;">BIG DATA</span>

<span style="color:#f00; font-family: 'Arial'; font-size: 1.5em;">Unit 7: Cluster Management </span>

<span style="color:#300; font-family: 'Arial'; font-size: 1.5em;"></span>
<h4 style="color:darkblue"> Universidad de Deusto</h4>

<span style="color:#300; font-family: 'Arial'; font-size: 1em;">m.varo@deusto.es</span>

<h5 style="color:black">  11 de abril de 2025 - Donostia </h5>

---


1. **Deploying a Multi-Node Cluster**: Set up a Spark Standalone or YARN cluster (depending on resource availability).
2. **Job Submission**: Run Spark applications on the configured cluster.
3. **Monitoring Tools**: Utilize Spark UI and Ganglia to monitor and analyze job performance.
4. **Fault Tolerance Simulation**: Simulate failures and test Spark's job recovery mechanisms.


### **Optional Additions**  
#### Code Block (example):  
```bash
spark-submit --master yarn --deploy-mode cluster your_app.py
```

#### Table (comparison):  
| Cluster Type  | Pros                     | Cons                     |  
|--------------|--------------------------|--------------------------|  
| Standalone   | Easy setup               | Limited resource sharing |  
| YARN         | Better resource management | More complex configuration |  

#### Links:  
- [Spark UI Guide](https://spark.apache.org/docs/latest/web-ui.html)  
- [Ganglia Monitoring](http://ganglia.sourceforge.net/)  

---

# Apache Spark: Cluster Deployment and Performance Optimization

### 1. Introduction to Apache Spark
Apache Spark is one of the most active and widely adopted open-source projects in the big data ecosystem...

### 2. Spark Memory Management

It's important to note that Spark does **not** automatically cache input data in memory. A common misconception is that Spark cannot be used effectively unless the input data fits entirely in memory. This is **not true**. Spark is capable of processing terabytes of data even on clusters with limited memory—for example, a cluster with only 100 GB of total memory.

Deciding what data to cache, and when to cache it during a data processing pipeline, is the responsibility of the application developer. In fact, if a Spark application only makes a single pass over the data, caching may not be necessary at all.

Another reason Spark outperforms Hadoop MapReduce is its advanced job execution engine. Like MapReduce, Spark represents jobs as Directed Acyclic Graphs (DAGs) of stages, but it processes these DAGs more efficiently, enabling better performance and reduced execution time.


### 3. Optimization in Spark vs. Hadoop MapReduce

Hadoop MapReduce creates a Directed Acyclic Graph (DAG) with exactly two predefined stages—Map and Reduce—for every job. A complex data processing algorithm in MapReduce may require multiple jobs to be executed sequentially, which prevents any optimization across jobs.

In contrast, Spark offers greater flexibility. It does not force the developer to break a complex algorithm into multiple jobs. A Spark DAG can contain any number of stages, allowing both simple jobs with just one stage and more complex jobs with several stages. This ability enables Spark to perform optimizations that are not possible in MapReduce.

Spark executes a multi-stage complex job in a single run, leveraging the knowledge of all stages to optimize the execution. For example, it can minimize disk I/O and data shuffling, which involves transferring data across the network and significantly increasing application execution time. By reducing these costly operations, Spark can improve overall job performance.

### 4. Scalability in Spark

Spark is highly scalable, allowing you to increase the data processing capacity of a cluster simply by adding more nodes. You can start with a small cluster and, as your dataset grows, scale your infrastructure by adding more computing resources. This flexibility makes Spark an economical choice for handling growing datasets.

One of Spark's key features is that it automatically handles scaling without requiring any changes to the application code. When you add nodes to a Spark cluster, the application can take advantage of the additional resources without any code modifications, making it easy to scale as needed.

### 5. Fault Tolerance in Spark

Spark is designed to be fault-tolerant, ensuring reliable execution even in the face of hardware failures. In a cluster with hundreds of nodes, the probability of a node failure on any given day is significant—whether due to a hard disk crash or other hardware issues. However, Spark automatically handles the failure of a node in the cluster, ensuring that the application continues running.

While the failure of a node may cause some performance degradation, it will not cause the application to crash. This built-in fault tolerance means that application developers do not need to explicitly handle node failures in their code, simplifying the application development process and increasing reliability.


### 6. Iterative Algorithms in Spark

Iterative algorithms are data processing algorithms that repeatedly iterate over the same data. Applications such as machine learning and graph processing commonly use iterative algorithms, running tens or even hundreds of iterations over the same dataset. Spark is particularly well-suited for these types of applications.

The reason iterative algorithms run efficiently on Spark is its in-memory computing capabilities. Spark allows applications to cache data in memory, so even if an iterative algorithm performs 100 iterations, it only needs to read the data from disk during the first iteration. Subsequent iterations can read the data from memory, which is typically **100 times faster** than reading from disk. This dramatically speeds up the execution of these applications, often resulting in orders of magnitude improvements in performance.

### 7. Interactive Data Analysis with Spark

Interactive data analysis involves exploring a dataset in real-time, allowing for quick insights before running long and resource-intensive batch processing jobs. For instance, before executing a time-consuming job that might run for hours, a data analyst might perform summary analysis on a large dataset. Similarly, business analysts often require the ability to interactively analyze data using BI or visualization tools, running multiple queries on the same data. Spark is an ideal platform for such interactive analysis of large datasets.

The key advantage of Spark in interactive analysis is its **in-memory computing capabilities**. When an application caches the data to be interactively analyzed, the first query will read data from disk, but subsequent queries will access the cached data in memory. Since reading from memory is orders of magnitude faster than reading from disk, Spark can dramatically reduce query execution time. A query that would normally take over an hour when reading from disk can often be completed in just a few seconds when the data is cached in memory.


### 8. High-level architecture

<div style="text-align: center;">
    <img src="spark_diagram1.png" width="500"/>
</div>


### Key Components in Spark

- **Workers**: A worker provides CPU, memory, and storage resources to a Spark application. Workers run Spark applications as distributed processes across a cluster of nodes, enabling parallel computation.

- **Cluster Managers**: Spark uses a cluster manager to acquire and manage cluster resources for executing jobs. A cluster manager, as the name implies, is responsible for managing computing resources across a cluster of worker nodes. It provides low-level scheduling of cluster resources and enables multiple applications to share resources, allowing them to run on the same worker nodes. Spark supports three cluster managers:
  - **Standalone**: Spark's native cluster manager.
  - **Mesos**: A general-purpose cluster manager.
  - **YARN**: The Hadoop cluster manager.

  Mesos and YARN allow Spark to run alongside Hadoop applications on the same worker nodes.

- **Driver Programs**: A driver program is an application that uses Spark as a library to process data. The driver provides the data processing code that Spark executes on the worker nodes. It can launch one or more jobs on a Spark cluster.

- **Executors**: An executor is a JVM (Java Virtual Machine) process created by Spark on each worker node for an application. It executes application code concurrently in multiple threads and can also cache data in memory or on disk. The lifespan of an executor is tied to the lifespan of the application. When the Spark application terminates, all executors associated with it are also terminated.

- **Tasks**: A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executor on a worker node. Each task performs computations to either return a result to the driver program or partition its output for shuffling. Spark creates one task per data partition, and an executor runs multiple tasks concurrently. The level of parallelism is determined by the number of partitions—more partitions lead to more tasks running in parallel.


### 9. Application execution

This section briefly describes how data processing code is executed on a Spark cluster.

#### Terminology

Before we dive into the execution details, let's define some key terms:

- **Shuffle**: A shuffle is the process of redistributing data across the nodes of a cluster. It is an expensive operation because it involves moving data over the network. However, a shuffle does not randomly distribute data; instead, it groups data elements into buckets based on specific criteria. Each bucket forms a new partition.
  
- **Job**: A job is a set of computations that Spark performs to return results to the driver program. Essentially, it represents the execution of a data processing algorithm on a Spark cluster. An application can launch multiple jobs. The specifics of how a job is executed will be covered later in this chapter.
  
- **Stage**: A stage is a collection of tasks. Spark splits a job into a Directed Acyclic Graph (DAG) of stages, and stages may depend on one another. For example, a job could be divided into two stages—stage 0 and stage 1—where stage 1 cannot begin until stage 0 has completed. Spark groups

