<a href="https://colab.research.google.com/github/victorviro/Big-Data/blob/main/Introduction_to_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 💥 Introduction to Apache Spark

In this notebook, we introduce [**Apache Spark**](https://spark.apache.org/docs/latest/index.html), an open-source, **distributed processing system** used for **big data** workloads. First, we will see the main characteristics and 👍 benefits this framework offers us. Later, we will explain some 🗝 key concepts around it. Finally, we explore the 🏗 architecture and components that compound it. 

The table of contents of this notebook is as follow:

1. [ℹ️ Introduction](#1)
2. [🗃 RDD, DataFrame and Dataset](#2)
3. [🎬⚙️ Actions and transformations](#3)
    1. [⚡ Caching](#3.1)
    2. [↕️ RDD lineage graph](#3.2)
    3. [🏎🚌 Narrow and wide transformations](#3.3)
    4. [⚙️ Jobs, stages and tasks](#3.4)
    5. [↔️ DAG (Directed Acyclic Graph)](#3.5)
4. [🏗 Spark architecture](#4)
    1. [🚗 The driver](#4.1)
    2. [👷 Executors](#4.2)
    3. [🕹 Cluster manager](#4.3)
    4. [🔛 Launching a program](#4.4)
    5. [🔃 Workflow](#4.5)
5. [📕 References](#5)

# ℹ️ Introduction <a name="1"></a>

[***Apache Spark***](https://spark.apache.org/docs/latest/index.html) is a unified **analytics ⚙️ engine for large-scale data processing**. It provides ⬆ high-level APIs in Java, Scala, Python, and R, and a rich set of higher-level 🛠 tools including 

- **Spark SQL** for SQL and structured data processing
- **MLlib** for machine learning
- **GraphX** for graph processing
- Spark **Streaming** for incremental computation and stream processing

These high-level tools are built on 🔝 top of **Spark Core**, which is the base engine for distributed data processing. 

<center><img src='https://i.ibb.co/Dr8d908/spark-ecosystem.png'></center>

Spark works **distributing the processing over a cluster** of servers or nodes. It can perform computations in **large data sets**. Briefly, Spark ✂️ splits a computation into smaller tasks and distributes them to the nodes of the cluster. These nodes execute the tasks and ↩️ deliver the result to the main node. 

Let's summarize the **main characteristics** of Spark.

🚄 ***Speed***

Spark can manage PetaBytes of data at once, distributed across thousands of servers. It does it through **in-memory processing**, which makes it capable of 🚚 deliver analysis in **real-time** at high ✈ speed. Spark extends the MapReduce model by supporting more types of computations (e.g. stream processing or machine learning).

😀 ***Friendly APIs***

Spark provides instructions at a high level of abstraction. It is **compatible** with Java, Python, Scala and R. Spark can create distributed datasets from different file storage systems like HDFS, Amazon S3, HBase, or Cassandra.

😴 ***Lazy evaluation*** 

Spark delays its evaluation until it is necessary. This is a 🗝 key factor that contributes to its speed. It creates a 📋 list of tasks (transformations) and it performs nothing until we ask it for the final result. Spark **aggregate the transformations in a computational DAG** (Directed Acyclic Graph) and they are **executed only when the controller asks for some data**.

💁🏻 ***Scalability***

Apache Spark is highly scalable. When we need more processing 💪 power, we ➕ **aggregate more servers to the cluster**. Instead of buying a super 🖥 computer able to accommodate our dataset (scaling-up), we rely on multiple computers, ✂ splitting the job between them (scaling out) which is less expensive and faster. To achieve this, Spark can run over different *cluster managers* as we'll see later.

☝️ **Unified stack**

Spark contains **multiple** tightly integrated **components** which provides various 👍 benefits:

- Higher-level components (e.g. SQL or machine learning) benefit from improvements in the lower layers (core engine).

- The 💰 cost of development, maintenance, and 🚀 deployment is ⬇️ reduced drastically since we **don't have to manage independent systems** for performing different processing models.

# 🗃 RDD, DataFrame and Dataset <a name="2"></a>

**RDDs**

The main programming abstraction of Spark is **resilient distributed dataset (RDD)**, which is a collection of elements **✂️ partitioned across the nodes** of the cluster that **can be operated in parallel**. RDDs can contain any type of Python, Java, or Scala objects. An RDD can be **persisted in-memory** on worker nodes, allowing it to be ♻️ reused efficiently across parallel operations. RDDs stands for:

- 💪 ***Resilient***: They can recover quickly from any 🐛 issues as the same data partitions are replicated across multiple worker nodes. Thus, even if one executor node ❌ fails, another will still process the data.
- ***Distributed***: Data is distributed among different nodes in a cluster
- ***Dataset***: Collection of partitioned data with values

Once an RDD is created, it becomes **immutable**, that is, its state cannot be modified after it is created, but it can be transformed.

In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark 🤖 automatically distributes the data contained in RDDs across our cluster and parallelizes the operations we perform on them.

We can create RDDs by loading an external dataset or by distributing a collection of objects in the driver program. The last way is usually used only for 🚧 development/testing since it requires having an entire dataset in memory in one 💻 machine.

**Dataset and DataFrame** APIs provide the 👍 **benefits of RDDs** with the 👍 **benefits of Spark SQL’s** optimized execution engine. Like an RDD, they are immutable distributed collections of data. Unlike an RDD, data is structured. When we 👨‍💻 develop Spark applications, **we typically use DataFrames and Datasets**. RDD still works internally within these APIs and it's important for the efficiency of Spark, but it's now used primarily for ⬇ low-level tasks.

In a **DataFrame**, data is **organized into named columns**, like a table in a relational database. It allows developers to **impose a structure** onto a distributed collection of data, and it provides a **friendly language API to manipulate our distributed data** (👉 select columns, filter, 🔗 join, aggregate, etc) that allows us to solve common data analysis problems efficiently. 

**Datasets** are an extension of DataFrame API which provides static type-safe (applications can be ⚠️ checked for errors before they are run), and an object-oriented programming interface.

<center><img src='https://i.ibb.co/HhwNJxY/spark-rdd-df-dataset.jpg'></center>

Further reading:
- [A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

- [Apache Spark: 3 Reasons Why You Should Not Use RDDs](https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds)

- [Apache Spark RDD vs DataFrame vs DataSet](https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/)

- [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)

# 🎬⚙️ Actions and transformations <a name="3"></a>

RDDs offer 2️⃣ **types of operations**:

- ***⚙️ Transformations*** are operations that **create a new RDD from an existing one**. For example, `map` is a transformation that passes each dataset element through a function and returns a new RDD representing the results. Other examples of transformations are:
 - Adding a column to a DataFrame
 - Performing an aggregation or filtering
 - Computing summary statistics on a dataset

- ***🎬 Actions***, on the other hand, **compute a result based on an RDD**, and either ↩️ **return it** to the driver program or save it to an external storage system (e.g., HDFS). For example, `reduce` is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program. Other examples of actions are:
 - Printing information on the screen (e.g. `show` method)
 - Writing data to a hard drive or cloud bucket (e.g. `write` method)
 - Count the number of elements in a dataset (`count` method)

**Transformations** in Spark are 😴 **lazily computed**. Spark just remember the transformations applied to some dataset (📝 instructions) and they are only computed when we use them in an action. For example, a dataset transformed through `map` that later uses a `reduce` will return only the result of the reduce to the driver, rather than the larger mapped dataset.

Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output.

<center><img src='https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/09/Picture1-5-768x266.png'></center>


This way of dealing with computations has many 👍 **benefits** with large scale data:

- Storing 📋 instructions in memory takes **less space** than storing intermediate data results. If we are performing many operations on a dataset and are 🔙 returning the result data each step, we'll blow our storage faster although we don't need the intermediate results.

- By having the list of operations to be performed, Spark can 💡 optimize the work between executors more efficiently.

## ⚡ Caching <a name="3.1"></a>

RDDs are by default ↩️ recomputed each time we run an 🎬 action on them. If we want to ♻️ reuse an RDD in multiple actions, we can **persist it in memory** (partitioned across the machines of the cluster). If we'll not reuse the RDD, there is 🙅 no reason to persist it since we would waste storage space when Spark could instead stream through the data once and just compute the result.

## ↕️ RDD lineage graph <a name="3.2"></a>

As we derive new RDDs from each other using transformations, Spark keeps 🗒 track of the set of 🖇 dependencies between different RDDs, called the **lineage graph**.

<center><img src='https://i.ibb.co/7Wk1R6y/spark-lineage-graph-example.png'></center>

An RDD lineage graph is hence a **graph of** what **transformations** need to be executed after an 🎬 action has been called. It starts with the 🔝 earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.

## 🏎🚌 Narrow and wide transformations <a name="3.3"></a>

There are 2️⃣ types of transformations:

- 🏎 **Narrow transformations** are transformations where all the elements that are required to compute the records in a single partition live in the single partition of parent RDD. For example, `map` and `filter` are narrow transformations.

- 🚌 **Wide transformations** are transformations where all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. For example, `groupbyKey` and `join` are wide transformations.


<center><img src='https://i.ibb.co/qFQvqPV/narrow-and-wide-transformations-spark.png'></center>

**Narrow** transformations are ✈ **faster** than wide transformations cause they **do not require any data 🔀 shuffling** over the cluster network or no data movement. It is always good to keep in mind transformations that might require data shuffling (and hence slow 🐢 down the process), and ⬇ reduce the usage of wide transformations if it's possible.

## ⚙️ Jobs, stages and tasks <a name="3.4"></a>

When we invoke an 🎬 action on an RDD, a **job** is created. A job is divided into single or multiple stages, and stages are further ✂️ divided into individual tasks.
- A **job is divided into stages** based on the 🔀 shuffle boundary. That is when Spark encounters a transformation that requires a shuffle (🚌 wide transformation) it creates a new stage.

- Each stage is divided into tasks based on the 🔢 number of partitions in the RDD. 

- The **tasks** within a stage will **run in parallel. Every task** in the stage executes the same 📋 set of instructions **over a 1 single partition**. So tasks are the smallest units of work in Spark.

<center><img src='https://i.ibb.co/Bzbwhk7/spark-job.png'></center>


The figure below ilustrates an example that shows how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are cached.

<center><img src='https://i.ibb.co/c39WQMk/stages-spark.png'></center>

To run an action on RDD *G*, the scheduler builds stages at 🚌 wide dependencies and **pipelines** 🏎 narrow transformations inside each stage. In this case, stage 1 does not need to run since B is cached, so we run 2 and then 3.

## ↔️ DAG (Directed Acyclic Graph) <a name="3.5"></a>

A **DAG** (Directed Acyclic Graph) is a directed graph with no 🔄 directed cycles. In Spark, the ✳️ **vertices represent the RDDs**, and the ➖ **edges** represent the **operation** to be applied on RDD.

During execution, Spark **transforms a ↕️ logical execution plan ([RDD lineage](#3.1)) to a physical execution plan (↔️ DAG of stages)** performing several ♻️ optimizations (such as "pipelining" transformations).


<center><img src='https://i.ibb.co/0s0vy8M/dagscheduler-rdd-lineage-stage-dag.png'></center>

# 🏗 Spark architecture <a name="4"></a>

In distributed mode, Spark follows a **master-slave architecture** with a cluster manager. A cluster has a master node and several 👷 worker nodes. In the master node is where the process called 🚗 ***driver*** runs. The driver 🗣️ communicates with distributed workers which run processes called ⚙️ ***executors***.

<center><img src='https://i.ibb.co/dkN1Tj6/spark-archi.png'></center>

## 🚗 The driver <a name="4.1"></a>

The 🚗 **driver program** is the process that runs the 👨‍💻 code of our application that creates a Spark Context, RDDs, and perform transformations and 🎬 actions.

> The ***Spark Context*** is like a 🚪 **gateway to the Spark functionalities**. It's similar to a database connection. Any command we execute in our database goes through the database connection. Likewise, anything we do on Spark goes through Spark context.

The driver perform two duties:
- **Convert** the user **program ➡️ into tasks**. When the driver runs and find an action on an RDD, it will 🔜 trigger a job to be run. 

  The driver then **converts the lineage graph into a set of stages** performing several optimizations (such as "pipelining" transformations). The tasks of the stages are blunded up and prepared to be ➡️ sent to the cluster.

- 🕒 **Schedule tasks on 👷 executors**. The driver, which has a complete 👀 view of the application's executors, coordinates the scheduling of individual tasks on executors. The drive will look at the current executors and try to schedule each task in an appropiate location, based on data placement.

Moreover, the driver exposes information about the running Spark application through a **web interface**.

## 👷 Executors <a name="4.2"></a>

The executors are worker processes responsible for 🔛 **running the individual tasks** in a given Spark job. Executors have two roles:

- They run the tasks and ↩️ return results to the 🚗 driver.

- They provide **in-memory storage** for RDDs that are cached by 👨‍💻 user programs. Since RDDs are cached inside the executors, tasks can run alongside the cached data.

A driver and its executors are 🔗 together termed a ***Spark application***.

## 🕹 Cluster manager <a name="4.3"></a>

A spark application is launched using an external service called cluster manager. The cluster manager launchs the 👷 executor processes. Spark can run 🔝 on top of different cluster managers, such as YARN, Mesos, and Standalone cluster manager.

## 🔛 Launching a program <a name="4.4"></a>

Spark provides the script `spark-submit` to ⬆️ **submit a program**. Through various options, `spark-submit` can 🔗 connect to different cluster managers and 🕹 control how many resources the application gets. For some cluster managers, `spark-submit` can run the 🚗 driver within the cluster (e.g., on a YARN worker node), while for others, it can run it only on our 💻 local machine.

## 🔃 Workflow <a name="4.5"></a>

Let's 🚶 walk through the steps that occur when we 🔛 run a Spark application on a cluster:

1. The 👨‍💻 user submits an application using the `spark-submit` command.
2. `spark-submit` launches the 🚗 driver program and invokes the `main()` method specified by the user.
3. The 🚗 driver program 🔊 contacts the 🕹 cluster manager to ask for resources to launch executors.
4. The cluster manager launches 👷 executors on behalf of the driver program.
5. The 🚗 driver process runs through the user application. Based on the RDD 🎬 actions and ⚙️ transformations in the program, the driver ➡️ sends work to executors in the form of tasks.
6. Tasks are run on 👷 executor processes to compute and save results or return them to the driver.
7. If the driver’s `main()` method exits or it calls `SparkContext.stop()`, it will 🏁 terminate the executors and release resources from the cluster manager.

# 📕 References <a name="5"></a>

- [Spark documentation](https://spark.apache.org/docs/latest/cluster-overview.html)

- [Book "Learning Spark: Lightning-Fast Big Data Analysis"](https://www.oreilly.com/library/view/learning-spark/9781449359034/)

- [Online book "apache-spark-internals"](https://books.japila.pl/apache-spark-internals/)

- [Databricks documentation](https://docs.databricks.com/getting-started/spark/index.html)

- [Book "Data Analysis with Python and PySpark"](https://www.manning.com/books/data-analysis-with-python-and-pyspark)

- [A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)](https://youtu.be/dmL0N3qfSc8)

- [Advanced Apache Spark Training - Sameer Farooqui (Databricks)](https://youtu.be/7ooZ4S7Ay6Y)