# Spark Architecture and Applications{background-color="black" background-image="https://miro.medium.com/max/1400/1*arBqq7O7umskV4O7JjhdrA.jpeg" background-size="75%" background-opacity="0.5" background-position="top"}

## Install pyspark locally

:::{.fragment}
Download Spark 
```bash
wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
tar xzf spark-3.5.3-bin-hadoop3.tgz
ln -s spark-3.5.3-bin-hadoop3 spark
```
:::

:::{.fragment}
Install pyspark/findspark 
```bash
pip install pyspark findspark
```
:::

:::{.fragment}
Test on bash
```bash
export SPARK_HOME=/home/tap/spark
$SPARK_HOME/bin/run-example SparkPi
```

:::

:::{.fragment}
Test on jupyter
```bash
export SPARK_HOME=/home/tap/spark
Run Jupyter
```
:::

In [2]:
import findspark
import pyspark
conf = pyspark.SparkConf().setAppName('Tap').setMaster('local[8]')
sc = pyspark.SparkContext(conf=conf)
sc

24/11/17 15:24:52 WARN Utils: Your hostname, pappanics resolves to a loopback address: 127.0.1.1; using 192.168.45.65 instead (on interface ens160)
24/11/17 15:24:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/17 15:24:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Spark Variables
![](https://programmerhumor.io/wp-content/uploads/2023/02/programmerhumor-io-programming-memes-20822be2f46de63-608x776.png){.fragment .lightbox}

### Shared Variables

- Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function.
- These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program.
- Supporting general, read-write shared variables across tasks would be inefficient.
- Spark does provide two limited types of shared variables for two common usage patterns: **broadcast variables** and **accumulators**.

#### Broadcast
- Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
- They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
- Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
- Spark actions are executed through a set of stages, separated by distributed “shuffle” operations.
- Spark automatically broadcasts the common data needed by tasks within each stage.
- The data broadcasted this way is cached in serialized form and deserialized before running each task.
- This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

In [3]:
euroRate = sc.broadcast(1.05)
euroRate.value

1.05

In [4]:
import random
#Generate 5 random numbers between 1 and 100
sales = random.sample(range(1, 100), 10)
salesRDD=sc.parallelize(sales)
salesRDD.collect()

[92, 59, 68, 37, 76, 52, 21, 70, 67, 80]

In [5]:
# Use euroRate to get tle
salesRDD.map(lambda x: str(round(x*euroRate.value,2))+"$").collect()

                                                                                

['96.6$',
 '61.95$',
 '71.4$',
 '38.85$',
 '79.8$',
 '54.6$',
 '22.05$',
 '73.5$',
 '70.35$',
 '84.0$']

In [6]:
salesRDD.reduce(lambda a,b:(a+b)*euroRate.value)

781.788424774219

##### Notes
- After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

- To release the resources that the broadcast variable copied onto executors, call .unpersist(). If the broadcast is used again afterwards, it will be re-broadcast. To permanently release all resources used by the broadcast variable, call .destroy(). The broadcast variable can’t be used after that. Note that these methods do not block by default. To block until resources are freed, specify blocking=true when calling them.

#### Accumulators
- Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel.
- They can be used to implement counters (as in MapReduce) or sums.
- Spark natively supports accumulators of numeric types, and programmers can add support for new types.
- For accumulator updates performed inside **actions** only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value.
- In **transformations**, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

In [6]:
accum = sc.accumulator(0)
accum

Accumulator<id=0, value=0>

In [7]:
data=sc.parallelize([1, 2, 3, 4])
data.foreach(lambda x: accum.add(x))

In [8]:
accum.value

10

**Accumulator and lazy evaluation**

- Accumulators do not change the lazy evaluation model of Spark.
- If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action.
- Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). 

**Example**

In [9]:
accum = sc.accumulator(0) # Reset
accum

Accumulator<id=1, value=0>

In [10]:
data.collect()

[1, 2, 3, 4]

In [11]:
def g(x):
    accum.add(x)
    return x
data.map(g) 
# Here, accum is still 0 because no actions have caused the `map` to be computed.
accum

Accumulator<id=1, value=0>

In [12]:
data.map(g).foreach(lambda x: accum.add(1))

**Which is the value of accum ?**

In [13]:
accum

Accumulator<id=1, value=14>

# Spark Application

[Spark Documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html)

Spark application consists of a **driver** program that runs the user’s main function and executes various parallel operations on a cluster. 

Let's see the anatomy of a spark application

![](https://miro.medium.com/max/700/1*B9lbB8uU7a_Xi0a1uDImRw.jpeg){.lightbox}

![](https://spark.apache.org/docs/latest/img/cluster-overview.png){.lightbox}

![[Anatomy](https://medium.com/@meenakshisundaramsekar/anatomy-of-a-spark-application-in-a-nutshell-2e542d5f334e)](images/anatomyofsparkapp.png){.lightbox}

## Driver

- The life of Spark programs starts and ends with the Spark Driver.
- The Spark driver is the process which the clients used to submit the spark program.
- The Driver is also responsible for application planning and execution of the spark program and returning the status/results to the client.

:::: {.columns}

::: {.fragment .column width="50%"}
[Apache Spark Architeture](https://www.dezyre.com/article/apache-spark-architecture-explained-in-detail/338)
![](images/sparkarchiteture.png)
::: 

::: {.fragment .column width="50%"}
- It is the central point and the entry point of the Spark Shell (Scala, Python, and R).
- The driver program runs the main() function of the application and is the place where the Spark Context is created.
- Spark Driver contains various components responsible for the translation of spark user code into actual spark jobs executed on the cluster.
:::
::::



## DAG Scheduler

### What is the DAG Scheduler (1)
[SparkBasic](https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454)

:::: {.columns}

::: {.fragment .column width="50%"}
- DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. 

- It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

:::{.fragment}
```scala
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1)).
reduceByKey{(a,b) => a + b}
```
:::
::: 

::: {.fragment .column width="50%"}
![](https://miro.medium.com/max/700/1*1WfneX6c7Lc9fqAaR9MaGA.png)
:::
::::


### What is the DAG Scheduler (2)
[SparkByExample](https://sparkbyexamples.com/spark/what-is-dag-in-spark/#:~:text=In%20Spark%2C%20the%20DAG%20Scheduler,across%20a%20cluster%20of%20machines.)

:::: {.columns}

::: {.fragment .column width="40%"}
In Spark, the DAG Scheduler is responsible for transforming a sequence of RDD transformations and actions into a directed acyclic graph (DAG) of stages and tasks, which can be executed in parallel across a cluster of machines. 
::: 

::: {.fragment .column width="60%"}
1. **stages in Spark**: shuffle stages and non-shuffle stages. Shuffle stages involve the exchange of data between nodes, while non-shuffle stages do not.
2. **Tasks**: A task represents a single unit of work that can be executed on a single partition of an RDD. Tasks are the smallest units of parallelism in Spark.
3. **Dependencies**: The dependencies between RDDs determine the order in which tasks are executed.
   - Narrow dependencies indicate that each partition of the parent RDD is used by at most one partition of the child RDD
   - wide dependencies indicate that each partition of the parent RDD can be used by multiple partitions of the child RDD.
:::
::::

## TaskScheduler
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/TaskScheduler.html>
```javadoc
public interface TaskScheduler
```

- Low-level task scheduler interface, currently implemented exclusively by TaskSchedulerImpl.
- This interface allows plugging in different task schedulers.
- Each TaskScheduler schedules tasks for a single SparkContext.
- These schedulers get sets of tasks submitted to them from the DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers.
- They return events to the DAGScheduler.


### what does it do ?
[Source](https://mallikarjuna_g.gitbooks.io/spark/content/spark-taskscheduler.html)

A TaskScheduler schedules tasks for a single Spark application according to scheduling mode.

![](https://mallikarjuna_g.gitbooks.io/spark/content/images/sparkstandalone-sparkcontext-taskscheduler-schedulerbackend.png)

## [SchedulerBackend](https://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulerBackend.html)

::: {.r-stack}
![](images/SparkSchedulerBackend1.png){.fragment .lightbox width="75%"}

![](images/SparkSchedulerBackend.png){.fragment .lightbox width="75%"}
:::

## BlockManager 
Spark storage system is managed by BlockManager that runs both in Driver and Executor instances.

Is a key-value store of blocks of data (block storage) identified by a block ID.

Among the types of data stored in blocks we can find:

- RDD 
- shuffle: in this category we can distinguish shuffle data, shuffle index and temporary shuffle files (intermediate results)
- broadcast - broadcasted data is organized in blocks too
- task results
- stream data
- temp data (including swap)

## Repetita Iuvant
<https://databricks.com/glossary/what-are-spark-applications>

:::: {.columns}

::: {.fragment .column width="40%"}
![](https://databricks.com/wp-content/uploads/2018/05/Spark-Applications.png)
::: 

::: {.fragment .column width="60%"}
The driver process:

- runs your main() function
- sits on a node in the cluster
- is responsible for three things: 
   1. maintaining information about the Spark Application;
   2. responding to a user’s program or input; 
   3. and analyzing, distributing, and scheduling work across the executors (defined momentarily).
- The driver process is absolutely essential, it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application.
:::
::::


## Spark Context 
- The Spark context is application's Instance created by the Spark driver for each individual Spark programs when it is first submitted by the user.
- Allows Spark Driver to access the cluster through a Cluster Resource Manager and it can be used to create RDDs, accumulators and broadcast variables on the cluster. Spark Context also keeps track of live executors by sending heartbeat messages regularly.
- The Spark Context is created by the Spark Driver for each Spark application when it is first submitted by the user. It exists throughout the entire life of a spark application.
- Usually referred to as variable name sc in programming.
- The Spark Context terminates once the spark application completes. Only one Spark Context can be active per JVM. You must stop() the active Spark Context before creating a new one.

## Deploy, Package and Launch

:::: {.columns}

::: {.fragment .column width="33%"}
**Deploy**

- The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.
- It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
::: 

::: {.fragment .column width="33%"}
**Package**

- Create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins (spark and Hadoop to be marked as provided dependencies)
- Call the bin/spark-submit script as shown here while passing your jar.

:::
::: {.fragment .column width="33%"}
**Launch**

- Once a user application is bundled, it can be launched using the bin/spark-submit script. 

- This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

::: {.fragment}
```bash
./bin/spark-submit --class <mainclass>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  <application-jar> [args]
```
:::
:::
::::



### Example of app submit
```bash
# Run application locally on 8 cores
# cd spark home
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master local[8] examples/jars/spark-examples_2.12-3.5.1.jar 1000

```

More in https://spark.apache.org/docs/latest/submitting-applications.html

##  Spark Tap Docker
- Code in tap/spark/code is copied into docker
- Dataset is inside spark and linked into tap root (check previuos example), reference as spark/dataset...
- [Hint] Test on machine and then test on Docker (including dependencies)
- https://github.com/apache/spark/tree/master/examples/src/main/python is a good source

:::{.fragment}
```bash
docker run --hostname spark -p 4040:4040 -it --rm \
-v /home/tap/tap-workspace/tap2024/spark/code/:/opt/tap/ \
-v /home/tap/tap-workspace/tap2024/spark/dataset:/tmp/dataset \
apache/spark /opt/spark/bin/spark-submit /opt/tap/simpleapp.py
```
:::