# Overview
In this notebook we explore the basics of spark and get ready to put hands on keyboard.

# Terminology

- SparkContext - Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.



Spark provides a helpful [gloassary](https://spark.apache.org/docs/3.2.0/cluster-overview.html#glossary)

Spark also provides a [list of configurations for the SparkContext](https://spark.apache.org/docs/latest/configuration.html).

# Object Model


- Application - While never explicitly defined, a spark application is an executable (code file or snippet) which utlizes a spark cluster. Spark applications are submitted to a spark cluster using one of the two execution modes described below. As mentioned in the [README](README.md) Spark provides APIs in Java, Scala, R, and Python which allow users to define their applciaitons.

- Deploy Mode (Execution Mode) - Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

- [Cache](https://spark.apache.org/docs/latest/quick-start.html#caching) - Apache spark provides a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank.


- SparkSession


# Process

- Submitting an application

https://blog.knoldus.com/cluster-vs-client-execution-modes-for-a-spark-application/

# Software Architecture

<center><img src="images/software-architecture.png", width="400px"></center>

## Java and The JVM
Spark runs on the JVM. To understand the spark architecture we need to understand this.

Spark is written in java and scala. Java runs natively in the JVM while scala can be comiled into Java bytecode and run inside the JVM.

The Java Virtual Machine (JVM) is an engine (sometimes called a virtual machine ... confusingly named) which provides the java runtime environment (JRE). In order to execute java code, it must be compiled into a jar file and executed inside the JVM. The Java Development Kit (JDK) allows the java code to access the JRE once inside the JVM.

## Py4j
Py4j is a python library which allows python to interact with spark. It is the backbone of the pyspark API.

On the [homepage](https://www.py4j.org/index.html) Py4j is described as:

> A Bridge between Python and Java
>
> Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods. Py4J also enables Java programs to call back Python objects.

Reading a bit deeper into the [documentation](https://www.py4j.org/getting_started.html) we see that the interop between python and java is provided by a **GatewayServer** instance. This gateway server allows Python programs to communicate with the JVM through a local network socket and send it instructions. It also has a callback functionality so that objects in the JVM can update objects in python in an event based manner. This Gateway is referred to as the Java Gateway in various points in the source code.

This is very important. Recall that the Driver runs in the JVM. If we look at the pyspark source code on [github](https://github.com/apache/spark/blob/e91ef1929201d4e493bb451fef0fb1b45800adae/python/pyspark/java_gateway.py#L214) we can see that a Driver is created in the JVM and a python wrapper provided by py4j and the Java Gateway allows us to manipulate the driver.



## Spark Shell
The Spark Shell is an analog of the traditional operating system shell (like BASH, CMD, or PowerShell). The Spark shell provides an interactive command line interface (CLI) through which a user can interact with the Spark API and thus a Spark Cluster (once properly configured). The Spark Shell's CLI allows users to type and execute ad-hoc lines of Scala, Python, or R code. Like any Shell, the Spark Shell follows the REPL (read-evaluate-print loop) pattern.

An important point of note is that the Spark Shell acts as the Driver while a user incrementally defines their Application using the CLI (note: I am using the term Application very precisely. See Application defined alsewhere in this document.). We will see later that the Spark Shell is leveraged in a few ways. 

- spark submit
- sparkcontext

More on the Spark Shell can be found in the official [documentation](https://spark.apache.org/docs/latest/quick-start.html).

## Entry Points
As mentioned previously, users develop applications which leverage Apache Spark. As such, many articles and documentations talk about the entry point of a Spark application. This was a bit confusing; traditionally applications are said to have entry points or *main()* functions. These entrypoints are the gateway where the execution of an application begins and the user gets access to the runtime environment and/or API that the applciation was built to interact with. With Spark, the entrypoint gives the program access to the Spark environment.

In Spark 1.x, three entry points were introduced: SparkContext, SQLContext and HiveContext. Since Spark 2.x, a new entry point called SparkSession has been introduced that essentially combined all functionalities available in the three aforementioned contexts. Note that all contexts are still available even in newest Spark releases, mostly for backward compatibility purposes.

More information on these entrypoints can be found [here](https://towardsdatascience.com/sparksession-vs-sparkcontext-vs-sqlcontext-vs-hivecontext-741d50c9486a).

### SparkContext

The SparkContext is an integral part of Spark. Unfortunately There is no explicit definition anywhere in the official documentation and there are a lot of confusing, contradicting, muttled definitions floating around in third party articles.

Note: The spark context is language specific, and we will be looking at the python SparkContext. That being said, I think it's safe to say most of what we say about the python API will apply to the other language bindings.

I decided to have a look at the [source code](https://github.com/apache/spark/blob/e91ef1929201d4e493bb451fef0fb1b45800adae/python/pyspark/context.py#L66) to really understand what this object is. In the class defintion I see some comments but they are not particularely helpful:

> Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs and broadcast variables on that cluster.

This doesn't really help tie it in with the rest of the components So I decided to look at the SparkContext object's constructor. According to the code, it does the following:

1. First the SparkContext ensures it is not running in a task on the worker node (it will raise an exception if it is). If it is not in a worker, the code assumes it is running in a driver. This phrasing was a bit confusing until I got deeper and understood how the py4j/JavaGateway worked (we will come back to this). 

2. Next the SparkContext initializes. This starts by considering a SparkConf object (and some other low level configurations for our driver and spark cluster) to constructs a Shell command. Based on the SparkConf the Shell command runs the spark-submit utility. If no target is specified (like when running from a jupyter notebook) the spark-submit utility is instructed to run the spark shell (in the case of python this is the pyspark shell). The Spark Shell acts as the Driver. Once the Spark Shell program is running, the SparkContext will determine the port it is listening on and leverage the magic of py4j. It  creates the JavaGateway and configures it so that the SparkContext can communicate with the driver (and other objects) running in the JVM. Specifically the SparkContext running in the python interpreter gets a reference to the JavaSparkContext object running in the JVM (through the marking of py4j). This Java object is what actually connects an API to the Spark cluster.

Note: I have seen a lot of confusing statements made about the driver and the spark context. For example I have seen statements that the Driver creates the SparkContext. This is a half truth. In the case of python, scala, or R, the SparkContext exists in both the JVM and the respective language. The driver does create the JavaSparkContext but not the accompanying SparkContext for the language binding. 

I have found that running multiple SparkContexts in a single JVM is not reccomended.

## The Driver

The Driver is a java process that runs in its own JVM. The Driver utilizes several components including the DAGScheduler, TaskScheduler, BackendScheduler and BlockManager to interpret and translate the user defined code in the Application into actual Spark Jobs which can be executed on the cluster. For example, python function calls become Transformations and Actions which are types of Spark Tasks. Once translated, the driver comes up with an execution plan and schedules the work with the Cluster Manager. Additionally, the Application may create data or cache data within the cluster. The Driver is also respobsible for keeping track of these resources. 

The Driver also hosts the Spark Web UI which allows admins to monitor the utlization of the cluster by the Application.

The driver creates the SparkContext, connecting the user program to a given Spark Master

### Deploy Mode
Spark offers two methods to configure how and where the Driver runs. This is referred to as Deploy Mode or Execution Mode. 

Note: Most of our examples will deal with the "Client Mode" method of using spark. This is the method by which we can leverage spark through a jupyter notebook. Even if we tell the notebok to run in cluster mode, we will see that this setting is ignored.


#### Client Mode
In **client mode**, the driver is launched in the same process as the client that submits the application. An example of this is when using spark from a jupyter notebook. Here, the notebook spins up a SparkContext which launches the pyspark program as the driver in a subprocess. This is why we previously install java on our machines; the driver requires java.


#### Cluster Mode

In **cluster mode** however, the driver runs somewhere on the cluster (on a worker node). Exactly where depends on the type of Cluster Manager. This is useful as it allows the client to "fire and forget". The client can submit the application, walk away and come back to a set of completed results. This method of execution also useful when one needs to minimize the network latency between the driver and the workers

# System Architecture
Recall that Spark is a distributed system and thus Spark applciations are distributed appliations. In order to understand how to use Spark we need to understand what it is and how it works. First we will focus on the system architecture before discussing the software that runs on top of the system.


<center><img src="images/cluster-overview.png", width="400px"></center>

A Spark cluster is designed according to the traditional master-slave pattern. Typically, a master node is responsible for organizing, provisioning resources on, and distributing work to, the worker nodes who perform the actual computations and return results. Spark has renamed the master/slaves as cluster manager and workers respectively.

## The Cluster Manager (The master node)
The Cluster Manager communicates with the SparkContext to understand what work needs to be accomplished. It takes the instructions and coordinates the execution with the worker nodes. In some cases, the Cluster Manager also provisions the instances of the worker node based on configurations passed to it from the SparkConf object.

Spark supports the following cluster managers:

- [Standalone](https://spark.apache.org/docs/latest/spark-standalone.html) – a simple cluster manager included with Spark that makes it easy to set up a cluster.
 - [Apache Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html) – a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
- [Hadoop YARN](https://spark.apache.org/docs/latest/running-on-yarn.html) – the resource manager in Hadoop 2.
- [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html) – an open-source system for automating deployment, scaling, and management of containerized applications.

https://spark.apache.org/docs/latest/cluster-overview.html

## The Worker Node
Performs the set of operations assigned to it by the Cluster Manager and returns the result.