# Overview
In this notebook we explore the basics of spark and get ready to put hands on keyboard.

# Terminology

- SparkContext - Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.



Spark provides a helpful [gloassary](https://spark.apache.org/docs/3.2.0/cluster-overview.html#glossary)

Spark also provides a [list of configurations for the SparkContext](https://spark.apache.org/docs/latest/configuration.html).

# Object Model


- Application - While never explicitly defined, a spark application is an executable (code file or snippet) which utlizes a spark cluster. Spark applications are submitted to a spark cluster using one of the two execution modes described below. As mentioned in the [README](README.md) Spark provides APIs in Java, Scala, R, and Python which allow users to define their applciaitons.

- Deploy Mode (Execution Mode) - Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

- [Cache](https://spark.apache.org/docs/latest/quick-start.html#caching) - Apache spark provides a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank.


- SparkSession


# Process

- Submitting an application

https://blog.knoldus.com/cluster-vs-client-execution-modes-for-a-spark-application/

# Software Architecture

<center><img src="images/software-architecture.png", width="400px"></center>

## Java and The JVM
Spark runs on the JVM. To understand the spark architecture we need to understand this.

Spark is written in java and scala. Java runs natively in the JVM while scala can be comiled into Java bytecode and run inside the JVM.

The Java Virtual Machine (JVM) is an engine (sometimes called a virtual machine ... confusingly named) which provides the java runtime environment (JRE). In order to execute java code, it must be compiled into a jar file and executed inside the JVM. The Java Development Kit (JDK) allows the java code to access the JRE once inside the JVM.

## Py4j
Py4j is a python library which allows python to interact with spark. It is the backbone of the pyspark API.

On the [homepage](https://www.py4j.org/index.html) Py4j is described as:

> A Bridge between Python and Java
>
> Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods. Py4J also enables Java programs to call back Python objects.

Reading a bit deeper into the [documentation](https://www.py4j.org/getting_started.html) we see that the interop between python and java is provided by a **GatewayServer** instance. This gateway server allows Python programs to communicate with the JVM through a local network socket and send it instructions. This Gateway is referred to as the Java Gateway.

This is very important. Recall that the Driver runs in the JVM. If we look at the pyspark source code on [github](https://github.com/apache/spark/blob/e91ef1929201d4e493bb451fef0fb1b45800adae/python/pyspark/java_gateway.py#L214) we can see that a Driver is created in the JVM and a python wrapper provided by py4j and the Java Gateway allows us to manipulate the driver.



## Spark Shell
The Spark Shell is an analog of the traditional operating system shell (like BASH, CMD, or PowerShell). The Spark shell provides an interactive command line interface (CLI) through which a user can interact with the Spark API and thus a Spark Cluster (once properly configured). The Spark Shell's CLI allows users to type and execute ad-hoc lines of Scala, Python, or R code. Like any Shell, the Spark Shell follows the REPL (read-evaluate-print loop) pattern.

An important point of note is that the Spark Shell acts as the Driver while a user incrementally defines their Application using the CLI (note: I am using the term Application very precisely. See Application defined alsewhere in this document.). We will see later how the Spark Shell is leveraged by the Spark Context.

More on the Spark Shell can be found in the official [documentation](https://spark.apache.org/docs/latest/quick-start.html).

## SparkContext



Looking at the code. SparkContext object's constructor does the following:
- Acceps as SparkConf object (and some other low level configurations)
- Uses the spark-submit utility to launch the pyspark-shell program within the JVM
- Determines what port the pysaprk-shell is listening on
- Creates a Java Gateway using py4j configured to attach to the JVM running pyspark-shell
It then creates a Driver


Note: I have seen a lot of misinformation

The SparkContext can only run on the driver. In fact, the code has checks to make sure we are not creating a spark context from the worker.

## The Driver

The Driver is a java process that runs in its own JVM. The Driver utilizes several components including the DAGScheduler, TaskScheduler, BackendScheduler and BlockManager to interpret and translate the user defined code in the Application into actual Spark Jobs which can be executed on the cluster. For example, python function calls become Transformations and Actions which are types of Spark Tasks. Once translated, the driver comes up with an execution plan and schedules the work with the Cluster Manager. Additionally, the Application may create data or cache data within the cluster. The Driver is also respobsible for keeping track of these resources. 

The Driver also hosts the Spark Web UI which allows admins to monitor the utlization of the cluster by the Application.

The driver creates the SparkContext, connecting the user program to a given Spark Master

### Deploy Mode
Spark offers two Deploy Modes (Execution Modes) which configure how and where the Driver runs. 

In **client mode**, the driver is launched in the same process as the client that submits the application. If using a jupyter notebook, the notebook spins up a subprocess (which is why we previously install java on our machines).

In **cluster mode** the driver runs somewhere on the cluster. Exactly where depends on the type of Cluster Manager. The spark-submit utility is typically used to sumbit the applciation.

# System Architecture
Recall that Spark is a distributed system and thus Spark applciations are distributed appliations. In order to understand how to use Spark we need to understand what it is and how it works. First we will focus on the system architecture before discussing the software that runs on top of the system.

A Spark cluster is designed according to the master-slave pattern. A master node is responsible for organizing, provisioning resources on, and distributing work to, the worker nodes who perform the actual computations and return results. Spark has renamed the master/slaves as cluster manager and workers respectively.

<center><img src="images/cluster-overview.png", width="400px"></center>

In [None]:
The SparkContext object is the gateway


Spark supports the following cluster managers:

- [Standalone](https://spark.apache.org/docs/latest/spark-standalone.html) – a simple cluster manager included with Spark that makes it easy to set up a cluster.
 - [Apache Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html) – a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
- [Hadoop YARN](https://spark.apache.org/docs/latest/running-on-yarn.html) – the resource manager in Hadoop 2.
- [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html) – an open-source system for automating deployment, scaling, and management of containerized applications.

https://spark.apache.org/docs/latest/cluster-overview.html