In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

# Spark basics

![footer_logo_new](images/logo_new.png)

### In this chapter we will cover the following topics:
- Spark components
- SparkSession
- DataFrames
- Transformation
- Laziness
- Actions
- Lineage

# Spark components

Spark's execution environment is divided into the following components:

- Driver
    - Launches applications from outside or inside a cluster.
- Executors
    - Separate execution engines or containers on the worker nodes of a cluster.
    - Tasks (unit of work) are run within the executors.
- Cluster manager
    - Allocates computing resources (CPU/Memory) in the distributed system the Spark application is run on.
    - Examples are Yarn, Mesos, Spark, Kubernetes.
- History Server
    - Loads event logs from competed applications and show runtime details like: query plans, tasks, executors.
    - Used to improve performance of the job

![Spark execution model](images/spark-cluster.png "Spark execution model")

# Why the manager though?
We just learned that Spark has a cluster manager, but don't we already have a driver?
Let's go over some more history of how this came about.

# Kubernetes
In recent years, it's been hard to miss the rise of Kubernetes.
Spark can run on Kubernetes too:

![](images/spark-on-kubernetes.png)

## Running Spark on Kubernetes
There are a few things to point out here:
+ Data locality (i.e. moving the code to the data) tends to get lost on Kubernetes, as your storage generally is external to the Kubernetes cluster.
+ The driver pod will stay available in the _completed_ state until Kubernetes decides to garbage collect it. While 'completed' no resources are used (but it does clutter your cluster).
+ Spark 3.0 promises to be better optimized for Kubernetes.

# SparkSession

Main entrypoint for (new) Spark applications and a handle to the execution environment.

```python
spark = (
    pyspark.sql.SparkSession.builder
    .getOrCreate()
)
```
The SparkSession provides builtin support for Hive features including  writing queries using HiveQL, access to Hive UDFs, and reading Hive tables.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("spark-2").\
        master("spark://spark-master:7077").\
        config("spark.eventLog.enabled", "true").\
        config("spark.eventLog.dir", "/opt/workspace/history").\
        enableHiveSupport().\
        getOrCreate()

In [2]:
spark

In [3]:
spark.version

'3.0.0'

In [4]:
spark.catalog.listDatabases()

[Database(name='default', description='Default Hive database', locationUri='file:/opt/workspace/spark-warehouse')]

# Intermezzo: Hive
- Data warehouse infrastructure on top of Hadoop.
- Imposes table structure and querying capabilities:
    - Metastore: store metadata on tables (schema, location in HDFS etc.)
    - HiveQL: query language, basically an SQL dialect
- Originally it generated MapReduce jobs to run each query.
- These days it generates and submits Spark jobs.

Hive isn't necessarily a crazy choice for production jobs, even though people treat is as old-fashioned.
It has a reputation for reliability. It also has quite a few sophisticated knobs to tweak for tuning the way data is layed out.

# DataFrames

 - At its heart, we use Spark to manipulate large amounts of data.
 - The data structure (most often) used to do this is a DataFrame.
 - We can create a DataFrame in 3 ways:
 
   1. Using existing data (for example: read from files).
   2. Generating data in memory. (We'll see how to do this later.)
   3. By transforming another DataFrame.   
   
 - Dataframes are also immutable: the data they represent does not change.

# Transforming a DataFrame

We have many many ways of transforming DFs to produce new ones. Some key transformations:

 - `select()`
 - `drop()`
 - `where()`
 - `groupby()`
 - `join()`
 - `distinct()`
 - `fillna()`
 
The API documentation describes all these: there are many methods. We'll cover some of them later. Transformations are the ones that *return a DataFrame*.

# DataFrames are Lazy

 - When we create a DataFrame, its content is *not* evaluated.
 - Instead a *lineage* is constructed: each DataFrame knows what parent DataFrame it depends on, and what it needs to do, but won't actually do anything until the content of the DataFrame is actually required.
 
When is the content of the DataFrame actually required?

In the next slide, the empty boxes mean *Unevaluated DataFrame here*

![lineage](images/spark_df.svg)

# DataFrames Actions

- Actions trigger a DataFrame to evaluate its content, which is normally based on the lineage in a recursive manner.
- These are normally methods on the DataFrame that _don't_ return a DataFrame.
- Some examples include:
    - `toPandas()`
    - `count()`
    - `collect()`
    - `write()`
- Note: some actions may be problematic to run due to memory issues. Can you identify which ones?

Here below we can see that, until we call `.count`, nothing gets evaluated. This applies to all actions:

![lineage](images/spark_action.svg)

# What if things go south?

Spark keeps the lineage of these various boxes (DataFrame) in the memory of the driver. If something goes wrong (hard drive crashes, memory, etc.) in the workers, the driver can rebuild the DataFrame.

# Summary

In this chapter we covered:

- The execution model of Spark.
- What the Spark context is and does.
- DataFrame operations/transformation/actions/lineage.