In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

# Spark Advanced

![footer_logo_new](images/logo_new.png)

## Overview
- Spark Memory Model
- Java and Cryo Serializers
- Garbage Collection

## Spark Memory Model

Unlike Hadoop, Spark applications are memory heavy.  
That's why understanding how memory works is crucial.

### Spark Worker and Executor memory configs

`SPARK_WORKER_MEMORY=6g` - amount of memory available to a Worker node  
`spark.executor.memory=4g` - amount of memory available to an Executor process

### Executor Memory structure
Assuming we have an Executor with 4GB and memory settings are default. This is how the memory fractions will look like.

![spark-memory](images/spark-memory.png)

When it comes to memory optimization, usually these properties are tuned:

- `spark.memory.fraction` — defaults to 0.75
- `spark.memory.storageFraction` — defaults to 0.5

### Reserved Memory fraction

This region is dedicated to Spark internal objects - System.  
Things like classes, services, network connections etc.

It is hardcoded to be always **300MB**. Doesn't matter what is the size of other regions.

### User Memory fraction

Stores user defined structures, like UDFs and other functions.  
This region is not managed by Spark.

Formula: `(JVM Heap - 300MB) * (1 - spark.memory.fraction)`

In case of 4GB is `(4096MB - 300MB) * 0.25 = 949MB`


### Spark Memory

Managed by Spark, used for storing intermediate state, computations, serilization, joins, broadcast variables, etc.  
Caching, persisting in memory will be stored in the **storage** segment of this region.

Formula : `(JVM Heap — Reserved Memory) * spark.memory.fraction`  

In case of 4GB is `(4096MB -300MB) * 0.75 = 2847MB`

#### Storage Memory

Is used for storing all caching and broadcasting data. 

Persistence options from the list use Storage Memory:
- MEMORY_ONLY
- MEMORY_AND_DISK
- MEMORY_ONLY_SER
- MEMORY_AND_DISK_SER
- MEMORY_ONLY_2
- MEMORY_AND_DISK_2
- MEMORY_ONLY_SER_2
- MEMORY_AND_DISK_SER_2

Broadcast for example uses MEMORY_AND_DISK persistence option.

Storage Memory works in the LRU (Least Recently Used) mode.  
New data will be kept in memory and older will be evicted, to the disk or removed for the query plan recomutation.

Formula: `(Java Heap — Reserved Memory) * spark.memory.fraction * spark.memory.storageFraction`

In case of 4GB is `(4096MB — 300MB) * 0.75 * 0.5 = 1423MB`

#### Execution Memory

This segment is used for storing objects which are relevant to the Task execution.

For example:
- aggregations
- shuffle intermediate buffer
- serialization/deserialization

This segment also supports spilling to disk, when there are not enough memory in the buffer.

Is not LRU type of memory, tasks do not evict each other's memory.

Formula: `(Java Heap — Reserved Memory) * spark.memory.fraction * (1.0 — spark.memory.storageFraction)`

In case of 4GB is `(4096MB — 300MB) * 0.75 * (1.0 — 0.5) = 1423MB`


### Memory boarders crossing

There are situations when crossing the borders of Execution and Storage memory segments are possible.

1. Storage memory can use Execution memory if there are no blocks in use at the moment.
2. Execution memory can use Storage memory if there are unused blocks which could be evicted.
3. Execution memory can evict Storage blocks if Storage memory has blocks in Execution region and Execution needs more memory.
4. If Storage needs more memory and Execution uses blocks for storage, it cannot evict Execution blocks. It will wait until execution releases blocks.

## Serialization
Spark uses serialization mechanism to convert Java objects into bytes, for example to save storage space.

Serializaiton formats, which are either slow or heavy, will affect performance of an application.  
There is a tradeoff between usability and efficiency.

### Java Serializable

Default is Spark.

Built-in Java objects serialization mechanism. Allows to serialize any object which implements `java.io.Serializable`.

Drawbacks are - heavy and slow.

### Kryo Serialization

[Kryo](https://github.com/EsotericSoftware/kryo) is a library, outside of the JDK.

Switch Spark to Kryo:  
`conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")`


Benefits:  
Is faster, often 10x compact then Java Serialization.

Drawbacks:  
It requires registration.

If you want an object to be serialized by Kryo, you need to register its class.  
`conf.registerKryoClasses(Array(classOf[Class1]))`  

If you don't register, Kryo will have to keep the class metadata with each object, and it is much less efficient.
You can configure mandatory registration:  
`spark.kryo.registrationRequired=true`

When serializing large objects, might want to tweak the config parameter:  
`spark.kryoserializer.buffer`

### Serizalization advice

Try to always use Kryo, because serialization affects two major aspects:
1. Shuffle operations are hugely dependent on the size and speed of the serialization.
1. Caching depends on serialization specially when caching to disk or when data spills over from memory to disk and also when MEMORY_ONLY_SER storage level is set

## Garbage Collection

JVM GC is a mechanism for cleaning up unused memory.


#### JVM Memory Regions

![jvm_memory](images/jvm_memory.png)

Primarily Heap is divided in two parts: Young Generation and Old Generation.

-Xms				For setting the initial heap size when JVM starts  
-Xmx				For setting the maximum heap size.  
-Xmn				For setting the size of the Young Generation, rest of the space goes for Old Generation.  
-XX:PermGen			For setting the initial size of the Permanent Generation memory  
-XX:MaxPermGen		For setting the maximum size of Perm Gen  
-XX:SurvivorRatio	For providing ratio of Eden space and Survivor Space, for example if Young Generation size is 10m and VM switch is -XX:SurvivorRatio=2 then 5m will be reserved for Eden Space and 2.5m each for both the Survivor spaces. The default value is 8.  
-XX:NewRatio		For providing ratio of old/new generation sizes. The default value is 2. 

Configure `spark.executor.extraJavaOptions` with `-XX:+PrintGCDetails` and add specific memory configurations.

#### GC Types

**Serial GC** (-XX:+UseSerialGC): Serial GC uses the simple mark-sweep-compact approach for young and old generations garbage collection i.e Minor and Major GC.Serial GC is useful in client machines such as our simple stand-alone applications and machines with smaller CPU. It is good for small applications with low memory footprint.

**Parallel GC** (-XX:+UseParallelGC): Parallel GC is same as Serial GC except that is spawns N threads for young generation garbage collection where N is the number of CPU cores in the system. We can control the number of threads using -XX:ParallelGCThreads=n JVM option.

**Parallel Old GC** (-XX:+UseParallelOldGC): This is same as Parallel GC except that it uses multiple threads for both Young Generation and Old Generation garbage collection.

**Concurrent Mark Sweep** (CMS) Collector (-XX:+UseConcMarkSweepGC): CMS Collector is also referred as concurrent low pause collector. It does the garbage collection for the Old generation. CMS collector tries to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

**G1 Garbage Collector** (-XX:+UseG1GC): G1 collector is a parallel, concurrent, and incrementally compacting low-pause garbage collector. Garbage First Collector doesn’t work like other collectors and there is no concept of Young and Old generation space. It divides the heap space into multiple equal-sized heap regions. When a garbage collection is invoked, it first collects the region with lesser live data, hence “Garbage First”.

#### Profilers

Profilers allow to monitor execution metrics of a JVM.

##### jstat
Command line tool, ships with JDK, suitable for in place monitoring on a node.

```
ps -eaf |grep MyJavaApp

jstat -gc 5324 1000
```

and you will get metrics like:

```
S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT
1024.0 1024.0  0.0    0.0    8192.0   7933.3   42108.0    23401.3   20480.0 19990.9    157    0.274  40      1.381    1.654
1024.0 1024.0  0.0    0.0    8192.0   8026.5   42108.0    23401.3   20480.0 19990.9    157    0.274  40      1.381    1.654
```

##### Java VisualVM

Also ships with JDK and allows to bind a UI to local or remote Java process. 

![jvisualvm-jmx-connection](images/jvisualvm-jmx-connection.png)


Observe the state the JVM.

![jvisualvm-monitoring](images/jvisualvm-monitoring.png)


Analyze memory allocations.

![jvisualvm-profiler-memory](images/jvisualvm-profiler-memory.png)


Perform CPU snapshots.

![jvisualvm-profiler-cpu-snapshot](images/jvisualvm-profiler-cpu-snapshot.png)


# Summary

In this chapter we learned about:
- How Spark memory is organized, fractions and regions
- What is a Serialization Process and which serializers are available
- When we need to think about the JVM Garbage Collection

### Spark Memory

Managed by Spark, used for storing intermediate state, computations, serilization, joins, broadcast variables, etc.  
Caching, persisting in memory will be stored in the **storage** segment of this region.

Formula : `(JVM Heap — Reserved Memory) * spark.memory.fraction`  

In case of 4GB is `(4096MB -300MB) * 0.75 = 2847MB`

#### Storage Memory

Is used for storing all caching and broadcasting data. 

Persistence options from the list use Storage Memory:
- MEMORY_ONLY
- MEMORY_AND_DISK
- MEMORY_ONLY_SER
- MEMORY_AND_DISK_SER
- MEMORY_ONLY_2
- MEMORY_AND_DISK_2
- MEMORY_ONLY_SER_2
- MEMORY_AND_DISK_SER_2

Broadcast for example uses MEMORY_AND_DISK persistence option.

Storage Memory works in the LRU (Least Recently Used) mode.  
New data will be kept in memory and older will be evicted, to the disk or removed for the query plan recomutation.

Formula: `(Java Heap — Reserved Memory) * spark.memory.fraction * spark.memory.storageFraction`

In case of 4GB is `(4096MB — 300MB) * 0.75 * 0.5 = 1423MB`

#### Execution Memory

This segment is used for storing objects which are relevant to the Task execution.

For example:
- aggregations
- shuffle intermediate buffer
- serialization/deserialization

This segment also supports spilling to disk, when there are not enough memory in the buffer.

Is not LRU type of memory, tasks do not evict each other's memory.

Formula: `(Java Heap — Reserved Memory) * spark.memory.fraction * (1.0 — spark.memory.storageFraction)`

In case of 4GB is `(4096MB — 300MB) * 0.75 * (1.0 — 0.5) = 1423MB`


### Memory boarders crossing

There are situations when crossing the borders of Execution and Storage memory segments are possible.

1. Storage memory can use Execution memory if there are no blocks in use at the moment.
2. Execution memory can use Storage memory if there are unused blocks which could be evicted.
3. Execution memory can evict Storage blocks if Storage memory has blocks in Execution region and Execution needs more memory.
4. If Storage needs more memory and Execution uses blocks for storage, it cannot evict Execution blocks. It will wait until execution releases blocks.

## Serialization
Spark uses serialization mechanism to convert Java objects into bytes, for example to save storage space.

Serializaiton formats, which are either slow or heavy, will affect performance of an application.  
There is a tradeoff between usability and efficiency.

### Java Serializable

Default is Spark.

Built-in Java objects serialization mechanism. Allows to serialize any object which implements `java.io.Serializable`.

Drawbacks are - heavy and slow.

### Kryo Serialization

[Kryo](https://github.com/EsotericSoftware/kryo) is a library, outside of the JDK.

Switch Spark to Kryo:  
`conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")`


Benefits:  
Is faster, often 10x compact then Java Serialization.

Drawbacks:  
It requires registration.

If you want an object to be serialized by Kryo, you need to register its class.  
`conf.registerKryoClasses(Array(classOf[Class1]))`  

If you don't register, Kryo will have to keep the class metadata with each object, and it is much less efficient.
You can configure mandatory registration:  
`spark.kryo.registrationRequired=true`

When serializing large objects, might want to tweak the config parameter:  
`spark.kryoserializer.buffer`

### Serizalization advice

Try to always use Kryo, because serialization affects two major aspects:
1. Shuffle operations are hugely dependent on the size and speed of the serialization.
1. Caching depends on serialization specially when caching to disk or when data spills over from memory to disk and also when MEMORY_ONLY_SER storage level is set

## Garbage Collection

JVM GC is a mechanism for cleaning up unused memory.


#### JVM Memory Regions

![jvm_memory](images/jvm_memory.png)

Primarily Heap is divided in two parts: Young Generation and Old Generation.

-Xms				For setting the initial heap size when JVM starts  
-Xmx				For setting the maximum heap size.  
-Xmn				For setting the size of the Young Generation, rest of the space goes for Old Generation.  
-XX:PermGen			For setting the initial size of the Permanent Generation memory  
-XX:MaxPermGen		For setting the maximum size of Perm Gen  
-XX:SurvivorRatio	For providing ratio of Eden space and Survivor Space, for example if Young Generation size is 10m and VM switch is -XX:SurvivorRatio=2 then 5m will be reserved for Eden Space and 2.5m each for both the Survivor spaces. The default value is 8.  
-XX:NewRatio		For providing ratio of old/new generation sizes. The default value is 2. 

Configure `spark.executor.extraJavaOptions` with `-XX:+PrintGCDetails` and add specific memory configurations.

#### GC Types

**Serial GC** (-XX:+UseSerialGC): Serial GC uses the simple mark-sweep-compact approach for young and old generations garbage collection i.e Minor and Major GC.Serial GC is useful in client machines such as our simple stand-alone applications and machines with smaller CPU. It is good for small applications with low memory footprint.

**Parallel GC** (-XX:+UseParallelGC): Parallel GC is same as Serial GC except that is spawns N threads for young generation garbage collection where N is the number of CPU cores in the system. We can control the number of threads using -XX:ParallelGCThreads=n JVM option.

**Parallel Old GC** (-XX:+UseParallelOldGC): This is same as Parallel GC except that it uses multiple threads for both Young Generation and Old Generation garbage collection.

**Concurrent Mark Sweep** (CMS) Collector (-XX:+UseConcMarkSweepGC): CMS Collector is also referred as concurrent low pause collector. It does the garbage collection for the Old generation. CMS collector tries to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

**G1 Garbage Collector** (-XX:+UseG1GC): G1 collector is a parallel, concurrent, and incrementally compacting low-pause garbage collector. Garbage First Collector doesn’t work like other collectors and there is no concept of Young and Old generation space. It divides the heap space into multiple equal-sized heap regions. When a garbage collection is invoked, it first collects the region with lesser live data, hence “Garbage First”.

#### Profilers

Profilers allow to monitor execution metrics of a JVM.

##### jstat
Command line tool, ships with JDK, suitable for in place monitoring on a node.

```
ps -eaf |grep MyJavaApp

jstat -gc 5324 1000
```

and you will get metrics like:

```
S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT
1024.0 1024.0  0.0    0.0    8192.0   7933.3   42108.0    23401.3   20480.0 19990.9    157    0.274  40      1.381    1.654
1024.0 1024.0  0.0    0.0    8192.0   8026.5   42108.0    23401.3   20480.0 19990.9    157    0.274  40      1.381    1.654
```

##### Java VisualVM

Also ships with JDK and allows to bind a UI to local or remote Java process. 

![jvisualvm-jmx-connection](images/jvisualvm-jmx-connection.png)


Observe the state the JVM.

![jvisualvm-monitoring](images/jvisualvm-monitoring.png)


Analyze memory allocations.

![jvisualvm-profiler-memory](images/jvisualvm-profiler-memory.png)


Perform CPU snapshots.

![jvisualvm-profiler-cpu-snapshot](images/jvisualvm-profiler-cpu-snapshot.png)


# Summary

In this chapter we learned about:
- How Spark memory is organized, fractions and regions
- What is a Serialization Process and which serializers are available
- When we need to think about the JVM Garbage Collection