# Resources
This notebook is based on information gleaned from the following documents:

* [Spark official document: Tuning Spark](https://spark.apache.org/docs/latest/tuning.html#memory-tuning)
* [Spark Memory Management](https://0x0fff.com/spark-memory-management/)
* [Cloudera: How to tune you apache spark jobs (part 1)](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/)
* [Cloudera: How to tune you apache spark jobs (part 2)](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/)
* [DataBricks: tuning java garbage-collection for spark applications](https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html)


## Setup

In this notebook we change the configuration of Spark. the pyspark notebook creates a fully configured `SparkContext` before starting the notebook. In order to control the configuration you need to open this notebook as a plain iPython (Jupyter) notebook.

In addition, you need to set the following environment variables, we recommend you add these lines to your `.bashrc` or `.bash_login` scripts. Then start a new terminal.

```bash
######################################################################
# setting SPARK pointers
######################################################################
# set this to whereever you installed SPARK_HOME
export SPARK_HOME='$HOME/spark-latest'

# Where you specify options you would normally add after bin/pyspark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$SPARK_HOME/python:$PYTHONPATH
```

# Memory Management

COnfiguring memory management can have a large impact on the performance of a spark job.

### Resources
This notebook is based on information gleaned from the following documents:

* [Spark official document: Tuning Spark](https://spark.apache.org/docs/latest/tuning.html#memory-tuning)
* [Cloudera: How to tune you apache spark jobs (part 1)](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/)
* [Cloudera: How to tune you apache spark jobs (part 2)](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/)
* [DataBricks: tuning java garbage-collection for spark applications](https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html)

#### Three considerations in tuning memory usage: 

Taken from [Spark official document](https://spark.apache.org/docs/latest/tuning.html#memory-tuning):

1. the amount of memory used by your objects (you may want your entire dataset to fit in memory), 
1. the cost of accessing those objects, and 
1. the overhead of garbage collection (if you have high turnover in terms of objects).

### Determining Memory Consumption

The "Storage" page in the Spark web UI will show the amount of memory consumption of a RDD after it is **put into cache**.

In [63]:
from pyspark import SparkContext, SparkConf
from pyspark import StorageLevel

In [66]:
# Only one Spark Contex can run in a notebook at one time

# Kill previous SparkContext to apply new configurations,
# will trigger exception if no SparkContext is created yet
try:
    sc.stop()
except:
    pass


In [67]:
# Start a Spark Context using 3 out of the 4 cores on my laptop
sc = SparkContext(master="local[3]")

Open http://localhost:4040/.

![](images/ui-application.png)

![](images/empty-storage.png)

## Tuning Data Structures

### Serialized RDD Storage

Quote from [Spark official document](https://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage):

    When your objects are too large to efficiently store despite this tuning, a much simpler way to reduce memory usage is to store them in serialized form, using the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER. Spark will then store each RDD partition as one large byte array. The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly.

In [38]:
data = ["hello world"] * (10**5)

In [39]:
rdd1 = sc.parallelize(data) \
         .persist(StorageLevel.MEMORY_ONLY_SER)
rdd1.count()

100000

In [40]:
rdd2 = sc.parallelize(data) \
         .persist(StorageLevel.MEMORY_ONLY)
rdd2.count()

100000

![](images/serialization.png)

# Garbage Collection
Garbage collection is one of the most complex part of the Java Virtual Machine and it has a large impact on performance.

The problem that Garbage Collection intends to solve is to free the memory space occupied by objects that are no longer needed.

In some object-oriented languages such as **C++** the programmer has direct control of memory allocation. This means that each object type has to have a **constructor** and a **destructor**. The *constructor allocates* space for a new object (usually on the heap), and the *destructor de-allocates* the space when the object is no longer needed.

This type of memory management allows the programmer to write highly optimized code. On the other hand, it puts on her a significant burden as keeping track of which objects are needed and which are not can be very complex. Failing to deallocate memory in a timely manner can lead to *memory leaks* which are hard to debug.

**Java** (which is the native language of Spark) takes a different approach to memory management than **C++**. It does not require that the programmer write destructors. The only requirement on the programmer is that, when an object is no longer needed, there will be no way to reach that object from one of the variables in the program. For example, if a function assigns an object to a local variable, that object will have no pointer leading to it when the function returns.

More precisely, the program has, at any time, a set of pointers that point to objects on the **Heap**. Thi set of pointers is referred to as the *root set of references* in the figure below. When the heap gets too full, the *garbage collector* is invoked. The garbage collector identifies objects in the heap that cannot be reached from the root set of references. Those objects are the added to the "free" part of the heap and are made available for new objects.

![](images/Garbage-Collection.gif)

For more details read [here](http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html)

#### Measuring the Impact of GC

The impact of GC can be viewed using either the Spark UI or the terminal log messages.

![](images/ui-gc.png)

## Configuring Garbage Collection
*  [Spark configuration](http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties)
*  [Memory management overview](http://spark.apache.org/docs/latest/tuning.html#memory-management-overview)

#### From Memory Management Overview

Memory usage in Spark largely falls under one of two categories: **execution** and **storage**. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.

<img src="images/Spark-Memory-Management-1.6.0.png" width="400">

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

* **spark.memory.fraction** expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.75). The rest of the space (25%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
* **spark.memory.storageFraction** expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark import StorageLevel

#### First, try with very restricted memory

In [60]:
conf = SparkConf().setMaster("local[2]") \
                  .set("spark.memory.fraction", 0.1**10)  # Limit the memory accessible to Spark

try:
    sc.stop()
except:
    pass
sc = SparkContext(conf=conf)

In [61]:
import time

## trigger_gc generates all pairs of words in a given line s
## Named trigger_gc because it produces a large amount of output, that would trigger GC.
def trigger_gc(s):
    ws = s.split()
    for a in ws:
        for b in ws:
            yield (a, b)

st = time.time()
rdd = sc.textFile('./Moby-Dick-Edited.txt', minPartitions=1000) \
         .flatMap(trigger_gc)
print rdd.first()
print rdd.count()
print "It takes %.2f seconds." % (time.time() - st)

(u'\ufeffMOBY', u'\ufeffMOBY')
2581129
It takes 11.22 seconds.


![](images/many-gc.png)

In [42]:
conf = SparkConf().setMaster("local[2]") \
                  .set("spark.memory.fraction", 1.0)  # No memory limit

try:
    sc.stop()
except:
    pass
sc = SparkContext(conf=conf)

In [43]:
st = time.time()
rdd = sc.textFile('./Moby-Dick-Edited.txt', minPartitions=1000) \
         .flatMap(triger_gc)
rdd.count()
print "It takes %.2f seconds." % (time.time() - st)

It takes 9.23 seconds.


![](images/few-gc.png)