<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../resources/logo.png" alt="Intellinum Bootcamp" style="width: 600px; height: 163px">
</div>

# ETL Optimizations

Apache Spark&trade; clusters can be optimized using compression and caching best practices along with autoscaling clusters.

## In this lesson you:
* Compare trade-offs in compression algorithms in terms of parallelization, data transfer, file types, and read vs write time
* Cache data at the optimal point in a workload to limit data transfer across the network
* Configure clusters based on the demands of your workload including size, location, and types of machines
* Employ an autoscaling strategy to dynamically adapt to changes in data size
* Monitor clusters using the Ganglia UI

### Optimizing ETL Workloads

Optimizing Spark jobs boils down to a few common themes, many of which were addressed in other lessons in this course.  Additional optimizations include compression, caching, and hardware choices.

Many aspects of computer science are about trade-offs rather than a single, optimal solution.  For instance, data compression is a trade-off between decompression speed, splitability, and compressed to uncompressed data size.  Hardware choice is a trade-off between CPU-bound and IO-bound workloads.  This lesson explores the trade-offs between these issues with rules of thumb to apply them to any Spark workload.

While a number of compression algorithms exist, the basic intuition behind them is that they reduce space by reducing redundancies in the data.  Compression matters in big data environments because many jobs are IO bound (or the bottleneck is data transfer) rather than CPU bound.  Compressing data allows for the reduction of data transfer as well as reduction of storage space needed to store that data.

<div><img src="../../resources/data-compression.png" style="height: 400px; margin: 20px"/></div>

Run the following cell to create the lab environment:

In [None]:
#MODE = "LOCAL"
MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
from matplotlib import interactive
interactive(True)
import matplotlib.pyplot as plt
%matplotlib inline
import json
import math
import numbers
import numpy as np
import plotly
import uuid
import time
plotly.offline.init_notebook_mode(connected=True)

sys.path.insert(0,'../../src')
from settings import *

try:
    fh = open('../../libs/pyspark24_py36.zip', 'r')
except FileNotFoundError:
    !aws s3 cp s3://devops.intellinum.co/bins/pyspark24_py36.zip ../../libs/pyspark24_py36.zip

try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession detected")
    print("Creating a new SparkSession")

SPARK_DRIVER_MEMORY= "1G"
SPARK_DRIVER_CORE = "1"
SPARK_EXECUTOR_MEMORY= "1G"
SPARK_EXECUTOR_CORE = "1"
SPARK_EXECUTOR_INSTANCES = 12



conf = None
if MODE == "LOCAL":
    os.environ["PYSPARK_PYTHON"] = "/home/yuan/anaconda3/envs/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_etl_14-ETL-optimizations").\
            setMaster('local[*]').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', '../../libs/mysql-connector-java-5.1.45-bin.jar').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1')
else:
    os.environ["PYSPARK_PYTHON"] = "./MN/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_etl_14-ETL-optimizations").\
            setMaster('yarn-client').\
            set('spark.executor.cores', SPARK_EXECUTOR_CORE).\
            set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\
            set('spark.driver.cores', SPARK_DRIVER_CORE).\
            set('spark.driver.memory', SPARK_DRIVER_MEMORY).\
            set("spark.executor.instances", SPARK_EXECUTOR_INSTANCES).\
            set('spark.sql.files.ignoreCorruptFiles', 'true').\
            set('spark.yarn.dist.archives', '../../libs/pyspark24_py36.zip#MN').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars.packages','io.delta:delta-core_2.11:0.2.0,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.2,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.2,net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1'). \
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', 's3://devops.intellinum.co/bins/mysql-connector-java-5.1.45-bin.jar')
        

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()


sc = spark.sparkContext

sc.addPyFile('../../src/settings.py')

sc=spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

def display(df, limit=10):
    try:
        # For spark-core 
        result = df.limit(limit).toPandas()
    except Exception as e:
        # For structured-streaming
        stream_name = str(uuid.uuid1()).replace("-","")
        query = (
          df
            .writeStream
            .format("memory")        # memory = store in-memory table (for debugging only)
            .queryName(stream_name) # show = name of the in-memory table
            .trigger(processingTime='1 seconds') #Trigger = 1 second
            .outputMode("append")  # append
            .start()
        )
        while query.isActive:
            time.sleep(1)
            result = spark.sql(f"select * from {stream_name} limit {limit}").toPandas()
            print("Wait until the stream is ready...")
            if result.empty == False:
                break
        result = spark.sql(f"select * from {stream_name} limit {limit}").toPandas()
    
    return result

def untilStreamIsReady(name):
    queries = list(filter(lambda query: query.name == name, spark.streams.active))

    if len(queries) == 0:
        print("The stream is not active.")

    else:
        while (queries[0].isActive and len(queries[0].recentProgress) == 0):
            pass # wait until there is any type of progress

        if queries[0].isActive:
            queries[0].awaitTermination(5)
            print("The stream is active and ready.")
        else:
            print("The stream is not active.")
            
            
def dfTest(id, expected, result):
    assert str(expected) == str(result), "{} does not equal expected {}".format(result, expected)

### Compression Best Practices

There are three compression algorithms commonly used in Spark environments: GZIP, Snappy, and bzip2.  Choosing between this option is a trade-off between the compression ratio, the CPU usage needed to compress and decompress the data, and whether the data it saves is splittable and therefore able to be read and written in parallel.

|                   | GZIP   | Snappy | bzip2 |
|:------------------|:-------|:-------|:------|
| Compression ratio | high   | medium | high  |
| CPU usage         | medium | low    | high  |
| Splittable        | no     | yes    | no    |

While GZIP offers the highest compression ratio, it is not splittable and takes longer to encode and decode.  GZIP files can only use a single core to process.  Snappy offers less compression but is splittable and quick to encode and decode.  bzip offers a high compression ratio but at high CPU costs, only making it the preferred choice if storage space and/or network transfer is extremely limited.

Parquet already uses some compression though Snappy compression applied to Parquet can reduce a file to half its size while improving job performance.  Parquet, unlike other files formats, is splittable regardless of compression format due to the internal layout of the file.

<a href="https://issues.apache.org/jira/browse/SPARK-14482" target="_blank">The default compression codec for Parquet is Snappy.</a>

Import a file to test compression options on.

In [None]:
pagecountsEnAllDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/wikipedia/pagecounts/staging_parquet_en_only_clean/")

display(pagecountsEnAllDF)    

Write the file as parquet using no compression, snappy and GZIP

In [None]:
YOUR_FIRST_NAME = # FILL_IN
userhome = f"s3a://temp.intellinum.co/{YOUR_FIRST_NAME}"

In [None]:
YOUR_FIRST_NAME = "yuan"
userhome = f"s3a://temp.intellinum.co/{YOUR_FIRST_NAME}"

In [None]:
pathTemplate = userhome + "/elt-optimizations/pageCounts{}.csv"
compressions = ["Uncompressed", "Snappy", "GZIP"]

uncompressedPath, snappyPath, GZIPPath = [pathTemplate.format(i) for i in compressions]

pagecountsEnAllDF.write.mode("OVERWRITE").csv(uncompressedPath)
pagecountsEnAllDF.write.mode("OVERWRITE").csv(snappyPath, compression = "snappy")
pagecountsEnAllDF.write.mode("OVERWRITE").csv(GZIPPath, compression = "GZIP")

Observe the size differences in bytes for the compression techniques

In [None]:
for compressionType in compressions:
    metadata = !aws s3 ls --summarize --human-readable --recursive {pathTemplate.format(compressionType).replace('s3a','s3')}
    size = metadata[-1]
    print("{}:  \t{}".format(compressionType, size))

### Caching

Caching data is one way to improve query performance.  Cached data is maintained on a cluster rather than forgotten at the end of a query.  Without caching, Spark reads the data from its source again after every action. 

There are a number of different storage levels with caching, which are variants on memory, disk, or a combination of the too.  By default, Spark's storage level is `MEMORY_AND_DISK`

`cache()` is the most common way of caching data while `persist()` allows for setting the storage level.

It's worth noting that caching should be done with care since data cached at the wrong time can lead to less performant clusters.

Import a DataFrame to test caching performance trade-offs.

In [None]:
pagecountsEnAllDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/wikipedia/pagecounts/staging_parquet_en_only_clean/")

display(pagecountsEnAllDF)    

Use the `%timeit` function to see the average time for counting an uncached DataFrame.  Recall that Spark will have to reread the data from its source each time

In [None]:
%timeit pagecountsEnAllDF.count()

Cache the DataFrame.  An action will materialize the cache (store it on the Spark cluster).

In [None]:
(pagecountsEnAllDF
  .cache()         # Mark the DataFrame as cached
  .count()         # Materialize the cache
) 

Perform the same operation on the cached DataFrame.

In [None]:
%timeit pagecountsEnAllDF.count()

What was the change?  Now unpersist the DataFrame.


In [None]:
pagecountsEnAllDF.unpersist()

### Cluster Configuration

Choosing the optimal cluster for a given workload depends on a variety of factors.  Some general rules of thumb include:<br><br>

* **Fewer, large instances** are better than more, smaller instances since it reduces network shuffle
* With jobs that have varying data size, **autoscale the cluster** to elastically vary the size of the cluster
* Price sensitive solutions can use **spot pricing resources** at first, falling back to on demand resources when spot prices are unavailable
* Run a job with a small cluster to get an idea of the number of tasks, then choose a cluster whose **number of cores is a multiple of those tasks**
* Production jobs should take place on **isolated, new clusters**
* **Colocate** the cluster in the same region and availability zone as your data

Available resources are generally compute, memory, or storage optimized.  Normally start with compute-optimized clusters and fall back to the others based on demands of the workload (e.g. storage optimization for analytics or memory optimized for iterative algorithms such as machine learning).

When running a job, examine the CPU and network traffic using the Spark UI to confirm the demands of your job and adjust the cluster configuration accordingly.

## Exercise 1: Comparing Compression Algorithms

Compare algorithms on compression ratio and time to compress for different file types and number of partitions.

### Step 1: Create and Cache a DataFrame for Analysis

Create a DataFrame `pagecountsEnAllDF`, cache it, and perform a count to realize the cache.

In [None]:
pagecountsEnAllDF = spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/wikipedia/pagecounts/staging_parquet_en_only_clean/")
pagecountsEnAllDF.cache()
pagecountsEnAllDF.count()

### Step 2: Examine the Functions for Comparing Writes and Reads

Examine the following functions, defined for you, for comparing write and read times by file type, number of partitions, and compression type.

In [None]:
from time import time

def write_read_time(df, file_type, partitions=1, compression=None, outputPath=userhome + "/bootcamp/comparisonTest"):
    '''
    Prints write time and read time for a given DataFrame with given params
    '''
    start_time = time()
    _df = df.repartition(partitions).write.mode("OVERWRITE")
    
    if compression:
        _df = _df.option("compression", compression)
    if file_type == "csv":
        _df.csv(outputPath)
    elif file_type == "parquet":
        _df.parquet(outputPath)
      
    total_time = round(time() - start_time, 1)
    print("Save time of {}s for\tfile_type: {}\tpartitions: {}\tcompression: {}".format(total_time, file_type, partitions, compression))
    
    start_time = time()
    if file_type == "csv":
        spark.read.csv(outputPath).count()
    elif file_type == "parquet":
        spark.read.parquet(outputPath).count()
      
    total_time = round(time() - start_time, 2)
    print("\tRead time of {}s".format(total_time))
  
  
def time_all(df, file_type_list=["csv", "parquet"], partitions_list=[1, 16, 32, 64], compression_list=[None, "gzip", "snappy"]):
    '''
    Wrapper function for write_read_time() to gridsearch lists of file types, partitions, and compression types
    '''
    for file_type in file_type_list:
        for partitions in partitions_list:
            for compression in compression_list:
                write_read_time(df, file_type, partitions, compression)

### Step 3: Examine the Output

Apply `time_all()` to `pagecountsEnAllDF` and examine the results.  Why do you see these changes across different file types, partition numbers, and compression algorithms?

In [None]:
time_all(pagecountsEnAllDF)

## Review
**Question:** Why does compression matter when storage space is so inexpensive?  
**Answer:** Compression matters in big data environments largely due to the IO bound nature of workloads.  Compression allows for less data transfer across the network, speeding up tasks significantly.  It can also reduce storage costs substantially as data size grows.

**Question:** What should or shouldn't be compressed?  
**Answer:** One best practice would be to compress all files using snappy, which balances compute and compression ratio trade-offs nicely.  Compression depends largely on the type of data being compressed so a text file type like CSV will likely compress significantly more than a parquet file of integers, for example, since parquet will store them as binary by default.

**Question:** When should I cache my data?  
**Answer:** Caching should take place with any iterative workload where you read data multiple times.  Reexamining when you cache your data often leads to performance improvements since poor caching can have negative downstream effects.

**Question:** What kind of cluster should I use?  
**Answer:** Choosing a cluster depends largely on workload but a good rule of thumb is to use larger and fewer compute-optimized machines at first.  You can then tune the cluster size and type depending on the workload.  Autoscaling also allows for dynamic resource allocation.

## IMPORTANT Next Steps
* Please complete the <a href="https://docs.google.com/forms/d/e/1FAIpQLSd5whqoFBjNEEMvgwW5KRr-PeMyv6Lsczxk1p0es9s3IigEYQ/viewform?vc=0&c=0&w=1" target="_blank">short feedback survey</a>.  Your input is extremely important and shapes future course development.
* Congratulations, you have completed ETL Part 3!

&copy; 2019 [Intellinum Analytics, Inc](http://www.intellinum.co). All rights reserved.<br/>