## Table Of Contents
* [Categories of Optimizations](#Categories-of-Optimizations)
  * [Cluster Level Optimizations](#Cluster-Level-Optimizations)
  * [Balanced Approach](#Balanced-Approach)
  * [Practical](#Practical)
* [Salting](#Salting)
* [Static Resource Allocation](#Static-Resource-Allocation)
* [Memory in Apache Spark](#Memory-in-Apache-Spark)
* [Cache and Persist](#Cache-and-Persist)
* [Serializer](#Serializer)

### Categories of Optimizations

There are basically two main areas we should focus on when we talk about Spark Performance Tuning:

1. **Cluster configuration level** - This is basically thinking on resource perspective like CPU/Memory etc.

2. **Application Code level** - How we write the code. partitioning, bucketing, cache & persist, avoiding shuffling, join optimizations, using optimized file formats, using reduceByKey instead of groupByKey etc.





### Cluster Level Optimizations

When we talk about the resources we mainly focus on Memory(RAM) and CPU cores(Compute) and when we talk about optimization then we've to make sure that job should get the right amount of resources.

* Let's say we've 10 node cluster or 10 worker nodes.
* Each node has 16 CPU cores.
* Each node has 64GB RAM.

Executor(It is like a container of resources or JVM, It means it contains both RAM and CPU. It is upon use that how big should be an executor).

*Can a single node can have more than one executor?*</BR>
Yes, a single node can have more than one execotors or multiple containers.

*How many executors can we created in a node?*</BR>
16 Cores, 64GB RAM - 1 Core will be alloted to background processing and 1GB RAM to OS</BR>

There are two strategies:

1. **Thin Executor** - We'll try to create more executors with each executor holding minimum possible resources. As per above configuration we can create 15 executors with each executor holding 1 core and 63/13 GB RAM.

Drawbacks:
* Here, multithreading is not possible because we've only one core in each container.
* A lot of copies of a broadcast variable is required since each executors should recieve a copy of the B variable.

2. **Fat Executor** - Intentions is to give maximium resources to each container/executor. As per above configuration we can allocate 15Cores and 63GB RAM to one container/executor.

Drawbacks:
* It has observed, If an executor holds more than 5 CPU cores than the HDFS throughput suffers.
* If executor holds huge amount of memory then garbage collection will take a lot of time. Garbage collection is to remove the unused objects from the memory if the memory is huge it takes time to remove the objects from the memory.

Instead of going with Thin and Fat approaches we should go with the balanced way of allocating resources to the executors to avoid the above mentioned bottlenecks.


### Balanced Approach
10 Node Cluster</BR>
16 Cores, 64GB RAM - 1 Core will be alloted to background processing and 1GB RAM to OS</BR>

-> We want multithreading, that can be achieved with more than one CPU cores.</BR>
-> We do not want our HDFS throughput to suffer, and this happend when we use more than 5 cores.</BR>
-> That means 5 CPU cores is the right choice for the number of CPU cores.</BR>

* Each executor will contain 5 CPU cores and 21GB RAM.
* Out of this 21GB RAM some of it will go as part of overhead(off heap memory). RAW memory that is not used by JVM. That will be MAX(384MB, 7% of executor's memory) ~ 1.5GB.
* So each executor will have 5CPU cores and (21-1.5)19.5GB of memory.
* Number of executors = 10 * 3 = 30
* Out of these 30 executors one will be give for YARN(Application Manager).

**Note:** If we store something on RAW memory/off-heap Memory then we can save some time on garbage collection since that won't be required there as it is not a part of executor/JVM. But we've to do our own memory management.


### Practical

Increase the file size after copying

```

[itv002768@g02 ~]$ cat bigLogNew.txt bigLogNew.txt bigLogNew.txt >> bigLogLatest.txt
[itv002768@g02 ~]$ cat bigLogNew.txt bigLogNew.txt bigLogNew.txt >> bigLogLatest.txt

[itv002768@g02 ~]$ ll -h bigLogLatest.txt
-rw-r--r-- 1 itv002768 students 8.2G Sep 13 09:48 bigLogLatest.txt

[itv002768@g02 ~]$ hadoop fs -put bigLogLatest.txt /user/itv002768/week13_practicals
[itv002768@g02 ~]$ hadoop fs -head /user/itv002768/week13_practicals/bigLogLatest.txt
ERROR: Thu Jun 04 10:37:51 BST 2015
WARN: Sun Nov 06 10:37:51 GMT 2016
WARN: Mon Aug 29 10:37:51 BST 2016
ERROR: Thu Dec 10 10:37:51 GMT 2015
ERROR: Fri Dec 26 10:37:51 GMT 2014
ERROR: Thu Feb 02 10:37:51 GMT 2017
WARN: Fri Oct 17 10:37:51 BST 2014
ERROR: Wed Jul 01 10:37:51 BST 2015
WARN: Thu Jul 27 10:37:51 BST 2017
WARN: Thu Oct 19 10:37:51 BST 2017
WARN: Wed Jul 30 10:37:51 BST 2014
ERROR: Fri Jan 12 10:37:51 GMT 2018
```


In [12]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.column
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window


val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application")
sparkConfig.set("spark.master", "yarn")



val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

print(spark.sparkContext.uiWebUrl)


Some(http://g02.itversity.com:4040)

sparkConfig = org.apache.spark.SparkConf@2818e77
spark = org.apache.spark.sql.SparkSession@6dd9169d


lastException: Throwable = null


org.apache.spark.sql.SparkSession@6dd9169d

In [10]:
sys.props.update("spark.ui.proxyBase", "") // to fix the broken UI

In [8]:
spark.stop()

If you'll look at the UI you'll see that one executor is added because we are operating on local mode(`local[2]`) that's why executor and driver are same. One container is allocated which is on local machine. It will work on the edge node that is local.

For development purpose this is fine but for prod this is not preferred.

Now we'll start it with `master=YARN` mode. After this we can see in UI that one driver and two executors are added.

*Why two executors are added?*</BR>
There are two ways to allocate the resources:

* Allocating Manually

* Allocating Dynamically - In the cluster that we've in itversity dynamic alloation is `true`. Whenever we go with this, There are some properties that are configured initial executors(2), max.executors(10) and min.executors(0)..

**initially 2 executors will be allocated and if the job is big it can go max to 10 and when the job finishes or nothing is required it will come down to 0. It is based on idle time to release the executor that is again a property(executorIdleTime)**

You can check for the default configuration of the executor in Ambari.

* **Storage Memory** - whenever we allocate memory to the container 300MB goes for some overhead. In our case 1024-300 = 724MB. Again it will be divided into two parts. **Storage & execution Memory**(60% of 724MB) and **Additional Buffer Memory**(40% of 724MB) for other purposes. 


In [11]:

val inputRdd = spark.sparkContext.textFile("/user/itv002768/week13_practicals/bigLogLatest.txt")
val splitRdd = inputRdd.map(x => (x.split(":")(0), x.split(":")(0)))
val rdd3 = splitRdd.groupByKey
val rdd4 = rdd3.map(x=>(x._1, x._2.size))
rdd4.collect()

/* 
file size - 8.2GB
total partitions = (8.2*1024)/128 = 66
Since there are two stages so 66*2 = 132 Tasks

Stage-1
10 executors have to execute 66 tasks and Some of the executors might have to execute multiple tasks
If in executor there is one CPU core then only one task can be performed at a time - Our case. At max 10 tasks can run in parallel.
Parallelism  - num_executors * num_cpu_cores

Stage 0 had 66 tasks and each of the partition was having equal amount of data

stage 1 - After we used groupByKey, shuffling happened and we had only two keys ERROR and WARN and only two partitions were full.
        Let's assume If I'll divide the data in two parts then one of the partitions definately will have data > 4.25GB. Only two partitions will
        hold the data out of 66 partitions created in stage 1. Entire data will be executed in two tasks and remaining 64tasks won't do anything.
        Even executor cannot handle this much of data. Executor can handle 400MB and we want this executor to handle 4.25GB this will give Out Of Memory
        Error.
*/

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 1.0 failed 4 times, most recent failure: Lost task 6.3 in stage 1.0 (TID 137, w01.itversity.com, executor 16): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container from a bad node: container_1658918988971_19299_01_000017 on host: w01.itversity.com. Exit status: 143. Diagnostics: [2022-09-13 11:59:07.054]Container killed on request. Exit code is 143
[2022-09-13 11:59:07.054]Container exited with a non-zero exit code 143. 
[2022-09-13 11:59:07.054]Killed by external signal
.
Driver stacktrace:

### Salting
In the above case where we've uneven data we can append some number let's say (1 to 10) after every key then data can be distributed evenly.
e.g WARN1, WARN2 ..... WARN10 ERROR1, ERROR2 ...... ERROR10

20 Distinct keys and It can use 20 different partitions.

### Static Resource Allocation

To disable the dynamic allocation:

`spark2-shell --conf spark.dynamicAllocation.enabled=false --master yarn --num-executors 20 --executor-cores 2 --executor-memory 2`

In [None]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.column
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window


val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application")
sparkConfig.set("spark.master", "yarn")
sparkConfig.set("spark.dynamicAllocation.enabled", "false")
sparkConfig.set("spark.executor.memory", "3500m")
sparkConfig.set("spark.executor.cores", "2")
sparkConfig.set("spark.executor.instances", "20")



val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

print(spark.sparkContext.uiWebUrl)



lastException = null


In [15]:
spark

org.apache.spark.sql.SparkSession@10d25f40

In [None]:
spark.stop()

In [11]:
val inputRdd = spark.sparkContext.textFile("/user/itv002768/week13_practicals/bigLogLatest.txt")
val splitRdd = inputRdd.map(x => (x.split(":")(0), x.split(":")(0)))
val rdd3 = splitRdd.groupByKey
val rdd4 = rdd3.map(x=>(x._1, x._2.size))
rdd4.collect()


org.apache.spark.SparkException: Job 0 cancelled 

* If you've a long running job then it's better to go with `Dynamic Resource Allocation`.

For the above practical we need container with more memory that is not possible So in this case we'll use **Salting**.

We'll generate a random number in range(1 to 40) for both the WARN and ERROR keys.

In [1]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.column
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window


val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application")
sparkConfig.set("spark.master", "yarn")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

print(spark.sparkContext.uiWebUrl)
sys.props.update("spark.ui.proxyBase", "") // to fix the broken UI

Some(http://g02.itversity.com:4046)

sparkConfig = org.apache.spark.SparkConf@2764efff
spark = org.apache.spark.sql.SparkSession@6732a97f


org.apache.spark.sql.SparkSession@6732a97f

In [2]:
val random = new scala.util.Random
val start = 1
val end = 60

val inputRdd = spark.sparkContext.textFile("/user/itv002768/week13_practicals/bigLogLatest.txt")

val rdd1 = inputRdd.map(x => {
    val num = start + random.nextInt((end - start) + 1)
    (x.split(":")(0) + num, x.split(":")(1))
})

val rdd2 = rdd1.groupByKey

val rdd3 = rdd2.map(x=>(x._1, x._2.size))

val rdd4 = rdd3.map(x => {
    if(x._1.substring(0, 4)=="WARN"){
        ("WARN", x._2)
    }else{
        ("ERROR", x._2)
    }
})

val rdd5 = rdd4.reduceByKey(_ + _)

rdd5.collect()

random = scala.util.Random@28ed973c
start = 1
end = 60
inputRdd = /user/itv002768/week13_practicals/bigLogLatest.txt MapPartitionsRDD[1] at textFile at <console>:40
rdd1 = MapPartitionsRDD[2] at map at <console>:42
rdd2 = ShuffledRDD[3] at groupByKey at <console>:47
rdd3 = MapPartitionsRDD[4] at map at <console>:49
rdd4 = MapPartitionsRDD[5] at map at <console>:51
rdd5 = ShuffledRDD[6] at reduceByKey at <console>:59


Array((WARN,119973264), (ERROR,120026736))

### Memory in Apache Spark

**Spark Optimization - session 10**

Memory use in spark falls under two broad categories:

* Execution Memory</BR>
Memory required for computations like join, shuffles, sorts, aggregations etc.


* Storage Memory</BR>
Memory used for caching.

In spark, execution and storage memory share a common region.</BR>

**Benefit of  Common region**</BR>
When there is no execution is happening then storage can accuire all the available memory or vice versa.

Common Region - 2GB</BR>
* If necessary, Execution may evict storeage. It means execution has the presendency. But this eviction can only happen unit total storage memory falls under a certain threshold. If some executions/computions are coming it cannot accuire all the 2GB.

* Storage cannot evict Execution. Execution can't stop because storage is required.

Advantage

This Design ensure several desirable properties:
* Applications which do not use caching can use entire space for execution.
* Applications which do not use caching right now but want to use at later point of time can reserver minumum storage space. This makes sure that the data blockes are immune from getting evicted.

* Let's say we're requesting a container with 4GB so it will give 4GB + max(384MB, 10% of heap) to a container. 4GB will be Heap(JVM) and other will be Off Heap(reserved for VM overheads)
* 300MB is again reserved for running the executors.
* We are left with 3.7GB.
* Out of this 3.7GB, 60% will go to unified(storage + execution memory) ~ 2.3GB
* Remaining 40% of 3.7GB(~1.4GB) will go to user memory - that holds user data structures, spark related metadata and sefeguarding Out of memory Errors.

* Out of 2.3GB(storage + execution memory), 50% is the threshold for storage. It means we can cache data upto 1.15GB without worrying about eviction by executions/computations.


### Cache and Persist

**Spark Optimization - session 10**</BR>
Please go through the Spark UI after running the following code

In [1]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.column
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window


val sparkConfig = new SparkConf()
sparkConfig.set("spark.app.name", "My Application")
sparkConfig.set("spark.master", "yarn")

val spark = SparkSession.builder().config(sparkConfig).getOrCreate()

print(spark.sparkContext.uiWebUrl)
sys.props.update("spark.ui.proxyBase", "") // to fix the broken UI

Some(http://g02.itversity.com:4040)

sparkConfig = org.apache.spark.SparkConf@15822350
spark = org.apache.spark.sql.SparkSession@14c5aff7


org.apache.spark.sql.SparkSession@14c5aff7

In [2]:
val random = new scala.util.Random
val start = 1
val end = 60

val inputRdd = spark.sparkContext.textFile("/user/itv002768/week13_practicals/bigLogLatest.txt")

val rdd1 = inputRdd.map(x => {
    val num = start + random.nextInt((end - start) + 1)
    (x.split(":")(0) + num, x.split(":")(1))
})

val rdd2 = rdd1.groupByKey

/*
If we'll cache the following as it is doing a lot of things 
*/
val rdd3 = rdd2.map(x=>(x._1, x._2.size))

rdd3.cache

val rdd4 = rdd3.map(x => {
    if(x._1.substring(0, 4)=="WARN"){
        ("WARN", x._2)
    }else{
        ("ERROR", x._2)
    }
})

//val rdd5 = rdd4.reduceByKey(_ + _)

rdd4.collect()

random = scala.util.Random@13a5434c
start = 1
end = 60
inputRdd = /user/itv002768/week13_practicals/bigLogLatest.txt MapPartitionsRDD[1] at textFile at <console>:40
rdd1 = MapPartitionsRDD[2] at map at <console>:42
rdd2 = ShuffledRDD[3] at groupByKey at <console>:47
rdd3 = MapPartitionsRDD[4] at map at <console>:52
rdd4 = MapPartitionsRDD[5] at map at <console>:56


Array((WARN,1994728), (WARN,2000208), (WARN,1998905), (WARN,2002428), (WARN,2004661), (WARN,1990811), (WARN,1986703), (WARN,2002536), (WARN,1989488), (WARN,20...

In [3]:
rdd4.collect()

Array((WARN,1994728), (WARN,2000208), (WARN,1998905), (WARN,2002428), (WARN,2004661), (WARN,1990811), (WARN,1986703), (WARN,2002536), (WARN,1989488), (WARN,2006403), (WARN,1999421), (WARN,1993077), (WARN,2004461), (WARN,2009241), (WARN,2006287), (WARN,1999164), (WARN,2027641), (WARN,2000844), (WARN,1990925), (WARN,1995284), (ERROR,1989047), (WARN,1997947), (ERROR,2010830), (WARN,1985123), (WARN,2002821), (ERROR,1999714), (WARN,1994828), (ERROR,2008639), (ERROR,1995269), (ERROR,1996067), (ERROR,1997713), (ERROR,1998259), (ERROR,2003380), (ERROR,2003405), (ERROR,1993746), (ERROR,2001933), (ERROR,1986944), (ERROR,2004486), (ERROR,1992079), (ERROR,2008356), (ERROR,2003895), (ERROR,1994750), (ERROR,2002428), (ERROR,2008659), (ERROR,2006985), (ERROR,2001868), (ER...

In [7]:
/*
To uncache use unpersist

After that you can check in storage Tab
*/
rdd3.unpersist()

MapPartitionsRDD[6] at map at <console>:43

In [8]:
rdd3.persist

MapPartitionsRDD[6] at map at <console>:43

In [9]:
rdd4.collect()

Array((WARN,1994728), (WARN,2000208), (WARN,1998905), (WARN,2002428), (WARN,2004661), (WARN,1990811), (WARN,1986703), (WARN,2002536), (WARN,1989488), (WARN,2006403), (WARN,1999421), (WARN,1993077), (WARN,2004461), (WARN,2009241), (WARN,2006287), (WARN,1999164), (WARN,2027641), (WARN,2000844), (WARN,1990925), (WARN,1995284), (ERROR,1989047), (WARN,1997947), (ERROR,2010830), (WARN,1985123), (WARN,2002821), (ERROR,1999714), (WARN,1994828), (ERROR,2008639), (ERROR,1995269), (ERROR,1996067), (ERROR,1997713), (ERROR,1998259), (ERROR,2003380), (ERROR,2003405), (ERROR,1993746), (ERROR,2001933), (ERROR,1986944), (ERROR,2004486), (ERROR,1992079), (ERROR,2008356), (ERROR,2003895), (ERROR,1994750), (ERROR,2002428), (ERROR,2008659), (ERROR,2006985), (ERROR,2001868), (ER...

### Serializer

We should prefer kryo serializer insted of java serializer.

Whenever the data is stored on disc and has to be transfer over the disc it has to be in serialized form.

Whenever we use java based serializer size is still more than kryo seraializer. It means by using kryo searializer, size will be much seraializer. kryo searializer is also much more faster than java serializer