# Apache Spark Executor Cores & Memory

Here are the notes on Apache Spark Executor Cores & Memory from the video transcript:

*   If CPU and memory resources are not correctly allocated, Spark jobs may take a long time to complete.

*   **Executor Tuning** involves deciding the number of executors to create, the amount of memory, and the number of cores to allocate to those executors.

*   To decide the number of executors, cores, and memory to assign, consider some examples.

*   **Cluster Configuration Example**: Let's say you have five nodes (machines), each with 12 cores and 48 GB of RAM. The question is how to decide the number of executors, cores per executor, and memory per executor.

*   There are generally three options to consider when deciding on executors: **thin executors, fat executors, and optimally sized executors**.

*   **Fat Executors**
    *   Occupy a large portion of the resources on a node.
    *   Calculation: First, leave out one core and 1 GB of RAM for the operating system and other processes.
    *   If a machine has 12 cores and 48 GB of RAM, you will have 11 cores and 47 GB of RAM available per node.
    *   One fat executor will take all 11 cores and 47 GB of RAM.
    *   One node will have one executor.
    *   If the cluster has five nodes, you will have five executors.
    *   The number of executors is five, executor cores is 11, and executor memory is 47 GB.
    *   Each of the five executors will have 11 cores and 47 GB of RAM.
    *   Advantages:
        *   Increased task level parallelism. With a lot of cores, fat executors can run many tasks together.
        *   Can load data which requires significant amounts of memory.
        *   Managing a lot of executors is not a concern because one node has one or minimal executors.
        *   Enhanced data locality. The executor can fit a lot of partitions in its memory. Reduces network traffic, improves application speed.
    *   Disadvantages:
        *   If the executor is not fully utilized, resources will sit idle.
        *   Fault tolerance. If an executor fails, a large amount of computation will be lost, reducing application reliability.
        *   For those using HDFS, using more than five cores can cause garbage collection, which pauses the program and takes a performance toll.
*   **Thin Executors**
    *   Occupy minimal resources from the node.
    *   Calculation:
        *   Again, leaving out one core and 1 GB of RAM per node, you are left with 11 cores and 47 GB of RAM.
        *   Give one core to one executor. Therefore, one node will have 11 executors.
        *   With 11 executors in total and 47 GB of RAM on a node, each executor will have close to 4 GB of RAM.
        *   One executor will contain one core and 4 GB of RAM.
        *   One node has 11 executors, and with five nodes, you will end up with 55 executors.
        *   The number of executors will be 55, the executor code will be one, and the executor memory will be close to 4 GB.
    *   Advantages:
        *   Executor level parallelism.
        *   Fault tolerance. If an executor is lost, it is easy to recompute what has already been done.
    *   Disadvantages:
        *   High network traffic. The data it needs might not be fully present on the executor, so it needs to move data across the cluster.
        *   Reduced data locality. Because each executor has a small amount of memory, the number of partitions that are local will be small.
*   **Optimally Sized Executors**
    *   To size or create an optimal executor, keep in mind the following rules:
        1.  Leave out one core and 1 GB of RAM for Hadoop, YARN, and operating system processes.
        2.  Leave out one executor or one core and 1 GB of RAM for the YARN application master, which negotiates resources to the resource manager. The application master generally works well with one core and 1 GB of RAM. If the executor is small, subtract one executor when you define the num executors. However, this may not be suitable for cases where you have a fat executor.
        3.  Have three to five tasks per executor. It is a general good practice to have three to five cores per executor because HDFS throughput deteriorates if you have more than five cores per executor, which leads to a lot of garbage collection.
        4.  When you define your executor memory, this should exclude the memory overhead, which is used for internal system processes.
    *   **Example**:
        *   Five node cluster, each with 12 cores and 48 GB of RAM.
        *   Leave out one core and 1 GB of RAM per node for Hadoop daemons and other operating system processes.
        *   Per node, after subtracting, you are left with 11 cores and 47 GB of RAM.
        *   Calculate the total memory and cores you have in the cluster.
            *   The total memory is 47 GB * 5 = 235 GB.
            *   The total cores is 11 * 5 = 55 cores.
        *   Subtract out one core and 1 GB of RAM for the application master.
            *   235 GB - 1 GB = 234 GB
            *   55 cores - 1 core = 54 cores
        *   Assign five cores for each executor.
            *   The total executors is the total cores divided by the cores per executor.
            *   54 cores / 5 cores is close to 10 executors.
        *   To find out the memory per executor, take the total memory, 234 GB, and divide it by the number of executors, 10, which gives you close to 23.4 GB.
        *   Subtract the overhead memory from the executor memory to get the actual executor memory. The calculation for overhead memory is the maximum of 384 MB or 10% of whatever executor memory we have.
            *   23 GB - max(384 MB, 10% of 23 GB)
            *   23 GB - max(384 MB, 2.3 GB)
            *   23 GB - 2.3 GB = 20 GB
        *   The number of executors is 10, the cores per executor is five cores, and the memory per executor is 20 GB.
        *   It is important to focus on the memory per core.
            *   With five cores and 20 GB of memory, one core will get 4 GB of memory.
            *   One core can process one partition.
            *   As long as a partition is less than or equal to 4 GB, the processing should happen seamlessly.
        *   With data partitioned into sizes of 128 MB, this configuration is very good because each core can live up to 4 GB of data.
    *   Benefits:
        *   Maintains a balance between thin and fat executors.
        *   Good configuration for good parallelism.
        *   Avoid issues with HDFS throughput because we have assigned five cores, which should be good.
        *   Data locality is preserved and enhanced because the amount of memory is 20 GB of RAM, so the number of partitions this executor can hold should be a good number.
    *   **Example 2:**
        *   Three nodes, each with 16 cores and 48 GB of RAM.
            1.  Leave out one core and 1 GB of RAM per node.
                *   You will be left with 15 cores and 47 GB of RAM per node.
            2.  Calculate the total memory and total cores.
                *   Total cores: 15 * 3 = 45 cores
                *   Total memory: 47 * 3 = 141 GB
            3.  Leave out 1 GB of RAM and one core for the application master.
                *   45 cores - 1 core = 44 cores
                *   141 GB - 1 GB = 140 GB
            4.  Find out how many executors we want to create, the number of cores, and the memory.
                *   Give four cores per executor.
                    *   The number of executors is the total cores / core per executor, so 44 / 4 = 11 executors.
                *   Total memory is 140 GB, and there are 11 executors, which is close to 12 GB.
                *   Subtract out the memory overhead.
                    *   The memory overhead is the maximum of 384 MB or 10% of executor memory.
                    *   10% of 12 GB is 1.2 GB.
                    *   Assume the overhead memory is 1 GB to make the calculation simple.
                *   The actual memory is 12 GB - 1 GB = 11 GB.
        *   The number of executors is 11, the executor cores is four, and the executor memory is 11 GB.


# Questions

## Apache Spark Executor Cores & Memory Allocation MCQs

Here are some multiple-choice questions (MCQs) to help you revise the concepts of Apache Spark Executor Cores & Memory allocation based on the information from the video:

**Question 1:** What is the primary goal of executor tuning in Apache Spark?
* a) To minimize the number of executors
* b) To optimally allocate CPU and memory resources for Spark jobs
* c) To maximize the memory usage of the driver program
* d) To reduce the amount of data processed

**Question 2:** Which of the following is NOT a general type of executor configuration discussed in the video?
* a) Thin Executors
* b) Fat Executors
* c) Medium Executors
* d) Optimally Sized Executors

**Question 3:** What is a characteristic of "Fat Executors"?
* a) They occupy minimal resources from a node.
* b) They are suitable for lightweight jobs.
* c) They occupy a large portion of the resources on a node.
* d) They enhance network traffic.

**Question 4:** What is a key advantage of Fat Executors?
* a) Increased task level parallelism
* b) Lower memory requirements
* c) Reduced data locality
* d) Lower fault tolerance

**Question 5:** What is a primary disadvantage of Fat Executors?
* a) Low network traffic
* b) Efficient resource utilization
* c) Potential for resource wastage if not fully utilized
* d) Enhanced data locality

**Question 6:** What is a defining characteristic of "Thin Executors"?
* a) They occupy a large portion of the resources on a node.
* b) They are not fault-tolerant.
* c) They occupy minimal resources from a node.
* d) They are suitable for memory-intensive tasks.

**Question 7:** What is an advantage of using Thin Executors?
* a) Reduced network traffic
* b) Better fault tolerance
* c) Enhanced data locality
* d) Ability to process large amounts of data in each executor

**Question 8:** What is a key disadvantage of Thin Executors?
* a) High fault tolerance
* b) Efficient data locality
* c) High network traffic
* d) Limited parallelism

**Question 9:** When sizing optimal executors, what resources should be reserved per node for the OS, YARN, and Hadoop daemons?
* a) Leave out five cores and 5 GB of RAM
* b) Leave out one core and 1 GB of RAM
* c) Allocate all available resources
* d) This is not necessary for optimal sizing.

**Question 10:** According to the video, what is the general recommendation for the number of cores per executor to optimize HDFS throughput?
* a) More than 10 cores
* b) More than 7 cores
* c) Three to five cores
* d) Only one core

**Question 11:** What is the purpose of the YARN application master?
* a) To manage data locality
* b) To execute tasks within executors
* c) To negotiate resources with the resource manager
* d) To monitor executor health

**Question 12:** When calculating executor memory, what should be excluded from the total memory allocated?
* a) Memory used by the OS
* b) Memory used by YARN
* c) Memory overhead for internal system processes
* d) Memory used for caching data

**Question 13:** If an executor has five cores and 20 GB of memory, how much memory is allocated per core?
* a) 2 GB
* b) 4 GB
* c) 5 GB
* d) 20 GB

**Question 14:** Cluster configuration: 3 nodes, each with 16 cores and 48 GB RAM. After subtracting resources for OS and YARN daemons, how many cores are left per node?
* a) 16 cores
* b) 15 cores
* c) 14 cores
* d) 12 cores

**Question 15:** In the above scenario, after also reserving resources for the application master, what are the total available cores in the cluster?
* a) 40
* b) 45
* c) 44
* d) 48


# Answers

Here are the answers to the multiple-choice questions (MCQs) with the correct option and a brief explanation, based on the video transcript:

*   Question 1: What is the primary goal of executor tuning in Apache Spark?
    *   b) **To optimally allocate CPU and memory resources for Spark jobs**.
        *   Executor tuning focuses on correctly allocating CPU and memory resources.
*   Question 2: Which of the following is NOT a general type of executor configuration discussed in the video?
    *   c) **Medium Executors**.
        *   The video discusses thin, fat, and optimally sized executors.
*   Question 3: What is a characteristic of "Fat Executors"?
    *   c) They **occupy a large portion of the resources on a node**.
        *   Fat executors are defined as those that use a significant amount of resources on a node.
*   Question 4: What is a key advantage of Fat Executors?
    *   a) **Increased task level parallelism**.
        *   Fat executors have many cores, enabling multiple tasks to run in parallel within the same executor.
*   Question 5: What is a primary disadvantage of Fat Executors?
    *   c) **Potential for resource wastage if not fully utilized**.
        *   If a fat executor isn't fully utilized, the allocated resources may sit idle.
*   Question 6: What is a defining characteristic of "Thin Executors"?
    *   c) They **occupy minimal resources from a node**.
        *   Thin executors use only a small amount of resources from the node.
*   Question 7: What is an advantage of using Thin Executors?
    *   b) **Better fault tolerance**.
        *   Because thin executors process smaller amounts of data, the impact of an executor failure is reduced.
*   Question 8: What is a key disadvantage of Thin Executors?
    *   c) **High network traffic**.
        *   Thin executors have small memories, potentially requiring data to be moved across the cluster, increasing network traffic.
*   Question 9: When sizing optimal executors, what resources should be reserved per node for the OS, YARN, and Hadoop daemons?
    *   b) **Leave out one core and 1 GB of RAM**.
        *   It is recommended to leave out one core and 1 GB of RAM per node for these processes.
*   Question 10: According to the video, what is the general recommendation for the number of cores per executor to optimize HDFS throughput?
    *   c) **Three to five cores**.
        *   The recommendation is to have three to five cores per executor to avoid garbage collection issues that negatively impact HDFS throughput.
*   Question 11: What is the purpose of the YARN application master?
    *   c) To **negotiate resources with the resource manager**.
        *   The application master negotiates for resources from the resource manager.
*   Question 12: When calculating executor memory, what should be excluded from the total memory allocated?
    *   c) **Memory overhead for internal system processes**.
        *   Executor memory should exclude the memory overhead used for internal system processes.
*   Question 13: If an executor has five cores and 20 GB of memory, how much memory is allocated per core?
    *   b) **4 GB**.
        *   With 20 GB of memory for five cores, each core has 4 GB of memory (20/5 = 4).
*   Question 14: Cluster configuration: 3 nodes, each with 16 cores and 48 GB RAM. After subtracting resources for OS and YARN daemons, how many cores are left per node?
    *   b) **15 cores**.
        *   After subtracting one core per node for OS and YARN, 15 cores remain (16 - 1 = 15).
*   Question 15: In the above scenario, after also reserving resources for the application master, what are the total available cores in the cluster?
    *   c) **44**.
        *   With 15 cores per node across three nodes, there are 45 total cores. After subtracting one core for the application master, 44 cores are available (15 * 3 - 1 = 44).
