### grp

# Spark: The Definitive Guide

## PART 4: Production Applications

## dataPaths

In [1]:
online = '/Users/grp/sparkTheDefinitiveGuide/data/retail-data/all/online-retail-dataset.csv'
flights2105 = '/Users/grp/sparkTheDefinitiveGuide/data/flight-data/csv/2015-summary.csv'

## _Chapter #15 - How Spark Runs on a Cluster_

### Spark Application Architecture:
-  Spark Driver:
    -  process controling execution of Spark Application
    -  requests resources from cluster manager
    -  maintains state of the application running on cluster
-  Spark Executors:
    -  processes that perform the _tasks_ assigned by Spark Driver
    -  reports back _task_ state (success or failure) to Spark Driver
-  Cluster Manager:
    -  manages physical cluster of machines running Spark Application
    -  contains resources that Spark Application requests
    -  YARN, Mesos, Spark Standalone
    -  _edge (gateway) node_ are machines not co-located on the cluster
    -  Spark Driver process exists in Application Master ("cluster driver")
-  SparkSession:
    -  entry point to programming Spark
    -  combines separate contexts [SparkContext and SQLContext] from Spark 1.x

### Execution Modes:
-  Cluster mode:
    -  submit scripts (.jar, .py, .r) to cluster manager
    -  CM launches driver process on worker node in cluster as well as executor processes across worker nodes
-  Client mode:
    -  Spark driver remains on the client machine that submitted the application
    -  client machine is responsbile for maintaining the Spark driver process; cluster manager maintains the executor processes
-  Local mode:
    -  runs entire Spark Application on a single machine while achieving parallelism through threads on that single machine

## The Life Cycle of a _Spark Application_ (Outside Spark => Infrastructure):
1.  submit application (pre-compiled jar or library)
2.  application makes request to _CM Driver_ node asking for resources for _Spark Driver Process_
3.  _Spark Driver Process_ is placed on node within the cluster
4.  code starts running and _SparkSession_ initializes a _Spark Cluster_ [driver + executors] / communicates with CM to orchestrate _Spark Executor Processes_
5.  CM launches _Spark Executor Processes_ and returns information back to the _Spark Driver Process_
6.  _Spark Cluster_ is now in session
7.  _Spark Application_ is running with _Spark Driver_ scheduling tasks onto each worker and each worker responds back to _Spark Driver_ with status of tasks assigned (success or failure)
8.  _CM_ shuts down _Spark Executors_ within _Spark Cluster_ for _Spark Driver_ when application has completed

## The Life Cycle of a _Spark Application_ (Inside Spark => Internal Code Process):
-  all Spark code compiles down to RDDs

    1.  create _SparkSession_ for _Spark Application_
    2.  compile transformation(s)
    3.  trigger action(s)

#### Spark Job(s):
-  each application is made up of 1 or more _Spark Jobs_
-  _Spark Jobs_ are executed serially
-  an _Action_ is an execution of a _Spark Job_ [**individually consists of _Spark Stages_ and _Spark Tasks_**]
-  mostly always 1 _Spark Job_ for 1 _Spark Action_

#### Spark Stage(s):
-  each _Spark Job_ breaks down into a series of _Spark Stages_ [**# of stages depends on how many shuffle operations need to take place**]

#### Spark Task(s):
-  _Spark Stages_ contain groups of _Spark Tasks_ [**compute operations on multiple machines**]:
-  engine starts new _Spark Stages_ after operations called _Spark Shuffles_ [**physical re-partitioning of data**]:
-  each _Spark Task_ corresponds to a **combination of blocks of data and a set of transformations that will run on a single executor**
-  number of partitions = number of tasks (ex: 1,000 small partitions means 1,000 tasks executed in parallel)
-  the more partitions means the more parallelism

#### Pipelining:
-  _Spark Stages_ and _Spark Tasks_ are _Pipelined_ via map operations
-  _Pipelining_ performs data dependent operations under the hood by collapsing them into single stage of tasks

#### Shuffle Persistence:
-  occurs when operations have to move data across nodes
-  all shuffle operations will write data to disk for stable storage to use across multiple jobs


#### SPARK.SQL.SHUFFLE.PARTITIONS:
-  rule of thumb:
    -  number of partitions > number of executors in cluster
        -  cluster mode:
            -  recommended to set according to the number of cores in cluster to ensure efficient execution
        -  local mode:
            -  recommended to set to low value since single machine cannot execute many tasks in parallel

### _Chapter #15 Exercises (Spark Application)_

### _Spark Submit Example_

In [2]:
'''
./bin/spark-submit \
    --class <main-class> \
    --master <master-url> \
    --deploy-mode cluster \
    --conf <key>=<value> \
    # other options
    <application.jar> \
    [application-arguments]
'''

'\n./bin/spark-submit     --class <main-class>     --master <master-url>     --deploy-mode cluster     --conf <key>=<value>     # other options\n    <application.jar>     [application-arguments]\n'

### _SparkSession (Manual Approach) Application Example_

In [3]:
'''
from pyspark.sql import SparkSession

spark = SparkSession\
.builder\
.master("local")\
.appName("Word Count)\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
'''

'\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.master("local").appName("Word Count).config("spark.some.config.option", "some-value").getOrCreate()\n'

### _SparkContext (Old Method) Example_

In [4]:
'''
from pyspark import SparkContext

sc = SparkContext.getOrCreate()
'''

'\nfrom pyspark import SparkContext\n\nsc = SparkContext.getOrCreate()\n'

### _Spark Code Internal Execution Example_

In [5]:
df1 = spark.range(2, 10000000, 2)
df2 = spark.range(2, 10000000, 4)
step1 = df1.repartition(5)
step12 = df2.repartition(6)
step2 = step1.selectExpr("id * 5 as id")
step3 = step2.join(step12, ["id"])
step4 = step3.selectExpr("sum(id)")

step4.collect() # 2500000000000

[Row(sum(id)=2500000000000)]

#### Stage 1 w/ 8 Tasks [created DF; range function by default has 8 partitions]
#### Stage 2 w/ 8 Tasks [created DF; range function by default has 8 partitions]
#### Stage 3 w/ 6 Tasks [changed # of partitions to 6 by shuffling data]
#### Stage 4 w/ 5 Tasks [changed # of partitions to 5 by shuffling data]
#### Stage 5 w/ 200 Tasks [computed shuffle join with default 200 partitions; spark.sql.shuffle.partitions]
#### Stage 6 w/ 1 Tasks

In [6]:
step4.explain()

== Physical Plan ==
*(7) HashAggregate(keys=[], functions=[sum(id#6L)])
+- Exchange SinglePartition
   +- *(6) HashAggregate(keys=[], functions=[partial_sum(id#6L)])
      +- *(6) Project [id#6L]
         +- *(6) SortMergeJoin [id#6L], [id#2L], Inner
            :- *(3) Sort [id#6L ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(id#6L, 200)
            :     +- *(2) Project [(id#0L * 5) AS id#6L]
            :        +- Exchange RoundRobinPartitioning(5)
            :           +- *(1) Range (2, 10000000, step=2, splits=8)
            +- *(5) Sort [id#2L ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(id#2L, 200)
                  +- Exchange RoundRobinPartitioning(6)
                     +- *(4) Range (2, 10000000, step=4, splits=8)


## _Chapter #16 - Developing Spark Applications_

-  Spark Applications consist of:
    -  Spark Cluster
    -  code
-  Spark Scala Applications:
    -  build applications via JVM-based build tools:
        -  sbt:
            -  specify "build.sbt" containing package information:
                -  project metadata (package name, version, information, etc.)
                -  where to resolve dependencies
                -  library dependencies
            -  either use "sbt assemble" to build .JAR containing all dependencies in 1 .JAR or ...
            -  use "sbt package" to gather all dependencies into target folder (won't package all in 1 .JAR)
        -  Apache Maven
-  Spark Python Applications:
    -  execute .py scripts since Spark doesn't have a build method for Python
    -  possible to package multiple Python files into egg or ZIP files of Spark code via --py-files argument which:
        -  adds .py, .zip, .egg files to be distributed with application
-  Testing Spark Applications:
    -  things to keep in mind ... :
        -  input data resilience
        -  business logic resilience
        -  output data resilience
-  Spark Unit Testing:
    -  JUnit
    -  ScalaTest
-  Spark Development Process:
    -  interactive applications (initial development) => shell
    -  production applications (submit to cluster) => spark-submit
-  Launching Spark Applications:
    -  client mode or cluster mode:
        -  cluster mode is recommended to reduce latency between executors and driver
        -  however client mode node is sometimes apart of the cluster
    -  _Table 16.1 => Spark Submit Help_
    -  _Table 16.2 => Spark Deployment Configuration_
-  Spark Configuration:
    -  SparkConf:
        -  manages all of the applications configurations
        -  controls how the Spark Application runs and how Spark Cluster is configued
-  Spark Application Properties:
    -  set via:
        -  spark-submit during launch
        -  code within Spark Application
    -  confirm parameters via Spark UI "Environment" tab
    -  _Table 16.3 => Spark Application Properties_
    -  _Runtime Properties => http://spark.apache.org/docs/latest/configuration.html#runtime-environment _
    -  _Execution Properties => http://spark.apache.org/docs/latest/configuration.html#execution-behavior _
    -  _Memory Properties => http://spark.apache.org/docs/latest/configuration.html#memory-management _
    -  _Shuffle Properties => http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior _
    -  Environmental Variables:
        -  JAVA_HOME =>  location where java is installed
        -  PYSPARK_PYTHON => python binary executable location for pyspark in driver and workers
        -  PYSPARK_DRIVER_PYTHON => python binary executable location for pyspark in driver only
        -  SPARKR_DRIVER_R => r binary executable for sparkR
        -  SPARK_LOCAL_IP => ip address of the machine to bind
        -  SPARK_PUBLIC_DNS => hostname spark program will advertise to other machines
-  Spark Application Job Scheduling:
    -  runs in FIFO (first in first out) fashion
    -  set _spark.scheduler.mode_ to FAIR to enable a fair scheduler

### _Chapter #16 Exercises (Spark Application Development)_

### _Spark Scala Application Build Example_

In [7]:
'''
// Package Information

name := "example" // change to project name
organization := "com.databricks" // change to your org
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"

// Spark Information
val sparkVersion = "2.1.0"

// allows us to include spark packages
resolvers += "bintray-spark-packages" at
  "https://dl.bintray.com/spark-packages/maven/"

resolvers += "Typesafe Simple Repository" at
  "http://repo.typesafe.com/typesafe/simple/maven-releases/"

resolvers += "MavenRepository" at
  "https://mvnrepository.com/"

libraryDependencies ++= Seq(
  // spark core
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,

  // spark-modules
  "org.apache.spark" %% "spark-graphx" % sparkVersion,
  // "org.apache.spark" %% "spark-mllib" % sparkVersion,

  // spark packages
  "graphframes" % "graphframes" % "0.4.0-spark2.1-s_2.11",

  // testing
  "org.scalatest" %% "scalatest" % "2.2.4" % "test",
  "org.scalacheck" %% "scalacheck" % "1.12.2" % "test",

  // logging
  "org.apache.logging.log4j" % "log4j-api" % "2.4.1",
  "org.apache.logging.log4j" % "log4j-core" % "2.4.1"
)
'''

'\n// Package Information\n\nname := "example" // change to project name\norganization := "com.databricks" // change to your org\nversion := "0.1-SNAPSHOT"\nscalaVersion := "2.11.8"\n\n// Spark Information\nval sparkVersion = "2.1.0"\n\n// allows us to include spark packages\nresolvers += "bintray-spark-packages" at\n  "https://dl.bintray.com/spark-packages/maven/"\n\nresolvers += "Typesafe Simple Repository" at\n  "http://repo.typesafe.com/typesafe/simple/maven-releases/"\n\nresolvers += "MavenRepository" at\n  "https://mvnrepository.com/"\n\nlibraryDependencies ++= Seq(\n  // spark core\n  "org.apache.spark" %% "spark-core" % sparkVersion,\n  "org.apache.spark" %% "spark-sql" % sparkVersion,\n\n  // spark-modules\n  "org.apache.spark" %% "spark-graphx" % sparkVersion,\n  // "org.apache.spark" %% "spark-mllib" % sparkVersion,\n\n  // spark packages\n  "graphframes" % "graphframes" % "0.4.0-spark2.1-s_2.11",\n\n  // testing\n  "org.scalatest" %% "scalatest" % "2.2.4" % "test",\n  "org.

### _Spark Scala Application Directory Structure Example_

In [8]:
'''
src/
    main/
        resources/
            <files to include in main jar here>
    scala/
        <main Scala sources>
    java/
        <main Java sources>
    test/
        resources
            <files to include in test jar here>
        scala/
            <test Scala sources>
        java/
            <test Java sources>
'''

'\nsrc/\n    main/\n        resources/\n            <files to include in main jar here>\n    scala/\n        <main Scala sources>\n    java/\n        <main Java sources>\n    test/\n        resources\n            <files to include in test jar here>\n        scala/\n            <test Scala sources>\n        java/\n            <test Java sources>\n'

### _Spark Scala Application SparkSession Initializer Example_

In [9]:
'''
object DataFrameExample extends Serializable {
    def main(args: Array[String]) = {

    val pathToDataFolder = args(0)

// start up the SparkSession
// along with explicitly setting a given config
val spark = SparkSession.builder().appName("Spark Example")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.getOrCreate()

// udf registration
spark.udf.register("myUDF", someUDF(_:String):String)
val df = spark.read.json(pathToDataFolder + "data.json")
val manipulated = df.groupBy(expr("myUDF(group)")).sum().collect()
.foreach(x => println(x))

    }
}
'''

'\nobject DataFrameExample extends Serializable {\n    def main(args: Array[String]) = {\n\n    val pathToDataFolder = args(0)\n\n// start up the SparkSession\n// along with explicitly setting a given config\nval spark = SparkSession.builder().appName("Spark Example")\n.config("spark.sql.warehouse.dir", "/user/hive/warehouse")\n.getOrCreate()\n\n// udf registration\nspark.udf.register("myUDF", someUDF(_:String):String)\nval df = spark.read.json(pathToDataFolder + "data.json")\nval manipulated = df.groupBy(expr("myUDF(group)")).sum().collect()\n.foreach(x => println(x))\n\n    }\n}\n'

### _Running Spark Scala Application Example_

In [10]:
'''
$SPARK_HOME/bin/spark-submit \
--class com.databricks.example.DataFrameExample \
--master local \
target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar "hello"
'''

'\n$SPARK_HOME/bin/spark-submit --class com.databricks.example.DataFrameExample --master local target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar "hello"\n'

### _Spark Python Application Execution Example_

In [11]:
'''
# in Python
from __future__ import print_function
if __name__ == '__main__':
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
        .master("local") \
        .appName("Word Count") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
        
    print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())
'''


'\n# in Python\nfrom __future__ import print_function\nif __name__ == \'__main__\':\n    from pyspark.sql import SparkSession\n    spark = SparkSession.builder         .master("local")         .appName("Word Count")         .config("spark.some.config.option", "some-value")         .getOrCreate()\n        \n    print(spark.range(5000).where("id > 500").selectExpr("sum(id)").collect())\n'

### _Running Spark Python Application Example_

In [12]:
'''
$SPARK_HOME/bin/spark-submit --master local pyspark_template/main.py
'''

'\n$SPARK_HOME/bin/spark-submit --master local pyspark_template/main.py\n'

### _Spark Submit Help Example_

In [13]:
'''
> spark-submit --help
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
'''

'\n> spark-submit --help\nUsage: spark-submit [options] <app jar | python file | R file> [app arguments]\nUsage: spark-submit --kill [submission ID] --master [spark://...]\nUsage: spark-submit --status [submission ID] --master [spark://...]\nUsage: spark-submit run-example [options] example-class [example args]\n\nOptions:\n  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,\n                              k8s://https://host:port, or local (Default: local[*]).\n  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or\n                              on one of the worker machines inside the cluster ("cluster")\n                              (Default: client).\n  --class CLASS_NAME          Your application\'s main class (for Java / Scala apps).\n  --name NAME                 A name of your application.\n  --jars JARS                 Comma-separated list of jars to include on the driver\n                              and executor classpa

### _SparkConf Example_

In [14]:
'''
from pyspark import SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("DefinitiveGuide")\
  .set("some.conf", "to.some.value")
'''

'\nfrom pyspark import SparkConf\nconf = SparkConf().setMaster("local[2]").setAppName("DefinitiveGuide")  .set("some.conf", "to.some.value")\n'

## _Chapter #17 - Deploying Spark_

### Cluster Managers:
-  cluster manager documentation => http://spark.apache.org/docs/latest/cluster-overview.html
-  manages set of machines to deploy Spark Applications including:
    -  Standalone:
        -  built specifically for Spark workloads
        -  only runs Spark framework
        -  run multiple Spark Applications on the same cluster
        -  environment variables documentation => http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
        -  set --master to 'master node IP'
    -  YARN:
        -  set --master to 'yarn'
        -  deployment modes:
            -  cluster mode:
                -  Spark Driver process managed by YARN
                -  client can exit after creating application
                -  YARN picks a machine (may not be machine used to manaully execute application) as Master
            -  client mode:
                -  Spark Driver process runs in client process
                -  YARN is only responsible for granting executor resources to application
                -  YARN does not maintain Spark Driver
        -  enable HDFS read/write for Spark:
            -  set HADOOP_CONF_DIR in SPARK_HOME/spark-env.sh to location containing hdfs-site.xml & core-site.xml
        -  YARN configurations documentation => http://spark.apache.org/docs/latest/running-on-yarn.html#configuration
    -  Mesos:
        -  abstracts cpu, memory, storage and other resources from machines (physical and virtual)
        -  uses coarse-grain mode:
            -  each Spark executor runs as a single Mesos task
        -  supports client and cluster mode
        -  Mesos configurations documentation => http://spark.apache.org/docs/latest/running-on-mesos.html#configuration

### Spark Cluster Deployment Options:
-  On Premise Cluster:
    -  pros:
        -  secure private datacenters
        -  full control over hardware to optimize workloads
    -  cons:
        -  fixed cluster size
        -  resource sharing
        -  elastic resource demands (OP cluster may not have enough horsepower to support ML/data analytics)
        -  operate own storage system (ex: HDFS [distributed file system]; Cassandra [key-value store])
        -  setup of georeplication and disaster recovery
-  Public Cloud Cluster:
    -  pros:
        -  provide applications its own cluster
        -  customize cluster size per job to optimize cost performance
        -  launch/shut down resources elastically
        -  utilize GPUs for DL jobs
        -  elastic low cost storage (ex: AWS; Azure; GCP)
    -  cons:
        -  fixed cluster size and file system lacks elasticity (ex: EMR)
    -  alternative:
        -  recommended to use global storage systems (ex: S3, Azure Blob, GCS) decoupled from specific cluster
        -  **decouple compute and storage to spin up machines dynamically for each Spark workload**
-  Secure Deployment Configuration:
    -  security configurations documentation => http://spark.apache.org/docs/latest/configuration.html#security
-  Cluster Networking Configuration:
    -  networking configurations documentation => https://spark.apache.org/docs/latest/configuration.html#networking

### Spark Application Scheduling:
-  each Spark Application runs an independent set of executor processes
-  Spark has the ability via configuration to use a Fair Scheduler to schedule resources within each application
-  use Static Partitioning of resources if multiple users share your cluster and run different Spark Applications:
    -  static partitioning allocates a maximum amount of resources each application can utilize
-  job scheduling configurations documentation => https://spark.apache.org/docs/latest/configuration.html#scheduling

### Spark Dynamic Allocation:
-  allows applications to scale resources up and down dynamically based on workload needs
-  disabled by default except in some vendor distributions (Cloudera, Hortonworks)
-  configure via parameters:
    -  _spark.dynamicAllocation.enabled_ to TRUE
    -  _spark.shuffle.service.enabled_ to TRUE
-  dynamic allocation configurations documentation => https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

### _Chapter #17 Exercises (Spark Application Deployment)_

### _Spark Standalone Deployment By Hand Example_:

In [15]:
# how to start cluster by hand? :
    #1 - start cluster manager master process => $SPARK_HOME/sbin/start-master.sh
    #2 - master prints URI => spark://HOST:PORT
    #3 - log into each worker machine & start node w/ URI => $SPARK_HOME/sbin/start-slave.sh <master-spark-URI>
    #4 - submit applications via spark-submit => spark://URI of master

### _Spark Standalone Deployment By Automated Scripts Example_:

In [16]:
# how to start cluster by automated scripts? :
        #1 - create file called conf/slaves in Spark directory containing all hostnames of machines
        #2 - launch and stop cluster via shell scripts available in $SPARK_HOME/sbin

# $SPARK_HOME/sbin/start-master.sh
    # Starts a master instance on the machine on which the script is executed.
# $SPARK_HOME/sbin/start-slaves.sh
    # Starts a slave instance on each machine specified in the conf/slaves file.
# $SPARK_HOME/sbin/start-slave.sh
    # Starts a slave instance on the machine on which the script is executed.
# $SPARK_HOME/sbin/start-all.sh
    # Starts both a master and a number of slaves as described earlier.
# $SPARK_HOME/sbin/stop-master.sh
    # Stops the master that was started via the bin/start-master.sh script.
# $SPARK_HOME/sbin/stop-slaves.sh
    # Stops all slave instances on the machines specified in the conf/slaves file.
# $SPARK_HOME/sbin/stop-all.sh
    # Stops both the master and the slaves as described earlier.      

## _Chapter #18 - Monitoring and Debugging_

### Spark Monitoring Landscape:
-  Spark Application & Jobs:
    -  Spark UI
    -  Spark Logs
-  JVM:
    -  where Spark Executors run
    -  utilities:
        -  jstack => provides stack traces
        -  jmap => creates heap-dumps
        -  jstat => reports time-series stats
        -  jconsole => visually explores many JVM properties
        -  jvisualvm => can be used to help profile Spark jobs
-  OS/Machine:
    -  host OS where JVMs run
    -  monitor health of CPU, network, I/O
    -  available tools like dstat, iostat, iotop
-  Cluster:
    -  depends on CM
    -  things like YARN UI, Ganglia, and Prometheus are options

### Spark Monitoring Components:
-  Processes running application (CPU usage, memory usage, etc.):
    -  keep an eye on driver [**state where application lives**]
    -  also monitor state of executors
-  Query Execution inside process (jobs / stages / tasks):
    -  Spark Logs
    -  Spark UI:
        -  ui configurations documentation => https://spark.apache.org/docs/latest/configuration.html#spark-ui; https://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options
        -  tabs:
            -  Jobs => shows Spark Jobs
            -  Stages => shows Spark Stages and their associated Tasks
            -  Storage => shows cached data in Spark Application
            -  Environment => shows configuration information and current settings of Spark Application
            -  Executors => shows information about each Spark Executor running Spark Application
            -  SQL => shows Structured API queries (SQL / DFs)
    -  Spark REST API:
        -  another method to access Spark's status and metrics => http://localhost:4040/api/v1
        -  useful for custom built reporting solutions
        -  rest api monitoring documentation => https://spark.apache.org/docs/latest/monitoring.html#rest-api
    -  Spark History Server:
        -  stores historical Spark logs
        -  configure application to store event logs to a certain location:
            -  _spark.eventLog.enabled_
            -  _spark.eventLog.dir_
        -  runs as standalone applications
        -  history server configurations documentation => https://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options 

### Spark Debugging - Common Problems:
1.  Spark Jobs Not Starting:
    -  resources requested not available
    -  cluster misconfiguration
2.  Errors Before Execution:
    -  typo / incorrect column name
    -  network disconnection
3.  Errors During Execution:
    -  bad input data:
        -  null values
        -  incorrect schema
        -  row of data doesn't match schema
4.  Slow Tasks or Stragglers:
    -  data is partitioned unevenly across cluster => increase # of partitions to have less data per partition
    -  not enough memory allocated to executors => allocate more memory
    -  machine may have a hardware problem / disk full
5.  Slow Aggregations:
    -  data skews with keys being grouped => increase # of partitions prior to aggregation
    -  "empty" values => change to _null_
6.  Slow Joins:
     -  "empty" values => change to _null_
     -  inefficient join type => optimize with another join type / filter data first then adjust join order
     -  data skews => partition prior to joining / increase executor memory
7.  Slow Reads and Writes:
    -  bad network connectivity
8.  Driver OutOfMemoryError or Driver Unresponsive / GC Messages:
    -  too much data being collected back to driver (runs out of memory) => increase driver memory allocation
9.  Executor OutOfMemoryError or Executor Unresponsive / GC Messages:
    -  executors crash => increase executor memory and # of executors
    -  python memory problem => increase PySpark worker size
    -  garbage collection errors => repartition data to increase parallelism to reduce amount of records per task and ensure executors are getting same amount of work to process
    -  "empty" values => change to _null_
    -  avoid using UDFs if possible
10.  Unexpected Nulls in Results:
    -  volatile data
11.  No Space Level on Disk Errors:
    -  not enough space => add more disk space
12.  Serialization Errors:
    -  data cannot be serialized => usually via RDDs or UDFs
    -  default serialization => change to Kryo serialization

### _Chapter #18 Exercises (Spark Application Monitoring)_

In [17]:
spark.read\
.option("header", "true")\
.csv(online)\
.repartition(2)\
.selectExpr("instr(Description, 'GLASS') >= 1 as is_glass")\
.groupBy("is_glass")\
.count()\
.collect()

[Row(is_glass=None, count=1454),
 Row(is_glass=True, count=12861),
 Row(is_glass=False, count=527594)]

In [18]:
'''
# Spark UI metrics:

// SQL TAB:
    // Summary Statistics => metrics for query
    // DAG of Spark Stages => each BLUE box represents a Spark Stage of Spark Tasks
    // Spark Job => entire group of Spark Stages represent a Spark Job

// JOBS TAB:
    // shows Spark Jobs
    // shows Spark Stages within Job ID
    // shows Spark Tasks within Stage ID
    // Summary Metrics => monitoring statistics (be on the lookout for outliers / distribution of values)
    // Aggregated Metrics by Executor => examine Spark Executor performance
    // Show Additional Metrics => provides more advanced metrics
'''

'\n# Spark UI metrics:\n\n// SQL TAB:\n    // Summary Statistics => metrics for query\n    // DAG of Spark Stages => each BLUE box represents a Spark Stage of Spark Tasks\n    // Spark Job => entire group of Spark Stages represent a Spark Job\n\n// JOBS TAB:\n    // shows Spark Jobs\n    // shows Spark Stages within Job ID\n    // shows Spark Tasks within Stage ID\n    // Summary Metrics => monitoring statistics (be on the lookout for outliers / distribution of values)\n    // Aggregated Metrics by Executor => examine Spark Executor performance\n    // Show Additional Metrics => provides more advanced metrics\n'

### _Spark Log Level Example_

In [19]:
# allows log reading
spark.sparkContext.setLogLevel("INFO")

## _Chapter #19 - Performance Tuning_

### Discussed Topics:
-  DFs vs RDDs:
    -  recommended to use Scala/Java for RDDs
    -  Python expensively serializes data to and from Python process when running RDD code
-  RDD Object Serialization:
    -  set _spark.serializer_ to _org.apache.spark.serializer.KryoSerializer_
-  Dynamic Allocation:
    -  dynamically allocate resources
    -  set _spark.dynamicAllocation.enabled_ to _true_
-  Scheduling:
    -  set _spark.scheduler.mode_ to _FAIR_
    -  FAIR provides better sharing of resources across multiple users
    -  set --max-executor-cores to specify max # of executor cores application will need
    -  --max-executor-cores ensures application won't consume up all the resources on cluster
-  Data Storage Format:
    -  favor structured binary type for frequent access
    -  Apache Parquet is best
-  Splittable File Types / Compression:
    -  "splittable" => different tasks can read different parts of the file in parallel
    -  use compression like gzip, snappy
    -  **try to keep individual read files no larger than a few hundred MB for "splittable" purpose**
-  Table Partitioning:
    -  stores files in separate directories based on a key (column)
    -  improves speed, filtering, and spread of data across cluster
-  Bucketing:
    -  allows Spark to "pre-partition" data according to how joins or aggregations are performed
    -  helps with partitioning / prevent shuffle before join / data access speed
-  Number of Files:
    -  avoid many small files and many large files
    -  **rule of thumb is to aim for each written file to be around a few tens of MB**
    -  _maxRecordsPerFile_ => controls how many records go into each file
-  Data Locality:
    -  specifies a preference for certain nodes that hold certain data
    -  avoids exchanging blocks of data over the network
    -  local storage (ex: HDFS) marked as "local" in Spark UI tasks
-  Table Statistics:
    -  table level stats => ANALYZE TABLE tableName COMPUTE STATISTICS
    -  column level stats => ANALYZE TABLE tableName COMPUTE STATISTICS FOR COLUMNS column1, column2, etc.
-  Shuffle Configurations (Spark's External Shuffle Servce):
    -  helps increase performance because nodes read shuffle data from remote machines even when executors on machines are busy
    -  small partitions lead to some nodes being under-utilized and data skews
    -  large partitions lead to overhead and some nodes dominating
    -  **aim for around a few tens of MB of data per output partition in shuffle**
-  OOM / GC:
    -  gather statistics via _spark.executor.extraJavaOptions_ => -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
    -  logs located on cluster's worker nodes (in stdout files)
-  Parallelism:
    -  increase parallelism to speed up stage
    -  recommendations:
        -  use 2 or 3 tasks per CPU cores in cluster if stage processes large amount of data:
            -  set _spark.default.parallelism_
            -  set _spark.sql.shuffle.partitions_ => number of cores in cluster
-  Filtering:
    -  always filter has early on as possible
-  Repartitioning / Coalescing:
    -  repartition:
        -  incurs a shuffle
        -  optimizes execution/parallelism by load balancing data across cluster
        -  helpful for joins/cached data
    -  coalesce:
        -  reduces # of shuffles by merging partitions on nodes
-  UDFS:
    -  expensive operations
    -  force representing data as objects in JVM
-  Temporary Data Storage (Caching):
    -  _Table 19.1 => Data Cache Storage Levels_
    -  reuse same dataset over and over
    -  places DF, table, RDD into temporary storage (memory or disk) across executors in cluster
    -  helps with faster reads
    -  negative impacts => incurs serialization, deserialization, storage costs
    -  lazy operation / only cached when used (action)
    -  default cache is in memory
    -  RDD Cache:
        -  physical data (bits) is cached as object
    -  DF Cache:
        -  physical plan is cached => physical plan is stored as key and performs lookup prior to execution of Structured job
    -  Storage Levels:
        -  MEMORY_ONLY (default)
        -  MEMORY_AND_DISK
        -  MEMORY_ONLY_SER
        -  MEMORY_AND_DISK_SER
        -  DISK_ONLY
        -  MEMORY_ONLY_2
        -  MEMORY_AND_DISK_2
        -  OFF_HEAP
        -  https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose
-  Joins:
    -  equi joins are best when possible
    -  cartesian/full joins should be avoided
    -  bucketing helps with avoiding shuffles prior to joins
-  Aggregations:
    -  via RDDs use reduceByKey when possible over groupByKey

### Best Spark Performance Advice In A Nutshell To Prioritize:
1.  read as little data as possible through partitioning and efficient binary formats
2.  sufficient parallelism and no data skews on cluster using partitioning
3.  use high-level Stuctured APIs for optimized code
4.  utilize monitoring tools (ex: Spark UI) to efficiently and effectively troubleshoot and optimize Spark Jobs

### _Chapter #19 Exercises (Spark Application Tuning)_

### _Garbage Collection Tuning Example_

In [20]:
'''
memory management in the JVM:
Java heap space is divided into two regions: Young and Old. The Young generation is meant
to hold short-lived objects whereas the Old generation is intended for objects with longer
lifetimes.

The Young generation is further divided into three regions: Eden, Survivor1, and Survivor2.

garbage collection procedure:
1. When Eden is full, a minor garbage collection is run on Eden and objects that are alive from
Eden and Survivor1 are copied to Survivor2.
2. The Survivor regions are swapped.
3. If an object is old enough or if Survivor2 is full, that object is moved to Old.
4. Finally, when Old is close to full, a full garbage collection is invoked. This involves tracing
through all the objects on the heap, deleting the unreferenced ones, and moving the others to
fill up unused space, so it is generally the slowest garbage collection operation.

The goal of garbage collection tuning in Spark is to ensure that only long-lived cached datasets are
stored in the Old generation and that the Young generation is sufficiently sized to store all short-lived
objects.

If a full garbage collection is invoked multiple times before a task completes, it means that there isn’t enough memory
available for executing tasks, so you should decrease the amount of memory Spark uses for caching
(spark.memory.fraction).

Try the G1GC garbage collector with -XX:+UseG1GC. It can improve performance in some situations
in which garbage collection is a bottleneck and you don’t have a way to reduce it further by sizing the
generations. Note that with large executor heap sizes, it can be important to increase the G1 region
size with -XX:G1HeapRegionSize.
'''

'\nmemory management in the JVM:\nJava heap space is divided into two regions: Young and Old. The Young generation is meant\nto hold short-lived objects whereas the Old generation is intended for objects with longer\nlifetimes.\n\nThe Young generation is further divided into three regions: Eden, Survivor1, and Survivor2.\n\ngarbage collection procedure:\n1. When Eden is full, a minor garbage collection is run on Eden and objects that are alive from\nEden and Survivor1 are copied to Survivor2.\n2. The Survivor regions are swapped.\n3. If an object is old enough or if Survivor2 is full, that object is moved to Old.\n4. Finally, when Old is close to full, a full garbage collection is invoked. This involves tracing\nthrough all the objects on the heap, deleting the unreferenced ones, and moving the others to\nfill up unused space, so it is generally the slowest garbage collection operation.\n\nThe goal of garbage collection tuning in Spark is to ensure that only long-lived cached datasets 

### _Data Cache Storage Levels Example_

In [21]:
'''
MEMORY_ONLY:
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions
will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.

MEMORY_AND_DISK:
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the
partitions that don’t fit on disk, and read them from there when they’re needed.

MEMORY_ONLY_SER (Java and Scala):
Store RDD as serialized Java objects (one byte array per partition). This is generally more spaceefficient
than deserialized objects, especially when using a fast serializer, but more CPU-intensive to
read.

MEMORY_AND_DISK_SER (Java and Scala):
Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing
them on the fly each time they’re needed.

DISK_ONLY:
Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.:
Same as the previous levels, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental):
Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to
be enabled.
'''

'\nMEMORY_ONLY:\nStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions\nwill not be cached and will be recomputed on the fly each time they’re needed. This is the default level.\n\nMEMORY_AND_DISK:\nStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the\npartitions that don’t fit on disk, and read them from there when they’re needed.\n\nMEMORY_ONLY_SER (Java and Scala):\nStore RDD as serialized Java objects (one byte array per partition). This is generally more spaceefficient\nthan deserialized objects, especially when using a fast serializer, but more CPU-intensive to\nread.\n\nMEMORY_AND_DISK_SER (Java and Scala):\nSimilar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing\nthem on the fly each time they’re needed.\n\nDISK_ONLY:\nStore the RDD partitions only on disk.\n\nMEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.:\nSame as the previous levels, but replica

### _Cache DF Example_

In [22]:
# Original loading code that does *not* cache DataFrame
DF1 = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.load(flights2105)

DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect() # refers to original file
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect() # refers to original file
DF4 = DF1.groupBy("count").count().collect() # refers to original file

DF1.cache()
DF1.count()

DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect() # refers to new cached data in memory
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect() # refers to new cached data in memory
DF4 = DF1.groupBy("count").count().collect() # refers to new cached data in memory

### grp