Slide 1

## Spark Intro
▪ Apache Spark is the most actively developed open-source engine for data processing on
computer clusters.
▪ This engine is now a standard data processing tool for developers and data scientists who want
to work with big data.
▪ Spark supports a variety of common languages (Python, Java, Scala, and R), includes libraries for
a variety of tasks, from ETL to streaming to machine learning to graph processing & analytics.
▪ Spark can run on a laptop to a cluster of thousands of servers. Spark can be deployed on Mesos,
Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy it.

## What is Spark?
▪ Spark is an open-source, framework-based component that processes a large amount of
unstructured, semi-structured, and structured data for analytics.
▪ Apart from Hadoop and map-reduce architectures for big data processing, Apache Spark’s
architecture is regarded as an alternative.
▪ The Resilient Distributed Dataset (RDD) and Directed Acyclic Graph (DAG), Spark’s data
storage and processing framework, are utilized to store and process data, respectively.
▪ Spark architecture consists of four components, including the spark driver, executors, cluster
administrators, and worker nodes.
▪ It uses the Dataset and data frames as the fundamental data storage mechanism to optimize
the Spark process and big data computation.

## Why Spark?
▪ Spark is used to apply transformations on Big Data
▪ Two scenarios in which it is particularly useful:
• When data is too large (Big Data)
• When we want to accelerate a calculation


## Spark Features
The following are main features of Spark
▪ Speed
Spark performs up to 100 times faster than MapReduce for processing large amounts of data.
It is also able to divide the data into chunks in a controlled way.
▪ Powerful Caching
Powerful caching and disk persistence capabilities are offered by a simple programming layer.
▪ Deployment
Mesos, Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy it.
▪ Real-Time
Because of its in-memory processing, it offers real-time computation and low latency.
▪ Polyglot
In addition to Java, Scala, Python, and R, Spark also supports all four of these languages. You
can write Spark code in any one of these languages. Spark also provides a command-line
interface in Scala and Python


## Spark v/s Map Reduce
The following are differences between Spark & Map Reduce:
▪ Spark is faster as it processes data in RAM (memory) while Hadoop reads and writes files to
HDFS (on disk)
▪ Spark is optimized for better parallelism , CPU utilization , and faster startup
▪ Spark has richer functional programming model
▪ Spark is especially useful for iterative algorithms

# ![image.png](attachment:image.png)

## Spark Architecture
A high-level view of the architecture of the Apache Spark application is as follows:
▪ Driver
▪ Cluster Manager
▪ Worker Nodes
▪ Executors


## Driver
▪ The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object.
▪ The purpose of SparkContext is to coordinate the spark applications, running as independent
sets of processes on a cluster.
▪ To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
• At last, the SparkContext sends tasks to the executors to run.

## Cluster Manager
▪ The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
▪ It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
▪ Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install
Spark on an empty set of machines.

## Worker Nodes
▪ The worker node is a slave node
▪ Its role is to run the application code in the cluster

## Executors
▪ An executor is a process launched for an application on a worker node.
▪ It runs tasks and keeps data in memory or disk storage across them.
▪ It read and write data to the external sources.
▪ Every application contains its executor.

## Spark Components
▪ The Spark project consists of different types of tightly integrated components.
▪ At its core, Spark is a computational engine that can schedule, distribute and monitor multiple
applications.
▪ The following are Spark components:
• Spark Core
• Spark SQL
• Spark Streaming
• MLlib
• GraphX


## Spark Core
▪ The Spark Core is the heart of Spark and performs the core functionality.
▪ It holds the components for task scheduling, fault recovery, interacting with storage systems
and memory management.

## Spark SQL
▪ The Spark SQL is built on the top of Spark Core. It provides support for structured data.
▪ It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL called the HQL (Hive Query Language).
▪ It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
▪ It also supports various sources of data like Hive tables, Parquet, and JSON.

## Spark Streaming
▪ Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of
streaming data.
▪ It uses Spark Core's fast scheduling capability to perform streaming analytics.
▪ It accepts data in mini-batches and performs RDD transformations on that data.
▪ Its design ensures that the applications written for streaming data can be reused to analyze
batches of historical data with little modification.
▪ The log files generated by web servers can be considered as a real-time example of a data
stream.


## MLib
▪ The MLlib is a Machine Learning library that contains various machine learning algorithms.
▪ These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.
▪ It is nine times faster than the disk-based implementation used by Apache Mahout.


## GraphX
▪ The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
▪ It facilitates to create a directed graph with arbitrary properties attached to each vertex and
edge.
▪ To manipulate graph, it supports various fundamental operators like subgraph, join Vertices,
and aggregate Messages.


## Spark Concepts
Spark Jobs / Spark Tasks
▪ A Spark Job(s) are composed of tasks
▪ A Spark Task(s) are actual computation or transformation
Actions & Transformations
▪ Actions & Transformations are output of a Task
▪ If a task returns a DataFrame, Dataset, or RDD, it is a transformation.
▪ If a task returns anything else or does not return a value at all it is an action.
Lazy Evaluation
▪ Lazy Evaluation is a trick commonly used for large data processing.
▪ Lazy Evaluation means triggering processing only when a Spark action is run and not a Spark
transformation.
▪ This allows Spark to prepare a logical and physical execution plan to perform the action
efficiently.

## Spark Concepts
Wide and Narrow Transformations
▪ The Spark transformations are divided into:
• Wide Transformations: require data shuffle, are naturally the most expensive.
• Narrow Transformations: does not require data shuffle
Maximize Parallelism In Spark
▪ Spark’s efficiency is based on processing several tasks in parallel.
▪ This is why optimizing a Spark job often means reading and processing as much data as
possible in parallel.
▪ And to achieve this goal, it is necessary to split a dataset into several partitions.


## Spark Concepts
Partitions
▪ Partition is a logical chunk of your DataFrame.
▪ When reading the default size is 128 MB
▪ Methods to modify partition : repartition and coalesce
▪ Coalesce: only to reduce the number of partitions in a DataFrame, no shuffle
▪ Repartition: to either increase or decrease the number of partitions, does a full shuffle
Caching
▪ Spark can execute part of the DAG to store expensive intermediate results for downstream
operations.
▪ Like transformations, caching is applied when an action is executed
▪ Caching is appropriate if you use the same DataFrame multiple times (EDA or ML model
▪ Aside from this, you should not cache (performance degrades)

## Spark Concepts
Resilient Distributed Datasets (RDD)
▪ It is a key tool for data computation.
▪ It enables you to recheck data in the event of a failure, and it acts as an interface for
immutable data.
▪ It helps in recomputing data in case of failures, and it is a data structure.
▪ There are two methods for modifying RDDs: transformations and actions.
Directed Acyclic Graph (DAG)
▪ The driver converts the program into a DAG for each job.
▪ The Apache Spark Eco-system includes various components such as the API core, Spark SQL,
Streaming and real-time processing, MLIB, and Graph X.
▪ A sequence of connection between nodes is referred to as a driver.
▪ As a result, you can read volumes of data using the Spark shell.
▪ You can also use the Spark context to run a job / task or to stop a job / task