# 02 - Introduction to Spark

### Apache Spark

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Apache_Spark_logo-f31fc9de-9456-4351-b459-2fe24f92628b.png)

Apache Spark in an open-source distributed cluster-computing system designed to be fast and general-purpose.

*Notes*

*Created in 2009 at Berkeley's AMPLab by Matei Zaharia, the Spark codebase was donated in 2013 to the Apache Software Foundation. It has since become one of its most active projects.*

*Spark provides high-level APIs in Scala, Java, Python and R and an optimized execution engine. On top of this technology, sit higher-lever tools including Spark SQL, MLlib, GraphX and Spark Streaming.*

---

N**umbers everyone should know**

> Remember that there is a "computer" in "computer science".
— Peter Norvig

- CPU ≈ 1 ns
- Memory ≈ 100 ns
- Disk ≈ 20 μs
- Network ≈ 150 ms

*Notes*

*Before we dive into Spark, it's important to understand some concepts of computer science, in particular these latency number every programmer should know, first talked about by Peter Norvig in its famous [Teach yourself programming in 10 years](http://norvig.com/21-days.html#answers).*

---

### Hadoop vs Spark

- Faster through In-Memory computation
- Simpler (high-level APIs) and execution engine optimisation

*Notes*

**Faster through In-Memory computation**

*Because memory time access are much faster than disk access (see previous slide), Spark's In-Memory computation makes it much faster than Hadoop*

**Simpler (high-level APIs) and execution engine optimisation**

Spark's high-level APIs combined with lazy computation *means we don't have to optimize each query. Spark execution engine will take care of building an optimized physical execution plan.*

*Also, code you write in "local" mode will work on a cluster "out-of-the-box" thanks to Spark's higher level API.*

*That doesn't mean it will be easy to write Spark code, but Spark makes it much easier to write optimized code that will run at big data scale.*

*Links*

- The [official documentation](http://spark.apache.org/docs/latest/)

---

**The need for a distributed storage**

If compute is distributed, all the machine needs to have access to the data, without a distributed storage that would be **very tedious**.

Unlike Hadoop, Spark doesn't come with its own file system, but can interface with many existing ones, such as Hadoop Distributed File System (HDFS), Cassandra, Amazon S3 and many more...

*Notes*

*Spark can supports a pseudo-distributed local mode (for development or testing purposes), in this case, Spark is run on a single machine with one executor per CPU core and a distributed file storage is not required.*

---

**Spark mechanics**

> At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster. The driver program contains your application's main function and defines distributed datasets on the cluster, then applies operations on them.
- Learning Spark, page 14

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/cluster-overview-273ddf73-9063-47bb-9060-e094443700eb.png)

Source: [https://spark.apache.org/docs/latest/cluster-overview.html](https://spark.apache.org/docs/latest/cluster-overview.html)

*Notes*

*Interesting notes on cluster: [https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-cluster.html](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-cluster.html)*

---

**DAG Scheduling**

In order to distribute the compute among the worker node, Spark transforms logical execution plan into a physical execution plan (how the computation will actually take place). While doing so, it implements an execution plan that will maximize performances, in particular avoid moving data across the network, because as we've seen, network latency is the worse.

*Notes*

*You can take a look at [Spark Basics : RDDs,Stages,Tasks and DAG](https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454) but this covers concepts we haven't seen yet.*

---

### The Spark Stack

One of Spark's promises it to deliver an unified analytics system. On top of its powerful distributed processing engine (Spark Core), sits a collection of higher-level libraries that all benefit from the improvement of the core library.

*Notes*

*That's true in the broad lines, but can suffers from some caveats, in particular Spark Streaming performances can't rival those of Storm and Flink.*

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/spark-stack-oreilly-674376df-ecdf-45f2-8ef7-539393568c0e.png)

Source: Learning Spark (O'Reilly - Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia)

**Spark Core**

Spark Core is the underlying general execution engine for the Spark platform that all other functionality are built on top of.

It provides many core functionalities such as task dispatching and scheduling, memory management and basic I/O functionalities, exposed through an application programming interface.

**Spark SQL**

Spark module for structured data processing

Spark SQL provides a programming abstraction called DataFrame and can also act as a distributed SQL query engine.

Also they're called "DataFrames", Spark's DataFrame are quite different that those of pandas that you might be familiar with

---

At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) to build an extensible query optimizer.

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Catalyst-Optimizer-diagram-152974c4-e1fc-4bb5-a788-c1ee71657ecd.png)

Source: [https://databricks.com/glossary/catalyst-optimizer](https://databricks.com/glossary/catalyst-optimizer)

*Links*

- [What is Spark SQL](https://databricks.com/glossary/what-is-spark-sql)
- [Deep dive into Spark SQL's Catalyst Optimizer](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html)
- [SparkSqlAstBuilder](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkSqlAstBuilder.html)

---

**GraphX**

Spark module for Graph computations

Spark DataFrames for Graph computations

GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. It comes complete with a library of common algorithms.

**MLlib**

Spark module for Machine Learning

Machine Learning library for Spark, influenced by Scikit-Learn (in particular, its pipelines system)

Historically a RDD-based API, now comes with a DataFrame-based API that has become the primary API while the RDD-based API is now in [maintenance mode](https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api).

**Spark streaming**

Spark module for stream processing

Streaming, also called Stream Processing is used to query continuous data stream and process this data within a small time period from the time of receiving the data.

*Notes*

*Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, this comes at the cost of having to wait for the full mini-batch to be processed while alternatives like Apache Storm and Apache Flink process data by event and provide better speed.*

*Links*

- [*A Gentle Introduction to Stream Processing*](https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97)

---

**Mixed language**

Apache Spark is written in Scala, making wide usage of the Java Virtual Machine and can be interfaced with: Scala (primary), Java, Python (PySpark) and R.

*Notes*

*Because Spark is written in Scala, PySpark, the Python interface tends to follow Scala's principle, whether for small details like naming convention (PySpark's API is frequently not consistent with Python's standard good practices, for example using pascalCase instead of snake_case) or global programming paradigm like functional programming.*

*The functional paradigm is particularly adapted for distributed computing as it uses concept like immutability.*