# Apache Spark

## Unified engine for large-scale data analytics
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters 
![](https://spark.apache.org/images/spark-logo-trademark.png)

## The Apache Spark project's History
[A Gentle Introduction to Apache Spark on Databricks](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2168141618055043/484361/latest.html)

Spark was originally written by the founders of Databricks during their time at UC Berkeley. 

![](https://i.imgflip.com/8o8ag0.jpg)
[Nicsmeme](https://imgflip.com/i/8o8ag0)

The Spark project started in 2009, was open sourced in 2010, and in 2013 its code was donated to Apache, becoming Apache Spark. 

![](images/SparkTrends.png)

The employees of Databricks have written over 75% of the code in Apache Spark and have contributed more than 10 times more code than any other organization.

![](https://www.databricks.com/en-website-assets/static/0ba77ee7cfc7bfe140a683b947071484/19830.png)
https://www.databricks.com/spark/about

Apache Spark is a sophisticated distributed computation framework for executing code in parallel across many different machines. 

![](https://media.licdn.com/dms/image/D4D12AQG1hNjnq0uHdw/article-cover_image-shrink_720_1280/0/1686844447449?e=1719446400&v=beta&t=T1hG-QVH7brUS4xkrkh-mu7ommgnibbgPzfRKqehp24)

https://www.linkedin.com/pulse/exploring-world-distributed-computing-frameworks-empowering-nath/

While the abstractions and interfaces are simple, managing clusters of computers and ensuring production-level stability is not. Databricks makes big data simple by providing Apache Spark as a hosted solution.

**everyone** sells Spark as a service in Cloud

- https://cloud.google.com/solutions/spark
- https://aws.amazon.com/it/emr/features/spark/
- https://learn.microsoft.com/it-it/azure/hdinsight/spark/apache-spark-overview
- https://www.oracle.com/it/big-data/data-flow/


## The Genesis of Spark

## Hadoop (2004)

Doug Cutting:

> The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term” Hadoop is hardly the first unusual name to be attached to a tech company, of course. Google was born from a misspelling of "googol" (1 followed by 100 zeros), which itself was invented when a mathematician was playing with his nephew and together they came up with a name for really big numbers.

![](https://camo.githubusercontent.com/7b15805c76844c10826e2c17f20f644a0aa3201d87083e864441ed84d9681f5f/68747470733a2f2f666d2e636e62632e636f6d2f6170706c69636174696f6e732f636e62632e636f6d2f7265736f75726365732f696d672f656469746f7269616c2f323031332f30352f32332f3130303736323131302d6861646f6f702e353330783239382e6a70673f763d31333639373537303830)







Doug Cutting

[Source](https://www.cnbc.com/id/100769719)

![](https://minimalistquotes.com/wp-content/uploads/2022/08/simple-things-should-be-simple-and-complex-things-.jpg)


The question then became:
there a way to make Hadoop and MR simpler and faster?

# Development History

- Spark’s Early Years at AMPLab (2009) 

- First Paper 10-20x faster then map reduce (2010)

- Spark 1.0 Released (2014)

- Spark 2.0: Unifying DataFrame and Dataset. Structured Streaming (2016)

- Spark 3.0: Hadoop 3.0 support, Support for Pandas, SQL Engine Faster (2020)

- Spark 3.4: [Introducing Spark Connect](https://spark.apache.org/docs/latest/spark-connect-overview.html) (2023)

- Spark 3.5: Spark Connect GA, distributed training with [DeepSpeed](https://www.deepspeed.ai/)

## At the end

![](https://i.imgflip.com/7dcyqm.jpg)
[NicsMeme](https://imgflip.com/i/7dcyqm)

# Design Philosophy

Spark’s design philosophy centers around four key characteristics:

- Speed
- Ease of use
- Modularity
- Extensibility

https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch01.html

## 1. Speed

![](https://cc-media-foxit.fichub.com/image/fox-it-mondofox/0177f439-3c0f-44ae-9803-c25f8bfac0dd/flash-vs-superman-game-2jpg-maxw-824.jpg)

https://www.reddit.com/r/DCcomics/comments/271ueb/the_definitive_answer_to_flash_vs_superman/

### Run workloads 100x faster.

![Logistic Regression](https://spark.apache.org/images/logistic-regression.png)

Apache Spark achieves:
- high performance for both batch and streaming data
- using a state-of-the-art DAG scheduler
- a query optimizer
- a physical execution engine.

## Why Spark is faster ?

### 1. Hardware improvements

Today’s commodity servers come cheap, with hundreds of gigabytes of memory, multiple cores, and the underlying Unix-based operating system taking advantage of efficient multithreading and parallel processing.

![](https://5.imimg.com/data5/SELLER/Default/2023/10/353578866/VS/YQ/LQ/200119173/top-quality-trimmed-gold-ram-finger-scrap-5-tons-500x500.jpg)
[RAM SCRAP](https://m.indiamart.com/proddetail/top-quality-trimmed-gold-ram-finger-scrap-5-tons-2852693519773.html)

### 2. Direct Acyclic Graph (DAG) Scheduler and Query Optimizer

Provides an efficient computational graph that can usually be decomposed into tasks that are executed in parallel across workers on the cluster.

![](https://www.researchgate.net/publication/336769100/figure/fig2/AS:817393752371221@1571893265396/Spark-DAG-for-a-WordCount-application-with-two-stages-each-consisting-of-three-tasks.png)

https://www.researchgate.net/publication/336769100_Artificial_neural_networks_based_techniques_for_anomaly_detection_in_Apache_Spark

## Ease of Use

![](http://www.quickmeme.com/img/4d/4d4759d82ce65de86834ff151bc8b419f89f4e2f0d003f10a54b236785e3e6d2.jpg)

### Modularity

Write applications quickly in Java, Scala, Python, R, and SQL.

Spark offers over 80 high-level operators that make it easy to build parallel apps. 

And you can use it **interactively** from the Scala, Python, R, and SQL shells.

#### Scala Example


```scala
df = spark.read.json("logs.json") 
df.where("age > 21").select("name.first").show()
```

## Generality

### Combine SQL, streaming, and complex analytics.

Spark powers a stack of libraries including:
- **Spark SQL** module for working with structured data
- **Spark Streaming** build streaming applications and pipelines
- **MLlib** scalable machine learning library
- **GraphX** API for graphs and graph-parallel computation
New:
- **Pandas API**: Use pandas syntax on Spark
- **Spark Connect**: Client application that communicate with remote Spark server


![](https://spark.apache.org/images/spark-stack.png)

You can combine these libraries seamlessly in the same application.

## Runs everywhere

![](https://images2.corriereobjects.it/methode_image/socialshare/2014/10/07/f143a1aa-4e22-11e4-b38c-5070a4632162.jpg)

https://www.corriere.it/foto-gallery/esteri/14_ottobre_07/nuovo-attrezzo-fare-sport-ruota-criceti-misura-d-uomo-809cb22a-4e22-11e4-b38c-5070a4632162.shtml

### Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. 

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes
![](https://spark.apache.org/images/spark-runs-everywhere.png)

### It can access diverse external data sources


#### Analyse
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets

#### Query
Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on using relational transformations and can also be used to create a temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries over its data.

https://spark.apache.org/docs/latest/sql-data-sources.html

# In short


- Apache Spark is a fast and general-purpose cluster computing system. 

- It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. 

- Supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

# Spark (2024)


## Batch/streaming data

Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

![](https://spark.apache.org/images/batch-sstreaming-data-icon.svg)

## SQL analytics

Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

![](https://spark.apache.org/images/sql-analytics-icon.svg)

## Data science at scale

Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

![](https://spark.apache.org/images/data-science-scale-icon.svg)

## Machine learning

Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

![](https://spark.apache.org/images/machine-learning-icon.svg)

## Running basic example of Spark in docker

**nics vanilla**

~~Download from https://spark.apache.org/downloads.html into spark/setup
We are going to use Spark 3.4.0 Prebuilt for Hadoop 3.3 and later 
https://dlcdn.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
~~

**official image for spark**

Since 27/06/23

https://github.com/apache/spark-docker


## SparkPI
https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

Use Monte Carlo Method  https://theabbie.github.io/blog/estimate-pi-using-random-numbers.html

Run sparkExamplePi.sh
```bash
docker run -it --rm apache/spark /opt/spark/bin/run-example SparkPi 10
```

# Spark Shell

[Spark Shell](https://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell) provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. 

It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python.

## Run a Scala Spark shell in docker
```bash
# change the path according to your setup, remember should be a fullpath
docker run --hostname spark -p 4040 -it --rm -v /home/tap/tap-workspace/tap2024/spark/dataset:/tmp/dataset  apache/spark /opt/spark/bin/spark-shell

```

### Spark shell in action

```
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/27 14:11:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://spark:4040
Spark context available as 'sc' (master = local[*], app id = local-1714227106342).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.22)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
```


### Execute some commands
```scala
scala> val textFile=spark.read.textFile("file:///tmp/dataset/lotr_characters.csv");
scala> textFile
scala> textFile.count();
scala> textFile.first();

```


## Start Py Spark Docker
```bash
docker run --hostname spark -p 4040:4040 -it --rm -v /home/tap/tap-workspace/tap2024/spark/dataset:/tmp/dataset  apache/spark /opt/spark/bin/pyspark
```

### Create a RDD from a python list
```python
# Create a list
data = range(10000) 
# Create a RDD using parallelize. 
distData = sc.parallelize(data) 
# who is sc ?
sc
# and distData ?
distData
# Let's list
distData.collect()
```

### Create a RDD from a text file
```python
# An RDD can be also created from external storage
# textFile creates a RDD(String) (remember when we use spark.read.file)
distFile = sc.textFile("/tmp/dataset/The Return Of The King_djvu.txt") 
distFile

# Take the first ones
distFile.take(10)
```

### A simple map-reduce function
```python
sizeOfBook=distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b)
sizeOfBook

# Let's do in steps

# First step for each line compute the lenght of the line 
mappa=distFile.map(lambda s: len(s))
# Show some elements
mappa.take(20)
# Then sum all the elements of the RDD 
reduce=mappa.reduce(lambda a, b: a + b)
reduce
```

# Key-Value Pairs

- Typically Spark operations work on RDDs containing any type of objects

- Few operations are only available on RDDs of key-value pairs. 

- The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

- In Python, these operations work on RDDs containing built-in Python [tuples](https://realpython.com/python-tuple/) such as (1, 2). 

- In pyspark create RDD of tuples and then call your desired operation.

### A line count example
```python
# Create a RDD of pairs from file, 
pairs = distFile.map(lambda s: (s, 1))
pairs
# We have create a new RDD, let's see what it contains
pairs.take(50)
# Now we can use a reduce function, to count how may times the line appears in the document
counts = pairs.reduceByKey(lambda a, b: a + b)
counts.take(50)
# Let's order by key
ordered=counts.sortByKey()
# and take ordered
ordered.takeOrdered(10)
```

# Let's do a better analysis
Which is the most frequent word in the book ?

### Most frequent word in the book
```python
words=distFile.flatMap(lambda line:line.split(" "))
words.take(100)
# Great, let's assign a counter and then sum
wordCounters=words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
wordCounters.take(10)
# Ok I want to sort now
wordsSorted=wordCounters.takeOrdered(200, key = lambda x: -x[1])
wordsSorted
```