# Apache Spark

## Apache Spark™ is a unified analytics engine for large-scale data processing.
![](https://spark.apache.org/images/spark-logo-trademark.png)

# Features

## Speed

![](https://cc-media-foxit.fichub.com/image/fox-it-mondofox/0177f439-3c0f-44ae-9803-c25f8bfac0dd/flash-vs-superman-game-2jpg-maxw-824.jpg)

### Run workloads 100x faster.

![Logistic Regression](https://spark.apache.org/images/logistic-regression.png)

Apache Spark achieves 

- high performance for both batch and streaming data

- using a state-of-the-art DAG scheduler

- a query optimizer

- a physical execution engine.

## Why Spark is faster ?

## In - Memory

![](https://external-preview.redd.it/RVpCIxhliY2p5vKF8I-AoCLIoI48yIEpVPXDduTG6Fc.jpg?auto=webp&s=88a001359893e5533423e9886d4d55cfd2dbdf62)

## Lazy

![](https://miro.medium.com/max/4096/1*KiC1gf3x3Ia_2PBYqfkLBg.jpeg)

## Ease of Use

![](http://www.quickmeme.com/img/4d/4d4759d82ce65de86834ff151bc8b419f89f4e2f0d003f10a54b236785e3e6d2.jpg)

### Write applications quickly in Java, Scala, Python, R, and SQL.

Spark offers over 80 high-level operators that make it easy to build parallel apps. 

And you can use it **interactively** from the Scala, Python, R, and SQL shells.

#### Scala Example


```scala
df = spark.read.json("logs.json") 
df.where("age > 21").select("name.first").show()
```

## Generality

![](https://upload.wikimedia.org/wikipedia/commons/3/3d/%D0%92%D1%80%D0%B5%D0%BC%D0%B5%D0%BD%D0%B0_%D0%B3%D0%BE%D0%B4%D0%B0._%D0%94%D0%B6%D1%83%D0%B7%D0%B5%D0%BF%D0%BF%D0%B5_%D0%90%D1%80%D1%87%D0%B8%D0%BC%D0%B1%D0%BE%D0%BB%D1%8C%D0%B4%D0%BE.jpg)

> Arcimboldo - 1563 - Quattro stagioni

### Combine SQL, streaming, and complex analytics.

Spark powers a stack of libraries including 

- SQL and DataFrames

- MLlib for machine learning

- GraphX

- Spark Streaming. 

You can combine these libraries seamlessly in the same application.

![](https://spark.apache.org/images/spark-stack.png)

## Runs everywhere

![](https://www.nextme.it/images/societa/next-economy/Criceti_umani.jpg)

### Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. 

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes
![](https://spark.apache.org/images/spark-runs-everywhere.png)

### It can access diverse data sources

Access data in 

HDFS

Alluxio

Apache Cassandra

Apache HBase

Apache Hive

and hundreds of other data sources.

# Overview


Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

## Docker

```Dockerfile
FROM openjdk:8-jre-alpine

ENV PATH $SPARK_DIR/bin:$PATH
ENV SPARK_VERSION=3.1.1
ENV SPARK_DIR=/opt/spark
ENV PATH $SPARK_DIR/bin:$PATH

ADD setup/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz /opt

RUN apk update && apk add --no-cache bash
# Create Sym Link 
RUN ln -s /opt/spark-${SPARK_VERSION}-bin-hadoop2.7 ${SPARK_DIR} 

ADD spark-manager.sh $SPARK_DIR/bin/spark-manager

WORKDIR ${SPARK_DIR}
ENTRYPOINT [ "spark-manager" ]
```

```bash
#!/bin/bash

[[ -z "${SPARK_ACTION}" ]] && { echo "SPARK_ACTION required"; exit 1; }

 
echo "Running action ${SPARK_ACTION}"
case ${SPARK_ACTION} in
"example")
echo "Running example ARGS $@"
./bin/run-example $@
;;
esac
```


## SparkPI

Run sparkExamplePi.sh
```bash
#!/usr/bin/env bash
# Stop
docker stop sparkPi

# Remove previuos container 
docker container rm sparkPi

docker build ../spark/ --tag tap:spark
docker run -e SPARK_ACTION=example --network tap --name sparkPi -it tap:spark SparkPi 10
```

```
Running action example
Running example ARGS SparkPi 100
21/04/18 15:31:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/04/18 15:31:59 INFO SparkContext: Running Spark version 3.1.1
21/04/18 15:31:59 INFO ResourceUtils: ==============================================================
21/04/18 15:31:59 INFO ResourceUtils: No custom resources configured for spark.driver.
21/04/18 15:31:59 INFO ResourceUtils: ==============================================================
21/04/18 15:31:59 INFO SparkContext: Submitted application: Spark Pi

...
Pi is roughly 3.1419099141909914
21/04/18 15:32:09 INFO SparkUI: Stopped Spark web UI at http://958a11429922:4040
21/04/18 15:32:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/04/18 15:32:09 INFO MemoryStore: MemoryStore cleared
21/04/18 15:32:09 INFO BlockManager: BlockManager stopped
21/04/18 15:32:09 INFO BlockManagerMaster: BlockManagerMaster stopped
21/04/18 15:32:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/04/18 15:32:09 INFO SparkContext: Successfully stopped SparkContext
21/04/18 15:32:09 INFO ShutdownHookManager: Shutdown hook called
21/04/18 15:32:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-bc591bec-ca63-4e7b-86f4-191684261e8f
21/04/18 15:32:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-4dc0f0c2-7124-4fc7-9122-b4ea4789f57f
```


# Spark Shell

```bash
./sparkShell.sh
### docker run -e SPARK_ACTION=spark-shell --network tap -it tap:spark)
```

```
Spark context Web UI available at http://17827060ca34:4040
Spark context available as 'sc' (master = local[2], app id = local-1586889603227).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
```


```scala
spark.read.textFile("/opt/tap/spark/dataset/lotr_characters.csv");
scala> textFile.count();
res0: Long = 912

scala> textFile.first();

res3: String = birth,death,gender,hair,height,name,race,realm,spouse

```


# Spark Data Structures

# Resilient Distributed Dataset (RDD)

> https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds

Spark revolves around the concept of a resilient distributed dataset (RDD)

![](https://1.bp.blogspot.com/-wMroEy8Ow-k/WdCUxRefTTI/AAAAAAAABNM/Z14px-DgqGYqPfAfwNIILI9EX-ozLGplQCLcBGAs/s640/apache-spark-streaming-13-638.jpg)

fault-tolerant 

![](https://i.imgflip.com/1dzjjc.jpg)

collection of elements

![](https://mallikarjuna_g.gitbooks.io/spark/content/diagrams/spark-rdds.png)

> https://books.japila.pl/apache-spark-internals/

that can be operated on in parallel. 

![](https://mallikarjuna_g.gitbooks.io/spark/content/diagrams/spark-rdd-partitioned-distributed.png)

> Learning Spark 

An RDD in Spark is simply an immutable distributed collection of objects. 

Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. 

RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

There are two ways to create RDDs

parallelizing an existing collection in your driver program, 

referencing a dataset in an external storage system, 
such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat

# Py Spark

# Start Py Spark Docker
```bash
./pyspark.sh 
```

In [1]:
import findspark
import pyspark
findspark.find() 
findspark

<module 'findspark' from '/home/nics/anaconda3/lib/python3.7/site-packages/findspark.py'>

In [2]:
conf = pyspark.SparkConf().setAppName('Tap').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
sc

In [3]:
# List
data = [1, 2, 3, 4, 5] 
data

[1, 2, 3, 4, 5]

In [4]:
distData = sc.parallelize(data)
distData

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [7]:
# An RDD can be also created from external storage
# textFile creates a RDD(String) (remember when we use spark.read.file)
distFile = sc.textFile("../dataset/The Lord Of The Ring 1-The Fellowship Of The Ring_djvu.txt") # Path may be different in your local env
distFile

/mnt/c/Dev/tap2021/spark/dataset/The Lord Of The Ring 1-The Fellowship Of The Ring_djvu.txt MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:0

In [8]:
sizeOfBook=distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b)
sizeOfBook

997146

# Key Pairs

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

In Python, these operations work on RDDs containing built-in Python tuples such as (1, 2). Simply create such tuples and then call your desired operation.

In [9]:
pairs = distFile.map(lambda s: (s, 1))
pairs

PythonRDD[9] at RDD at PythonRDD.scala:53

We have create a new RDD, let's see what it contains

In [10]:
pairs.take(10)

[('', 1),
 ("“THE LORD OF THE RINGS' ", 1),
 ('', 1),
 ('V*art One ', 1),
 ('', 1),
 ('THE FELLOWSHIP ', 1),
 ('OF THE RING ', 1),
 ('', 1),
 ('J.R.R.ToIkien ', 1),
 ('', 1)]

Now we can use a reduce function, to count how may times the line appears in the document

In [11]:
counts = pairs.reduceByKey(lambda a, b: a + b)
counts

PythonRDD[15] at RDD at PythonRDD.scala:53

In [12]:
counts.take(10)

[('', 5807),
 ("“THE LORD OF THE RINGS' ", 1),
 ('V*art One ', 1),
 ('THE FELLOWSHIP ', 1),
 ('OF THE RING ', 1),
 ('J.R.R.ToIkien ', 1),
 ('Complete Table of Contents ', 1),
 ('Foreword ', 2),
 ('Prologue ', 1),
 ('1 . Concerning Hobbits ', 1)]

Let's order by key

In [13]:
ordered=counts.sortByKey()

In [14]:
ordered.takeOrdered(10)

[('', 5807),
 ('" "The enemy must have some great need or purpose," said Radagast; "but ',
  1),
 ('" "The time of my thought is my own to spend," answered Dbin. ', 1),
 ('"\'Give us that, Deal, my love," said Smjagol, over his friend\'s ', 1),
 ('"Ass! Fool! Thrice worthy and beloved Barliman! " said I. "It\'s the ', 1),
 ('"At the worst," said he, "our Enemy knows that we have it not and ', 1),
 ('"I am afraid we must go back to the Road here for a while,\' said ', 1),
 ('"I will do that," he said, and rode off as if the Nine were after ', 1),
 ('"So you have come, Gandalf," he said to me gravely; but in his eyes ', 1),
 ('"Talking,\' said Bilbo. \'There was a deal of talk, and everyone had an ',
  1)]

# Let's do a better analysis
Which is the most frequent word in the book ?

In [13]:
words=distFile.flatMap(lambda line:line.split(" "))

In [14]:
words.take(100)

['',
 '“THE',
 'LORD',
 'OF',
 'THE',
 "RINGS'",
 '',
 '',
 'V*art',
 'One',
 '',
 '',
 'THE',
 'FELLOWSHIP',
 '',
 'OF',
 'THE',
 'RING',
 '',
 '',
 'J.R.R.ToIkien',
 '',
 '',
 '',
 '',
 'Complete',
 'Table',
 'of',
 'Contents',
 '',
 '',
 '',
 '',
 'Foreword',
 '',
 '',
 'Prologue',
 '',
 '',
 '1',
 '.',
 'Concerning',
 'Hobbits',
 '',
 '',
 '2.',
 'Concerning',
 'Pipe-weed',
 '',
 '',
 '3.',
 'Of',
 'the',
 'Ordering',
 'of',
 'the',
 'Shire',
 '',
 '',
 '4.',
 'Of',
 'the',
 'Finding',
 'of',
 'the',
 'Ring',
 '',
 '',
 'note',
 'on',
 'the',
 'shire',
 'records',
 '',
 '',
 '',
 '',
 '',
 'Book',
 'I',
 '',
 '',
 'Chapter',
 '1',
 '',
 'Chapter',
 '2',
 '',
 'Chapter',
 '3',
 '',
 'Chapter',
 '4',
 '',
 'Chapter',
 '5',
 '',
 'Chapter',
 '6',
 '']

Great, let's assign a counter and then sum 

In [15]:
wordCounters=words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

In [16]:
wordCounters.take(10)

[('', 21555),
 ('“THE', 1),
 ('LORD', 1),
 ('OF', 3),
 ('THE', 9),
 ("RINGS'", 1),
 ('V*art', 1),
 ('One', 67),
 ('FELLOWSHIP', 1),
 ('RING', 1)]

Ok I want to sort now

In [17]:
wordsSorted=wordCounters.takeOrdered(50, key = lambda x: -x[1])
wordsSorted

[('', 21555),
 ('the', 10695),
 ('and', 7045),
 ('of', 5025),
 ('to', 3857),
 ('a', 3522),
 ('in', 2702),
 ('was', 2430),
 ('that', 2256),
 ('he', 2234),
 ('I', 2191),
 ('it', 1644),
 ('they', 1518),
 ('his', 1494),
 ('not', 1305),
 ('you', 1280),
 ('is', 1260),
 ('as', 1254),
 ('had', 1246),
 ('for', 1241),
 ('said', 1159),
 ('on', 1122),
 ('with', 1067),
 ('were', 1016),
 ('at', 1015),
 ('have', 988),
 ('The', 973),
 ('but', 961),
 ('be', 841),
 ('from', 722),
 ('all', 704),
 ('we', 693),
 ('or', 668),
 ('He', 660),
 ('their', 655),
 ('are', 627),
 ('if', 613),
 ('Frodo', 586),
 ('there', 554),
 ('But', 553),
 ('by', 534),
 ('no', 518),
 ('will', 514),
 ('out', 511),
 ('up', 482),
 ('them', 459),
 ('my', 457),
 ('now', 456),
 ('into', 451),
 ('It', 446)]

In [16]:
sc.stop()

# Biblio
- https://www.kdnuggets.com/2017/08/three-apache-spark-apis-rdds-dataframes-datasets.html
- https://www.slideshare.net/differentsachin/apache-spark-introduction-and-resilient-distributed-dataset-basics-and-deep-dive
- https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
- https://www.slideshare.net/taposhdr/resilient-distributed-datasets
- http://vishnuviswanath.com/spark_rdd.html
- https://sparkbyexamples.com/apache-spark-rdd/spark-rdd-actions/
- https://www.educba.com/rdd-in-spark/
- https://www.javahelps.com/2019/02/spark-03-understanding-resilient.html
- https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
- https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
- https://www.sicara.ai/blog/2017-05-02-get-started-pyspark-jupyter-notebook-3-minutes