# History - Spark

- **2009**: started as a project at UC Berkley AMPLab.
- **2010**: open sources (BSD license)
- **2013**: Spark became an Apache project.
- **2014**: used by Databricks to sort large_scale datasets and set a world record.

It is an open-source data processing engine to store and process data in real-time across various clusters of computers using simple programming constructs.

## Hadoop v/s Spark

- Processing data using MapReduce in Hadoop is slow (batch-oriented). Spark processes data 100 times faster than MapReduce as it is done in-memory.

- Hadoop performs batch processing whereas Spark can perform both batch processing and real-time processing of data.

- Hadoop is written in Java, has more lines of code. Spark is written in Scala and has fewer lines of code.

## Spark features

- Spark contains RDDs (resilient distributed datasets). Think of it as one dataset stored across multiple computers.

- In-memory computing. Lazy evaluation. Data is stored in RAM so it can be accessed quickly.

- Fault tolerance. 

## Spark components

<b>Spark core</b> (contains RDDs, core engine) + <b>Spark SQL</b> (for structured data, has dataframes) + <b>Spark streaming</b> (works with data which is being constantly generated) + <b>Spark MLlib</b> (contains libraries for ML development) + <b>Spark GraphX</b> (data with a network structure)

a) Spark core is the base engine for large-scale parallel and distributed data processing. Responsible for memory management, fault recovery, scheduling & distributing & monitoring jobs on a cluster, interacting with storage systems. Spark does not have its own storage. The storage could be HDFS, Hbase, any RDBMS etc. <br><br>
b) Spark SQL is the framework component used for structured and semi-structured data processing. <br><br>
c) Spark streaming is a lightweight API that allows developers to perfrom batch processing and real-time streaming of data with ease. Provides secure, reliable and fast processing of live data streams.<br><br>
d) MLlib is a low level ML library that is simple to use, scalable and compatible with various programming languages.<br><br>
e) GraphX is Sparks own Graph Computation Engine and data score. Provides a uniform tool for ETL. 

## Spark Architecture

Apache Spark uses a master slave architecture that consists of a driver, that runs on a master node, and multiple executors which run across the worker nodes in the cluster.

https://www.simplilearn.com/tutorials/apache-spark-tutorial/apache-spark-architecture

- Master node has a driver program. 

- The Spark code behaves as a driver program and creates a ```SparkContext```, which is a gateway to all the Spark functionalities. 

## Spark cluster managers

Options:<br>
a) Apache Spark standalone mode.
b) Apache Mesos
c) Hadoop Yarn
d) Kubernetes

## Resilient distributed datasets (RDDs)

Spark core is embedded with RDDs, an immutable fault tolerant, distributed collection of objects that can be operated on in parallel. 

RDD -> Transformation + Action


- **Transformation**: Operations such as map, filter, join union that are performed on an RDD that yields a new RDD with the result.  
- **Action**: Operations such as reduce, first, count that return a value after running a computation on an RDD.

Transformations only create an execution logic - RDDs. They do not evaluate to anything. Only when you take an action, will the execution gets triggered. 

RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark which is fault-tolerant, immutable distributed collections of objects. Immutable meaning once you create an RDD you cannot change it. Each record in RDD is divided into logical partitions, which can be computed on different nodes of the cluster. 

In other words, RDDs are a collection of objects similar to list in Python, with the difference being RDD is computed on several processes scattered across multiple physical servers also called nodes in a cluster while a Python collection lives and process in just one process.

## PySpark

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities, using PySpark we can run applications parallelly on the distributed cluster (multiple nodes).

## RDDs

PySpark RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark that is fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it.

In order to create an RDD, first, you need to create a ```SparkSession``` which is an entry point to the PySpark application. SparkSession can be created using a builder() or newSession() methods of the SparkSession.

In [7]:
from pyspark.sql import SparkSession

In [15]:
# create spark session
# If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn or mesos depends 
# on your cluster setup.
# Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD,
# DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
# Used to set your application name.
# This returns a SparkSession object if already exists, and creates a new one if not exist.

# spark = SparkSession.builder.master("local[1]").appName('PySpark-learn').getOrCreate()

You can also create a new SparkSession using newSession() method. This uses the same app name, master as the existing session. Underlying SparkContext will be the same for both sessions as you can have only one context per PySpark application.

```python
# Create new SparkSession
spark2 = SparkSession.newSession
print(spark2)
```

---