# **Spark**

* If Data contains velocity, variety, volume & veracity.
* Hadoop only works on **'On Disk'** computation and **Batch Data**. It has lengthy and complex framework. Low Cost
* Hadoop has two main components -
    * **HDFS** - Stores data in distributed fashion. Scaling is easier here.
    * **Map Reduce** - used for distributed processing.
* If you want to run SQL on hadoop then you need to learn HIVE
* Hbase, Apache Storm (Handling realtime data), oozie, Scoop, pig.
* Spark supports both realtime and batch processing. High Cost
* In memory computation is supported i.e transformations are done on RAM, read write happens on disk. Supports tools like Spark SWL, Mlib, GraphX, and Spark Streaming.
* Spark is simple and user friendly system.
* If you want to do 10 different things you need to operate 10 different tools, to overcome this Spark was introduced. It is 100x faster than Hadoop. This is made possible by reducing the number of read/write iperations on the disk.
* 350+ projects are there under Apache Foundation, Spark is one of them.
* Spark can be used with Java, Scala, Python, SQL, and R.
* To run Spark, Databricks was introduced.
* Microsoft Fabric, Azure Synapse, AWS Blue.
* Databricks is preffered to run Spark. (Why??)


Apache Spark unifies
  * Batch Processing
  * Stream Analytics
  * Machine Learning
  * SQL Processing



#### **Spark's Basic Architecture**
------
![Alt text](https://hacked.work/blog/wp-content/uploads/2015/03/spark-cluster.png)



***Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called “Workers”. When you run a Spark application, Spark Driver creates a context (Spark Context) that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.***


* **Driver Program** – The process running the main() function of the application and creating the SparkContext. It is also the program/job, written by the developers which is submitted to Spark for processing. Driver program will partition the data. There will always be only 1 driver program.


* **Spark Context** – Spark Context is the entry point to use Spark Core services and features. It sets up internal services and establishes a connection to a Spark execution environment. It communicates with cluster and to create RDD. Every Spark job creates a spark context object before it can do any processing. It allows your Spark Application to access Spark Context with the help of resource manager. It will start the Driver Program. ***There is one Spark Context per JVM***



* **Cluster Manager** – Spark uses cluster manager to acquire resources across the cluster for executing a job. However, Spark is also agnostic of cluster managers and does not really care how it can get its hands on cluster resources. It supports the following cluster managers

    * Spark standalone cluster manager - A simple cluster manager included with spark that makes it easy to set up a cluster
    * YARN - resource manager in hadoop2
    * Mesos
    * Kubernetes


* **Worker Node** – Worker Nodes are nodes which actually do data processing/heavy lifting on data.

* **Executor** – Executors are independent processes which run inside the Worker Nodes in their own JVMs. Data processing is actually done by these executor processes.


* **Cache** – Data stored in physical memory. Jobs can cache data so that it does not need to re-compute RDDs and hence increases the performance storing intermediary data.


* **Task** – A task is a unit of work performed independently by the executor on one partition.


* **Partition** – Spark manages its data by splitting data into manageable chunks across the nodes in a cluster. These chunks are called partitions. The splitting of data is done in a way so that it leads to reduction of network traffic and also optimise the operations to be performed on the data.

[Imp Link](https://www.mrstonewallin.com/post/spark-knowledge-series-i)



#### **Spark Deployment Modes: Client Mode vs Cluster Mode**
---
<img src="https://th.bing.com/th/id/R.5b4223cfa8490f2a8ac960b3e3d3738b?rik=sN29WUb1k7JWxw&riu=http%3a%2f%2fblog.brainlounge.de%2fmemoryleaks%2f2018-12-getting-started-with-spark-on-kubernetes-deploy-modes.png&ehk=zTXeqqjcdNpkjexQ77%2bl3JSIvFN1ljY4scGGGNdGo6Y%3d&risl=&pid=ImgRaw&r=0" width="550" height="300" />

* **Cluster Mode:** In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.
* **Client Mode:** In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes. Note that in client mode only the driver runs locally and all tasks run on cluster worker nodes.


#### **Spark Toolset**

-------



<img src="https://miro.medium.com/max/1104/1*_Dy9w0lUXIeH6WHALkQC-g.png" width="400" height="400" />

#### **Data Structures in Spark: RDD, DataFrame, Dataset**
------

* **Resilient Distributed Dataset**: resilient, immutable, collection of data.
  * **Resilient:** RDDs are fault tolerant
  * **Collection of Data:** RDD holds data and appears to be scala collection.
  * **Partition:** Sparks break RDD into smaller cgunks of data.
  * **Distributed:** Spark distributes the partition along the cluster.
\n

* **Dataframe:** Most common Structured API and simply represents a table of data with rows and columns. Similar to DB table. The list that defines the columns and the types within those columns is called Schema.


* **Dataset:**


#### **Notes:**
---

* All Spark is lazily evaulated.
* Spark context only creates rdd, dataframe.
* **Spark Session:** can be used to create rdd, dataframe, dataset.

### **Hands On**

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=0af94a29d71cfa1d3eed9c7aa58655591675f8cff5856b1eae38253ed73aa864
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


In [None]:
import pyspark
pyspark.__version__

'3.4.1'

In [None]:
import pyspark
# import findspark
from pyspark import SparkContext
pyspark.__version__

'3.4.1'

In [None]:
# findspark.init('usr/local/spark')

In [None]:
conf=pyspark.SparkConf().setMaster('local').setAppName("first")
# Creating a Spark Context
sc = SparkContext(conf = conf)

In [None]:
#Creating rdd on Spark Context
rdd = sc.parallelize([1,2,3]) #Creating simple array type object

In [None]:
rdd.collect()

[1, 2, 3]

In [None]:
sc

In [None]:
rdd2 = sc.parallelize(['Python','SQL','PySpark'])
rdd2.collect()

['Python', 'SQL', 'PySpark']

In [None]:
type(rdd2)

pyspark.rdd.RDD

In [None]:
rdd3 = sc.parallelize([1,2,3,4,5,6,7,8,9])
rdd3.collect()
type(rdd3)

pyspark.rdd.RDD

In [None]:
rdd4 = rdd3.map(lambda x: x*2)

In [None]:
rdd4.collect()

[2, 4, 6, 8, 10, 12, 14, 16, 18]

In [None]:
rdd5 = rdd3.filter(lambda x:x%2==0)

In [None]:
rdd5.collect()

[2, 4, 6, 8]

In [None]:
numbers = [1, 2, 3, 4, 5, 6]
even_sum = 0
for num in numbers:
  if num % 2 == 0:
    even_sum += num

print(even_sum)

12


In [None]:

numbers = [1, 2, 3, 4, 5, 6]
even_sum = sum([x for x in numbers if x % 2 == 0])
print(even_sum)

12


In [None]:
numbers = [1, 2, 3, 4, 5, 6]
even_sum = 0
for i in range(len(numbers)):
  if numbers[i] % 2 == 0:
    even_sum += numbers[i]

print(even_sum)

12


In [None]:
numbers = [1, 2, 3, 4, 5, 6]
even_sum = 0
while len(numbers) > 0:
  num = numbers.pop()
  if num % 2 == 0:
    even_sum += num

print(even_sum)

12
