# Apache Spark Architecture 


## Spark Cluster Architecture

Apache Spark is a **distributed compute engine**. A Spark application runs across a **cluster** (a group of machines), even if you are using a managed platform like Databricks or EMR.

### The big picture

- **Driver**: the brain of your Spark application (plans and coordinates work)
- **Executors**: the workers that run tasks and process partitions
- **Cluster Manager**: allocates resources (YARN / Kubernetes / Standalone)
- **Storage**: where input/output data lives (HDFS, S3, ADLS, GCS, DBFS, etc.)


![Spark Cluster Architecture](../images/spark_cluster_architecture.png)



### What each component does

#### Driver
- Creates the SparkSession / SparkContext
- Builds the logical plan (what to do)
- Builds the physical execution plan (how to do it)
- Splits the work into **jobs → stages → tasks**
- Sends tasks to executors and tracks progress

#### Executors
- JVM processes on worker nodes
- Run **tasks** on **partitions**
- Store cached data (if you use cache/persist)
- Write shuffle files during wide transformations

#### Cluster manager
- Controls resource allocation (CPU/memory) for executors
- Examples: **YARN**, **Kubernetes**, **Standalone**



## How a Spark Application Runs 

Spark is **lazy**: transformations build a plan, and Spark actually runs when an **action** happens.

### Typical workflow

1. **Driver starts** and creates SparkSession/SparkContext  
2. You define transformations (map/filter/select/withColumn/...)  
3. Spark builds a **DAG** (Directed Acyclic Graph) of transformations  
4. When an **action** happens (count/collect/write/show/...), Spark creates a **job**  
5. Job is split into **stages** (based on shuffle boundaries)  
6. Each stage is split into **tasks** (1 task per partition)  
7. Tasks run on executors; shuffle may occur between stages  
8. Results are returned to driver or written to storage

### Image placeholder: DAG → stages → tasks
![DAG, Stages, Tasks](../images/spark_dag_stages_tasks.png)



### Narrow vs Wide transformations (why stages form)

- **Narrow transformation**: each output partition depends on **one** input partition  
  Examples: `map`, `filter`, `flatMap`, `select`, `withColumn`  
  ✅ Usually stays within the same stage

- **Wide transformation**: output partition depends on **many** input partitions  
  Examples: `groupByKey`, `reduceByKey`, `join`, `groupBy`, `distinct`  
  ⚠️ Requires **shuffle** and creates a new stage

Spark docs explain shuffle as the mechanism that re-distributes data across partitions and executors, often involving disk + network IO.

### Image placeholder: Shuffle
![Shuffle](../images/spark_shuffle.png)



## Key Terminologies 

- **Application**: one Spark program run (one driver + many executors)
- **Job**: created by an **action** (e.g., `count()`, `collect()`, `write`, `show()`)
- **Stage**: a set of tasks that can run without shuffle (bounded by shuffle)
- **Task**: smallest unit of work; processes **one partition**
- **Partition**: a slice of data; the unit of parallelism
- **DAG**: graph of transformations; Spark schedules it efficiently
- **Executor**: process on workers that runs tasks
- **Driver**: coordinator; schedules stages/tasks; holds SparkSession



## SparkSession

**SparkSession** is the modern entry point (Spark 2.0+). It unifies:
- DataFrame / Dataset APIs
- Spark SQL
- Streaming and more

In Databricks notebooks, a `spark` session is usually already available.

### Image placeholder: SparkSession as unified entry point
![SparkSession](../images/spark_session_unified_entry.png)


In [None]:
# SparkSession is usually available in Databricks as `spark`
spark



## SparkContext

**SparkContext** represents the connection to the Spark cluster and is the classic entry point for RDD operations.

In modern Spark, you typically access it via:

`spark.sparkContext`

SparkContext still matters when:
- working with RDDs
- low-level operations (broadcast variables, accumulators, etc.)
- legacy codebases


In [None]:
sc = spark.sparkContext
sc



## SQLContext 

Before Spark 2.0, structured operations were done via SQLContext / HiveContext.
From Spark 2.0 onward, **SparkSession replaces SQLContext** for most use cases.

You may still see SQLContext in older tutorials or older Spark code.



## Spark Execution Methods

### Interactive execution (explore + learn)
- `spark-shell` (Scala REPL)
- `pyspark` shell
- Notebooks (Databricks / Jupyter / Zeppelin)

✅ Best for learning, debugging, exploratory analysis.

### Job submission (production runs)
- `spark-submit` (most common)
- REST APIs (platform-dependent)

✅ Best for scheduled batch pipelines and production workloads.

Spark docs for application submission: https://spark.apache.org/docs/latest/submitting-applications.html



### Deploy modes (client vs cluster)

Where does the **driver** run?

- **client mode**: driver runs where you run `spark-submit`
- **cluster mode**: driver runs inside the cluster (e.g., YARN ApplicationMaster / a Kubernetes driver pod)

Spark’s YARN docs describe these modes: https://spark.apache.org/docs/latest/running-on-yarn.html





## RDD example (stages + shuffle)

This example demonstrates a classic **word count** flow:

- `flatMap` → `map` are **narrow** transformations (no shuffle)
- `reduceByKey` is a **wide** transformation (shuffle) → creates a new stage
- `collect` is an **action** → triggers execution

### Image placeholder: wordcount DAG
![WordCount DAG](../images/spark_dag_stages_tasks.png)


In [None]:
# RDD word count (small demo)
text = ["hello spark", "hello lakehouse", "hello world", "spark is fast"]
rdd = sc.parallelize(text, 2)

words = rdd.flatMap(lambda line: line.split(" "))           # narrow
pairs = words.map(lambda w: (w, 1))                         # narrow
counts = pairs.reduceByKey(lambda a, b: a + b)              # wide (shuffle)

 counts.collect()                                   # action
)



##  Spark Web UI (what to look for)

Every Spark application exposes a Web UI (commonly on port 4040). It shows:
- Jobs, Stages, Tasks
- Executors
- Storage (cached datasets)
- SQL tab (for Spark SQL queries)
- Environment and event timeline

Official docs: https://spark.apache.org/docs/latest/monitoring.html




## Official references (Spark docs)

- Cluster Mode Overview: https://spark.apache.org/docs/latest/cluster-overview.html  
- Submitting Applications (spark-submit): https://spark.apache.org/docs/latest/submitting-applications.html  
- RDD Programming Guide (transformations, actions, shuffle): https://spark.apache.org/docs/latest/rdd-programming-guide.html  
- Monitoring / Spark Web UI: https://spark.apache.org/docs/latest/monitoring.html  
- SparkSession API (Python): https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html  
- SparkContext API (Python): https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html  
- SQLContext (legacy): https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html  
- Running on YARN (deploy modes): https://spark.apache.org/docs/latest/running-on-yarn.html  
