# Apache Spark Fundamentals

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

---

## Table of Contents
1. [Spark Architecture](#spark-architecture)
2. [Core Abstractions: RDDs, DataFrames, Datasets](#core-abstractions)
3. [Transformations and Actions](#transformations-and-actions)
4. [Spark SQL](#spark-sql)
5. [Performance Tuning](#performance-tuning)
6. [Key Takeaways](#key-takeaways)

---

## Spark Architecture <a id='spark-architecture'></a>

Spark follows a **master-worker** architecture with three main components:

### 1. Driver Program
- The **central coordinator** of a Spark application
- Runs the `main()` function and creates the `SparkContext`
- Converts user code into tasks and schedules them on executors
- Maintains information about the Spark application

### 2. Cluster Manager
- Responsible for **resource allocation** across the cluster
- Types: Standalone, YARN, Mesos, Kubernetes
- Negotiates resources between Driver and Worker nodes

### 3. Executors
- **Worker processes** that run on cluster nodes
- Execute tasks assigned by the Driver
- Store data in memory/disk for caching
- Report task status back to the Driver

```
┌─────────────────────────────────────────────────────────────────┐
│                        SPARK APPLICATION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    ┌─────────────────┐                                          │
│    │     DRIVER      │                                          │
│    │  ┌───────────┐  │                                          │
│    │  │SparkContext│ │                                          │
│    │  └───────────┘  │                                          │
│    │  ┌───────────┐  │         ┌──────────────────────┐         │
│    │  │ DAG       │  │         │   CLUSTER MANAGER    │         │
│    │  │ Scheduler │  │◄───────►│ (YARN/Mesos/K8s)     │         │
│    │  └───────────┘  │         └──────────────────────┘         │
│    │  ┌───────────┐  │                   │                      │
│    │  │Task       │  │                   │                      │
│    │  │Scheduler  │  │                   ▼                      │
│    │  └───────────┘  │    ┌─────────────────────────────┐       │
│    └────────┬────────┘    │         WORKER NODE         │       │
│             │             │  ┌───────────────────────┐  │       │
│             │             │  │      EXECUTOR         │  │       │
│             └────────────►│  │ ┌─────┐ ┌─────┐       │  │       │
│                           │  │ │Task1│ │Task2│ ...   │  │       │
│                           │  │ └─────┘ └─────┘       │  │       │
│                           │  │ ┌─────────────────┐   │  │       │
│                           │  │ │  Cache/Storage  │   │  │       │
│                           │  │ └─────────────────┘   │  │       │
│                           │  └───────────────────────┘  │       │
│                           └─────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────┘
```

### DAG (Directed Acyclic Graph)

Spark's execution engine uses DAGs to optimize query plans:

1. **DAG Construction**: When you call transformations, Spark builds a DAG of stages
2. **Stage Division**: DAG is divided into stages at **shuffle boundaries**
3. **Task Creation**: Each stage is divided into tasks (one per partition)
4. **Optimization**: Catalyst optimizer and Tungsten engine optimize the execution plan

```
User Code → Logical Plan → Optimized Logical Plan → Physical Plan → DAG → Tasks
```

---

## Core Abstractions: RDDs, DataFrames, Datasets <a id='core-abstractions'></a>

### RDD (Resilient Distributed Dataset)

The **fundamental data structure** of Spark - an immutable, distributed collection of objects.

| Property | Description |
|----------|-------------|
| **Resilient** | Fault-tolerant via lineage graph |
| **Distributed** | Data partitioned across cluster nodes |
| **Dataset** | Collection of partitioned data |
| **Immutable** | Cannot be changed once created |
| **Lazy** | Transformations are not executed until an action is called |

### DataFrame

A **distributed collection of data organized into named columns** - like a table in a relational database.

- Built on top of RDDs with schema information
- Optimized by Catalyst query optimizer
- API available in Python, Scala, Java, and R
- **Preferred for most use cases** due to optimizations

### Dataset (Scala/Java only)

A **strongly-typed** collection of domain-specific objects.

- Combines benefits of RDDs (type-safety) and DataFrames (optimization)
- Compile-time type checking
- Not available in Python (PySpark uses DataFrames)

### Comparison

| Feature | RDD | DataFrame | Dataset |
|---------|-----|-----------|----------|
| **Type Safety** | Yes | No | Yes |
| **Optimization** | No | Yes (Catalyst) | Yes (Catalyst) |
| **Schema** | No | Yes | Yes |
| **Python Support** | Yes | Yes | No |
| **Serialization** | Java/Kryo | Tungsten | Tungsten |
| **Use Case** | Low-level control | SQL-like operations | Type-safe OOP |

In [None]:
# Initialize SparkSession (entry point for Spark 2.0+)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark Fundamentals") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

# Access SparkContext from SparkSession
sc = spark.sparkContext

print(f"Spark Version: {spark.version}")
print(f"App Name: {sc.appName}")
print(f"Master: {sc.master}")

In [None]:
# Creating RDDs

# Method 1: From a Python collection (parallelize)
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd_from_list = sc.parallelize(data, numSlices=4)  # 4 partitions

print(f"Number of partitions: {rdd_from_list.getNumPartitions()}")
print(f"First 5 elements: {rdd_from_list.take(5)}")

# Method 2: From external data source
# rdd_from_file = sc.textFile("hdfs://path/to/file.txt")

# Method 3: From existing RDD (transformation)
rdd_squared = rdd_from_list.map(lambda x: x ** 2)
print(f"Squared values: {rdd_squared.collect()}")

In [None]:
# Creating DataFrames
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Method 1: From Python list with schema inference
data = [
    ("Alice", "Engineering", 75000),
    ("Bob", "Marketing", 65000),
    ("Charlie", "Engineering", 80000),
    ("Diana", "Sales", 70000),
    ("Eve", "Marketing", 72000)
]

df_inferred = spark.createDataFrame(data, ["name", "department", "salary"])
df_inferred.show()
df_inferred.printSchema()

In [None]:
# Method 2: With explicit schema definition
schema = StructType([
    StructField("name", StringType(), nullable=False),
    StructField("department", StringType(), nullable=True),
    StructField("salary", IntegerType(), nullable=True)
])

df_explicit = spark.createDataFrame(data, schema)
df_explicit.printSchema()

# Method 3: From files (CSV, JSON, Parquet, etc.)
# df_csv = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
# df_json = spark.read.json("path/to/file.json")
# df_parquet = spark.read.parquet("path/to/file.parquet")

---

## Transformations and Actions <a id='transformations-and-actions'></a>

Spark operations are divided into two categories:

### Transformations (Lazy)
- Create a **new RDD/DataFrame** from an existing one
- **Not executed immediately** - just build the DAG
- Two types:
  - **Narrow**: Each input partition contributes to one output partition (e.g., `map`, `filter`)
  - **Wide**: Input partitions contribute to multiple output partitions (e.g., `groupBy`, `join`) - **requires shuffle**

### Actions (Eager)
- **Trigger execution** of the DAG
- Return results to the driver or write to storage
- Examples: `collect()`, `count()`, `save()`

```
┌──────────────────────────────────────────────────────────────┐
│                    NARROW TRANSFORMATIONS                    │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐                │
│  │Partition│────►│Partition│────►│Partition│                │
│  │    1    │     │    1'   │     │    1''  │                │
│  └─────────┘     └─────────┘     └─────────┘                │
│                                                              │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐                │
│  │Partition│────►│Partition│────►│Partition│                │
│  │    2    │     │    2'   │     │    2''  │                │
│  └─────────┘     └─────────┘     └─────────┘                │
│       map()          filter()        map()                   │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    WIDE TRANSFORMATIONS                      │
│  ┌─────────┐                     ┌─────────┐                │
│  │Partition│─────────┬──────────►│Partition│                │
│  │    1    │         │           │    A    │                │
│  └─────────┘         │           └─────────┘                │
│                      │  SHUFFLE                              │
│  ┌─────────┐         │           ┌─────────┐                │
│  │Partition│─────────┴──────────►│Partition│                │
│  │    2    │                     │    B    │                │
│  └─────────┘                     └─────────┘                │
│                  groupByKey()                                │
└──────────────────────────────────────────────────────────────┘
```

In [None]:
# ============================================
# RDD Transformations Examples
# ============================================

numbers = sc.parallelize(range(1, 11))

# map() - Apply function to each element
squared = numbers.map(lambda x: x ** 2)
print(f"map (squared): {squared.collect()}")

# filter() - Keep elements that satisfy condition
evens = numbers.filter(lambda x: x % 2 == 0)
print(f"filter (evens): {evens.collect()}")

# flatMap() - Map + flatten results
words = sc.parallelize(["hello world", "spark is great"])
split_words = words.flatMap(lambda line: line.split(" "))
print(f"flatMap: {split_words.collect()}")

# distinct() - Remove duplicates
duplicates = sc.parallelize([1, 1, 2, 2, 3, 3, 3])
unique = duplicates.distinct()
print(f"distinct: {unique.collect()}")

In [None]:
# ============================================
# RDD Pair Operations (Key-Value RDDs)
# ============================================

# Create pair RDD
sales = sc.parallelize([
    ("Electronics", 1000),
    ("Clothing", 500),
    ("Electronics", 1500),
    ("Food", 300),
    ("Clothing", 700),
    ("Food", 400)
])

# reduceByKey() - Aggregate values by key (preferred over groupByKey)
total_by_category = sales.reduceByKey(lambda a, b: a + b)
print(f"reduceByKey: {total_by_category.collect()}")

# groupByKey() - Group values by key (creates iterator)
grouped = sales.groupByKey().mapValues(list)
print(f"groupByKey: {grouped.collect()}")

# sortByKey() - Sort by key
sorted_sales = total_by_category.sortByKey()
print(f"sortByKey: {sorted_sales.collect()}")

# mapValues() - Apply function to values only
doubled = sales.mapValues(lambda x: x * 2)
print(f"mapValues: {doubled.collect()}")

In [None]:
# ============================================
# RDD Actions Examples
# ============================================

numbers = sc.parallelize(range(1, 11))

# collect() - Return all elements to driver (use with caution!)
all_data = numbers.collect()
print(f"collect: {all_data}")

# count() - Count number of elements
print(f"count: {numbers.count()}")

# first() - Return first element
print(f"first: {numbers.first()}")

# take(n) - Return first n elements
print(f"take(3): {numbers.take(3)}")

# reduce() - Aggregate all elements
total = numbers.reduce(lambda a, b: a + b)
print(f"reduce (sum): {total}")

# aggregate() - More flexible aggregation
# (initial_value, seq_op, comb_op)
sum_count = numbers.aggregate(
    (0, 0),
    lambda acc, val: (acc[0] + val, acc[1] + 1),
    lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])
)
print(f"aggregate (sum, count): {sum_count}")
print(f"average: {sum_count[0] / sum_count[1]}")

In [None]:
# ============================================
# DataFrame Transformations
# ============================================
from pyspark.sql.functions import col, avg, sum as spark_sum, count, when, lit, upper

# Sample employee data
employees = spark.createDataFrame([
    (1, "Alice", "Engineering", 75000, 5),
    (2, "Bob", "Marketing", 65000, 3),
    (3, "Charlie", "Engineering", 80000, 7),
    (4, "Diana", "Sales", 70000, 4),
    (5, "Eve", "Marketing", 72000, 6),
    (6, "Frank", "Engineering", 90000, 10),
    (7, "Grace", "Sales", 68000, 2)
], ["id", "name", "department", "salary", "years_exp"])

employees.show()

In [None]:
# select() - Choose columns
employees.select("name", "salary").show()

# selectExpr() - Select with SQL expressions
employees.selectExpr("name", "salary * 1.1 as new_salary").show()

# filter() / where() - Filter rows
high_earners = employees.filter(col("salary") > 70000)
high_earners.show()

# Multiple conditions
experienced_engineers = employees.filter(
    (col("department") == "Engineering") & (col("years_exp") >= 5)
)
experienced_engineers.show()

In [None]:
# withColumn() - Add or modify columns
employees_with_bonus = employees.withColumn(
    "bonus",
    when(col("years_exp") >= 5, col("salary") * 0.15)
    .otherwise(col("salary") * 0.10)
)
employees_with_bonus.show()

# drop() - Remove columns
employees.drop("years_exp").show()

# withColumnRenamed() - Rename column
employees.withColumnRenamed("salary", "annual_salary").show()

In [None]:
# ============================================
# Aggregations
# ============================================

# groupBy() with aggregation functions
dept_stats = employees.groupBy("department").agg(
    count("*").alias("employee_count"),
    avg("salary").alias("avg_salary"),
    spark_sum("salary").alias("total_salary")
)
dept_stats.show()

# orderBy() / sort() - Sort results
employees.orderBy(col("salary").desc()).show()

# Multiple sort columns
employees.orderBy("department", col("salary").desc()).show()

In [None]:
# ============================================
# Joins
# ============================================

# Department details
departments = spark.createDataFrame([
    ("Engineering", "Building A", 100),
    ("Marketing", "Building B", 50),
    ("Sales", "Building C", 75),
    ("HR", "Building D", 25)
], ["dept_name", "location", "budget_k"])

# Inner Join (default)
joined = employees.join(
    departments,
    employees.department == departments.dept_name,
    "inner"
)
joined.select("name", "department", "location").show()

# Left Outer Join
left_joined = employees.join(
    departments,
    employees.department == departments.dept_name,
    "left_outer"
)
left_joined.select("name", "department", "location").show()

# Join types: inner, left_outer, right_outer, full_outer, cross, semi, anti

---

## Spark SQL <a id='spark-sql'></a>

Spark SQL allows you to query structured data using SQL syntax. It provides:

- **Seamless integration** between SQL and DataFrame API
- **Catalyst Optimizer** for query optimization
- **Unified data access** across various data sources
- **Hive compatibility** for existing Hive workloads

In [None]:
# ============================================
# Spark SQL Examples
# ============================================

# Register DataFrame as temporary view
employees.createOrReplaceTempView("employees")
departments.createOrReplaceTempView("departments")

# Basic SQL query
result = spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE salary > 70000
    ORDER BY salary DESC
""")
result.show()

In [None]:
# Aggregation query
dept_summary = spark.sql("""
    SELECT 
        department,
        COUNT(*) as emp_count,
        ROUND(AVG(salary), 2) as avg_salary,
        MAX(salary) as max_salary,
        MIN(salary) as min_salary
    FROM employees
    GROUP BY department
    ORDER BY avg_salary DESC
""")
dept_summary.show()

In [None]:
# Join query
joined_data = spark.sql("""
    SELECT 
        e.name,
        e.department,
        e.salary,
        d.location,
        d.budget_k
    FROM employees e
    LEFT JOIN departments d ON e.department = d.dept_name
    ORDER BY e.salary DESC
""")
joined_data.show()

In [None]:
# Window functions
window_query = spark.sql("""
    SELECT 
        name,
        department,
        salary,
        RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
        AVG(salary) OVER (PARTITION BY department) as dept_avg_salary,
        salary - AVG(salary) OVER (PARTITION BY department) as salary_diff_from_avg
    FROM employees
    ORDER BY department, dept_rank
""")
window_query.show()

In [None]:
# Subquery and CTE (Common Table Expression)
cte_query = spark.sql("""
    WITH dept_avg AS (
        SELECT department, AVG(salary) as avg_sal
        FROM employees
        GROUP BY department
    )
    SELECT 
        e.name,
        e.department,
        e.salary,
        ROUND(da.avg_sal, 2) as dept_avg,
        CASE 
            WHEN e.salary > da.avg_sal THEN 'Above Average'
            ELSE 'Below Average'
        END as salary_category
    FROM employees e
    JOIN dept_avg da ON e.department = da.department
    ORDER BY e.department, e.salary DESC
""")
cte_query.show()

---

## Performance Tuning <a id='performance-tuning'></a>

### Key Performance Concepts

#### 1. Partitioning
- **Number of partitions** affects parallelism
- Rule of thumb: 2-4 partitions per CPU core
- Use `repartition()` to increase or `coalesce()` to decrease partitions

#### 2. Caching and Persistence
- Cache frequently accessed data to avoid recomputation
- Choose appropriate storage level based on memory availability

#### 3. Broadcast Variables
- Send small read-only data to all executors
- Useful for small lookup tables in joins

#### 4. Avoid Shuffles
- Shuffles are expensive (network I/O, disk I/O)
- Use `reduceByKey()` instead of `groupByKey()`
- Use broadcast joins for small tables

In [None]:
# ============================================
# Partitioning
# ============================================

rdd = sc.parallelize(range(100), 10)  # 10 partitions
print(f"Initial partitions: {rdd.getNumPartitions()}")

# Increase partitions (triggers shuffle)
rdd_repartitioned = rdd.repartition(20)
print(f"After repartition(20): {rdd_repartitioned.getNumPartitions()}")

# Decrease partitions (no shuffle - more efficient)
rdd_coalesced = rdd.coalesce(5)
print(f"After coalesce(5): {rdd_coalesced.getNumPartitions()}")

# DataFrame partitioning
print(f"\nDataFrame partitions: {employees.rdd.getNumPartitions()}")
df_repartitioned = employees.repartition(4, "department")  # Partition by column
print(f"After repartition by department: {df_repartitioned.rdd.getNumPartitions()}")

In [None]:
# ============================================
# Caching and Persistence
# ============================================
from pyspark import StorageLevel

# Cache in memory (shorthand for persist(MEMORY_AND_DISK))
employees.cache()

# Trigger caching with an action
employees.count()

# Check if cached
print(f"Is cached: {employees.is_cached}")

# Persist with specific storage level
# StorageLevel options:
# - MEMORY_ONLY: Store as deserialized Java objects in memory
# - MEMORY_AND_DISK: Spill to disk if memory is insufficient
# - MEMORY_ONLY_SER: Store as serialized objects (more space-efficient)
# - DISK_ONLY: Store only on disk
# - MEMORY_AND_DISK_SER: Serialized objects with disk spillover

# Example: persist with serialization
# df.persist(StorageLevel.MEMORY_AND_DISK_SER)

# Unpersist to free memory
employees.unpersist()
print(f"Is cached after unpersist: {employees.is_cached}")

In [None]:
# ============================================
# Broadcast Variables
# ============================================
from pyspark.sql.functions import broadcast

# RDD broadcast example
lookup_dict = {"Engineering": "ENG", "Marketing": "MKT", "Sales": "SAL"}
broadcast_lookup = sc.broadcast(lookup_dict)

# Use broadcast variable in transformation
def map_department(emp):
    dept_code = broadcast_lookup.value.get(emp[2], "UNK")
    return (emp[0], emp[1], emp[2], dept_code, emp[3])

emp_rdd = employees.rdd
result = emp_rdd.map(map_department).collect()[:3]
print("With broadcast lookup:", result)

In [None]:
# DataFrame broadcast join (for small tables)
# This avoids shuffling the large table

# Hint Spark to broadcast the smaller table
broadcast_join = employees.join(
    broadcast(departments),
    employees.department == departments.dept_name
)

# Explain the query plan
print("Query Plan with Broadcast Join:")
broadcast_join.explain()

In [None]:
# ============================================
# Performance Configuration Tips
# ============================================

# Important Spark configurations for performance:
spark_configs = {
    "spark.sql.shuffle.partitions": "Default 200, reduce for smaller datasets",
    "spark.sql.autoBroadcastJoinThreshold": "Max size (bytes) for auto-broadcast, default 10MB",
    "spark.default.parallelism": "Default parallelism for RDD operations",
    "spark.executor.memory": "Memory per executor (e.g., '4g')",
    "spark.executor.cores": "Cores per executor",
    "spark.dynamicAllocation.enabled": "Enable dynamic executor allocation",
    "spark.sql.adaptive.enabled": "Enable Adaptive Query Execution (AQE)",
    "spark.serializer": "Use 'org.apache.spark.serializer.KryoSerializer' for better performance"
}

for config, description in spark_configs.items():
    current_value = spark.conf.get(config, "Not set")
    print(f"{config}:")
    print(f"  Current: {current_value}")
    print(f"  Tip: {description}\n")

In [None]:
# ============================================
# Query Execution Plans
# ============================================

# Understanding explain() output
complex_query = employees.filter(col("salary") > 70000) \
    .groupBy("department") \
    .agg(avg("salary").alias("avg_salary"))

# Simple explain
print("=== Simple Explain ===")
complex_query.explain()

# Extended explain (includes logical plans)
print("\n=== Extended Explain ===")
complex_query.explain(extended=True)

### Performance Best Practices Summary

| Area | Best Practice |
|------|---------------|
| **Partitioning** | 2-4 partitions per core; use `coalesce()` for reducing |
| **Caching** | Cache iteratively used DataFrames; use appropriate storage level |
| **Joins** | Broadcast small tables; filter before joining |
| **Shuffles** | Minimize; use `reduceByKey` over `groupByKey` |
| **Serialization** | Use Kryo serializer for better performance |
| **Data Format** | Use columnar formats (Parquet, ORC) |
| **Filtering** | Push filters as early as possible (predicate pushdown) |
| **AQE** | Enable Adaptive Query Execution in Spark 3.0+ |

In [None]:
# Clean up
spark.stop()

---

## Key Takeaways <a id='key-takeaways'></a>

### Architecture
- Spark uses a **master-worker architecture** with Driver, Cluster Manager, and Executors
- **DAG execution model** enables optimization and fault tolerance
- Jobs are divided into **stages** (at shuffle boundaries) and **tasks** (per partition)

### Data Abstractions
- **RDDs**: Low-level, unstructured, full control but no optimization
- **DataFrames**: Structured, optimized by Catalyst, preferred for most use cases
- **Datasets**: Type-safe DataFrames (Scala/Java only)

### Operations
- **Transformations are lazy** - they build the DAG but don't execute
- **Actions trigger execution** - `collect()`, `count()`, `save()`, etc.
- **Narrow transformations** (map, filter) don't require shuffle
- **Wide transformations** (groupBy, join) require shuffle - expensive!

### Spark SQL
- Provides **SQL interface** to structured data
- Supports **joins, aggregations, window functions, CTEs**
- Seamlessly integrates with DataFrame API

### Performance
- **Partition data appropriately** for parallelism
- **Cache frequently accessed data** to avoid recomputation
- **Use broadcast joins** for small lookup tables
- **Minimize shuffles** - they are the biggest performance bottleneck
- **Use columnar formats** (Parquet, ORC) for storage efficiency
- **Enable AQE** (Adaptive Query Execution) in Spark 3.0+

---

### Further Learning Resources
- [Apache Spark Official Documentation](https://spark.apache.org/docs/latest/)
- [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [Spark Performance Tuning](https://spark.apache.org/docs/latest/tuning.html)
- [Databricks Spark Knowledge Base](https://kb.databricks.com/)