# Batch Processing — Overview

## Purpose
Understand the fundamentals of batch processing, when to apply it, and how Apache Spark enables scalable data transformations on large datasets.

## Key Questions
1. What is batch processing and how does it differ from real-time processing?
2. When should you choose batch over streaming?
3. How does Apache Spark architecture enable distributed batch processing?
4. What are the core PySpark APIs for batch workloads?

---
## What Is Batch Processing?

**Batch processing** is a method of running high-volume, repetitive data jobs on a scheduled or on-demand basis. Data is collected over a period, then processed as a single unit (batch).

### Characteristics
| Aspect | Description |
|--------|-------------|
| **Latency** | Minutes to hours (not real-time) |
| **Volume** | Large datasets (GBs to PBs) |
| **Scheduling** | Cron, Airflow, or orchestrators |
| **Use Cases** | ETL, reporting, ML training, data warehousing |

### When to Use Batch Processing
- **Historical analysis**: Aggregate logs, sales, or events over days/weeks
- **Cost efficiency**: Process during off-peak hours
- **Complex transformations**: Joins, aggregations, and ML pipelines
- **Data doesn't require instant results**: Reports, dashboards refreshed periodically

---
## Apache Spark — Overview & Architecture

**Apache Spark** is a unified analytics engine for large-scale data processing with built-in modules for SQL, streaming, ML, and graph processing.

### Core Components
```
┌─────────────────────────────────────────────────────────┐
│                    Driver Program                       │
│  ┌─────────────┐                                        │
│  │ SparkContext│ ──► Cluster Manager (YARN/K8s/Mesos)   │
│  └─────────────┘                                        │
└─────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │ Executor │    │ Executor │    │ Executor │
    │  (Node)  │    │  (Node)  │    │  (Node)  │
    │ ┌──────┐ │    │ ┌──────┐ │    │ ┌──────┐ │
    │ │Tasks │ │    │ │Tasks │ │    │ │Tasks │ │
    │ └──────┘ │    │ └──────┘ │    │ └──────┘ │
    └──────────┘    └──────────┘    └──────────┘
```

### Key Concepts
| Concept | Description |
|---------|-------------|
| **RDD** | Resilient Distributed Dataset — immutable, partitioned collection |
| **DataFrame** | Distributed collection with named columns (like a table) |
| **Transformation** | Lazy operations (map, filter, join) that define a computation |
| **Action** | Operations that trigger execution (collect, count, write) |
| **DAG** | Directed Acyclic Graph of stages built from transformations |

---
## PySpark Basics — Code Examples

Below are foundational PySpark patterns for batch processing.

In [None]:
# Initialize a Spark Session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("BatchProcessingDemo") \
    .master("local[*]") \
    .getOrCreate()

print(f"Spark version: {spark.version}")

In [None]:
# Create a DataFrame from sample data
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Define schema
schema = StructType([
    StructField("order_id", StringType(), False),
    StructField("customer_id", StringType(), False),
    StructField("product", StringType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("price", DoubleType(), True)
])

# Sample data
data = [
    ("ORD001", "C100", "Laptop", 1, 999.99),
    ("ORD002", "C101", "Mouse", 2, 29.99),
    ("ORD003", "C100", "Keyboard", 1, 79.99),
    ("ORD004", "C102", "Monitor", 2, 299.99),
    ("ORD005", "C101", "Laptop", 1, 999.99)
]

df = spark.createDataFrame(data, schema)
df.show()

In [None]:
# Transformations: filter, select, compute
from pyspark.sql.functions import col, sum as spark_sum, count

# Filter orders with quantity > 1
high_qty_orders = df.filter(col("quantity") > 1)
print("Orders with quantity > 1:")
high_qty_orders.show()

# Add a total column
df_with_total = df.withColumn("total", col("quantity") * col("price"))
print("Orders with total:")
df_with_total.show()

In [None]:
# Aggregations: Group by customer
customer_summary = df_with_total.groupBy("customer_id").agg(
    count("order_id").alias("num_orders"),
    spark_sum("total").alias("total_spent")
)

print("Customer spending summary:")
customer_summary.orderBy(col("total_spent").desc()).show()

In [None]:
# Reading and Writing Data (common batch patterns)

# Read from CSV (example path)
# df = spark.read.csv("data/orders.csv", header=True, inferSchema=True)

# Read from Parquet (columnar, efficient)
# df = spark.read.parquet("data/orders.parquet")

# Write to Parquet (partitioned by date for efficiency)
# df_with_total.write \
#     .mode("overwrite") \
#     .partitionBy("customer_id") \
#     .parquet("output/orders_processed")

print("Common I/O patterns shown above (commented for demo)")

In [None]:
# Stop Spark session when done
spark.stop()
print("Spark session stopped.")

---
## Batch vs Stream Processing Comparison

| Dimension | Batch Processing | Stream Processing |
|-----------|------------------|-------------------|
| **Latency** | Minutes to hours | Milliseconds to seconds |
| **Data Scope** | Bounded (finite dataset) | Unbounded (continuous flow) |
| **Processing Model** | Process all at once | Process as events arrive |
| **Complexity** | Simpler (no state management) | Complex (windowing, watermarks) |
| **Fault Tolerance** | Rerun entire job | Checkpointing, exactly-once |
| **Use Cases** | ETL, ML training, reports | Real-time alerts, dashboards |
| **Tools** | Spark, Hive, Flink (batch) | Kafka Streams, Flink, Spark Streaming |
| **Cost** | Often cheaper (scheduled) | Higher (always-on infrastructure) |

### When to Choose Each
```
┌─────────────────────────────────────────────────────────────────┐
│  Choose BATCH when:              Choose STREAM when:           │
│  • Latency of hours is OK        • Need sub-second response    │
│  • Data arrives in bulk          • Data is continuous          │
│  • Complex joins/aggregations    • Simple event processing     │
│  • Cost optimization matters     • Real-time decisions needed  │
└─────────────────────────────────────────────────────────────────┘
```

---
## Takeaways

1. **Batch processing** handles large volumes of data efficiently when real-time results aren't required

2. **Apache Spark** provides a distributed computing framework with:
   - Lazy evaluation via transformations
   - Optimized execution through DAG scheduling
   - Unified APIs for SQL, ML, and graph processing

3. **PySpark essentials**:
   - `SparkSession` is the entry point
   - `DataFrame` API for structured data
   - Transformations are lazy; actions trigger execution

4. **Batch vs Stream** is a latency and complexity trade-off — use batch for cost-effective, high-volume historical processing

5. **Best practices**:
   - Use Parquet for efficient columnar storage
   - Partition data by common query dimensions
   - Cache intermediate results for iterative algorithms