# Lesson 12 - Performance Tuning Techniques

Okay, let's construct the detailed technical notes for Lesson 12 on PySpark Performance Tuning Techniques.

---

**Technical Notes: PySpark Performance Tuning Fundamentals**

**Objective:** These notes provide practical techniques and conceptual understanding for optimizing the performance of PySpark applications. Effective tuning is crucial for processing large datasets efficiently, reducing job completion times, and minimizing resource consumption in distributed environments. We will focus on optimizing data reads, mitigating shuffle costs, and utilizing diagnostic tools like `explain()` alongside caching strategies.

---

**1. Optimizing Data Input: Reading Data Efficiently**

*   **Theory:**
    The initial phase of reading data is often a significant bottleneck in Spark jobs. Optimizing this step involves choosing appropriate file formats, defining schemas correctly, and leveraging data layout strategies like partitioning and predicate pushdown.

    *   **File Formats:**
        *   **Text-Based (CSV, JSON):** Convenient but generally slower. Spark needs to parse entire rows even if only a few columns are needed. Schema inference can be slow and unreliable. Limited predicate pushdown capabilities.
        *   **Columnar (Parquet, ORC):** **Highly recommended** for analytics workloads. Store data column-wise, enabling:
            *   **Column Pruning:** Spark reads only the data for columns referenced in the query, significantly reducing I/O.
            *   **Efficient Compression:** Often achieve better compression ratios, further reducing I/O.
            *   **Predicate Pushdown:** Filters (predicates) in the query can often be pushed down to the file reading layer, skipping entire chunks (row groups in Parquet) of data that don't match the filter, drastically reducing data scanned.
            *   **Schema Evolution:** Store schema metadata within the files.
    *   **Schema Definition:**
        *   **`inferSchema=True`:** Convenient for exploration, but requires an extra pass over (part of) the data, can be slow, and may infer incorrect types, leading to runtime errors or suboptimal plans. **Avoid in production.**
        *   **Manual Schema:** Providing an explicit schema (`StructType`) is faster (no inference pass) and safer (ensures data type consistency). It allows Spark to optimize read operations more effectively.
    *   **Storage Partitioning:**
        *   Organizing data in storage (e.g., HDFS, S3) into directories based on the values of one or more columns (e.g., `/data/sales/year=2023/month=10/`).
        *   When queries filter on partition columns (e.g., `WHERE year = 2023 AND month = 10`), Spark can prune entire directories, reading only the relevant data partitions. This drastically reduces the amount of data listed and scanned. Choose partition columns with low-to-moderate cardinality that are frequently used in filters.
    *   **Predicate Pushdown:**
        *   The ability of Spark (and the underlying data source/format) to apply filter conditions directly at the data reading stage *before* data is loaded into Spark executors' memory.
        *   Columnar formats like Parquet are particularly effective at this, using internal metadata (min/max statistics per column chunk) to skip reading irrelevant data blocks.

*   **Code Example: Demonstrating Partitioning and Schema**

    ```python
    from pyspark.sql import SparkSession
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
    import os
    import shutil

    spark = SparkSession.builder \
        .appName("EfficientReading") \
        .master("local[*]") \
        .getOrCreate()

    # --- Setup: Create Sample Data and Write Partitioned Parquet ---
    data = [
        ("P101", "Widget", 100, "2023-10-26"),
        ("P102", "Gadget", 50, "2023-10-26"),
        ("P201", "Doohickey", 200, "2023-10-27"),
        ("P103", "Widget Pro", 75, "2023-10-27")
    ]
    # Define Schema Manually - crucial for performance and reliability
    schema = StructType([
        StructField("product_id", StringType(), True),
        StructField("name", StringType(), True),
        StructField("quantity", IntegerType(), True),
        StructField("sale_date", StringType(), True) # Read as string initially for partitioning
    ])

    initial_df = spark.createDataFrame(data, schema)

    # Define the output path and partition column
    output_path = "partitioned_sales.parquet"
    partition_col = "sale_date"

    print(f"--- Writing Partitioned Parquet Data to: {output_path} by '{partition_col}' ---")
    initial_df.write \
        .mode("overwrite") \
        .partitionBy(partition_col) \
        .parquet(output_path)

    print("Listing created partition directories:")
    # Verify directory structure (e.g., sale_date=2023-10-26/, sale_date=2023-10-27/)
    if os.path.exists(output_path):
        for item in os.listdir(output_path):
            if os.path.isdir(os.path.join(output_path, item)) and "=" in item:
                print(f"- {item}")

    # --- Reading Partitioned Data Efficiently ---
    # Define schema for reading (correct type for date now)
    read_schema = StructType([
        StructField("product_id", StringType(), True),
        StructField("name", StringType(), True),
        StructField("quantity", IntegerType(), True),
        StructField("sale_date", DateType(), True) # Read partition col with correct type
    ])

    print(f"\n--- Reading all data from: {output_path} ---")
    all_sales_df = spark.read.schema(read_schema).parquet(output_path)
    all_sales_df.show()
    print("Physical Plan (All Data):")
    all_sales_df.explain() # Note the PartitionFilters: [], PushedFilters: [] initially

    print(f"\n--- Reading data with filter on partition column ('sale_date' = '2023-10-27') ---")
    filtered_sales_df = spark.read.schema(read_schema).parquet(output_path) \
        .filter("sale_date = '2023-10-27'") # Filter applied AFTER read in this syntax for demo
        # Alternatively and often better: apply filter on read path if possible
        # filtered_sales_df = spark.read.schema(read_schema).parquet(f"{output_path}/sale_date=2023-10-27") # Path filtering

    filtered_sales_df.show()
    print("Physical Plan (Filtered Data):")
    # CRITICAL: Observe the Physical Plan output for 'PartitionFilters'
    filtered_sales_df.explain()
    # Expect PartitionFilters: [isnotnull(sale_date#...), (sale_date#... = 19657)]
    # This shows Spark will only scan the relevant partition directory.

    # Clean up
    if os.path.exists(output_path):
        shutil.rmtree(output_path)

    spark.stop()
    ```

*   **Explanation:**
    *   We first create sample data and write it using `.partitionBy("sale_date")`. This creates subdirectories like `sale_date=2023-10-26/` and `sale_date=2023-10-27/` within `partitioned_sales.parquet/`.
    *   A `read_schema` is defined manually, specifying the correct `DateType` for `sale_date`.
    *   When reading *all* data, `explain()` shows no partition filters applied initially.
    *   When reading and then applying `.filter("sale_date = '2023-10-27'")`, the `explain()` output is key. Look for the `PartitionFilters` section in the `FileScan parquet` part of the plan. It will indicate that Spark has pushed the `sale_date` filter down and will *only* scan the `sale_date=2023-10-27` directory, skipping the other partition entirely. This is a massive I/O saving on large datasets.
    *   Using columnar formats like Parquet also enables predicate pushdown on non-partition columns (seen as `PushedFilters` in `explain()`), further reducing data read *within* the selected files/partitions.

*   **Use Case:** Essential for Data Lake architectures, large-scale ETL pipelines, and any scenario involving querying large, structured datasets where filtering is common.

---

**2. Understanding and Minimizing Shuffles**

*   **Theory:**
    A **shuffle** is the process of redistributing data across partitions, often moving it between executors over the network. This occurs during operations that need to group or relate data with the same key from different partitions.

    **Why Shuffles are Expensive:**
    *   **Network I/O:** Serializing and sending data over the network is slow.
    *   **Disk I/O:** Intermediate shuffle data is often written to disk on executors.
    *   **Serialization/Deserialization:** CPU overhead for converting objects to/from byte streams.
    *   **Garbage Collection:** Increased memory pressure on executors can lead to more frequent GC pauses.

    **Common Shuffle-Inducing Operations:**
    *   `groupByKey()`, `reduceByKey()`, `aggregateByKey()`
    *   `join()` (except for Broadcast Hash Joins)
    *   `distinct()`
    *   `repartition()`
    *   Window functions without appropriate `PARTITION BY` clauses.
    *   `sort()`, `orderBy()` (unless data is already partitioned and sorted correctly).

    **Minimization Strategies:**
    1.  **Avoid `groupByKey()`:** When possible, use `reduceByKey()` or `aggregateByKey()`. These perform partial aggregation locally on each partition *before* shuffling, reducing the amount of data transferred. `groupByKey()` shuffles *all* values for a given key to a single executor, potentially causing OOM errors for keys with many values.
    2.  **Use Broadcast Joins:** As covered previously, broadcast small DataFrames in joins to avoid shuffling the large DataFrame. Check `explain()` for `BroadcastHashJoin`.
    3.  **Filter Early:** Reduce the amount of data entering shuffle operations by applying filters (`where`, `filter`) *before* joins or aggregations.
    4.  **Optimize Window Functions:** Use `PARTITION BY` clauses within window functions (`Window.partitionBy(...)`) to limit the data processed per partition/task. Ensure the partitioning aligns with how data is distributed.
    5.  **Use `coalesce()` instead of `repartition()` for *reducing* partitions:** `coalesce(n)` merges existing partitions to reduce parallelism, attempting to minimize data movement (often avoiding a full shuffle). `repartition(n)` always incurs a full shuffle but can be used to *increase* parallelism or combat data skew by redistributing data more evenly (often using hash partitioning). Use `repartition` intentionally when needed, not just arbitrarily.

*   **Code Example: `reduceByKey` vs. `groupByKey` (RDD level illustration)**

    ```python
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("ShuffleAvoidance").master("local[*]").getOrCreate()
    sc = spark.sparkContext # Need SparkContext for RDD example

    data = ["apple", "banana", "apple", "orange", "banana", "apple"]
    rdd = sc.parallelize(data, 4) # Create an RDD with 4 partitions

    # Map to key-value pairs
    pairs_rdd = rdd.map(lambda x: (x, 1))
    print(f"Number of partitions: {pairs_rdd.getNumPartitions()}")

    # --- Using groupByKey (Causes more shuffling) ---
    print("\n--- groupByKey() - Conceptually Shuffles All Values ---")
    # Internally: All 'apple' values shuffled to one task, all 'banana' to another...
    grouped_rdd = pairs_rdd.groupByKey()
    # Now apply mapValues to sum - aggregation happens AFTER shuffle
    counts_grouped = grouped_rdd.mapValues(sum)
    print(f"Counts via groupByKey: {counts_grouped.collectAsMap()}")
    # Note: check Spark UI Stage details to see shuffle read/write for groupByKey

    # --- Using reduceByKey (More efficient - map-side combine) ---
    print("\n--- reduceByKey() - Combines Locally Before Shuffle ---")
    # Internally: Sums occur within each partition first, then results are shuffled & summed
    counts_reduced = pairs_rdd.reduceByKey(lambda a, b: a + b)
    print(f"Counts via reduceByKey: {counts_reduced.collectAsMap()}")
    # Note: check Spark UI Stage details - shuffle write may be smaller than groupByKey

    spark.stop()
    ```

*   **Explanation:**
    *   While both methods yield the same final count, their execution differs significantly.
    *   `groupByKey()` collects *all* values associated with each key (`apple -> [1, 1, 1]`, `banana -> [1, 1]`) onto a single executor task *after* a shuffle. The summation (`mapValues(sum)`) happens afterward. This can be very memory-intensive if keys have many values.
    *   `reduceByKey(lambda a, b: a + b)` performs the summation *locally within each original partition first*. For example, if one partition has `[("apple", 1), ("apple", 1)]`, it calculates `("apple", 2)` locally. Only these partially aggregated results (`("apple", 2)`) are shuffled, significantly reducing the data volume transferred and memory pressure on the receiving tasks.
    *   **DataFrame Equivalent:** While DataFrames abstract this, using aggregate functions like `count()`, `sum()`, `avg()` within a `groupBy().agg(...)` operation leverages similar optimized aggregation strategies (like partial aggregation) under the hood, generally avoiding the pitfalls of RDD `groupByKey`.

*   **Use Case:** Optimizing aggregations, avoiding OOM errors during large-scale grouping operations, improving join performance via broadcasting.

---

**3. Leveraging Caching and Persistence (`cache()`, `persist()`)**

*   **Theory:**
    Spark DataFrames (and RDDs) are computed lazily and are typically recomputed each time an action is called on them or their derivatives. If a specific DataFrame is used multiple times in a workflow (e.g., in iterative algorithms, interactive analysis, or branching logic), recomputing it can be wasteful.

    *   **`cache()`:** A shorthand for `persist(StorageLevel.MEMORY_ONLY)`. It requests Spark to store the computed partitions of the DataFrame in memory on the executors.
    *   **`persist(storageLevel)`:** Provides more control over *how* the DataFrame is stored. Common `StorageLevel`s (from `pyspark.StorageLevel`):
        *   `MEMORY_ONLY`: Fast access, but partitions might be evicted if memory is insufficient (requiring recomputation).
        *   `MEMORY_ONLY_SER`: Stores *serialized* objects in memory. More space-efficient than `MEMORY_ONLY` but requires CPU time for deserialization on access.
        *   `MEMORY_AND_DISK`: Stores partitions in memory; spills excess partitions to disk. Slower access for disk-spilled partitions but more robust to memory pressure.
        *   `MEMORY_AND_DISK_SER`: Serialized version of `MEMORY_AND_DISK`.
        *   `DISK_ONLY`: Stores partitions only on disk. Slowest access, but reliable for very large DataFrames that don't fit in memory and whose recomputation is extremely costly.
        *   Replicated versions (`_2` suffix): Store partitions on two nodes for better fault tolerance (can survive one node failure without recomputation).
    *   **Lazy Caching:** Calling `cache()` or `persist()` marks the DataFrame for caching, but it's only actually populated when an action is executed on it for the first time.
    *   **`unpersist()`:** **Crucial** to manually release the cached data from memory/disk when it's no longer needed, freeing up resources for subsequent stages.

*   **When to Cache:**
    *   DataFrames used repeatedly in iterative algorithms (e.g., machine learning training loops).
    *   Intermediate DataFrames accessed multiple times in a complex workflow or interactive session.
    *   When a computationally expensive DataFrame needs to be queried multiple times.

*   **Code Example: Caching an Intermediate DataFrame**

    ```python
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, upper
    from pyspark.storagelevel import StorageLevel
    import time

    spark = SparkSession.builder \
        .appName("CachingExample") \
        .master("local[*]") \
        .getOrCreate()

    # Simulate reading a somewhat large DataFrame
    # In reality, this would be spark.read.parquet(...) or similar
    data_range = spark.range(0, 10 * 1000 * 1000).withColumnRenamed("id", "value") # 10M rows

    # Define a moderately expensive transformation
    def complex_transform(df):
        print("--> Performing complex_transform...")
        # Simulate some work - e.g., adding columns, UDFs etc.
        time.sleep(1) # Simulate computation time
        return df.withColumn("value_str", col("value").cast("string")) \
                 .withColumn("value_upper", upper(col("value_str")))

    # Apply the transformation
    transformed_df = complex_transform(data_range)

    # --- Scenario 1: Without Caching ---
    print("\n--- Running actions without caching ---")
    start = time.time()
    count1 = transformed_df.filter(col("value") < 1000).count() # Action 1
    duration1 = time.time() - start
    print(f"Action 1 Count: {count1}, Duration: {duration1:.2f}s") # complex_transform runs

    start = time.time()
    count2 = transformed_df.filter(col("value") > 9999000).count() # Action 2
    duration2 = time.time() - start
    print(f"Action 2 Count: {count2}, Duration: {duration2:.2f}s") # complex_transform runs AGAIN

    # --- Scenario 2: With Caching ---
    print("\n--- Caching the transformed DataFrame ---")
    # Cache using MEMORY_AND_DISK for robustness
    transformed_df.persist(StorageLevel.MEMORY_AND_DISK)
    # transformed_df.cache() # Shorthand for MEMORY_ONLY

    print("\n--- Running actions with caching ---")
    start = time.time()
    # First action triggers computation AND caching
    count3 = transformed_df.filter(col("value") < 1000).count() # Action 3
    duration3 = time.time() - start
    print(f"Action 3 Count (computes & caches): {count3}, Duration: {duration3:.2f}s") # complex_transform runs

    start = time.time()
    # Second action should read from cache (memory or disk)
    count4 = transformed_df.filter(col("value") > 9999000).count() # Action 4
    duration4 = time.time() - start
    print(f"Action 4 Count (reads from cache): {count4}, Duration: {duration4:.2f}s") # Should be faster

    # IMPORTANT: Unpersist when done
    transformed_df.unpersist()
    print("\n--- DataFrame unpersisted ---")

    spark.stop()
    ```

*   **Explanation:**
    *   We simulate an expensive transformation (`complex_transform`).
    *   Without caching, each action (`count()`) triggers the re-execution of the `range()` and `complex_transform` lineage. Durations for Action 1 and Action 2 should be similar (and include the `time.sleep`).
    *   With `persist(StorageLevel.MEMORY_AND_DISK)`, the first action (Action 3) still runs the transformation but also stores the resulting partitions in memory or on disk. Its duration includes computation + caching overhead.
    *   The second action (Action 4) finds the persisted data and reads directly from the cache, skipping the `complex_transform` step. Its duration should be significantly shorter.
    *   `unpersist()` is called to free up the storage resources used by the cache.

*   **Use Case:** Iterative machine learning, interactive data analysis in notebooks, complex ETL pipelines where intermediate results are reused.

---

**4. Using `explain()` for Diagnosis**

*   **Theory:**
    The `explain()` method is an indispensable tool for understanding how Spark translates your DataFrame operations into a physical execution plan that runs on the cluster. Analyzing this plan helps identify potential bottlenecks like unnecessary shuffles, missed broadcast joins, or inefficient data scans.

    *   **Plan Stages:** `explain()` typically shows:
        *   **Parsed Logical Plan:** Initial plan representing the user's code logic.
        *   **Analyzed Logical Plan:** Schema information is resolved, attributes are bound.
        *   **Optimized Logical Plan:** Catalyst optimizer applies rules (predicate pushdown, projection pruning, etc.) to refine the logical plan.
        *   **Physical Plan:** The **most important** plan for performance tuning. It shows the *actual* RDD operations, join algorithms (e.g., `BroadcastHashJoin`, `SortMergeJoin`), data exchanges (shuffles), and scan methods Spark will use.
    *   **What to Look For in the Physical Plan:**
        *   **Scan Operations:** `FileScan parquet`, `Scan JDBC`, etc. Look for `PartitionFilters` (shows partition pruning) and `PushedFilters` (shows predicate pushdown).
        *   **Join Operations:** `BroadcastHashJoin` (good if intended), `SortMergeJoin`, `ShuffledHashJoin` (indicate shuffling).
        *   **Shuffle Indicators:** `Exchange hashpartitioning(...)`, `Exchange RoundRobinPartitioning(...)`, etc., clearly mark shuffle boundaries between stages.
        *   **Aggregations:** `HashAggregate` (often indicates efficient map-side aggregation combined with final aggregation).
        *   **Whole-Stage Code Generation:** Stages often start with `*(N)` (e.g., `*(1) Project`, `*(2) Filter`). This indicates Tungsten's whole-stage code generation is active for that stage, which is generally good for CPU performance.

*   **Code Example: Analyzing a Simple Query Plan**

    ```python
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, broadcast

    spark = SparkSession.builder \
        .appName("ExplainExample") \
        .master("local[*]") \
        .config("spark.sql.autoBroadcastJoinThreshold", "-1") # Disable auto-broadcast for demo
        .getOrCreate()

    # Sample DataFrames from join lesson
    employees_data = [(1, "Alice", 101), (2, "Bob", 102), (3, "Charlie", 101), (5,"Eve", 104)]
    employees_df = spark.createDataFrame(employees_data, ["emp_id", "name", "dept_id"])
    departments_data = [(101, "Eng"), (102, "Sales"), (103, "HR")]
    departments_df = spark.createDataFrame(departments_data, ["dept_id", "dept_name"])

    # --- Scenario 1: Default Join (Likely Shuffle) ---
    print("\n--- Plan for Default Join (Shuffle Expected) ---")
    joined_df_shuffle = employees_df.join(departments_df, on="dept_id", how="inner") \
                                   .filter(col("emp_id") > 1) \
                                   .select("name", "dept_name")
    joined_df_shuffle.explain()
    # Look for: SortMergeJoin or ShuffledHashJoin, Exchange operators

    # --- Scenario 2: Broadcast Join Hint ---
    print("\n--- Plan for Broadcast Join ---")
    joined_df_broadcast = employees_df.join(broadcast(departments_df), on="dept_id", how="inner") \
                                     .filter(col("emp_id") > 1) \
                                     .select("name", "dept_name")
    joined_df_broadcast.explain(extended=True) # Use extended=True for more details
    # Look for: BroadcastHashJoin, BroadcastExchange (for the small table)
    # Notice NO Exchange on the larger (employee) side for the join itself.

    spark.stop()
    ```

*   **Explanation:**
    *   We disable automatic broadcasting (`autoBroadcastJoinThreshold=-1`) to force a shuffle join initially.
    *   The `explain()` output for `joined_df_shuffle` will likely show `SortMergeJoin` (or `ShuffledHashJoin`) and associated `Exchange` operations, indicating data is being shuffled based on `dept_id`.
    *   In the second scenario, we explicitly add `broadcast(departments_df)`. The `explain()` output for `joined_df_broadcast` should now clearly show `BroadcastHashJoin`. You'll see a `BroadcastExchange` operation responsible for distributing the small `departments_df`, but the potentially large `employees_df` avoids a shuffle for the join itself.
    *   `explain(extended=True)` provides all plan levels (Parsed, Analyzed, Optimized, Physical).

*   **Use Case:** Essential for *diagnosing* performance issues, *validating* optimizations (did my broadcast hint work? did filters get pushed down?), and understanding *how* Spark is executing a query. It's the primary tool for understanding the impact of code changes on execution strategy.

---

**Summary & Further Steps:**

Performance tuning in PySpark is an iterative process involving understanding the execution model, identifying bottlenecks, and applying appropriate optimization techniques. Key strategies include:

1.  **Optimize I/O:** Use efficient columnar formats (Parquet), define schemas manually, and leverage storage partitioning with predicate pushdown.
2.  **Minimize Shuffles:** Prefer shuffle-optimized operations (`reduceByKey`), use broadcast joins for small tables, filter data early, and use partitioning effectively in window functions.
3.  **Use Caching Wisely:** Persist intermediate DataFrames that are reused, choosing appropriate storage levels, and remember to `unpersist`.
4.  **Diagnose with `explain()`:** Regularly inspect physical execution plans to understand join strategies, shuffles, and filter application.

Beyond these, monitoring the **Spark UI** is crucial for observing stage/task durations, shuffle read/write sizes, data skew, GC time, and executor resource usage, providing further insights for targeted tuning.