# Spark Optimization Notebook

This notebook demonstrates practical examples and code snippets to optimize Spark jobs for better performance based on partitioning, shuffle operations, memory management, and Adaptive Query Execution (AQE).

## 1. Understanding Partitioning
Partitioning allows Spark to divide data into smaller, manageable chunks for parallel processing. Proper partitioning can improve task parallelism and reduce shuffling.

**Best Practices for Partitioning:**
- Use `repartition(n)` to increase partitions for large datasets.
- Use `coalesce(n)` to decrease partitions for smaller datasets.
- Leverage data locality to minimize shuffle overhead.

In [None]:
# Example: Adjusting Partitions
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Optimization Example").getOrCreate()

# Create a large dataset
data = [(x, x**2) for x in range(1000000)]
df = spark.createDataFrame(data, ["number", "square"])

# Repartition to optimize parallelism
df = df.repartition(100)
print("Number of partitions after repartition:", df.rdd.getNumPartitions())

## 2. Diagnosing Performance Bottlenecks
Use `explain()` to analyze query execution plans and identify inefficiencies such as excessive shuffling or skewed partitions.

In [None]:
# Analyze Execution Plan
df.groupBy("number").count().explain()

## 3. Optimizing Shuffle Operations
Shuffling involves data movement across partitions, which can be expensive. The following strategies can reduce shuffle overhead:

- Use `reduceByKey` instead of `groupByKey` for aggregations.
- Avoid wide transformations where possible.
- Enable **salting** to mitigate data skew.

In [None]:
# Example: Reduce Shuffle with reduceByKey
rdd = spark.sparkContext.parallelize([(1, 2), (3, 4), (3, 6)])
reduced_rdd = rdd.reduceByKey(lambda a, b: a + b)
print(reduced_rdd.collect())

### Mitigating Data Skew with Salting
Data skew can cause performance bottlenecks. Salting is a technique to distribute data more evenly across partitions.

In [None]:
# Example: Salting for Skew Mitigation
from pyspark.sql.functions import col, lit, concat

# Add a salt column to distribute keys
salted_df = df.withColumn("salt", (col("number") % 10).cast("string"))
salted_df = salted_df.withColumn("salted_key", concat(col("number"), lit("_"), col("salt")))
salted_df.show(5)

## 4. Adaptive Query Execution (AQE)
Adaptive Query Execution dynamically optimizes execution plans at runtime, addressing issues such as skewed partitions and suboptimal join strategies.

In [None]:
# Enable AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Execute a query with AQE enabled
df.groupBy("number").count().show()

## 5. Memory Management
Efficient memory management is critical for Spark performance. Best practices include:

- Adjusting executor and driver memory (`spark.executor.memory`, `spark.driver.memory`).
- Caching intermediate results to avoid recomputation.
- Using `persist()` for repeated access patterns.

In [None]:
# Example: Caching DataFrames
cached_df = df.cache()
cached_df.count()  # Triggers caching

# Persist with specific storage levels
persisted_df = df.persist()

## 6. Summary and Best Practices
- **Partitioning**: Adjust partitions for parallelism and data locality.
- **Shuffling**: Avoid unnecessary wide transformations and use salting to mitigate skew.
- **AQE**: Enable dynamic query optimization for runtime adjustments.
- **Memory Management**: Use caching and persisting for efficient resource utilization.