**Performance Tuning**

# Indirect Performance Enhancements

We can do indirectly by setting configuration values or changing the runtime environment which improve things across Spark Applications or across Spark jobs

**Design Choices**

*  Language used for application development - Scala versus Java versus Python versus R
*  DataFrames versus SQL versus Datasets versus RDDs
*  To register your classes, use the SparkConf  and pass in the names of your
classes: conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

*  Dynamic allocation : set spark.dynamicAllocation.enabled to true.
https://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup

**Scheduling**

Inside a Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
example:
* Same SparkContext, different threads


```
thread1 -> df1.count()

thread2 -> df2.write.parquet(...)
```







Spark’s DAGScheduler can schedule these jobs at the same time
By default, Spark’s scheduler runs jobs in FIFO fashion

Each job is divided into “stages” first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority and so on

If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

To enable the fair scheduler,simply set the spark.scheduler.mode property to FAIR when configuring a SparkContext
`conf.set("spark.scheduler.mode", "FAIR")`

**Fair Scheduler Pools**

Fair Scheduler Pools are a feature of Spark’s FAIR scheduling mode

2 pools: analytics & etl

Spark tries to give 4 executors each
Unless one pool is idle → then the other can use more


```
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "etl")
df_etl.write.format("delta").save(...)

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "analytics")
df_analytics.count()
```



**Shuffle Configurations**

A shuffle happens when Spark needs to redistribute data across partitions, e.g. during:groupBy, reduceByKey,join, distinct, orderBy,repartition

Shuffle is expensive (disk + network), so tuning matters

spark.sql.shuffle.partitions = 200   (default)

For small/medium data: 50–100

For large data: (total cores × 2–4)

On databricks: SET spark.sql.shuffle.partitions = 100;


**Shuffle File Buffer** - Buffer size when writing shuffle files to disk

spark.shuffle.file.buffer = 32k

**Shuffle Spill Compression** - compresses data spilled to disk and transferred over network

```
spark.shuffle.spill.compress = true
spark.shuffle.compress = true

```


**Shuffle I/O Retry & Timeout**

```
useful in :
Cloud environments (Azure/AWS)
Intermittent network issues

Increase retries for stability:
spark.shuffle.io.maxRetries = 3
spark.shuffle.io.retryWait = 5s

spark.shuffle.io.maxRetries = 10
spark.shuffle.io.retryWait = 10s


```




**Adaptive Query Execution** (Very Important)

```
spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
```



# Direct Performance Enhancements

**Parallelism**

Parallelism helps to process large amount of data by distributing at least 2 or 3 tasks per CPU core in the cluster

Set spark.default.parallelism property as well as tuning the spark.sql.shuffle.partitions

According to the number of cores in the cluster



```
example :
Cluster details
Worker nodes: 2
Cores per node: 4
Total cores = 8

Parallelism ≈ 2–4 × total executor cores

spark.default.parallelism = 16
spark.sql.shuffle.partitions = 16
```



**Filtering**

Advisable to move filters to the earliest part of Spark job



```
df.filter("country = 'IN' OR country = 'US'")
```



**Repartitioning**

Repartition calls can incur a shuffle but still helps overall optimize jobs across cluster



```
example: df_repart = df.repartition(8, "customer_id") --> Rows with same customer_id go to same partition
Repartition before aggregation --Avoids skewed partitions during aggregation
df_agg = (
    df.repartition("region")
      .groupBy("region")
      .sum("revenue")
)
```



**Coalescing**

Coalesce do not perform shuffle and merge partitions on the same node into one partition

Reduce partitions after filtering

```
f_filtered = df.filter("year = 2024")
df_final = df_filtered.coalesce(4)
```

```
Coalesce before writing (most common use case)
df.coalesce(1).write.mode("overwrite").parquet("/output/report")
df.coalesce(1) means: Reduce the DataFrame to exactly one partition, without doing a full shuffle.
```





**User defined function**

In general, avoiding UDFs is a good optimization opportunity
UDFs are expensive because they force representing data as objects in the JVM and sometimes do this multiple times per record in a query

Since data is treated as object so it needs to be casted

**Temporary Data Storage (Caching)**

Caching will place a DataFrame, table, or RDD into temporary storage (either memory or disk) across the executors in the cluster, and make subsequent reads faster.

Caching data incurs a serialization, deserialization, and storage cost.

Caching is a lazy operation, meaning that things will be cached only as they are accessed.
You might be expecting to access raw data but because someone else already cached the data, you’re actually accessing their cached version.Keep that in mind when using this feature

**Joins**

Joins are a common area for optimization.

Optimizing joins is simply educating about what each join does and how it’s performed.

**Aggregations**

Filtering data before the aggregation