### PySpark coalesce Interview Question

Given a PySpark DataFrame with 8 partitions, use the coalesce function to reduce the number of partitions to 4.

Explain how this impacts performance and provide the code to show the result.

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CoalesceFunction").getOrCreate()
spark

In [0]:
# sample data
data = [
    (1, 'Apple'),
    (2, 'Banana'),
    (3, 'Orange'),
    (4, 'Grapes'),
    (5, 'Mango'),
    (6, 'Pineapple')
]

columns = ["ID", "Fruit"]

df = spark.createDataFrame(data, columns)
df.show()

+---+---------+
| ID|    Fruit|
+---+---------+
|  1|    Apple|
|  2|   Banana|
|  3|   Orange|
|  4|   Grapes|
|  5|    Mango|
|  6|Pineapple|
+---+---------+



In [0]:
# check the initial number of partitions
print("Initial number of partitons:" , df.rdd.getNumPartitions())

Initial number of partitons: 8


In [0]:
# descrease the number of partitions to 4
reduced_df = df.coalesce(4)

In [0]:
# check the number of partitions after coalesce
print("Number of partitions after coalesce:" , reduced_df.rdd.getNumPartitions())

Number of partitions after coalesce: 4


**Explanation:**
- By using coalesce, we are reducing the number of partitions without performing a full shuffle.
- This is useful when we're writing the output to disk or when we want to optimize operations that involve small datasets.