### üöÄ PySpark coalesce() Function
üîπ What is coalesce() in PySpark?

The coalesce() function is used to reduce the number of partitions in a DataFrame without shuffling the data across the cluster.
It‚Äôs primarily used to optimize performance before writing data to storage, such as Parquet, Delta, or CSV.

‚ö° Use coalesce() when you want to reduce partitions (e.g., from 10 ‚Üí 2).
‚ö†Ô∏è Don‚Äôt use it to increase partitions ‚Äî use repartition() for that.

DataFrame.coalesce(numPartitions)
numPartitions ‚Üí Target number of partitions (must be less than current count).

In [0]:
# Sample data
data = [
    (1, "Alice", 3000),
    (2, "Bob", 4000),
    (3, "Cathy", 2500),
    (4, "David", 3500),
    (5, "Eve", 2000),
    (6, "Frank", 2800)
]

# Create DataFrame
df = spark.createDataFrame(data, ["emp_id", "name", "salary"])

print("Initial DataFrame:")
df.display()

# Check initial partitions
print("Initial partitions:", df.rdd.getNumPartitions())

Initial DataFrame:


emp_id,name,salary
1,Alice,3000
2,Bob,4000
3,Cathy,2500
4,David,3500
5,Eve,2000
6,Frank,2800


Initial partitions: 8


In [0]:
# Reduce partitions to 2
df_coalesced = df.coalesce(2)

print("Partitions after coalesce:", df_coalesced.rdd.getNumPartitions())

Partitions after coalesce: 2


In [0]:
#üíæ Use Case: Writing Optimized Output

#When saving small data to storage, reducing partitions helps avoid many tiny files.

# Example: write to Parquet with optimized partition count
df_coalesced.write.mode("overwrite").parquet("/tmp/optimized_output/")

### ‚öñÔ∏è coalesce() vs repartition()

| Feature                 | `coalesce()`        | `repartition()`           |
| ----------------------- | ------------------- | ------------------------- |
| Shuffles Data           | ‚ùå No                | ‚úÖ Yes                     |
| Can Increase Partitions | ‚ùå No                | ‚úÖ Yes                     |
| Best For                | Reducing partitions | Increasing partitions     |
| Performance             | Faster (no shuffle) | Slower (shuffle involved) |


###  Reduce partitions using coalesce
df_coalesce = df.coalesce(2)
print("Coalesce partitions:", df_coalesce.rdd.getNumPartitions())

###  Repartition data (full shuffle)
df_repartition = df.repartition(4)
print("Repartition partitions:", df_repartition.rdd.getNumPartitions())


üí° Tips

Use coalesce() before writing to reduce output files.

Avoid reducing to 1 unless you truly need a single output file.

Combine with .cache() or .persist() if reusing the coalesced DataFrame.