# üîπ PySpark repartition() Function Tutorial

Goal: Optimize data partitioning for better parallelism and performance in Spark jobs.

### üî∏ What is repartition() in PySpark?

repartition() is a wide transformation that reshuffles data across a specified number of partitions.
It can increase or decrease partitions and ensures even data distribution.

DataFrame.repartition(numPartitions, *cols)

numPartitions ‚Üí Number of partitions you want.

cols (optional) ‚Üí Columns to partition by (for hash partitioning).

### üî∏ When to Use repartition()

Use it when:

You need to increase partitions for parallel processing.

You want to repartition based on a specific column (e.g., by employee_id).

You plan to join or aggregate large datasets efficiently.

In [0]:
# Sample data
data = [
    (1, "John", "Sales", 3000),
    (2, "Jane", "Finance", 4000),
    (3, "Sam", "Sales", 2500),
    (4, "Sara", "HR", 4500),
    (5, "Mike", "Finance", 3500)
]

# Create DataFrame
columns = ["emp_id", "name", "dept", "salary"]
df = spark.createDataFrame(data, columns)

print("Initial Number of Partitions:", df.rdd.getNumPartitions())


Initial Number of Partitions: 8


In [0]:
df_repart = df.repartition(4)
print("After Repartition:", df_repart.rdd.getNumPartitions())

After Repartition: 4


In [0]:
df_dept = df.repartition(3, "dept")
print("Repartitioned by 'dept':", df_dept.rdd.getNumPartitions())


Repartitioned by 'dept': 3


In [0]:
df_coalesce = df.coalesce(2)
print("After Coalesce:", df_coalesce.rdd.getNumPartitions())

After Coalesce: 2


| Function          | Type                  | Shuffle? | Use Case                                             |
| ----------------- | --------------------- | -------- | ---------------------------------------------------- |
| **repartition()** | Wide transformation   | ‚úÖ Yes    | Even distribution, both increase/decrease partitions |
| **coalesce()**    | Narrow transformation | ‚ùå No     | Reduce partitions efficiently                        |


### Performance Tips

Use repartition() before joins or aggregations on a key column.

Avoid excessive repartitioning; each repartition triggers a shuffle, which is expensive.

Combine with cache() or persist() if reused.