## Spark Repartition() vs Coalesce()

In Apache Spark, both repartition() and coalesce() are methods used to control the partitioning of data in a `RDD` or a `DataFrame`, but they serve slightly different purposes and have performance implications.

Proper partitioning can have a significant impact on the performance and efficiency of your Spark job.

### Repartition()
- Used for increasing or decreasing the number of partitions.
- Always performs a full shuffle across the cluster, which can be an expensive operation.
- **Syntax**: `df.repartition(num_partitions)`
- **Example**: If you have a DataFrame with 4 partitions and want to increase it to 8, `df.repartition(8)` will redistribute the data across 8 partitions.

#### Scenarios:

##### 1. Balancing Workload with Skewed Data

- *Problem:* Skewed data in join operations causing performance issues.
- *Solution:* Use repartition to redistribute data evenly before the join, ensuring a balanced workload.
- *Example:* 
  ```python
  skewed_rdd.repartition(10).join(normal_rdd)
  ```

##### 2. Optimizing Grouping Operations

- *Problem:* Uneven data distribution affecting groupByKey or reduceByKey.
- *Solution:* Apply repartition before grouping to enhance data distribution.
- *Example:*
  ```python
  original_rdd.repartition(20).groupByKey()
  ```


### Coalesce()
- Used for decreasing the number of partitions.
- Avoids a shuffle when reducing partitions, as it tries to reduce partitions by merging data within existing partitions.
- Primarily used to decrease the number of partitions when moving data to fewer nodes, often for optimized final stages like saving data to storage. However, it is not efficient for increasing partitions or balancing data across nodes.
- **Syntax**: `df.coalesce(num_partitions)`
- **Example**: If you have a DataFrame with 8 partitions and want to reduce it to 4, df.coalesce(4) will reduce the partitions without a full shuffle.

#### Scenarios: 

##### 1. Final Stage Data Reduction
- *Problem:* High partition count at the end of processing, leading to numerous small output files.
- *Solution:* Use coalesce to decrease partitions before saving the final result.
- *Example:* 
  ```python
  intermediate_rdd.coalesce(1).saveAsTextFile("final_output")
  ```

##### 2. Aggregating Small Files
- *Problem:* Numerous small files causing storage and reading inefficiencies.
- *Solution:* Utilize coalesce to reduce output file count.
- Example:
  ```python
  processed_data.coalesce(5).write.parquet("output_data.parquet")
  ```


### Choosing between them:
- Use `repartition` for significant changes in partition count.
- Use `coalesce` for reducing partitions with minimal shuffling, which is more efficient.



## Practical Examples

In [0]:
from pyspark.sql.functions import *

In [0]:
flights_df = spark.read.format("csv") \
          .option("header", "true") \
          .option("inferSchema", "true") \
          .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/2010_summary.csv")

In [0]:
flights_df.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|    1|
|       United States|            Ireland|  264|
|       United States|              India|   69|
|               Egypt|      United States|   24|
|   Equatorial Guinea|      United States|    1|
|       United States|          Singapore|   25|
|       United States|            Grenada|   54|
|          Costa Rica|      United States|  477|
|             Senegal|      United States|   29|
|       United States|   Marshall Islands|   44|
|              Guyana|      United States|   17|
|       United States|       Sint Maarten|   53|
|               Malta|      United States|    1|
|             Bolivia|      United States|   46|
|            Anguilla|      United States|   21|
|Turks and Caicos ...|      United States|  136|
|       United States|        Afghanistan|    2|
|Saint Vincent and..

In [0]:
flights_df.count()

Out[4]: 255

In [0]:
# checking the number of partitions: currently we have only 1 partitions
flights_df.rdd.getNumPartitions()

Out[6]: 1

### partitioning the data using repartition()

In [0]:
partition_flights_df = flights_df.repartition(4)

In [0]:
# now you can see the we have 4 partitions
partition_flights_df.rdd.getNumPartitions()

Out[9]: 4

In [0]:
# now lets check in each partition how much data we have
partition_flights_df.withColumn("partitionID", spark_partition_id()).groupBy("partitionID").count().show()


# as you can see repartition() distributed the data evenly across all the partitions

+-----------+-----+
|partitionID|count|
+-----------+-----+
|          0|   63|
|          1|   64|
|          2|   64|
|          3|   64|
+-----------+-----+



##### spark_partition_id()
- In Spark, spark_partition_id() is a function that returns the partition ID of each row in a DataFrame or RDD. 
- This function is often used for understanding the distribution of data across partitions, which can be helpful for debugging and optimizing Spark jobs.
- The spark_partition_id() function is available in PySpark through pyspark.sql.functions. You can use it to inspect how rows are distributed across partitions by adding it as a column in your DataFrame.

In [0]:
# we can also do repartition on any specific columns as well
partition_on_column_df = flights_df.repartition(300, "DEST_COUNTRY_NAME")

In [0]:
partition_on_column_df.rdd.getNumPartitions()

Out[12]: 300

In [0]:
# now lets check in each partition how much data we have
partition_on_column_df.withColumn("partitionID", spark_partition_id()).groupBy("partitionID").count().show()

+-----------+-----+
|partitionID|count|
+-----------+-----+
|          0|    1|
|          2|    2|
|          7|    1|
|         10|    1|
|         13|    1|
|         15|    2|
|         16|    2|
|         21|    1|
|         22|    1|
|         28|    1|
|         31|    1|
|         39|    1|
|         42|    1|
|         43|    1|
|         44|    1|
|         45|    2|
|         48|    1|
|         53|    1|
|         54|    1|
|         55|    1|
+-----------+-----+
only showing top 20 rows



#### NOTE:
- in the above example, we partition teh data into 300 paritions on column but we have only 255 records in our df
- in this case the spark wil assing some null values to the few partitions

### partitioning the data using coaleasce()

In [0]:
# first lets create 8 partitions

partition_flights_df = flights_df.repartition(8)
partition_flights_df.withColumn("partitionID", spark_partition_id()).groupBy("partitionID").count().show()

+-----------+-----+
|partitionID|count|
+-----------+-----+
|          0|   32|
|          1|   31|
|          2|   32|
|          3|   32|
|          4|   32|
|          5|   32|
|          6|   32|
|          7|   32|
+-----------+-----+



In [0]:
# now lets create 3 partitions using coalesce
three_coaleasce_df = partition_flights_df.coalesce(3)

In [0]:
# now let see how coaleasce distributed the data 
three_coaleasce_df.withColumn("partitionID", spark_partition_id()).groupBy("partitionID").count().show()

# as you can see, it just merged the data into 3 papritions

+-----------+-----+
|partitionID|count|
+-----------+-----+
|          0|   63|
|          1|   96|
|          2|   96|
+-----------+-----+



In [0]:
# if we repartition to 3 we will get even partitions data
repartition_df = partition_flights_df.repartition(3)
repartition_df.withColumn("partitionID", spark_partition_id()).groupBy("partitionID").count().show()

+-----------+-----+
|partitionID|count|
+-----------+-----+
|          0|   85|
|          1|   85|
|          2|   85|
+-----------+-----+

