# **Databricks spark: Repartitoin vs coalesce**

In [33]:
import findspark
findspark.init()

Partition playes important role in performance improvement, error handling  and debugging.

Need of partition strategy
* Adopting best partition strategy is designing best performance in spark application.
* It is important to choose right number of partitions and size of each parition.
  * e.g.
    * we did 10 paritions and there are 16 cores so suring processing each core will process each parititon and as there are only 10 parition (parittion cannot be further breakdown so to spilt it among the cores) then these paritions will be processed by 10 core nut in this the remaining 6 core will remain ideal or unused which is wastage of resources. Thus we try to parition the data into number of cores present.
* Evenly distributed parition improves the performace, unevenly distributed hits the performance.
  * with even distribution reources are effectively used like all cores are used. 
  * try to make parition equal to cores or number should be multiple of number of cores.
* Size of the partition should be choosen carefully.
  * lets say only one partition is created with size of 500 MB in a worker node with 16 core. One partition can't be shared among cores. so one core would be processing 500MB data where 15 cores kept idle. thus 
  * or lets say one parittion is of size 1GB and other 100MB thus partition of 100MB will be processed quickly and for 1GB it will take time. thus the core which processes 100MB file will remain in idle state for large amount of time which is like wastage of the services thus we should do partition in proper size.

**Default partitions for RDD and dataframe**
* The parameter **sc.defaultParallelism** determines the number of partition when creating data within spark. default value is 8 so it create 8 parition by default.
  * sc.defaultParallelsim - when creating data within spark.
* when reading data from external system, paritions are created based on parameter **spark.sql.files.maxPartitionBytes** which is by default 128 MB.
  * sc.files.maxPartitionByte - when we are reading file from external source. The spark will paritition the file basesd on size of 128 MB each partition.
  * Applicable only when file is splitable, if it is zip file than there will be only one partition..

In [41]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.config("spark.driver.host",'localhost').getOrCreate()

### **Partitions by changing the parameters spark.sql.files.maxPartitionBytes**

**Change the maxpartitionBytes parameter which changes in no of partitions.**

In [42]:
spark.conf.get("spark.sql.files.maxPartitionBytes")

'134217728b'

In [46]:
spark.conf.set("spark.sql.files.mazPartitionBytes",100000)

In [49]:
df_conf= spark.read.format("csv").option("inferschema","true").option("header","True").option("sep",',').load("data.csv")
df_conf.rdd.getNumPartitions()

1

lets say data.csv has size=477486 than with  spark.conf.set("spark.sql.files.mazPartitionBytes",100000) setting there will be 5 parition will be craeted each of 100Kb.

**To reset the parameter**

In [50]:
spark.conf.unset("spark.sql.files.maxPartitionBytes")

In [51]:
spark.conf.get("spark.sql.files.maxPartitionBytes")

'134217728b'

<hr>

**Example - how spark.sql.files.maxPartitionBytes will effect the number of partitions**
>**dbutils.fs.ls('dbfs:/FileStore/tables/')**

suppose the tables contain two **files 2011-12-05.csv** of size=477486  and **2011-12-08.csv** of size=437903 total size of both the files is 9,15,389

>**spark.conf.set("spark.sql.files.maxPartitionBytes",200000)**

we set the configuration parameter spark.sql.files.maxPartitionBytes to a value of 200,000

>**df=spark.read.format("csv").option("inferschema","true").option("header","true").option("sep",",").csv('dbfs:/FileStore/tables/')**


This will read all the files present at the path defined and will craete a dataframe.\
**df.rdd.getNumPartitions()**\
-- this will give o/p = 6

That measn there will be six partition of the df where n-1 partition of size 200000 and remaining 1 partition will have remaining portion of file.




<hr>

### **Repartition and coalesce**

**Note -**
*   Both repartition and coalesce are transformations, and they do not trigger immediate execution. They are evaluated lazily, and the actual data movement occurs during subsequent actions, such as writing to disk or performing certain operations.

### **Repartition**


* Function reparition is used to increase and decrease the partition in spark.
* It is a mehtod in spark which is used to perform a full shuffle on the data present and createds partitoins based on user's input. The resulting data is hash paritioned and the data is equally distributed among the partitions.
* Repartition always shuffle the data and build new partition from the scratch.
  * shufflying is costlier process.
  * full shuffle is done in this before repartition.
  * reparition always involve a shuffle.
* repartition result in almost equal sized partition. After the shufflying repartition is done in which parittion of resultant file is done in equal size. 
* 

**what happens when we want to decrease the number of partitons.**

*   When you use repartition to decrease the number of partitions, it performs a full shuffle of the data.
*   A full shuffle means that the data is redistributed across the specified number of partitions, involving data movement across the network.
*   This operation is more expensive in terms of computational resources compared to coalesce.
repartition is often used when you need to adjust the number of partitions and achieve a more balanced distribution of data.


**Its effect**\
in PySpark, the repartition operation not only redistributes data across partitions in memory but also affects the physical storage layout on disk when the data is persisted. When you repartition a DataFrame and then write it to a storage system (e.g., HDFS, S3, local file system), the data will be stored in the specified number of files or directories.

* **Repartitioning in Memory:**
    * When you call repartition on a DataFrame, it shuffles the data across the specified number of partitions in memory. This is done to create a more balanced distribution of data among the partitions, which can improve the efficiency of subsequent operations.

*  **Writing to Disk:**
    * When you write a DataFrame to an external storage system using methods like write.parquet(), write.csv(), or others, the data is physically stored on disk. The number of files or directories created on disk corresponds to the number of partitions in the DataFrame.

In [35]:
# Create a sample DataFrame
data = [("John", 25), ("Jane", 30), ("Bob", 22), ("Alice", 28), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data=data,schema=columns)

# Show the original DataFrame
print("Original DataFrame:")

df.printSchema()
df.show(5)



Original DataFrame:
root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)



+-------+---+
|   Name|Age|
+-------+---+
|   John| 25|
|   Jane| 30|
|    Bob| 22|
|  Alice| 28|
|Charlie| 35|
+-------+---+



**To get number of partitions in the original Dataframe**

In [36]:
# Get the number of partitions in the original DataFrame
original_partitions = df.rdd.getNumPartitions()
print(f"Number of original partitions: {original_partitions}")

# o/p - Number of original partitions: 12


Number of original partitions: 12



**Repartition the DataFrame into a specified number of partitions**

In [37]:
num_partitions = 3
df_repartitioned = df.repartition(num_partitions)

**df_repartitioned = df.repartition(num_partitions)**
* This will reparittion the dataframe into 2 partitions.
* we can write these parition to a specific location.
* e.g.  
  * df.repartition(2).write.mode('overwrite').parquet('/new_test')
    * This will parition the df dataframe in two partitoins and those paritions are saves in the location mentioned.
    * parquet('/new_test')
      * At location /new_test the dataframe is stored in parquet format.
        * /new_test is  - dbfs:/news_test
      * At this location two files will be created as we have doen two partitions of dataframe.

In [38]:
print(df_repartitioned)

DataFrame[Name: string, Age: bigint]


**Get the number of partitions in the repartitioned DataFrame**

In [39]:

new_partitions = df_repartitioned.rdd.getNumPartitions()
print(f"Number of new partitions: {new_partitions}")

# o/p -Number of new partitions: 3

Number of new partitions: 2


### **To check the data present in the paritions.**

**File 1**

In [None]:

path='dbfs:/new_test/part-00000.parquet'

# part-00000.parquet - name of file created after partition.


# Read the partitioned parquet file into a DataFrame
df_parquet = spark.read.parquet(path, header=True)

# Show the DataFrame
df_parquet.show()

**File 2**

In [None]:

path='dbfs:/new_test/part-00001.parquet'

# part-00000.parquet - name of file created after partition.


# Read the partitioned parquet file into a DataFrame
df_parquet = spark.read.parquet(path, header=True)

# Show the DataFrame
df_parquet.show()

**File 3**

In [None]:
path='dbfs:/new_test/part-00003.parquet'

# part-00000.parquet - name of file created after partition.


# Read the partitioned parquet file into a DataFrame
df_parquet = spark.read.parquet(path, header=True)

# Show the DataFrame
df_parquet.show()

### **Coalesce**

* Coalesce function only reduce the number of the partitions.
* Coalsce does't require a full shuffle.
* Coalesce combine few partitions or shuffle data from few partitions thus avoiding funll shuffle.
* Due to partition merge, it produce uneven size of partition.

 When you use coalesce, it tries to minimize data movement and achieve the target number of partitions by merging adjacent partitions without a full shuffle. It is a more efficient operation compared to repartition when decreasing the number of partitions. However, it doesn't guarantee an even distribution.


### **To reduce the partitions, using coalesce we created two partitions**

**1. when we want to directly partition the dataframe**

supppose df had 4 partitions. using coalesce we will be decreasing the number of partitions.
>df_col = df.coalesce(2)
*   Now df data frame will be repartitioned to 2 partition.
*   coalesce operation is indeed a transformation, and when you call df.coalesce(2), it returns a new DataFrame.
*   The variable df_col now holds a reference to a new DataFrame that has been coalesced to 2 partitions. The actual computation and data movement associated with the coalesce transformation will occur when you perform an action on this DataFrame.


when you use transformations like coalesce in PySpark, it creates a new DataFrame with the specified transformations applied, and it retains a logical execution plan. This logical plan contains the series of transformations to be applied, but the actual execution (evaluation of the plan and data movement) is deferred until an action is called.

In the case of df_col = df.coalesce(2), the variable df_col is a reference to a new DataFrame with the coalesce transformation specified. The execution plan, including the coalesce operation, is part of the DataFrame's logical plan.

When you perform an action, such as calling show(), collect(), or writing to a file, PySpark triggers the physical execution of the logical plan. At this point, the coalesce operation is executed, and the data is moved accordingly.



**To check the number of partition after coalesce.**\
new_partitions_coalesce = df_col.rdd.getNumPartitions()
print(f"Number of new partitions: {new_partitions_coalesce}")

**2. when a folder contain number of partioned files**

*   We will first read all the files, create dataframe and then do partition using coalesce()
*   we can write the partitioned dataframe to disk in particular format.
*   The number of files craeted will be equal to number of partitions we did.


In [40]:
# Coalesce the DataFrame to 2 partitions (minimizing data movement)
df=spark.read.parquet('/new_test')
df_col = df.coalesce(2)
df3=df_col.write.mode('overwrite').parquet('/new_col')


new_partitions_coalesce = df_col.rdd.getNumPartitions()
print(f"Number of new partitions: {new_partitions_coalesce}")


Number of new partitions: 2


*  **spark.read.parquet('/new_test')**
   *  /new_test - location from where we will read the files
*  **df_col = df.coalesce(2)**
   *  It creates a new DataFrame with the specified transformations applied, and it retains a logical execution plan. This logical plan contains the series of transformations to be applied, but the actual execution (evaluation of the plan and data movement) is deferred until an action is called.

### **repartitions vs coalesce**

*  reparition 
    *  repartition: When you use repartition, a full shuffle occurs, and the data is indeed moved between partitions across the network. The goal is to create a new set of partitions according to the specified number. During this process, the data is reorganized, and there's substantial network communication.
    *  It involves a full shuffle of the data, making it a relatively expensive operation, especially for large datasets.

    *  Use repartition when you want to increase or decrease the number of partitions and aim to achieve a more balanced distribution of data.

    *  It can be more computationally expensive compared to coalesce, especially when increasing the number of partitions
    *  If you need to increase the number of partitions or achieve a more balanced distribution, use repartition.

*  coalesce:

    *  coalesce is a transformation that reduces the number of partitions without a full shuffle of data.

    *  It works by merging adjacent partitions, trying to minimize data movement.

    *  coalesce is more efficient than repartition when decreasing the number of partitions.

    *  Use coalesce when you want to decrease the number of partitions, and achieving a perfectly balanced distribution is not as critical.

    *  It is generally faster than repartition, especially when decreasing the number of partitions
    *  If you need to decrease the number of partitions and achieving a perfect balance is not critical, use coalesce for efficiency

Both repartition and coalesce are transformations and do not trigger immediate execution. The actual data movement occurs during subsequent actions, such as writing to disk or performing certain operations.