# Partitioning in Spark

Partitioning is important for following reasons

- It affects parallelism
- It affects resiliency: Failure in computing even a single record will result in retry of the whole partition.
- It affects data shuffling
- It affects operations efficiency
- It affects the RAM processing: `mapPartitions` and `foreachPartition` work at partition level. Therefore, if there are more records per partition then there is a more demand on available RAM.
- It affects data skew
- It affects storage in files: Usally produces one file for each partition.

In [1]:
import org.apache.commons.lang3.time.{DateFormatUtils, FastDateFormat}
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.log4j._
import com.tccc.dna.synapse.dataset.NewYorkTaxiYellow
import com.tccc.dna.synapse.{StorageFormat, AzStorage}
import com.tccc.dna.synapse.Logs._

import com.tccc.dna.synapse.spark.{SynapseSpark, DataFrames, Partitions, Writers, Catalogs}

import java.nio.charset.StandardCharsets
import java.util.Date

sc.setLogLevel("DEBUG")

val notebookName = SynapseSpark.getCurrentNotebookName
val log = org.apache.log4j.LogManager.getLogger(s"com.aravind.notebook.$notebookName")
log.setLevel(Level.DEBUG)

## Setup

The setup and initialization logic is refactored into the below notebook.

In [2]:
%run /Operations/00-Setup-NewYorkYellowTaxi

## Partitions while reading from Source

- When reading from the source, **the # of partitions are determined** by pre-built rules. These rules determine the number of partitions of the input RDD. The input source can be a runtime collection of records, a set of files from filesystem or Table in a database.
- The files or part of a file from the source are then mapped to these pre-determined partitions.
- However, partitionng on existing RDD is accomplished using `Partitioner` implementation.

Rules:
- Number of CPU Cores per Spark executor is defined by `spark.executor.cores` configuration.
- Single CPU Core can read one file or partition of a splittable file at a single point in time.
- Once read a file is transformed into one or multiple partitions in memory.

**_Procedure for selecting input partitions "of a Dataset" while reading from source:_**

1. _Step 1: Calculate and identify file chunks: Split individual files at the boundary of `maxSplitBytes` if the file is splittable._
2. _Step 2: Partition packing i.e. one or more file chunks are packed into a partition. During packing of chunks into a partition, if the partition size exceeds `maxSplitBytes` then the partition is considered complete for packing and a new partition is taken for packing remaining chunks._

```
maxSplitBytes = Minimum(spark.sql.files.maxPartitionBytes [default 128 MB], bytesPerCore)
bytesPerCore = (Sum of sizes of all files + No. of files * spark.sql.files.openCostInBytes [default 4 MB])/spark.default.parallelism
```

#### Optimizations

Optimizing read parallelism:
- If number of cores is equal to the number of files, files are not splittable and some of them are larger in size - larger files become a bottleneck, Cores responsible for reading smaller files will idle for some time.
- If there are more Cores than the number of files - Cores that do not have files assigned to them will Idle. If we do not perform repartition after reading the files - the cores will remain Idle during processing stages.

**Rule of thumb:** 

- Set number of Cores to be two times less than files being read. Adjust according to your situation.
- You can use `spark.sql.files.maxPartitionBytes` config to set maximum size of the partition when reading files. Files that are larger will be split into multiple partitions accordingly.

In [3]:
SynapseSpark.printPartitionSplitRelatedProps()

The below cell prints `None` because there is **NO partitioner** while reading from source. Pre-built rules are used.

In [6]:
//Cache as it is used in rest of the notebook
val statgeTaxiDeltaDf = DataFrames.getDataFrame(tcccStorageAcct, tcccContainer, yellowTaxiDeltaPath, StorageFormat.Delta).cache
logDebug(log, s"Partitioner used: ${statgeTaxiDeltaDf.rdd.partitioner}")
//Create a view to work using SQL
statgeTaxiDeltaDf.createOrReplaceTempView("stage_nyc_yellow_taxi_trips_delta")

val stageTaxiDelta2001Df = statgeTaxiDeltaDf.filter("puYear == 2001")
logDebug(log, s"Partitioner used: ${stageTaxiDelta2001Df.rdd.partitioner}")

val shuffledTaxiDf = spark.sql("""
    SELECT 
        puYear, COUNT(*) as trips_count 
    FROM 
        stage_nyc_yellow_taxi_trips_delta 
    GROUP BY puYear 
    ORDER BY puYear""")
logDebug(log, s"Partitioner used: ${shuffledTaxiDf.rdd.partitioner}")

## Partitions in Memory

- By default spark creates partitions equal to the number of CPU cores in cluster
- Spark also creates 1 Task per partition
- Shuffle operations move data from one partition to other partitions. By default, spark shuffle operations create `200` partitions
- `repartitionXXX()` and `coalesce()` are generally used to changes partitions runtime in-memory

In [3]:
{
    //Partitions from open dataset
    val yellowTaxiOpenDf = NewYorkTaxiYellow.getOpenDatasetDataFrame
    val inMemPartitionCount = Partitions.getNumPartitionsInMem(yellowTaxiOpenDf)
    logDebug(log, s"Open dataset In-memory partition count: $inMemPartitionCount")
    var minMaxPartDf = Partitions.getRecordsPerPartitionInMem(yellowTaxiOpenDf)
    minMaxPartDf.agg(min("record_count"), max("record_count")).show

    //Partitions from local storage acct
    var df = DataFrames.getDataFrame(tcccStorageAcct, tcccContainer, yellowTaxiCsvPath, StorageFormat.Csv)
    logDebug(log, s"$yellowTaxiCsvPath In-memory partition count: ${Partitions.getNumPartitionsInMem(df)}")
    minMaxPartDf = Partitions.getRecordsPerPartitionInMem(df)
    minMaxPartDf.agg(min("record_count"), max("record_count")).show

    df = DataFrames.getDataFrame(tcccStorageAcct, tcccContainer, yellowTaxiParquetPath, StorageFormat.Parquet)
    logDebug(log, s"$yellowTaxiParquetPath In-memory partition count: ${Partitions.getNumPartitionsInMem(df)}")
    minMaxPartDf = Partitions.getRecordsPerPartitionInMem(df)
    minMaxPartDf.agg(min("record_count"), max("record_count")).show

    //cache this 
    df = DataFrames.getDataFrame(tcccStorageAcct, tcccContainer, yellowTaxiDeltaPath, StorageFormat.Delta)
    logDebug(log, s"$yellowTaxiDeltaPath In-memory partition count: ${Partitions.getNumPartitionsInMem(df)}")
    minMaxPartDf = Partitions.getRecordsPerPartitionInMem(df)
    minMaxPartDf.agg(min("record_count"), max("record_count")).show
}

## Partitions during Processing

Use `spark.default.parallelism` and `spark.sql.shuffle.partitions` configurations to set the number of partitions created after performing **wide** transformations.

**Rule of thumb:** set `spark.default.parallelism` equal to `spark.executor.cores` **X** the number of executors **X** a small number from 2 to 8, tune to specific Spark job.

In [6]:
%%sql
--SHOW PARTITIONS stage_nyc_yellow_taxi_trips_delta;
--SELECT puYear, puMonth, COUNT(*) as trips_count FROM stage_nyc_yellow_taxi_trips_delta GROUP BY puYear, puMonth ORDER BY puYear, puMonth;
SELECT puYear, COUNT(*) as trips_count FROM stage_nyc_yellow_taxi_trips_delta GROUP BY puYear ORDER BY puYear;

## Partitions on Disk

### Tables: SQL Managed Table

In [7]:
{
    logDebug(log, s"\nPartition info for $yellowTaxiParquetBackedTable")
    logDebug(log, s"Is paritioned: ${Partitions.isTablePartitioned(schemaName, yellowTaxiParquetBackedTable)}")
    var partitionCols = Partitions.getTablePartitionCols(schemaName, yellowTaxiParquetBackedTable)
    logDebug(log, s"Paritioned on col(s): $partitionCols")
    println()
    logDebug(log, s"\nPartition info for $yellowTaxiDeltaBackedTable")
    logDebug(log, s"Is paritioned: ${Partitions.isTablePartitioned(schemaName, yellowTaxiDeltaBackedTable)}")
    partitionCols = Partitions.getTablePartitionCols(schemaName, yellowTaxiDeltaBackedTable)
    logDebug(log, s"Paritioned on col(s): $partitionCols")
}

In [8]:
%%sql
--describe extended silver.mt_nyc_yellow_taxi_trips_delta
describe extended silver.mt_nyc_yellow_taxi_trips_parquet

In [10]:
val fullName = schemaName+"."+yellowTaxiDeltaBackedTable

In [20]:
%%sql
--DESCRIBE EXTENDED silver.mt_nyc_yellow_taxi_trips_delta;
--DESCRIBE FORMATTED silver.mt_nyc_yellow_taxi_trips_delta;
--DESCRIBE DETAIL silver.mt_nyc_yellow_taxi_trips_delta;

In [None]:
//Overwrite partition 2001
//incrementalOverwrite(trips2001Df, schemaName, managedTableName)

In [None]:
//incrementalAppend(trips2001Df, schemaName, managedTableName)

In [None]:
%%sql
select count(*) from silver.c;
select count(*) from silver.mt_nyc_yellow_taxi_trips_delta;

In [None]:
Writers.saveAsManagedTable(taxiOpenDatasetDf, schemaName, managedTableName, partitioCols, "Data of New York Yello Taxi Trips. There are about 1.5B rows (50 GB) in total as of 2018.")

In [None]:
val descDf=spark.sql(s"DESCRIBE TABLE EXTENDED $schemaName.$yellowTaxiDeltaBackedTable")//.show(truncate = false,numRows = 1000)
display(descDf)

In [29]:
%%sql
-- All Delta queries fail with the following error becuase SHOW PARTITIONS command is not supported yet for Delta.
--Error: Table spark_catalog.delta.`abfss://tlfs@xxx.dfs.core.windows.net/synapse/workspaces/syn-tccc-cdl-use2-dev-01/warehouse/silver.db/mt_nyc_yellow_taxi_trips_delta` does not support partition management.;
--SHOW partitions delta.`abfss://tlfs@xxx.dfs.core.windows.net/synapse/workspaces/syn-tccc-cdl-use2-dev-01/warehouse/silver.db/mt_nyc_yellow_taxi_trips_delta`;
-- show partitions silver.mt_nyc_yellow_taxi_trips_delta;
--SHOW PARTITIONS silver.mt_nyc_yellow_taxi_trips_parquet
--DESCRIBE FORMATTED silver.mt_nyc_yellow_taxi_trips_delta

### File: CSV

In [3]:
logDebug(log, s"\nPartition info for $yellowTaxiCsvPath")
logDebug(log, s"Is paritioned: ${Partitions.isCsvFilePartitioned(yellowTaxiCsvPath)}")
logDebug(log, s"Paritioned on col(s): ${Partitions.getCsvFilePartitionCols(yellowTaxiCsvPath)}")

### File: Parquet

In [20]:
// TODO DOESN't accurately work
/*logDebug(log, s"\nPartition info for $yellowTaxiParquetPath")
//logDebug(log, s"Is paritioned: ${isParquetFilePartitioned(yellowTaxiParquetPath)}")
logDebug(log, s"Is paritioned: ${isParquetFilePartitioned("/poc/parquet")}")
//logDebug(log, s"Paritioned on col(s): ${Partitions.getParquetFilePartitionCols(yellowTaxiParquetPath)}")*/


### File: Delta

In [21]:
//TODO - No support for delta - May be read the _delta_log json and checck for partition metadata in Delta 1.x and use Delta 2.0 API
/*logDebug(log, s"\nPartition info for $yellowTaxiDeltaPath")
logDebug(log, s"Is paritioned: ${Partitions.isDeltaFilePartitioned(yellowTaxiDeltaPath)}")
logDebug(log, s"Paritioned on col(s): ${Partitions.getDeltaFilePartitionCols(yellowTaxiDeltaPath)}")*/

In [None]:
NewYorkTaxiYellow.write("/poc/delta/nyc_yellow_taxi_trips", StorageFormat.Delta)
NewYorkTaxiYellow.write("/poc/parquet/nyc_yellow_taxi_trips", StorageFormat.Parquet)
NewYorkTaxiYellow.write("/poc/csv/nyc_yellow_taxi_trips", StorageFormat.Csv)

In [None]:
val nycTaxiParquetDf = spark.read.parquet("abfss://tlfs@xxx.dfs.core.windows.net/poc/parquet/nyc_yellow_taxi_trips/")
logDebug(log, "Total rows: "+nycTaxiParquetDf.count)

In [None]:
val nycTaxiDeltaDf = spark.read.format("delta").load("abfss://tlfs@xxx.dfs.core.windows.net/poc/delta/nyc_yellow_taxi_trips/")
logDebug(log, "Total rows: "+nycTaxiDeltaDf.count)

In [None]:
%%sql

select count(*) from test.emp_table where ldt >= '2023-02-26' and ldt < '2023-02-27';
select count(*) from test.emp_table where ldt >= '2023-02-26-00' and ldt < '2023-02-26-23';

## Types

### Round-robin

`repartition(numPartitions: Int)` used RoundRobinParitioning.

### Hash

### Range

## Reduce partitions

## Partition Management

adding, deleting, and relocating specific partitions

## Partition Pruning

### Static

### Dynamic

## Additional Reading

- [Why partitioning is needed and internals](https://medium.com/@vladimir.prus/spark-partitioning-the-fine-print-5ee02e7cb40b)
- [Understanding Paritioning in Spark](https://sparkbyexamples.com/spark/spark-partitioning-understanding/)
- [Scalable Partition Handling in Cloud-Native Spark Architectures](https://databricks.com/blog/2016/12/15/scalable-partition-handling-for-cloud-native-architecture-in-apache-spark-2-1.html)
- [Catalog based partition handling of DataSource Tables](http://www.gatorsmile.io/spark-2-1-catalog-based-partition-handling-for-data-source-tables/)
- [Dynamic Partition Pruning](https://www.youtube.com/watch?v=ME1KCAYO44o)