What is the difference between ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ถ๐ป๐ด ๐ฎ๐ป๐ฑ ๐๐๐ฐ๐ธ๐ฒ๐๐ถ๐ป๐ด ๐ถ๐ป ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ?
When working with big data there are many important concepts we need to consider about how the data is stored both on disk and in memory, we should try to answer questions like:
โก๏ธ Can we achieve desired parallelism?
โก๏ธ Can we skip reading parts of the data? โ The question is addressed by partitioning and bucketing procedures
โก๏ธ How is the data colocated on disk? โ The question is mostly addressed by bucketing.
So what are the procedures of Partitioning and Bucketing? ๐๐ฒ๐'๐ ๐๐ผ๐ผ๐บ ๐ถ๐ป.
๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ถ๐ป๐ด.
โก๏ธ Partitioning in Spark API is implemented by .partitionBy() method of the DataFrameWriter class. โก๏ธ You provide the method one or multiple columns to partition by. โก๏ธ The dataset is written to disk split by the partitioning column, each of the partitions is saved into a separate folder on disk. โก๏ธ Each folder can maintain multiple files, the amount of resulting files is controlled by the setting spark.sql.shuffle.partitions.
โ Partitioning enables Partition Pruning. Given we filter on a column that we used to partition the dataframe by, Spark can plan to skip the reading of files that are not falling into the filter condition.
๐๐๐ฐ๐ธ๐ฒ๐๐ถ๐ป๐ด.
โก๏ธ Bucketing in Spark API is implemented by .bucketBy() method of the DataFrameWriter class.
๐ญ: We have to save the dataset as a table since the metadata of buckets has to be saved somewhere. Usually, you will find a Hive metadata store leveraged here. ๐ฎ: You will need to provide number of buckets you want to create. Bucket number for a given row is assigned by calculating a hash on the bucket column and performing modulo by the number of desired buckets operation on the resulting hash. ๐ฏ: Rows of a dataset being bucketed are assigned to a specific bucket and collocated when saving to disk.
โ If Spark performs wide transformation between the two dataframes, it might not need to shuffle the data as it is already collocated in the executors correctly and Spark is able to plan for that.
โ๏ธThere are conditions that need to be met between two datasets in order for bucketing to have desired effect.
๐ช๐ต๐ฒ๐ป ๐๐ผ ๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป ๐ฎ๐ป๐ฑ ๐๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฐ๐ธ๐ฒ๐?
โ If you will often perform filtering on a given column and it is of low cardinality, partition on that column. โ If you will be performing complex operations like joins, groupBys and windowing and the column is of high cardinality, consider bucketing on that column.
โ๏ธBucketing is complicated to nail as there are many caveats and nuances you need to know when it comes to it. More on it in future posts.