Skip to content

What is the difference between ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ป๐—ฑ ๐—•๐˜‚๐—ฐ๐—ธ๐—ฒ๐˜๐—ถ๐—ป๐—ด ๐—ถ๐—ป ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ?

Vaquar Khan edited this page Apr 19, 2023 · 1 revision

When working with big data there are many important concepts we need to consider about how the data is stored both on disk and in memory, we should try to answer questions like:

โžก๏ธ Can we achieve desired parallelism?

โžก๏ธ Can we skip reading parts of the data? โœ… The question is addressed by partitioning and bucketing procedures

โžก๏ธ How is the data colocated on disk? โœ… The question is mostly addressed by bucketing.

So what are the procedures of Partitioning and Bucketing? ๐—Ÿ๐—ฒ๐˜'๐˜€ ๐˜‡๐—ผ๐—ผ๐—บ ๐—ถ๐—ป.

๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ถ๐—ป๐—ด.

โžก๏ธ Partitioning in Spark API is implemented by .partitionBy() method of the DataFrameWriter class. โžก๏ธ You provide the method one or multiple columns to partition by. โžก๏ธ The dataset is written to disk split by the partitioning column, each of the partitions is saved into a separate folder on disk. โžก๏ธ Each folder can maintain multiple files, the amount of resulting files is controlled by the setting spark.sql.shuffle.partitions.

โœ… Partitioning enables Partition Pruning. Given we filter on a column that we used to partition the dataframe by, Spark can plan to skip the reading of files that are not falling into the filter condition.

๐—•๐˜‚๐—ฐ๐—ธ๐—ฒ๐˜๐—ถ๐—ป๐—ด.

โžก๏ธ Bucketing in Spark API is implemented by .bucketBy() method of the DataFrameWriter class.

๐Ÿญ: We have to save the dataset as a table since the metadata of buckets has to be saved somewhere. Usually, you will find a Hive metadata store leveraged here. ๐Ÿฎ: You will need to provide number of buckets you want to create. Bucket number for a given row is assigned by calculating a hash on the bucket column and performing modulo by the number of desired buckets operation on the resulting hash. ๐Ÿฏ: Rows of a dataset being bucketed are assigned to a specific bucket and collocated when saving to disk.

โœ… If Spark performs wide transformation between the two dataframes, it might not need to shuffle the data as it is already collocated in the executors correctly and Spark is able to plan for that.

โ—๏ธThere are conditions that need to be met between two datasets in order for bucketing to have desired effect.

๐—ช๐—ต๐—ฒ๐—ป ๐˜๐—ผ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ฑ ๐˜„๐—ต๐—ฒ๐—ป ๐˜๐—ผ ๐—•๐˜‚๐—ฐ๐—ธ๐—ฒ๐˜?

โœ… If you will often perform filtering on a given column and it is of low cardinality, partition on that column. โœ… If you will be performing complex operations like joins, groupBys and windowing and the column is of high cardinality, consider bucketing on that column.

โ—๏ธBucketing is complicated to nail as there are many caveats and nuances you need to know when it comes to it. More on it in future posts.


image

Clone this wiki locally