Skip to content

Parallelism in Spark and tuning

Vaquar Khan edited this page Apr 3, 2023 · 1 revision

What do you need to know about ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ฃ๐—ฎ๐—ฟ๐—ฎ๐—น๐—น๐—ฒ๐—น๐—ถ๐˜€๐—บ?

While optimizing Spark Applications you will usually tweak two elements - performance and resource utilization.

Understanding parallelism in Spark and tuning it according to the situation will help you in both.

๐—ฆ๐—ผ๐—บ๐—ฒ ๐—™๐—ฎ๐—ฐ๐˜๐˜€:

โžก๏ธ Spark Executor can have multiple CPU Cores assigned to it. โžก๏ธ Number of CPU Cores per Spark executor is defined by ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐—ฒ๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ.๐—ฐ๐—ผ๐—ฟ๐—ฒ๐˜€ configuration. โžก๏ธ Single CPU Core can read one file or partition of a splittable file at a single point in time. โžก๏ธ Once read a file is transformed into one or multiple partitions in memory.

๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ถ๐—ป๐—ด ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐—ฃ๐—ฎ๐—ฟ๐—ฎ๐—น๐—น๐—ฒ๐—น๐—ถ๐˜€๐—บ:

โ—๏ธ If number of cores is equal to the number of files, files are not splittable and some of them are larger in size - larger files become a bottleneck, Cores responsible for reading smaller files will idle for some time. โ—๏ธ If there are more Cores than the number of files - Cores that do not have files assigned to them will Idle. If we do not perform repartition after reading the files - the cores will remain Idle during processing stages.

โœ… Rule of thumb: set number of Cores to be two times less than files being read. Adjust according to your situation.

๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ถ๐—ป๐—ด ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด ๐—ฃ๐—ฎ๐—ฟ๐—ฎ๐—น๐—น๐—ฒ๐—น๐—ถ๐˜€๐—บ:

โžก๏ธ Use ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐—ฑ๐—ฒ๐—ณ๐—ฎ๐˜‚๐—น๐˜.๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—น๐—น๐—ฒ๐—น๐—ถ๐˜€๐—บ and ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐˜€๐—พ๐—น.๐˜€๐—ต๐˜‚๐—ณ๐—ณ๐—น๐—ฒ.๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ configurations to set the number of partitions created after performing wide transformations. โžก๏ธ After reading the files there will be as many partitions as there were files or partitions in splittable files.

โ—๏ธ After data is loaded as partitions into memory - Spark jobs will suffer from the same set of parallelism inefficiencies like when reading the data.

โœ… Rule of thumb: set ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐—ฑ๐—ฒ๐—ณ๐—ฎ๐˜‚๐—น๐˜.๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—น๐—น๐—ฒ๐—น๐—ถ๐˜€๐—บ equal to ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐—ฒ๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ.๐—ฐ๐—ผ๐—ฟ๐—ฒ๐˜€ times the number of executors times a small number from 2 to 8, tune to specific Spark job.

๐—”๐—ฑ๐—ฑ๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐—ก๐—ผ๐˜๐—ฒ๐˜€:

๐Ÿ‘‰ You can use ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐˜€๐—พ๐—น.๐—ณ๐—ถ๐—น๐—ฒ๐˜€.๐—บ๐—ฎ๐˜…๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—•๐˜†๐˜๐—ฒ๐˜€ configuration to set maximum size of the partition when reading files. Files that are larger will be split into multiple partitions accordingly. ๐Ÿ‘‰ It has been shown that write throughput starts to bottleneck once there are more than 5 CPU Cores assigned per Executor so keep ๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ธ.๐—ฒ๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ.๐—ฐ๐—ผ๐—ฟ๐—ฒ๐˜€ at or below 5.

Clone this wiki locally