Parallelism in Spark and tuning

What do you need to know about 𝗦𝗽𝗮𝗿𝗸 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺?

While optimizing Spark Applications you will usually tweak two elements - performance and resource utilization.

Understanding parallelism in Spark and tuning it according to the situation will help you in both.

𝗦𝗼𝗺𝗲 𝗙𝗮𝗰𝘁𝘀:

➡️ Spark Executor can have multiple CPU Cores assigned to it. ➡️ Number of CPU Cores per Spark executor is defined by 𝘀𝗽𝗮𝗿𝗸.𝗲𝘅𝗲𝗰𝘂𝘁𝗼𝗿.𝗰𝗼𝗿𝗲𝘀 configuration. ➡️ Single CPU Core can read one file or partition of a splittable file at a single point in time. ➡️ Once read a file is transformed into one or multiple partitions in memory.

𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝗥𝗲𝗮𝗱 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺:

❗️ If number of cores is equal to the number of files, files are not splittable and some of them are larger in size - larger files become a bottleneck, Cores responsible for reading smaller files will idle for some time. ❗️ If there are more Cores than the number of files - Cores that do not have files assigned to them will Idle. If we do not perform repartition after reading the files - the cores will remain Idle during processing stages.

✅ Rule of thumb: set number of Cores to be two times less than files being read. Adjust according to your situation.

𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺:

➡️ Use 𝘀𝗽𝗮𝗿𝗸.𝗱𝗲𝗳𝗮𝘂𝗹𝘁.𝗽𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺 and 𝘀𝗽𝗮𝗿𝗸.𝘀𝗾𝗹.𝘀𝗵𝘂𝗳𝗳𝗹𝗲.𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀 configurations to set the number of partitions created after performing wide transformations. ➡️ After reading the files there will be as many partitions as there were files or partitions in splittable files.

❗️ After data is loaded as partitions into memory - Spark jobs will suffer from the same set of parallelism inefficiencies like when reading the data.

✅ Rule of thumb: set 𝘀𝗽𝗮𝗿𝗸.𝗱𝗲𝗳𝗮𝘂𝗹𝘁.𝗽𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺 equal to 𝘀𝗽𝗮𝗿𝗸.𝗲𝘅𝗲𝗰𝘂𝘁𝗼𝗿.𝗰𝗼𝗿𝗲𝘀 times the number of executors times a small number from 2 to 8, tune to specific Spark job.

𝗔𝗱𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗡𝗼𝘁𝗲𝘀:

👉 You can use 𝘀𝗽𝗮𝗿𝗸.𝘀𝗾𝗹.𝗳𝗶𝗹𝗲𝘀.𝗺𝗮𝘅𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗕𝘆𝘁𝗲𝘀 configuration to set maximum size of the partition when reading files. Files that are larger will be split into multiple partitions accordingly. 👉 It has been shown that write throughput starts to bottleneck once there are more than 5 CPU Cores assigned per Executor so keep 𝘀𝗽𝗮𝗿𝗸.𝗲𝘅𝗲𝗰𝘂𝘁𝗼𝗿.𝗰𝗼𝗿𝗲𝘀 at or below 5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism in Spark and tuning

Clone this wiki locally