Multi Processing Shuffling on Local #30

pickles-bread-and-butter · 2022-10-05T19:54:07Z

Overview:

The basic persistence is doing all of its work synchronously. This should at the least be done in parallel threads as it is very easy to parallelize. You can even just parallelize on the partitions if that's easiest.

We currently don't have any parallelization on the BasicPersistor, leading to sub-optimal performance. We can add parallelization through multi-processing the upload. I recommend multi processing here over multi threading as we will be CPU bound and not I/O bound. S3 does a good job at being really efficient with the actual uploading, it's doubtful we will see some giant jump in I/O wall time, we will see a jump in processing the partitions to get to the point where we do upload.

The process that IMO is a good candidate for multi processing is currently.

We partition our data into K partitions based off of the PARTITION_SIZE
For each of these partitions we call persist_wicker_partition to move the partition data to S3
1. The partition is iterated for every example
2. We create a set of column file writers, one for each partition key we find
3. Each example has its heavy pointer columns written to the partition file
4. We yield the partition and the metadata
5. We then clean up the column writers

All these steps happen sequentially, leading to an O(N) runtime addition (not the only other O(N)). We can improve on this by allocating each of these processes to their own processor (or thread). We have CPU and maybe some I/O bound here that will be improved by splitting this across nodes. If there are other places for easy/good multi threading/proc adding them in is a good add.

Acceptance Criteria:

Basic persister is now parallelized on threads or procs
Verification exists to ensure the paralleled data is consistent with synchronous

The text was updated successfully, but these errors were encountered:

pickles-bread-and-butter assigned varshaparthay Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Processing Shuffling on Local #30

Multi Processing Shuffling on Local #30

pickles-bread-and-butter commented Oct 5, 2022

Multi Processing Shuffling on Local #30

Multi Processing Shuffling on Local #30

Comments

pickles-bread-and-butter commented Oct 5, 2022

Overview:

Acceptance Criteria: