Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Processing Shuffling on Local #30

Open
pickles-bread-and-butter opened this issue Oct 5, 2022 · 0 comments
Open

Multi Processing Shuffling on Local #30

pickles-bread-and-butter opened this issue Oct 5, 2022 · 0 comments
Assignees

Comments

@pickles-bread-and-butter
Copy link
Collaborator

Overview:

The basic persistence is doing all of its work synchronously. This should at the least be done in parallel threads as it is very easy to parallelize. You can even just parallelize on the partitions if that's easiest.

We currently don't have any parallelization on the BasicPersistor, leading to sub-optimal performance. We can add parallelization through multi-processing the upload. I recommend multi processing here over multi threading as we will be CPU bound and not I/O bound. S3 does a good job at being really efficient with the actual uploading, it's doubtful we will see some giant jump in I/O wall time, we will see a jump in processing the partitions to get to the point where we do upload.

The process that IMO is a good candidate for multi processing is currently.

  1. We partition our data into K partitions based off of the PARTITION_SIZE
  2. For each of these partitions we call persist_wicker_partition to move the partition data to S3
    1. The partition is iterated for every example
    2. We create a set of column file writers, one for each partition key we find
    3. Each example has its heavy pointer columns written to the partition file
    4. We yield the partition and the metadata
    5. We then clean up the column writers

All these steps happen sequentially, leading to an O(N) runtime addition (not the only other O(N)). We can improve on this by allocating each of these processes to their own processor (or thread). We have CPU and maybe some I/O bound here that will be improved by splitting this across nodes. If there are other places for easy/good multi threading/proc adding them in is a good add.

Acceptance Criteria:

  • Basic persister is now parallelized on threads or procs
  • Verification exists to ensure the paralleled data is consistent with synchronous
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants