You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The basic persistence is doing all of its work synchronously. This should at the least be done in parallel threads as it is very easy to parallelize. You can even just parallelize on the partitions if that's easiest.
We currently don't have any parallelization on the BasicPersistor, leading to sub-optimal performance. We can add parallelization through multi-processing the upload. I recommend multi processing here over multi threading as we will be CPU bound and not I/O bound. S3 does a good job at being really efficient with the actual uploading, it's doubtful we will see some giant jump in I/O wall time, we will see a jump in processing the partitions to get to the point where we do upload.
The process that IMO is a good candidate for multi processing is currently.
We partition our data into K partitions based off of the PARTITION_SIZE
For each of these partitions we call persist_wicker_partition to move the partition data to S3
The partition is iterated for every example
We create a set of column file writers, one for each partition key we find
Each example has its heavy pointer columns written to the partition file
We yield the partition and the metadata
We then clean up the column writers
All these steps happen sequentially, leading to an O(N) runtime addition (not the only other O(N)). We can improve on this by allocating each of these processes to their own processor (or thread). We have CPU and maybe some I/O bound here that will be improved by splitting this across nodes. If there are other places for easy/good multi threading/proc adding them in is a good add.
Acceptance Criteria:
Basic persister is now parallelized on threads or procs
Verification exists to ensure the paralleled data is consistent with synchronous
The text was updated successfully, but these errors were encountered:
Overview:
The basic persistence is doing all of its work synchronously. This should at the least be done in parallel threads as it is very easy to parallelize. You can even just parallelize on the partitions if that's easiest.
We currently don't have any parallelization on the BasicPersistor, leading to sub-optimal performance. We can add parallelization through multi-processing the upload. I recommend multi processing here over multi threading as we will be CPU bound and not I/O bound. S3 does a good job at being really efficient with the actual uploading, it's doubtful we will see some giant jump in I/O wall time, we will see a jump in processing the partitions to get to the point where we do upload.
The process that IMO is a good candidate for multi processing is currently.
persist_wicker_partition
to move the partition data to S3All these steps happen sequentially, leading to an O(N) runtime addition (not the only other O(N)). We can improve on this by allocating each of these processes to their own processor (or thread). We have CPU and maybe some I/O bound here that will be improved by splitting this across nodes. If there are other places for easy/good multi threading/proc adding them in is a good add.
Acceptance Criteria:
The text was updated successfully, but these errors were encountered: