Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization - CPU #11

Open
tasket opened this issue Dec 14, 2018 · 10 comments
Open

Optimization - CPU #11

tasket opened this issue Dec 14, 2018 · 10 comments
Labels
help wanted Extra attention is needed optimization

Comments

@tasket
Copy link
Owner

tasket commented Dec 14, 2018

Changes that may improve throughput, especially for send:

  • Multithreaded encoding layer for compression and encryption
  • Other areas for concurrency: getting deltas, dedup init, send and receive main loops
  • Alternatives to tar streaming, such as direct file IO for internal: destination
  • Static buffers to avoid garbage collection
  • Structs, especially in deduplication code
  • Explore new Python optimization options
  • Tighten the main send loop, use locals
  • Use formats instead of + string concat
  • Quicker compression – issue Zstandard compression #23
@tasket
Copy link
Owner Author

tasket commented Dec 19, 2018

An optimization attempt was posted to the optimize1 branch. Unfortunately, the limited testing I did showed little if any difference.

I may try re-introducing some of these changes on top of alpha4 and do some more extensive trials.

@tasket
Copy link
Owner Author

tasket commented Feb 22, 2019

(Note) Some optimization of prune/merge was recently done by setting LC_ALL=C and using -m merge where possible. In some cases this slashes pruning time by more than 75%.

@tasket tasket added the help wanted Extra attention is needed label Jul 27, 2020
@tasket tasket mentioned this issue Feb 20, 2024
@tasket
Copy link
Owner Author

tasket commented Feb 23, 2024

#179 (comment)

An interesting observation.

The first 3 blurps of traffic are:

with 128K blocks
with 2048K blocks
with 2048K blocks and compression set to zstd:1

Those tests were run a against a 500GB test volume.
The 4th block is the 22TB volume
I would expect the B/s to be in the same general ball-park but there is something going on.

image

@tasket
Copy link
Owner Author

tasket commented Feb 23, 2024

@alvinstarr I think what may be going on is the large difference in Cpython's garbage collection workload due to dynamic buffering having to juggle larger buffers. (Hmmm. Does a 1MB buffer behave much differently?)

Wyng does not yet use static buffering for transfer operations. And I always suspected that locally-based archives would someday throw performance issues that were masked by net access into high relief (as your benchmark just did).

It would also be interesting to see the difference, for instance, with the helper script removed from the local transfer loop. That in combination with using static buffers could make a big difference, IMO. However, the limitations of the zstandard lib I'm currently using precludes static buffering.

One really cheap (and safe) tweak you could try in the Wyng code is to remove the file IO buffering= parameter, letting Cpython adjust automatically:

    # Open source volume and its delta bitmap as r, session manifest as w.
    with open(volpath,"rb", buffering=chunksize) as vf,    \

(Moved to issue 11.)

@tasket
Copy link
Owner Author

tasket commented Feb 26, 2024

@alvinstarr I've tested a simple modification to Wyng that bypasses the tarfile streaming when the destination is a local filesystem. This improves the throughput by 17%.

The parameters I'm using are: Same SSD source and destination, 2MB chunks, zstd:0 and no encryption.

@tasket
Copy link
Owner Author

tasket commented Feb 26, 2024

BTW, removing buffering=chunksize parameter did not improve throughput.

@alvinstarr
Copy link

got side tracked a bit.
I will take a look at your changes and see if it helps our situation.

We have a backup and an incremental of our 27TB volume.
Its good to here about the speed speed improvement since it took 4 days to run the 27TB backup.
The first incremental took close to 2 days.

We are looking at copying the our backups to a remote location so that we can have off-site storage.
As part of that process we started copying using rsync and scp but we ran into bdp problems(bandwidth delay product).
to get around this we are using bbcp([https://www.slac.stanford.edu/~abh/bbcp/])
The speed improvement has been from 20Mbs to 500Mbps by using bbcp.
Not sure if you can leverage it but bbcp can provide a huge speed improvement for large data sets.

@tasket
Copy link
Owner Author

tasket commented Feb 29, 2024

@alvinstarr Sorry, I got sidetracked as well. I just posted the --tar-bypass optimization for send after fixing some bugs. Use it with send when the dest archive is local (file:/...). It will indicate it is bypassing the tar stream.

The kicker is that while I'm seeing throughput increase >17% on an archive with 2MB chunks, I also tried an archive with 1MB chunks. It appears that the 1MB chunk size is the fastest for sending an initial/whole volume, regardless of whether tar-bypass is used, and the gain in throughput going from 2MB to 1MB is about 25%.

Overall, sending to a 1MB archive while using the --tar-bypass I saw as much as 37% gain in throughput.

The tar-bypass is considered experimental at this point, although I don't anticipate it causing any issues.

Thanks for the tip about bbcp for backup copies. To be useful inside of Wyng, a copy/archive utility would have to handle streams from memory as well as files (this is why Python's tarfile lib was used). To get a similar muti-thread, multi-stream boost in wyng send will probably require using asyncio or one of the new multiprocess options.

@tasket
Copy link
Owner Author

tasket commented Feb 29, 2024

@alvinstarr PS - You may want to look into bbcp's behavior when updating existing file sets. The documentation has scant info on that subject and I could not figure out if it would skip files based on file timestamps, for example. It also doesn't seem to have a delete feature. So my own preference would be, after the initial bbcp copy, to use rsync -aH --delete to update the offsite backups. Of course, if you have doubts about the effect of a copy or update, you can always use Wyng's arch-check feature to check the integrity of the copy.

@alvinstarr
Copy link

alvinstarr commented Mar 1, 2024

bbcp is defiantly not a replacement for rsync because of delete and sync of existing files like you mentioned.
Also for lots of small files like we are running here the best way to use bbcp is in pipe mode with tar on both ends.

bbcp may not integrate well into what your doing but it may be possible to leverage the knowledge and work that has gone into doing the network socket processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed optimization
Projects
None yet
Development

No branches or pull requests

2 participants