-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak? #179
Comments
@alvinstarr The manifest is held in a temporary file located in the archive's metadata directory, for example "/var/lib/wyng/a_b0a0ccba36969efb25c7e08cf060115bc92c8365/Vol_48e331/S_20240212-123456-tmp/manifest.tmp". This should be on regular storage, not RAM (although its conceivable that /var/lib/ might be setup differently in your system). The first thing I would suspect is Wyng's deduplicator. If you are enabling it, dedup will use a lot of RAM. The amount of RAM depends on the amount of data in the archive and in the volume being backed up. Disabling the deduplicator would be the first thing to try. If deduplication is a hard requirement for your larger volumes, you could try creating a separate archive for those volumes using a larger chunk size setting. The default is 128kB and setting it to 1MB would reduce the metadata size (and dedup RAM use) by about 80%. (A possible option for enhancement would be to automatically detect such large volumes and configure the dedup index to reside entirely on disk. However, there would be a quite noticeable performance penalty.) There is also the possibility that a Python garbage collection issue not related to dedup exists. That could be more difficult to diagnose and address. |
I should have mentioned that, in addition to the regular manifest.tmp in /var/lib, Wyng's dedup mode will also create a "hashindex.dat" under /tmp (which is usually in RAM). It is an extension of the dedup index located in RAM (Python heap mem). |
We have not turned on dedup. I know just enough python to get myself in trouble but I would be willing to help out find what this problem is any way I can. |
@alvinstarr OK thanks, as I don't have anything nearly that large your feedback would be important. I can try monitoring what a few TB of backups does on my end; hopefully I will get some clues that way. It sounds like a basic garbage collection issue, something that is suppressing gc. First I'll try adding a debug mode that reports resource use every 100MB or so, and we can go from there. On your end, I need to know the OS distro, Python version and any Python-specific env settings, as well as the command line used to invoke Wyng (obfuscating vol names etc is fine). |
Something for me to try.... I hope it is this simple. |
@alvinstarr There is a potential fix now in 08wip branch. Test by doing 'wyng send' on large volumes. (This does not yet have a debug mode that will show periodic resource status, but the existing --debug will show resources at the end of each volume send.) |
It looks like it may have finished but then puked.
The OS is CentOS Linux release 7.5.1804 (Core) Also these were thick volumes that were converted to thin volumes. |
@alvinstarr Go ahead and take the latest update from 08wip branch and do a send with that. It should behave a lot better.
This has no bearing on the archive chunk size, which is independent of the LVM chunk size. The archive chunk size can only be set by 'wyng init' when creating the archive. If the tarfile issue is really the culprit, I don't think the chunksize will have to change in order for it to work (although I'd still recommend eventually moving to a new archive that is set to a larger chunk size). Also BTW (unrelated): You may want to check out the |
We would hope to do a full copy and then incremental copies based on the tick/tock snapshots. Given how long it took to run the backup I am of half a mind to try and manually complete the backup and see if I can then do an incremental from there. Our data set is largely compressed image files to there is not a whole lot of room for dedup so I am inclined to try bumping up the chunk-factor. I am really impressed with the work you have done. I will start a new backup with your recent changes to the 08wip and also bump up the chunk size to try to get better performance. |
@alvinstarr I note the last output log you posted had Wyng release 20231002, not the one with the tarfile fix I posted yesterday. That would mean both the main and helper processes were struggling with swapped-out memory lists from the tarfile module when Wyng was closing the tar stream. This no doubt contributed toward the timeout. The fix would alleviate that, but you could also rig the timeout so it waits longer by changing the seconds value in
Yes, chunk factor 6 = 2MB. That would reduce a 6.5GB manifest to roughly 400MB. Sounds like maxing that out would be optimal in your use case. Wyng doesn't allow for chunk sizes larger than 2MB, and I'm not sure extending that would noticeably help in your case: Going from 1/16 the original metadata & dir size to 1/32 for example, and I'm not sure what you would consider optimal. Obviously, Wyng makes the back-end filesystem do a lot of the work and there needs to be a balance, which is why I allowed for a chunk size factor.
I'm considering it for a future version of the Wyng format. Manually completing the
They would all have to be present in the archive and renamed without the '.tmp'. Then you should do a Coping strategy: Instead of trying to make the vol fit in the existing archive, you could create a new archive with larger chunk factor and back up just that one volume to it (for now). And then schedule a switch-over date for when you would move the rest of your volumes over to the new archive. (Just a suggestion.)
Yes, this is issue 16, |
Is there a way to disable compression? |
No, compression is a fixture in the send/receive processes. You can get similar bandwidth to uncompressed by using 'zstd' compressor at a level less than 1. Beyond that, there are optimization issues to improve send speed, such as #11. OTOH, Wyng is intended to sparse skip over much of the volume space most of the time, and that is currently where it is most optimized. |
An interesting observation. The first 3 blurps of traffic are:
Those tests were run a against a 500GB test volume. |
I think what may be going on is the large difference in Cpython's garbage collection workload due to dynamic buffering having to juggle larger buffers. (Hmmm. Does a 1MB buffer behave much differently?) Wyng does not yet use static buffering for transfer operations. And I always suspected that locally-based archives would someday throw performance issues that were masked by net access into high relief (as your benchmark just did). It would also be interesting to see the difference, for instance, with the helper script removed from the local transfer loop. That in combination with using static buffers could make a big difference, IMO. However, the limitations of the zstandard lib I'm currently using precludes static buffering. For now, I might want to move this discussion to issue #11 ... |
Increase tar stream timeout to accommodate large manifests, issue #179
This should be close-able now. Issue 11 is open to handle optimization ideas. |
I am trying to backup a 26TB volume.
About 1/2 way through the process the OOM killer kicked in and killed the backup.
It looks like it consumed 64G on a system with 128G of ram and a 4G swap.
Is it possible that the manifest is held in memory until the backup is completed?
The text was updated successfully, but these errors were encountered: