-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examine Compressed Headers #2220
Comments
Compressing the main header is okay, as it is signed. Compressing the signature header is a bad idea unless an extra signature is added after compression. |
IMO this would somewhat devalue the RPM format as opposed to something like Debian or Arch packages, which basically just store compressed files in an archive. A benefit of RPM is that you don't have to do a lot of processing to get at the metadata, so in theory working with the headers can be more efficient. (I actually have no idea how true that is in practice with modern hardware, comparatively speaking). If you put all the metadata into one big compressed blob which forces you to pay the cost of decompression before you can access any of the metadata, it feels like you would end up in the worst of both worlds - a complex specialty file format that doesn't provide the intended benefits of having a complex specialty file format (at least not all of them). Random-ish data (like signatures and checksums) is not very compressable, anyway, and it sounds like a lot of the recent upwards pressure on header sizes is coming from this. There were some estimations done when there was a proposal about adding file IMA signatures to repo metadata: rpm-software-management/createrepo_c#313 (comment) File lists OTOH are fairly compressible, so I won't claim the benefit would be zero. |
All of which isn't to say that it's not a problem that requires solving, but if it requires a file format revision to solve then maybe we should also examine if a change like this would enable certain other simplifications. |
Just a data point: when I created ndb I played with putting lzo compressed headers in the database. I thought that decompression would be faster than IO, so the net result would be faster database access. But it turned out that things got slower with compression, so I removed the feature again. (This was quite some time ago, so take this with a grain of salt.) Compressing the main headers in the package format sounds more promising. |
LZO doesn't have particularly fast decompression by modern standards, as far as I'm aware. zstd is 2-3x faster and lz4 is 5x faster according to these numbers https://github.com/facebook/zstd |
Here's a script I threw together in 30 minutes for getting a rough estimate of the usefulness #!/usr/bin/env python
import os
import sys
import createrepo_c as cr
import lz4.frame
import zstd
results = {}
with os.scandir(sys.argv[1]) as entries:
for entry in entries:
if not entry.is_file() or not entry.path.endswith(".rpm"):
continue
pkg = cr.package_from_rpm(
entry.path,
location_href=str(entry.path),
checksum_type=cr.SHA256,
)
header_size_bytes = pkg.rpm_header_end - pkg.rpm_header_start
with open(entry.path, "rb") as f:
f.seek(pkg.rpm_header_start)
header = f.read(header_size_bytes)
zstd_header = zstd.ZSTD_compress(header)
lz4_header = lz4.frame.compress(header)
results[str(entry.path)] = {
"header_size": header_size_bytes,
"package_size": pkg.size_package,
"archive_size": pkg.size_archive,
"header_size_zstd": len(zstd_header),
"header_size_lz4": len(lz4_header),
}
total_size_headers = 0
total_size_packages = 0
total_size_archives = 0
total_size_lz4 = 0
total_size_zstd = 0
print("Results for {} packages".format(len(results)))
for package, data in results.items():
total_size_headers += data["header_size"]
total_size_packages += data["package_size"]
total_size_archives += data["archive_size"]
total_size_lz4 += data["header_size_lz4"]
total_size_zstd += data["header_size_zstd"]
print("Average header size as proportion of package total: {:.2f}%".format(total_size_headers / total_size_packages * 100))
print("Average header savings with LZ4 compression: {:.2f}%".format((1 - (total_size_lz4 / total_size_headers)) * 100))
print("Average header savings with ZSTD compression: {:.2f}%".format((1 - (total_size_zstd / total_size_headers)) * 100))
print("Average total savings with LZ4 compressed headers: {:.2f}%".format(
(1 - (total_size_packages - (total_size_headers - total_size_lz4)) / total_size_packages) * 100)
)
print("Average total savings with ZSTD compressed headers: {:.2f}%".format(
(1 - (total_size_packages - (total_size_headers - total_size_zstd)) / total_size_packages) * 100)
) Note:
|
Notice: just fixed some math bugs in the above script |
I ran the script against all the RPMs in RHEL 9 baseos + appstream repos. Results:
|
How one would compress the header section without altering some aspect of the format significantly (in a way that would not be trivial to backport)? |
You can't. So header compression on package level would have to be an optional feature. |
Doesn't seem worth it to me |
While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.
For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.
The text was updated successfully, but these errors were encountered: