Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examine Compressed Headers #2220

Open
ffesti opened this issue Oct 5, 2022 · 11 comments
Open

Examine Compressed Headers #2220

ffesti opened this issue Oct 5, 2022 · 11 comments
Labels
fileformat Matters concerning package (file) format RFE

Comments

@ffesti
Copy link
Contributor

ffesti commented Oct 5, 2022

While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.

For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.

@DemiMarie
Copy link
Contributor

While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.

For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.

Compressing the main header is okay, as it is signed. Compressing the signature header is a bad idea unless an extra signature is added after compression.

@dralley
Copy link
Contributor

dralley commented Dec 13, 2022

IMO this would somewhat devalue the RPM format as opposed to something like Debian or Arch packages, which basically just store compressed files in an archive.

A benefit of RPM is that you don't have to do a lot of processing to get at the metadata, so in theory working with the headers can be more efficient. (I actually have no idea how true that is in practice with modern hardware, comparatively speaking). If you put all the metadata into one big compressed blob which forces you to pay the cost of decompression before you can access any of the metadata, it feels like you would end up in the worst of both worlds - a complex specialty file format that doesn't provide the intended benefits of having a complex specialty file format (at least not all of them).

Random-ish data (like signatures and checksums) is not very compressable, anyway, and it sounds like a lot of the recent upwards pressure on header sizes is coming from this. There were some estimations done when there was a proposal about adding file IMA signatures to repo metadata: rpm-software-management/createrepo_c#313 (comment)

File lists OTOH are fairly compressible, so I won't claim the benefit would be zero.

@dralley
Copy link
Contributor

dralley commented Dec 13, 2022

All of which isn't to say that it's not a problem that requires solving, but if it requires a file format revision to solve then maybe we should also examine if a change like this would enable certain other simplifications.

@mlschroe
Copy link
Contributor

Just a data point: when I created ndb I played with putting lzo compressed headers in the database. I thought that decompression would be faster than IO, so the net result would be faster database access. But it turned out that things got slower with compression, so I removed the feature again. (This was quite some time ago, so take this with a grain of salt.)

Compressing the main headers in the package format sounds more promising.

@dralley
Copy link
Contributor

dralley commented Dec 14, 2022

LZO doesn't have particularly fast decompression by modern standards, as far as I'm aware. zstd is 2-3x faster and lz4 is 5x faster according to these numbers https://github.com/facebook/zstd

@dralley
Copy link
Contributor

dralley commented Dec 21, 2022

Here's a script I threw together in 30 minutes for getting a rough estimate of the usefulness

#!/usr/bin/env python

import os
import sys
import createrepo_c as cr

import lz4.frame
import zstd

results = {}

with os.scandir(sys.argv[1]) as entries:
    for entry in entries:
        if not entry.is_file() or not entry.path.endswith(".rpm"):
            continue

        pkg = cr.package_from_rpm(
            entry.path,
            location_href=str(entry.path),
            checksum_type=cr.SHA256,
        )
        header_size_bytes = pkg.rpm_header_end - pkg.rpm_header_start
        with open(entry.path, "rb") as f:
            f.seek(pkg.rpm_header_start)
            header = f.read(header_size_bytes)

        zstd_header = zstd.ZSTD_compress(header)
        lz4_header = lz4.frame.compress(header)

        results[str(entry.path)] = {
            "header_size": header_size_bytes,
            "package_size": pkg.size_package,
            "archive_size": pkg.size_archive,
            "header_size_zstd": len(zstd_header),
            "header_size_lz4": len(lz4_header),
        }

total_size_headers = 0
total_size_packages = 0
total_size_archives = 0
total_size_lz4 = 0
total_size_zstd = 0

print("Results for {} packages".format(len(results)))

for package, data in results.items():
    total_size_headers += data["header_size"]
    total_size_packages += data["package_size"]
    total_size_archives += data["archive_size"]
    total_size_lz4 += data["header_size_lz4"]
    total_size_zstd += data["header_size_zstd"]

print("Average header size as proportion of package total: {:.2f}%".format(total_size_headers / total_size_packages * 100))
print("Average header savings with LZ4 compression: {:.2f}%".format((1 - (total_size_lz4 / total_size_headers)) * 100))
print("Average header savings with ZSTD compression: {:.2f}%".format((1 - (total_size_zstd / total_size_headers)) * 100))
print("Average total savings with LZ4 compressed headers: {:.2f}%".format(
    (1 - (total_size_packages - (total_size_headers - total_size_lz4)) / total_size_packages) * 100)
)
print("Average total savings with ZSTD compressed headers: {:.2f}%".format(
    (1 - (total_size_packages - (total_size_headers - total_size_zstd)) / total_size_packages) * 100)
)

Note:

  • I didn't tweak the default compression values
  • Didn't test decompression times

@dralley
Copy link
Contributor

dralley commented Dec 21, 2022

Notice: just fixed some math bugs in the above script

@dralley
Copy link
Contributor

dralley commented Jan 12, 2023

I ran the script against all the RPMs in RHEL 9 baseos + appstream repos. Results:

[dalley@thinkpad devel]$ python compressed_header_test.py repos/rhel9-combined-pkgs/Packages/
Results for 10644 packages
Average header size as proportion of package total: 1.97%
Average header savings with LZ4 compression: 53.02%
Average header savings with ZSTD compression: 70.65%
Average total savings with LZ4 compressed headers: 1.05%
Average total savings with ZSTD compressed headers: 1.40%

@dmnks dmnks self-assigned this Nov 2, 2023
@dmnks dmnks added the RFE label Nov 7, 2023
@dmnks dmnks removed their assignment Nov 8, 2023
@pmatilai pmatilai added the fileformat Matters concerning package (file) format label Nov 8, 2023
@dralley
Copy link
Contributor

dralley commented Dec 5, 2023

How one would compress the header section without altering some aspect of the format significantly (in a way that would not be trivial to backport)?

@pmatilai
Copy link
Member

You can't. So header compression on package level would have to be an optional feature.

@dralley
Copy link
Contributor

dralley commented May 4, 2024

Doesn't seem worth it to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fileformat Matters concerning package (file) format RFE
Projects
Status: Backlog
Development

No branches or pull requests

6 participants