Examine Compressed Headers #2220

ffesti · 2022-10-05T10:17:55Z

While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.

For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.

DemiMarie · 2022-10-05T18:28:18Z

While the payload is compressed the header is not. This is wasting time and IO whenever a header is loaded - be it form a package or the RPM BD.

For the v6 format we need to examine if compressing the headers yields enough benefits to justify this change.

Compressing the main header is okay, as it is signed. Compressing the signature header is a bad idea unless an extra signature is added after compression.

dralley · 2022-12-13T20:35:48Z

IMO this would somewhat devalue the RPM format as opposed to something like Debian or Arch packages, which basically just store compressed files in an archive.

A benefit of RPM is that you don't have to do a lot of processing to get at the metadata, so in theory working with the headers can be more efficient. (I actually have no idea how true that is in practice with modern hardware, comparatively speaking). If you put all the metadata into one big compressed blob which forces you to pay the cost of decompression before you can access any of the metadata, it feels like you would end up in the worst of both worlds - a complex specialty file format that doesn't provide the intended benefits of having a complex specialty file format (at least not all of them).

Random-ish data (like signatures and checksums) is not very compressable, anyway, and it sounds like a lot of the recent upwards pressure on header sizes is coming from this. There were some estimations done when there was a proposal about adding file IMA signatures to repo metadata: rpm-software-management/createrepo_c#313 (comment)

File lists OTOH are fairly compressible, so I won't claim the benefit would be zero.

dralley · 2022-12-13T21:18:41Z

All of which isn't to say that it's not a problem that requires solving, but if it requires a file format revision to solve then maybe we should also examine if a change like this would enable certain other simplifications.

mlschroe · 2022-12-14T10:33:32Z

Just a data point: when I created ndb I played with putting lzo compressed headers in the database. I thought that decompression would be faster than IO, so the net result would be faster database access. But it turned out that things got slower with compression, so I removed the feature again. (This was quite some time ago, so take this with a grain of salt.)

Compressing the main headers in the package format sounds more promising.

dralley · 2022-12-14T14:49:08Z

LZO doesn't have particularly fast decompression by modern standards, as far as I'm aware. zstd is 2-3x faster and lz4 is 5x faster according to these numbers https://github.com/facebook/zstd

dralley · 2022-12-21T06:33:06Z

Here's a script I threw together in 30 minutes for getting a rough estimate of the usefulness

#!/usr/bin/env python

import os
import sys
import createrepo_c as cr

import lz4.frame
import zstd

results = {}

with os.scandir(sys.argv[1]) as entries:
    for entry in entries:
        if not entry.is_file() or not entry.path.endswith(".rpm"):
            continue

        pkg = cr.package_from_rpm(
            entry.path,
            location_href=str(entry.path),
            checksum_type=cr.SHA256,
        )
        header_size_bytes = pkg.rpm_header_end - pkg.rpm_header_start
        with open(entry.path, "rb") as f:
            f.seek(pkg.rpm_header_start)
            header = f.read(header_size_bytes)

        zstd_header = zstd.ZSTD_compress(header)
        lz4_header = lz4.frame.compress(header)

        results[str(entry.path)] = {
            "header_size": header_size_bytes,
            "package_size": pkg.size_package,
            "archive_size": pkg.size_archive,
            "header_size_zstd": len(zstd_header),
            "header_size_lz4": len(lz4_header),
        }

total_size_headers = 0
total_size_packages = 0
total_size_archives = 0
total_size_lz4 = 0
total_size_zstd = 0

print("Results for {} packages".format(len(results)))

for package, data in results.items():
    total_size_headers += data["header_size"]
    total_size_packages += data["package_size"]
    total_size_archives += data["archive_size"]
    total_size_lz4 += data["header_size_lz4"]
    total_size_zstd += data["header_size_zstd"]

print("Average header size as proportion of package total: {:.2f}%".format(total_size_headers / total_size_packages * 100))
print("Average header savings with LZ4 compression: {:.2f}%".format((1 - (total_size_lz4 / total_size_headers)) * 100))
print("Average header savings with ZSTD compression: {:.2f}%".format((1 - (total_size_zstd / total_size_headers)) * 100))
print("Average total savings with LZ4 compressed headers: {:.2f}%".format(
    (1 - (total_size_packages - (total_size_headers - total_size_lz4)) / total_size_packages) * 100)
)
print("Average total savings with ZSTD compressed headers: {:.2f}%".format(
    (1 - (total_size_packages - (total_size_headers - total_size_zstd)) / total_size_packages) * 100)
)

Note:

I didn't tweak the default compression values
Didn't test decompression times

dralley · 2022-12-21T16:30:23Z

Notice: just fixed some math bugs in the above script

dralley · 2023-01-12T23:38:06Z

I ran the script against all the RPMs in RHEL 9 baseos + appstream repos. Results:

[dalley@thinkpad devel]$ python compressed_header_test.py repos/rhel9-combined-pkgs/Packages/
Results for 10644 packages
Average header size as proportion of package total: 1.97%
Average header savings with LZ4 compression: 53.02%
Average header savings with ZSTD compression: 70.65%
Average total savings with LZ4 compressed headers: 1.05%
Average total savings with ZSTD compressed headers: 1.40%

dralley · 2023-12-05T21:24:39Z

How one would compress the header section without altering some aspect of the format significantly (in a way that would not be trivial to backport)?

pmatilai · 2023-12-12T11:33:04Z

You can't. So header compression on package level would have to be an optional feature.

dralley · 2024-05-04T05:01:47Z

Doesn't seem worth it to me

dmnks self-assigned this Nov 2, 2023

dmnks added the RFE label Nov 7, 2023

dmnks removed their assignment Nov 8, 2023

pmatilai added the fileformat Matters concerning package (file) format label Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examine Compressed Headers #2220

Examine Compressed Headers #2220

ffesti commented Oct 5, 2022

DemiMarie commented Oct 5, 2022

dralley commented Dec 13, 2022 •

edited

Loading

dralley commented Dec 13, 2022

mlschroe commented Dec 14, 2022

dralley commented Dec 14, 2022 •

edited

Loading

dralley commented Dec 21, 2022 •

edited

Loading

dralley commented Dec 21, 2022

dralley commented Jan 12, 2023 •

edited

Loading

dralley commented Dec 5, 2023

pmatilai commented Dec 12, 2023

dralley commented May 4, 2024

Examine Compressed Headers #2220

Examine Compressed Headers #2220

Comments

ffesti commented Oct 5, 2022

DemiMarie commented Oct 5, 2022

dralley commented Dec 13, 2022 • edited Loading

dralley commented Dec 13, 2022

mlschroe commented Dec 14, 2022

dralley commented Dec 14, 2022 • edited Loading

dralley commented Dec 21, 2022 • edited Loading

dralley commented Dec 21, 2022

dralley commented Jan 12, 2023 • edited Loading

dralley commented Dec 5, 2023

pmatilai commented Dec 12, 2023

dralley commented May 4, 2024

dralley commented Dec 13, 2022 •

edited

Loading

dralley commented Dec 14, 2022 •

edited

Loading

dralley commented Dec 21, 2022 •

edited

Loading

dralley commented Jan 12, 2023 •

edited

Loading