Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPM with Copy on Write #1470

Closed
wants to merge 4 commits into from
Closed

Commits on Feb 1, 2021

  1. RPM with Copy on Write

    This is part of https://fedoraproject.org/wiki/Changes/RPMCoW
    
    The majority of changes are in two new programs:
    
    = rpm2extents
    
    Modeled as a 'stream processor'. It reads a regular .rpm file on stdin,
    and produces a modified .rpm file on stdout. The lead, signature and
    headers are preserved 1:1 to allow all the normal metadata inspection,
    signature verification to work as expected. Only the 'payload' is
    modified.
    
    The primary motivation for this tool is to re-organize the payload as a
    sequence of raw file extents (hence the name). The files are organized
    by their digest identity instead of path/filename. If any digest is
    repeated, then the file is skipped/de-duped. Only regular files are
    represented. All other entries like directories, symlinks, devices are
    fully described in the headers and are omitted.
    
    The files are padded so they start on `sysconf(_SC_PAGESIZE)` boundries
    to permit 'reflink' syscalls to work in the `reflink` plugin.
    
    At the end of the file is a footer with 3 sections:
    
    1. List of calculated digests of the input stream. This is used in
       `librepo` because the file *written* is a derivative, and not the
       same as the repo metadata describes. `rpm2extents` takes one or more
       positional arguments that described which digest algorithms are
       desired. This is often just `SHA256`. This program is only measuring
       and recording the digest - it does not express an opinion on whether
       the file is correct. Due to the API on most compression libraries
       directly reading the source file, the whole file digest is measured
       using a subprocess and pipes. I don't love it, but it works.
    2. Sorted List of file content digests + offset pairs. This is used in
       the plugin with a trivial binary search to locate the start of file
       content. The size is not needed because it's part of normal headers.
    3. (offset of 1., offset of 2., 8 byte MAGIC value) triple
    
    = reflink plugin
    
    Looks for the 8 byte magic value at the end of the rpm file. If present
    it alters the `RPMTAG_PAYLOADFORMAT` in memory to `clon`, and reads in
    the digest-> offset table.
    
    `rpmPackageFilesInstall()` in `fsm.c` is
    modified to alter the enumeration strategy from
    `rpmfiNewArchiveReader()` to `rpmfilesIter()` if not `cpio`. This is
    needed because there is no cpio to enumerate. In the same function, if
    `rpmpluginsCallFsmFilePre()` returns `RPMRC_PLUGIN_CONTENTS` then
    `fsmMkfile()` is skipped as it is assumed the plugin did the work.
    
    The majority of the work is in `reflink_fsm_file_pre()` - the per file
    hook for RPM plugins. If the file enumerated in
    `rpmPackageFilesInstall()` is a regular file, this function will look up
    the offset in the digest->offset table and will try to reflink it, then
    fall back to a regular copy. If reflinking does work: we will have
    reflinked a whole number of pages, so we truncate the file to the
    expected size. Therefore installing most files does involve two writes:
    the reflink of the full size, then a fork/copy on write for the last
    page worth.
    
    If the file passed to `reflink_fsm_file_pre()` is anything other than a
    regular file, it return `RPMRC_OK` so the normal mechanics of
    `rpmPackageFilesInstall()` are used. That handles directories, symlinks
    and other non file types.
    
    = New API for internal use
    
    1. `rpmReadPackageRaw()` is used within `rpm2extents` to read all the
       headers without trying to validate signatures. This eliminates the
       runtime dependency on rpmdb.
    2. `rpmteFd()` exposes the Fd behind the rpmte, so plugins can interact
       with the rpm itself.
    3. `RPMRC_PLUGIN_CONTENTS` in `rpmRC_e` for use in
       `rpmpluginsCallFsmFilePre()` specifically.
    4. `pgpStringVal()` is used to help parse the command line in
       `rpm2extents` - the positional arguments are strings, and this
       converts the values back to the values in the table.
    
    Nothing has been removed, and none of the changes are intended to be
    used externally, so I don't think a soname bump is warranted here.
    malmond77 committed Feb 1, 2021
    Configuration menu
    Copy the full SHA
    82b454d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9de362d View commit details
    Browse the repository at this point in the history
  3. Match formatting/style of existing code

    The existing code contains some variability in formatting. I'm not sure
    if { is meant to be on the end of the line, or on a new line, but I've
    standardized on the former.
    
    The indentation is intended to match the existing convention: 4 column
    indent, but 8 column wide tab characters. This is easy to follow/use in
    vim, but is surprisingly difficult to get right in vscode. I am doing
    this reformat here and now, and future changes will be after this.
    
    I'm keen to fold the patches together, but for now, I'm trying to keep
    the history of rpm-software-management#1470 linear so everyone can follow along.
    malmond77 committed Feb 1, 2021
    Configuration menu
    Copy the full SHA
    91f7284 View commit details
    Browse the repository at this point in the history
  4. Fix printf formatting in reflink.c

    There were some mismatches on field "sizes". This should eliminate the
    error messages.
    malmond77 committed Feb 1, 2021
    Configuration menu
    Copy the full SHA
    19694b7 View commit details
    Browse the repository at this point in the history