Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zchunk support #92

Merged
merged 14 commits into from Nov 26, 2018
Merged

Add zchunk support #92

merged 14 commits into from Nov 26, 2018

Conversation

jdieter
Copy link
Contributor

@jdieter jdieter commented Jun 12, 2018

This request adds zchunk support to createrepo_c. It does now pass the python test cases, but there are some untested edge cases that will definitely need some work, namely using zchunk compression for anything other than primary, filelists and other.

The zchunk metadata is in addition to the standard gzip metadata and createrepo_c currently only creates zchunk versions primary, filelists and other.

@Conan-Kudo
Copy link
Member

@jdieter So this means that if I added other files to the metadata using modifyrepo_c, we can't chunk via zck for now?

@jdieter
Copy link
Contributor Author

jdieter commented Jun 13, 2018

That is correct. I don't think it will be too hard to fix that, especially now that zchunk has the ability to chunk using the buzhash algorithm as well as user-specified blocks, but I first wanted to get the big three (well, more the big two: primary and filelists) sorted.

@Conan-Kudo
Copy link
Member

Well, as long as there is a plan for eventually supporting any arbitrary files manipulated by createrepo_c, this looks good to me.

Copy link
Member

@Conan-Kudo Conan-Kudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good.

A couple of things:

  1. Can you regenerate the man pages to include your new options? The script to do so is in utils directory.
  2. Could you add some tests specifically for zck stuff?

src/cmd_parser.c Outdated
{ "zck-filelists-dict", 0, 0, G_OPTION_ARG_FILENAME, &(_cmd_options.zck_filelists_dict),
"Compression dictionary to use for zchunk filelists file", "ZCK_FILELISTS_DICT" },
{ "zck-other-dict", 0, 0, G_OPTION_ARG_FILENAME, &(_cmd_options.zck_other_dict),
"Compression dictionary to use for zchunk other file", "ZCK_OTHER_DICT" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the dictionaries mandatory, or optional? My understand of zstd is that they make it better for compression, but aren't necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are optional, but make a significant difference in the size of the zchunked metadata. I've put together dictionaries for primary, filelists and other based on data from current Fedora releases, that I'd suggest packaging for Fedora/EPEL (if that's what our builders are on).

The dictionary is embedded in the zchunk file, so clients don't need a copy.

If the dictionary changes, there will be no common chunks between the pre-change and post-change files, so dictionary changes should happen a maximum of once a release (and probably aren't needed even that often)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me for being forgetful, but is there a script or something that people can use to generate dictionaries for their distributions' metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a small python script that splits the metadata into much smaller files and strips out any checksums (which by definition are unpredictable).

I ran the script over a couple of days worth of updates metadata, which produced thousands of files, and then, in the directory containing those files, ran:

zstd --train --dictID=0 *

The default dict size of roughly 100KB was fairly optimal for Fedora's metadata. Too small and you don't see the savings you'd expect. Too large and you're just wasting space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed Fedora dicts are available at https://www.jdieter.net/downloads/zchunk-dicts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and updated the man page. I'll work on the tests but it might take a bit longer to get them done.

src/misc.c Outdated
@@ -1256,6 +1256,46 @@ cr_write_to_file(GError **err, gchar *filename, const char *format, ...)
return ret;
}

gchar *cr_read_from_file(GError **err, gchar *filename, size_t *file_size) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from g_file_get_contents()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I feel stupid. It's exactly the same. I've just pushed a commit that remove cr_read_from_file and uses g_file_get_contents instead.

@jdieter
Copy link
Contributor Author

jdieter commented Jun 22, 2018

@Conan-Kudo, I've added some tests (and found and fixed some problems in the process). Do you have some further suggestions as to which tests I should create?

@jdieter
Copy link
Contributor Author

jdieter commented Jun 22, 2018

Just a heads up that the latest tests require zchunk-0.7.3 (available in the copr) in order to work, as it added the --stdout flag to unzck.

@Conan-Kudo
Copy link
Member

No, it looks good to me.

Copy link
Member

@Conan-Kudo Conan-Kudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small issue with the CMakeLists...

CMakeLists.txt Outdated
@@ -34,7 +34,13 @@ find_package(LZMA REQUIRED)
find_package(OpenSSL REQUIRED)
find_package(Sqlite3 REQUIRED)
find_package(ZLIB REQUIRED)

find_package(PkgConfig REQUIRED)
find_library(ZCHUNKLIB NAMES zck)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write a FindZChunk.cmake similar to the FindGLIB2.cmake module and use that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me know if this works or you want something different

CMakeLists.txt Outdated
@@ -34,6 +34,8 @@ find_package(LZMA REQUIRED)
find_package(OpenSSL REQUIRED)
find_package(Sqlite3 REQUIRED)
find_package(ZLIB REQUIRED)
find_package(PkgConfig REQUIRED)
find_package(Zchunk REQUIRED)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to use pkg_check_modules() ? do not over-complicate this. please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about pkg_check_modules(). Thanks for the pointer!

cr_contentstat_free(pri_zck_stat, NULL);
cr_contentstat_free(fil_zck_stat, NULL);
cr_contentstat_free(oth_zck_stat, NULL);
g_free(pri_zck_filename);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use g_autofree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_autofree() was added in glib-2.44. What's the minimum glib version that we need to support?

int ret = CRE_OK;

assert(cr_file);
assert(!err || *err == NULL);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be g_return_val_if_fail

Copy link
Member

@Conan-Kudo Conan-Kudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change set looks good to me.

Copy link

@ignatenkobrain ignatenkobrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, but you miss include_directories(ZCK_INCLUDE_DIRS) or smth like that

@jdieter
Copy link
Contributor Author

jdieter commented Aug 16, 2018

There are no zck-specific include directories. zck.h is in /usr/include. Should I still have include_directories(...)?

@Conan-Kudo
Copy link
Member

@jdieter Yes, because it might be installed to non-standard paths by someone else.

@jdieter
Copy link
Contributor Author

jdieter commented Aug 16, 2018

Ok, is this right?

@Conan-Kudo
Copy link
Member

Looks good to me.

@j-mracek
Copy link
Member

j-mracek commented Aug 21, 2018

Please whats a status of availability of packages that fulfill requirement "pkg_check_modules(ZCK REQUIRED zck)" in Fedora/Centos/RHEL ... ?

@Conan-Kudo
Copy link
Member

@j-mracek It's been available in Fedora, CentOS, RHEL, openSUSE, and Mageia for a couple of weeks now.

@j-mracek
Copy link
Member

Please and whats a package name, because I was unable to find it?

@Conan-Kudo
Copy link
Member

Conan-Kudo commented Aug 21, 2018

zchunk-devel is the package you need installed for this code. It's in EPEL for CentOS/RHEL 7. That package name will work across all distributions (though this will install lib64zck-devel on Mageia, and libzck-devel on openSUSE).

You can also request it by the pkgconfig name: sudo dnf install 'pkgconfig(zck)'.

@jdieter
Copy link
Contributor Author

jdieter commented Aug 23, 2018

I've been thinking about how zchunk-enabled createrepo_c uses dictionaries, and the current method is a bit awkward. To use createrepo_c without dictionaries, you just run:

createrepo_c --zck ./

But to enable dictionary use, you need to run:

createrepo_c --zck --zck-primary-dict=foo-primary.dict --zck-filelists-dict=foo-filelists.dict --zck-other-dict=foo-other.dict ./

As we look at adding zchunk support to prestodelta.xml and other files generated by createrepo_c or added using modifyrepo_c, this starts to become very unwieldy. I suggest a different solution, passing a --zck-dict-dir argument that points to a directory containing foo.dict where foo is primary, filelists, other, prestodelta, etc. The command usage would look like:

createrepo_c --zck --zck-dict-dir=/usr/share/repo-dict

If there is no dictionary in the directory that matches the current file, then the file would be compressed without a dictionary.

Thoughts? Should I go ahead and implement this as part of this patch set?

@Conan-Kudo
Copy link
Member

@jdieter That sounds fine to me, but I suggest that you document what it expects for that.

@Conan-Kudo
Copy link
Member

Rather than having that structure exclusively, it's probably better to support both ways.

So, for now, I'd like to see this go in as-is, and we can add the --zck-dict-dir option later.

@jdieter
Copy link
Contributor Author

jdieter commented Sep 1, 2018

@Conan-Kudo, if you feel strongly about it, we can keep the current options, but I feel like they're really bulky. I think --zck-dict-dir is a more elegant method, and I'd definitely prefer to drop the current options before people start depending on them.

@jdieter
Copy link
Contributor Author

jdieter commented Nov 13, 2018

@dmach, now that F29's out the door, I'd love to start generating zchunk metadata in official Fedora. Is there anything I can do to help get this reviewed?

@Conan-Kudo
Copy link
Member

@jdieter Can you please rebase this change? There are reported conflicts from the other merge earlier today.

Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
--zck-filelists-dict and --zck-other-dict.

Jonathan
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
argument checking in cmd_parser.c rather than createrepo_c.c

Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
@jdieter
Copy link
Contributor Author

jdieter commented Nov 19, 2018

Done! Thanks for the heads up!

@Conan-Kudo
Copy link
Member

@dmach Can we please merge this now?

#include <zlib.h>
#include <bzlib.h>
#include <lzma.h>
#include <zck.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be wrapped into #ifdef WITH_ZCHUNK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is done now. I did a test build on a system without zchunk, and it built and ran just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for committing these patches!!!

Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
@dmach dmach merged commit 93c9970 into rpm-software-management:master Nov 26, 2018
@dmach
Copy link

dmach commented Nov 26, 2018

I'll submit a patch to Fedora's dist-git, spec needs couple tweaks to handle bcond_with(out) zchunk

kontura added a commit to kontura/createrepo_c that referenced this pull request May 20, 2019
kontura added a commit to kontura/createrepo_c that referenced this pull request May 20, 2019
kontura added a commit to kontura/createrepo_c that referenced this pull request May 21, 2019
kontura added a commit to kontura/createrepo_c that referenced this pull request May 21, 2019
kontura added a commit to kontura/createrepo_c that referenced this pull request May 21, 2019
kontura added a commit to kontura/createrepo_c that referenced this pull request May 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants