New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zchunk support #92
Conversation
@jdieter So this means that if I added other files to the metadata using |
That is correct. I don't think it will be too hard to fix that, especially now that zchunk has the ability to chunk using the buzhash algorithm as well as user-specified blocks, but I first wanted to get the big three (well, more the big two: primary and filelists) sorted. |
Well, as long as there is a plan for eventually supporting any arbitrary files manipulated by createrepo_c, this looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good.
A couple of things:
- Can you regenerate the man pages to include your new options? The script to do so is in
utils
directory. - Could you add some tests specifically for zck stuff?
src/cmd_parser.c
Outdated
{ "zck-filelists-dict", 0, 0, G_OPTION_ARG_FILENAME, &(_cmd_options.zck_filelists_dict), | ||
"Compression dictionary to use for zchunk filelists file", "ZCK_FILELISTS_DICT" }, | ||
{ "zck-other-dict", 0, 0, G_OPTION_ARG_FILENAME, &(_cmd_options.zck_other_dict), | ||
"Compression dictionary to use for zchunk other file", "ZCK_OTHER_DICT" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the dictionaries mandatory, or optional? My understand of zstd is that they make it better for compression, but aren't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are optional, but make a significant difference in the size of the zchunked metadata. I've put together dictionaries for primary, filelists and other based on data from current Fedora releases, that I'd suggest packaging for Fedora/EPEL (if that's what our builders are on).
The dictionary is embedded in the zchunk file, so clients don't need a copy.
If the dictionary changes, there will be no common chunks between the pre-change and post-change files, so dictionary changes should happen a maximum of once a release (and probably aren't needed even that often)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive me for being forgetful, but is there a script or something that people can use to generate dictionaries for their distributions' metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a small python script that splits the metadata into much smaller files and strips out any checksums (which by definition are unpredictable).
I ran the script over a couple of days worth of updates metadata, which produced thousands of files, and then, in the directory containing those files, ran:
zstd --train --dictID=0 *
The default dict size of roughly 100KB was fairly optimal for Fedora's metadata. Too small and you don't see the savings you'd expect. Too large and you're just wasting space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed Fedora dicts are available at https://www.jdieter.net/downloads/zchunk-dicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and updated the man page. I'll work on the tests but it might take a bit longer to get them done.
src/misc.c
Outdated
@@ -1256,6 +1256,46 @@ cr_write_to_file(GError **err, gchar *filename, const char *format, ...) | |||
return ret; | |||
} | |||
|
|||
gchar *cr_read_from_file(GError **err, gchar *filename, size_t *file_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from g_file_get_contents()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I feel stupid. It's exactly the same. I've just pushed a commit that remove cr_read_from_file and uses g_file_get_contents instead.
@Conan-Kudo, I've added some tests (and found and fixed some problems in the process). Do you have some further suggestions as to which tests I should create? |
Just a heads up that the latest tests require zchunk-0.7.3 (available in the copr) in order to work, as it added the --stdout flag to unzck. |
No, it looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small issue with the CMakeLists...
CMakeLists.txt
Outdated
@@ -34,7 +34,13 @@ find_package(LZMA REQUIRED) | |||
find_package(OpenSSL REQUIRED) | |||
find_package(Sqlite3 REQUIRED) | |||
find_package(ZLIB REQUIRED) | |||
|
|||
find_package(PkgConfig REQUIRED) | |||
find_library(ZCHUNKLIB NAMES zck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write a FindZChunk.cmake
similar to the FindGLIB2.cmake
module and use that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me know if this works or you want something different
CMakeLists.txt
Outdated
@@ -34,6 +34,8 @@ find_package(LZMA REQUIRED) | |||
find_package(OpenSSL REQUIRED) | |||
find_package(Sqlite3 REQUIRED) | |||
find_package(ZLIB REQUIRED) | |||
find_package(PkgConfig REQUIRED) | |||
find_package(Zchunk REQUIRED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not to use pkg_check_modules() ? do not over-complicate this. please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about pkg_check_modules(). Thanks for the pointer!
cr_contentstat_free(pri_zck_stat, NULL); | ||
cr_contentstat_free(fil_zck_stat, NULL); | ||
cr_contentstat_free(oth_zck_stat, NULL); | ||
g_free(pri_zck_filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use g_autofree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g_autofree() was added in glib-2.44. What's the minimum glib version that we need to support?
int ret = CRE_OK; | ||
|
||
assert(cr_file); | ||
assert(!err || *err == NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be g_return_val_if_fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change set looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, but you miss include_directories(ZCK_INCLUDE_DIRS) or smth like that
There are no zck-specific include directories. zck.h is in /usr/include. Should I still have include_directories(...)? |
@jdieter Yes, because it might be installed to non-standard paths by someone else. |
Ok, is this right? |
Looks good to me. |
Please whats a status of availability of packages that fulfill requirement "pkg_check_modules(ZCK REQUIRED zck)" in Fedora/Centos/RHEL ... ? |
@j-mracek It's been available in Fedora, CentOS, RHEL, openSUSE, and Mageia for a couple of weeks now. |
Please and whats a package name, because I was unable to find it? |
You can also request it by the pkgconfig name: |
I've been thinking about how zchunk-enabled createrepo_c uses dictionaries, and the current method is a bit awkward. To use createrepo_c without dictionaries, you just run:
But to enable dictionary use, you need to run:
As we look at adding zchunk support to prestodelta.xml and other files generated by createrepo_c or added using modifyrepo_c, this starts to become very unwieldy. I suggest a different solution, passing a
If there is no dictionary in the directory that matches the current file, then the file would be compressed without a dictionary. Thoughts? Should I go ahead and implement this as part of this patch set? |
@jdieter That sounds fine to me, but I suggest that you document what it expects for that. |
Rather than having that structure exclusively, it's probably better to support both ways. So, for now, I'd like to see this go in as-is, and we can add the |
@Conan-Kudo, if you feel strongly about it, we can keep the current options, but I feel like they're really bulky. I think |
@dmach, now that F29's out the door, I'd love to start generating zchunk metadata in official Fedora. Is there anything I can do to help get this reviewed? |
@jdieter Can you please rebase this change? There are reported conflicts from the other merge earlier today. |
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
--zck-filelists-dict and --zck-other-dict. Jonathan
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
argument checking in cmd_parser.c rather than createrepo_c.c Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
Done! Thanks for the heads up! |
@dmach Can we please merge this now? |
#include <zlib.h> | ||
#include <bzlib.h> | ||
#include <lzma.h> | ||
#include <zck.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be wrapped into #ifdef WITH_ZCHUNK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is done now. I did a test build on a system without zchunk, and it built and ran just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for committing these patches!!!
Signed-off-by: Jonathan Dieter <jdieter@gmail.com>
I'll submit a patch to Fedora's dist-git, spec needs couple tweaks to handle bcond_with(out) zchunk |
This request adds zchunk support to createrepo_c. It does now pass the python test cases, but there are some untested edge cases that will definitely need some work, namely using zchunk compression for anything other than primary, filelists and other.
The zchunk metadata is in addition to the standard gzip metadata and createrepo_c currently only creates zchunk versions primary, filelists and other.