ETag cache validation patches #137

ThePythonicCow · 2017-01-15T09:20:54Z

As I have written at more length in #136, this set of patches, which replaces the methods of checking whether the cache is still valid with a single method using ETags, is now ready for consideration to pull into the main branch.

See in particular patch "3ee963e ETag fix #3: add ETag consistency checking" for the most interesting details of this change.

Thanks!

Python2 code now must explicitly execute "python2", as the default for the unqualified "python" executable has changed in recent releases to "python3".

Existing riofs caches only support up to 4 GBytes of cached data, as cache_dir_max_size is a 32 bit unsigned integer count of the maximum cache size, in bytes. Extend that by a factor of a million (1024 * 1024, actually). Existing riofs configurations still work, as is, with no change in allowed cache size. A new configuration option can optionally be used, to override the existing setting, with a count of how many MBytes, instead of how many bytes, are allowed in the max cache directory size. Remove "filesystem.cache_dir_max_size" from the list of config settings that must be set on riofs startup, because now either that setting, and/or the new setting of "filesystem.cache_dir_max_megabyte_size" may be set (with the latter one taking precedence, if set.) The code that determines the configured size will complain if neither option is set.

The dir_tree_entry_update_xattrs() routine is only used and is defined in src/dir_tree.c, so mark it static to that file.

Add (void *) cast to various LOG_err and LOG_debug print arguments, so that printing them with a "%p" format doesn't generate an error when compiling with gcc -pedantic enabled. Various other tweaks to get printf formats to match the value types, on both 32 and 64 bit architectures.

Add --enable-strict-compile configuration option to enable support for strict compiler warnings. With a few patches just before this, riofs now compiles cleanly with CFLAGS -Wextra -pedantic -Wall (except for some older gcc versions that don't know about, so ignore, the GCC diagnostic push and pull pragmas, used to suppress a variadic-macros warning from include/log.h) Since the default has been to build riofs without these strict flags, I left that as the default. One has to use the command: configure --enable-strict-compile to create Makefile's that include -Wextra -pedantic -Wall in CFLAGS.

Since the patches that follow this patch make a fairly substantial change to how riofs recognizes that a cached file has changed on the Amazon AWS S3 server (thus requiring invalidating the local cache), I have therefore bumped the riofs version number.

Some key code in the fileio_read_on_head_cb() routine was serving a dual purpose: 1) Set the file's size from the http header Content-Length 2) Perform one of the consistency checks, this one for size change After adding the ETag consistency checking code in the next patch, I will then be able to remove the other, less reliable, methods of checking, including checking on file size. However the setting of the file's size from the http header is still needed. Therefore in this patch I split up this code, into two separate parts, one of which will remain, and the other of which will be removed in a few patches.

This is the one very important patch in this series. The proper way, in my understanding, to check whether the file we have locally cached is still the same file as on the AWS S3 server is to compare ETag's. Anytime we notice that our cached file no longer has the same exact contents as the similarly named file on the AWS S3 server, we need to invalidate our cache and download the new file contents as need be. AWS S3 guarantees (to the level of confidence of MD5 sums) that if and only if the content of a stored file changes, then its ETag changes. There is no need to compute MD5 sums locally (which aren't the same as ETags on multipart files anyway) to make this check. If the ETag hasn't changed, we're still dealing with the same file contents, even if we have only cached a few parts of it so far. Other ways of checking, such as on file size or version or calculated MD5 sums for single part files, are quite less reliable, subject to both false positives and false negatives, and not always applicable, if for example a file is not versioned on Amazon, or a file has multiple parts, or a file's contents changes but its size remained the same. This patch makes the following changes: 1) Add an etag string to CacheEntry, to track what ETag is currently in the cache for that inode. 2) Add get and update routines, to access that etag string. 3) Add an aws_etag string to FileReadData, to keep track of what was reported in the latest ETag header field from the AWS S3 server for that file, so that once the corresponding CacheEntry is available as well, we can get, update, and compare the two etags, and invalidate the cache if necessary. 4) Add insure_cache_etag_consistent_or_invalidate_cache() routine, called from two places, depending on the sequencing of events, to handle this ETag checking. 5) The insure_cache_etag_consistent_or_invalidate_cache() routine returns TRUE if it was able to make the check and (if necessary), invalidate the cache. It returns FALSE if and only if it encountered an error along the way. 6) The previously unused evkeyvalq argument to the fileio_read_on_get_cb() routine is now used, as the ETag header is needed in that routine. 7) FileReadData *rdata can no longer be free'd using a single g_free() call, as it might contain an allocated aws_etag. So add a fileread_destroy() routine to handle free'ing it. 8) Add a cache_etag_is_set flag to FileReadData, to keep track of whether we've set the etag string in the associated (via the ino number) CacheEntry. On newly cached files, setting up the cache occurs some steps after setting up FileReadData, so we need to keep track of whether we've set the cache's etag or not. 9) A few other minor changes in related comments and spacing. This patch provides a potentially substantial performance improvement for a particular use case. If a new large file was placed on an riofs managed server, and multiple clients learn of it and all go to download it at nearly the same (downloads overlap in time), then dreadful cache thrashing could occur, as the cache kept being invalidated when another client came along to start a download, and the size of whatever had been stored in the cache so far did not match the total size of what was on the AWS S3 server. Repeated partial downloads of the same file could result in S3 bandwidth charges that were one or two orders of magnitude larger than needed for a single download of such a file.

Two of the three existing consistency checking methods used in fileio_read_on_head_cb() required some additional supporting code elsewhere, that had no other use. So remove each of these methods, along with any now unneeded supporting code, one at a time. This patch removes the first consistency checking method, using md5 sums.

This patch removes the second no longer used consistency checking method from fileio_read_on_head_cb(), along with what seems to be a substantial amount of no longer used supporting code.

This is the easiest of the three old consistency checking methods to remove - based on file size checking.

wizzard · 2017-01-16T16:47:57Z

Cool, I'll review your patches in the next few days (will try to do this task asap) and let you know.
Thank you!

wizzard

Everything look good! Thank you!

ThePythonicCow · 2017-01-22T21:15:56Z

Thanks for pulling my patches, and thanks for a fine piece of software.

…

-- Paul Jackson pj@usa.net

ThePythonicCow added 14 commits December 11, 2016 03:07

Specify python2 explicitly, since python3 becoming default

6637a75

Python2 code now must explicitly execute "python2", as the default for the unqualified "python" executable has changed in recent releases to "python3".

Mark dir_tree_entry_update_xattrs() as static

8fbe3b4

The dir_tree_entry_update_xattrs() routine is only used and is defined in src/dir_tree.c, so mark it static to that file.

Fix strtok_r saved uninitialized warning

1c0084c

Fix warnings on fuse fh to/from pointer conversions

f4dc1d4

Fix some minor misspellings and typos

b8d1278

ETag fix skoobe#5: remove version consistency checking

9ded301

This patch removes the second no longer used consistency checking method from fileio_read_on_head_cb(), along with what seems to be a substantial amount of no longer used supporting code.

ETag fix skoobe#6: remove size checking

f982cd0

This is the easiest of the three old consistency checking methods to remove - based on file size checking.

wizzard approved these changes Jan 22, 2017

View reviewed changes

wizzard merged commit a01ec17 into skoobe:master Jan 22, 2017

deepthi mentioned this pull request Jan 28, 2019

Performance comparison between s3fs and riofs #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETag cache validation patches #137

ETag cache validation patches #137

ThePythonicCow commented Jan 15, 2017

wizzard commented Jan 16, 2017

wizzard left a comment

ThePythonicCow commented Jan 22, 2017 via email

ETag cache validation patches #137

ETag cache validation patches #137

Conversation

ThePythonicCow commented Jan 15, 2017

wizzard commented Jan 16, 2017

wizzard left a comment

Choose a reason for hiding this comment

ThePythonicCow commented Jan 22, 2017 via email