Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a timestamp in the hash cache #337

Merged

Conversation

adrhumphreys
Copy link
Contributor

@adrhumphreys adrhumphreys commented Sep 24, 2019

Caching between two servers is not working as expected. This address that.

Checks to see if File exists before proceeding
https://github.com/silverstripe/silverstripe-assets/blob/1/src/Storage/DBFile.php#L365
Uses "Flysystem" to check for file existing
https://github.com/silverstripe/silverstripe-assets/blob/master/src/Flysystem/FlysystemAssetStore.php#L1231
Compares "Hashes" of what is in the database, and what is returned by Flysystem (If they do not match, which they dont, it skips)
https://github.com/silverstripe/silverstripe-assets/blob/master/src/Flysystem/FlysystemAssetStore.php#L327
Flysystem checks if Cached value of Hash exists (which it does) and returns the cached value instead of checking directly.
https://github.com/silverstripe/silverstripe-assets/blob/master/src/Storage/Sha1FileHashingService.php#L116

This is the point in which a different Hash value is returned by Cache on the offending server (as the cache has not been invalidated / cleared on the server which did not receive the upload).

Cache is non-memory based and is set to infinite lifecycle (meaning it won't expire unless it is specifically removed).
https://github.com/silverstripe/silverstripe-assets/blob/master/_config/assetscache.yml#L17

By default, all SilverStripe caches are stored in the TempDir (/tmp/silverstripe...). This directory is not shared between servers, so each individual web server contains its own cache values

Parent issue

@adrhumphreys adrhumphreys force-pushed the bugfix/invalid-cache-between-servers branch from 5931c11 to 782fff6 Compare September 24, 2019 04:12
Copy link
Contributor

@maxime-rainville maxime-rainville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a weakness in this implementation that I didn't think off at the time.

You don't have a way to invalidate old cache entries using timestamp for files that don't exist any more.

I'm thinking the proper way of implementing this could be to store the timestamp along with the hash in the cache. Then, when you get cache value back, you double check that the timestamp from the cache still matches the one from the file ... if the file exists.

src/Storage/Sha1FileHashingService.php Outdated Show resolved Hide resolved
src/Storage/Sha1FileHashingService.php Outdated Show resolved Hide resolved
@chillu
Copy link
Member

chillu commented Sep 26, 2019

You don't have a way to invalidate old cache entries using timestamp for files that don't exist any more.

That depends on how you invalidate those caches on a system level. If "fresh" deployments nuke the cache, you're relying on regular deployments before disk space runs out. Or you have a cron job that nukes a cache folder. In both cases, it won't be limited to older caches only, and this makes it infeasible for large or performance sensitive contexts. Specialised cache adapters such as Memcache or Redis have garbage collection built in, it's mostly an issue with the default filesystem caches (constrained by disk space), and the PHP opcache (somewhat constrained by PHP memory).

There's an open ticket on how to handle cache cleanup in 4.x: silverstripe/silverstripe-framework#6678. Even if we implemented this in framework (by calling prune() via a cronjob or randomly in process), it won't fix the garbage collection issues with non-expiring cache entries.

@maxime-rainville maxime-rainville force-pushed the bugfix/invalid-cache-between-servers branch from e71c25e to 5a62048 Compare September 26, 2019 04:11
$filesystem = $this->getFilesystem($fs);
return $filesystem->has($fileID) ?
$filesystem->getTimestamp($fileID) :
DBDatetime::now()->getTimestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are instances where we'll be generating a hash value for a file that doesn't exists yet, so we need a fallback in case the file doesn't exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a chat with @NightJar about the virtue of return a fake timestamp. We were originally doing this, because we were making it part of the key and we sometime needed to generate hash keys for files that might not exists.

After some consideration, there's a scenario where this is useful. If you're moving a file around, the FysystemAssetStore will try to move the existing hash value to a different hash key. If you decide to move the hash before the physical file, you won't have a file there yet. By generating, a timestamp equal to "now", FlysystemAssetStore won't have to worry about this intricacy. So we reduce complexity a little bit.

@maxime-rainville
Copy link
Contributor

I tweaked the initial implementation, to store the timestamp on that value, instead of the key. This means we'll only ever have one possible entry for each fileID/Filesystem combination.

Before returning the hash, we compare the timestamp in the cache to the one of the physical file. If they don't match, we assume that the hash has been invalidated and recompute it.

Those timestamp are accurate to the second. There's a possible edge case where you would have an outside process that update the file within the same second as SilverStripe is trying to read it's hash. I'm not sure how we could address that however.

Copy link
Contributor

@NightJar NightJar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good. Unsure how the cache is utilised internally though, so will recommend someone else also reviews before merge.

Also tests are failing ;)

Copy link
Contributor

@maxime-rainville maxime-rainville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up taking over the work from Adrian ... and I'm delighted with my outstanding work.

@NightJar NightJar changed the title Use a timestamp in the hash key Use a timestamp in the hash cache Sep 29, 2019
@maxime-rainville maxime-rainville force-pushed the bugfix/invalid-cache-between-servers branch from 37f23ff to f475826 Compare September 29, 2019 22:53
@NightJar NightJar merged commit dc8bebb into silverstripe:1.4 Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants