Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache thrashing when multiple users try to read newly added file at about the same time #136

Open
ThePythonicCow opened this issue Dec 21, 2016 · 1 comment

Comments

@ThePythonicCow
Copy link
Contributor

I have been working (when I had the time) on the following problem over the last few months. At some point, I expect to propose some patches to address it.

In my use case for riofs, I upload large (often 10 to 1000 megabyte) files directly to my Amazon AWS store, and then I announce the availability of the files to a large user community, with links to my riofs download and caching server that is a front end for that AWS store. Soon after announcing, multiple users, sometimes dozens at a time, will try to download the lastest announced file.

If multiple users try to download the same, newly announced, file at nearly the same time, then my riofs log file shows many, many lines "invalidating local cached file!", due to "Local and remote file sizes do not match" or due to "Failed to get local MD5 sum".

What was happening was that a second user would come asking for the same file that another user had already started to download. The "consistency checking" in src/file_io_ops.c:fileio_read_on_head_cb() would notice that the already partially cached file did not (yet) have a size or MD5 sum matching what Amazon AWS expected for that file, and so would invalidate the partially filled cache for that file. Since the (more difficult) code to compute the multi-part ETag for large files has not been, and might never be, in riofs, the MD5 sum check has no chance of matching on such large files, and if a second download request shows up on a file that is only partially cached, the size check will fail as well.

I ended up racking up about fifty times more AWS download fees in a few days than it would have cost to download all the big files into the riofs cache one time. I have had to stop adding and announcing more big files, until I could resolve this issue of severe cache thrashing, as the AWS download fees were exceeding my budget for this project.

I have now got working (for the first time yesterday) a fix for this, using AWS ETag's. I anticipate that this fix will significantly improve the "consistency checking" in src/file_io_ops.c:fileio_read_on_head_cb().

My current plan is to complete the initial development and testing of the patch set providing this work, and to upload it to my clone of riofs, at https://github.com/ThePythonicCow/riofs . Then I will offer it to Paul Jonkins (https://github.com/wizzard) to pull into the main Riofs repository https://github.com/wizzard/riofs, though this may well be a significant enough change that he will prefer to let it "bake in the oven" for a while, until he and others find time to consider it carefully.

Initially, I had expected that I would need to make this change "persistent", preserving the contents of a local riofs cache across riofs restarts, in order to avoid the AWS charges to rebuild my riofs cache everytime I restarted riofs. But now I am of the view that just making proper use of ETag's will avoid thrashing and repeatedly "invalidating local cached file!" and thereby dramatically reduce AWS download charges, sufficient for my needs.

@ThePythonicCow
Copy link
Contributor Author

I have now finished this change and uploaded it to my clone of riofs, at https://github.com/ThePythonicCow/riofs.

I will issue a pull request.

However this is a more ambitious change than some, so as noted above, it would not surprise me if Paul Jonkins delays accepting it until he has time to look at it more closely.

When Jonkins accepts this (or asks for further changes) matters little to me, as I have this new version, which uses ETags to determine when to invalidate the cache, running in the application that matters to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant