Caching Example Walkthrough #47

krs85 · 2021-07-08T17:50:20Z

Let's try to write out a simple (if that's even possible) example to demonstrate the cache's workflow with P$.

Example Program:

To start, say file1.txt exists and file2.txt does not exist.
Program opens file1.txt for reading only and reads the contents.
Program creates file2.txt (HOW it makes it is very important. Does it use creat or open? What mode does it open with? Does it use O_TRUNC? O_APPEND? Don't you just love this system call interface? Isn't this just so intuitive? 🧠)
Program writes to file2.txt.
Program exits.

file1.txt is an input, and its contents are read. We hash the file when we see it opened as read only.
I guess the executable is another input, should be hashed at the start as well?
Also all the usual suspects: cwd, environment variables, yada yada yada...
file2.txt is an output, as it is created and written to. We hash the file when the program exits. We would then copy the file to our cache.

How do we know we can skip?
The hashes of file1.txt and the executable should match ours and the file should be present in the file system in the same location it was before.

How do we skip?
We skip the execution (#42).
We can then copy our file2.txt to its appropriate absolute path for the execution. This also means we need to keep track of that path, if we need to copy the output file over.

Can we get away with not copying the output file over?
If the hashes of file1.txt and the executable matched, and also file2.txt matches and is in the right spot in the file system, we don't have to copy over the file.

Further thoughts:
What if the program only used one file file1.txt? It reads the contents. Then it writes to the file. I think we can handle this, whether it uses O_APPEND or O_TRUNC. This is a little in the weeds, probably represents edge cases, but important to think about and document nonetheless.

We hash the file when it's opened for reading, this is the input file.
We hash the file at the end of the execution, this is the output file.
When we see this execution again, if our input file matches the one the new execution is using, we can just replace this file by copying over the output file from the cache.

Roughly what I need to implement:

Alter data structures to include file name, full path, and hash of the file
Hash input files (access, openat, open, read, pread64, fstat, newfstatat, stat) when we first see the access.
Hash output files (creat, open, openat, write, writev at the end of the execution.
Serialize the data structure to a file.
Deserialize the data structure.
Look ups in the data structure.
Copy output files to the "cache" at the end of execution.
Copy output files from the "cache".

The text was updated successfully, but these errors were encountered:

krs85 added documentation Improvements or additions to documentation read and discuss please labels Jul 8, 2021

krs85 mentioned this issue Jul 8, 2021

Put the $ in P$ (Caching Discussion) #44

Closed

krs85 mentioned this issue Oct 15, 2021

Add the functionality required to run at least one clustal job #48

Merged

krs85 linked a pull request Oct 15, 2021 that will close this issue

Add the functionality required to run at least one clustal job #48

Merged

krs85 closed this as completed in #48 Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching Example Walkthrough #47

Caching Example Walkthrough #47

krs85 commented Jul 8, 2021 •

edited

Loading

Caching Example Walkthrough #47

Caching Example Walkthrough #47

Comments

krs85 commented Jul 8, 2021 • edited Loading

krs85 commented Jul 8, 2021 •

edited

Loading