Implement Checkpoints #12

vijay03 · 2017-07-20T15:36:22Z

I have a long email thread in my inbox with @ashmrtn about this. Will add the summary from that thread here later.

In short, we want to have some mechanism to know what data/metadata to expect in each crash state. The idea is to allow users to call Checkpoint, which captures the user-visible state (directory tree + data) of the file system somewhere. On a crash, we go back to the latest Checkpoint and see if we have all the data in there.

ashmrtn · 2017-07-21T01:35:23Z

To summarize the email chain, we were thinking that we could use checksums to help us save the user-visible state when Checkpoint is called. For simplicity, we could likely use a hashmap keyed on file paths whose values are the checksums. Each checkpoint could generate a hashmap by walking the directory structure of the file system and checksumming the files found.

ashmrtn · 2017-07-21T01:37:47Z

We may also want to checksum at least some of the data available from calls to stat() (ex. file size and permissions but none of the modified/accessed/created times) so that we can catch user-visible metadata errors as well. @vijay03 may have meant that when he said "(directory tree + data)", but I would like to explicitly put that out there as well.

ashmrtn · 2017-07-22T12:40:33Z

To consolidate what we've said so far and what I've been thinking about this issue:

Overview:

In short, we want to have some mechanism to know what data/metadata to expect in each crash state. The idea is to allow users to call Checkpoint, which captures the user-visible state (directory tree + data) of the file system somewhere. On a crash, we go back to the latest Checkpoint and see if we have all the data in there.

The user-space CrashMonkey test harness needs to be able to receive Checkpoint requests from other processes. Since we are also expanding CrashMonkey to run in the background and have the user kick off their own workload (not one that implements CrashMonkey's BaseTestCase), we cannot assume that the workload will be a child process of CrashMonkey itself. Therefore, the Checkpoint feature must be capable of communicating with processes it does not have a parent-child relationship with. The Checkpoint() call should be available to users regardless of if they implement BaseTestCase and let CrashMonkey run their workload or they run CrashMonkey in the background and then run their workload.

      Checkpoint()   Workload continues
               |         |
Workload    ---A---------D-----------
                \       /
CrashMonkey -----B-----C-------------
                 |
             walk file system

Collecting Data:
In the CrashMonkey test harness, a call to Checkpoint() should cause CrashMonkey to walk the directory structure on the snapshot for the current workload (this should be /dev/cow_ram_snapshot1_0). During the file system walk, CrashMonkey should checksum the data of each file (ex. read the file and compute checksum) as well as checksum some of the file metadata obtainable by calling stat(). The metadata that is checksummed should not include date/time fields as they are prone to change but generally don't affect program correctness, but should include things like file permissions and file size. These checksums can then be stored in something like a hashmap. A new hashmap containing checksums for the entire file system should be created on each call to Checkpoint().

Implementation Thoughts:
As the cow_brd.c module currently only allows snapshots based off the base disk (/dev/cow_ram0) and the workload runs on a snapshot not the base disk, this can be a synchronous call to start out. This should be achievable by having a stub the user can call which tells CrashMonkey to do a Checkpoint operation and waits for CrashMonkey to reply.

I was planning on using local sockets when implementing #1, thus giving us flexibility down the line if we want to allow RPC calls into CrashMonkey functionality. I believe the implementation for this could also use local sockets as they allow bidirectional communication across processes and can be treated much like files in C code. On a local machine, they may not be as flexible as shared memory regions, but they avoid some of the synchronization/locking issues of shm in addition to allowing easy modification if we decide to allow RPC calls down the road.

ashmrtn · 2017-07-22T12:44:16Z

We also need to be able to associate checkpoints with points in our logged bio sequence so we should timestamp when the checkpoint was done. Logged bios will also need timestamps as that information is not currently recorded.

We can assume that the user has just completed a sync operation of some form when Checkpoint() is called.

vijay03 · 2017-07-22T17:21:19Z

Sockets sounds reasonable. To associate Checkpoints with the stream of data sent to the device, we should have either a file inside the device (lets say called Flag), that is written to everytime there is a checkpoint. Using writes to Flag, we can then associate each checkpoint in the dat stream.

Another approach would be to have a in-kernel counter that is incremented every time the user calls Checkpoint (via an ioctl for example). Using the counter we associate checkpoints with bios.

vijay03 · 2017-08-02T16:15:03Z

@domingues @ashmrtn progress seem to have stalled on this? Are we blocked on something?

domingues · 2017-08-02T20:02:34Z

I have two doubts by now:

Should I ignore lost+found folder?
So on every crash tested (test_check_random_permutations()) if the user test fails (test_loader.get_instance()->check_test()) we should check if the data is equal to the last checkpoint made?

vijay03 · 2017-08-02T20:27:49Z

Lets ignore lost and found for now.

By "user test", do you mean a test that the user runs on top of the mounted file system? If so, this is the default version of that user test. Once the file system mounts, we are basically testing that the data/metadata we expect is in there.

If the file system does not mount at all, we just report an error and return.

vijay03 · 2017-08-04T15:12:31Z

I think implementing checkpoints is a very large task, that is unlikely to be merged in with a single pull request. @domingues, could you merge in parts of it with pull requests as you code it up?

ashmrtn · 2017-09-13T01:38:38Z

@vijay03 I think it might be advantageous to split the functionality of the original checkpoint idea into 2 things:

a checkpoint type operation that will be passed to user tests, denoting the most recent checkpoint reached in the generated crash state (only available after sync/fsync)
a watch type operation where the user passes a file path to CrashMonkey and CrashMonkey then monitors that file path to make sure no changes occur in it after that point in generated crash states. This will require support from (1) as well. (only available after sync/fsync)

ashmrtn · 2017-09-13T15:40:23Z

I'm going to split this issue up into several smaller ones since both checkpoints and watches somewhat complicated and require support across different parts of CrashMonkey.

vijay03 · 2017-10-05T03:35:01Z

Should we close this issue now @ashmrtn ?

ashmrtn assigned domingues Jul 22, 2017

domingues mentioned this issue Jul 28, 2017

Decouple workload generation from C++ code #1

Closed

ashmrtn assigned ashmrtn and unassigned domingues Sep 1, 2017

ashmrtn removed their assignment Sep 13, 2017

ashmrtn closed this as completed Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Checkpoints #12

Implement Checkpoints #12

vijay03 commented Jul 20, 2017

ashmrtn commented Jul 21, 2017

ashmrtn commented Jul 21, 2017

ashmrtn commented Jul 22, 2017

ashmrtn commented Jul 22, 2017

vijay03 commented Jul 22, 2017

vijay03 commented Aug 2, 2017

domingues commented Aug 2, 2017

vijay03 commented Aug 2, 2017

vijay03 commented Aug 4, 2017

ashmrtn commented Sep 13, 2017

ashmrtn commented Sep 13, 2017

vijay03 commented Oct 5, 2017

Implement Checkpoints #12

Implement Checkpoints #12

Comments

vijay03 commented Jul 20, 2017

ashmrtn commented Jul 21, 2017

ashmrtn commented Jul 21, 2017

ashmrtn commented Jul 22, 2017

ashmrtn commented Jul 22, 2017

vijay03 commented Jul 22, 2017

vijay03 commented Aug 2, 2017

domingues commented Aug 2, 2017

vijay03 commented Aug 2, 2017

vijay03 commented Aug 4, 2017

ashmrtn commented Sep 13, 2017

ashmrtn commented Sep 13, 2017

vijay03 commented Oct 5, 2017