add memory snapshots #1820

PhilippTakacs · 2023-04-05T15:26:35Z

Hi

To implement a fuzzer based on unicorn we would like to have the ability to make snapshots of the memory. This is the current version of my changes. There are some corner case I have to check and think about how best handle them.

The basic idea is to use the priority of MemoryRegion and overlap multiple MemoryRegion based on the current snapshot level. On a write (after a snapshot) a new MemoryRegion will be created and mapped with the priority of the current snapshot level. To restore a snapshot the MemoryRegions with smaller priority will be unmapped.

This also includes a updated version for split_region which works with aliases. This allows to unmap parts of a mapping or changing priority without copy.

The implementation only adds a few 0 checks when snapshots are not used.

aquynh · 2023-04-05T17:15:26Z

In unicorn.h we dont use uc_struct, but uc_engine.

wtdcode · 2023-04-06T06:54:23Z

Pretty nice work. For API design, I have two ideas:

mimic uc_context API

uc_err uc_alloc_snapshot(uc_engine* uc, uc_snapshot** snapshot);
uc_err uc_take_snapshot(uc_engine* uc, uc_snapshot* snapshot);
uc_err uc_restore_snapshot(uc_engine* uc, uc_snapshot* snapshot);
uc_err uc_free_snapshot(uc_engine* uc, uc_snapshot* snapshot);

merge into uc_context API.

If we don't change current signature of the uc_context, internally we could first add a new uc_ctl to indicate the contents a context will save, say cpu or memory or cpu+memory. That is to say, uc_context_save will save different contexts regarding this uc_ctl. Note we will record what we have saved (cpu, memory or both) in uc_context struct so that when we restore or free the context, we know what we should overwrite.

PhilippTakacs · 2023-04-06T14:17:47Z

My approach was to use Copy on Write to implement snapshots. So making a snapshot is literally only increasing a counter. On write operations then is checked if the priority of the MemoryRegion is smaller then the current snapshot_level. When this is the case a new MemoryRegion is created and mapped at the address with a higher priority. To restore a snapshot the MemoryRegions are filtered by the priority and the snapshot_level is decreased. So the snapshot_level can only increased to take a snapshot and decreased to restore the last snapshot.

For API design, I have two ideas:
1. mimic `uc_context` API

This don't work with this implementation. The problem is making dumps of the complete Memory would be to big to be useful.

2. merge into `uc_context` API.

Some sort of combination with a context save is planed, but it's not that easy because the uc_context and snapshots work different. You can have multiple uc_context but snapshots can only increased and decreased.

wtdcode · 2023-04-07T05:57:30Z

My approach was to use Copy on Write to implement snapshots. So making a snapshot is literally only increasing a counter. On write operations then is checked if the priority of the MemoryRegion is smaller then the current snapshot_level. When this is the case a new MemoryRegion is created and mapped at the address with a higher priority. To restore a snapshot the MemoryRegions are filtered by the priority and the snapshot_level is decreased. So the snapshot_level can only increased to take a snapshot and decreased to restore the last snapshot.
For API design, I have two ideas:
1. mimic `uc_context` API
This don't work with this implementation. The problem is making dumps of the complete Memory would be to big to be useful.
2. merge into `uc_context` API.
Some sort of combination with a context save is planed, but it's not that easy because the uc_context and snapshots work different. You can have multiple uc_context but snapshots can only increased and decreased.

I got it. Then

My approach was to use Copy on Write to implement snapshots. So making a snapshot is literally only increasing a counter. On write operations then is checked if the priority of the MemoryRegion is smaller then the current snapshot_level. When this is the case a new MemoryRegion is created and mapped at the address with a higher priority. To restore a snapshot the MemoryRegions are filtered by the priority and the snapshot_level is decreased. So the snapshot_level can only increased to take a snapshot and decreased to restore the last snapshot.
For API design, I have two ideas:
1. mimic `uc_context` API
This don't work with this implementation. The problem is making dumps of the complete Memory would be to big to be useful.

I can't get it. Why not just track which memory region we should restore? I mean, say:

uc_snapshot_alloc(uc, &s1); // Also save this snapshot pointer in UC instance
uc_take_snapshot(uc, s1);  // Point s1 to current memory regions.

uc_mem_write(uc...); // CoW overlays a new memory region

uc_restore_snapshot(uc, s1); // Note s1 points to all previous memory regions so we can remove all regions overlayed before.

Note this infers that "snapshots" must be organized as a tree structure (which is also a common design in hypervisors like VMWare). That is to say, if we take the second snapshot s2 after s1, when we restore s1, s2 should be invalidated.

2. merge into `uc_context` API.
Some sort of combination with a context save is planed, but it's not that easy because the uc_context and snapshots work different. You can have multiple uc_context but snapshots can only increased and decreased.

PhilippTakacs · 2023-04-08T14:25:43Z

To support this the internal data structure of the memory need to be also a something with supports CoW (i.e. a Merkle tree). What could be done is to also add some serialization and combine this with the snapshot_level to only copy changed memory regions. But this is out of scope for this PR.

The memory is internal stored in lists (QTAILQ), therefor I can't do CoW on the AddressSpace/MemoryRegion. But I can use the priority feature to have a MemoryRegion override another without changing the Data. But this way the "snapshot" is only stored internal and can only increased and decreased. What I can add is a way to increase decrease the level more then one step. This leads to some corner cases I don't know how to handle, therefor I haven't implemented this.

Why not just track which memory region we should restore

In some way this is already done, but only internal. This stores the snapshot_level as priority of the MemoryRegion. So the rollback is only remove the regions with a higher or equal priority then the current level and then decrease the level. This is given more or less for free, because the priority is also used to decide which region is active.

What can be added is that context_save also makes a snapshot and on context_restore this snapshot is restored. But this only works with the same uc_engine object.

A bit more context what we want to do with this feature: We would like to emulate a program till it's reads an input then make a snapshot (including context save and save the internal state of the OS emulation). After that emulate till the next input. then check if this has lead to a new block, if not roll back and test with a different input.

wtdcode · 2023-04-08T14:39:46Z

To support this the internal data structure of the memory need to be also a something with supports CoW (i.e. a Merkle tree). What could be done is to also add some serialization and combine this with the snapshot_level to only copy changed memory regions. But this is out of scope for this PR.

The memory is internal stored in lists (QTAILQ), therefor I can't do CoW on the AddressSpace/MemoryRegion. But I can use the priority feature to have a MemoryRegion override another without changing the Data. But this way the "snapshot" is only stored internal and can only increased and decreased. What I can add is a way to increase decrease the level more then one step. This leads to some corner cases I don't know how to handle, therefor I haven't implemented this.

I got your point and I forget memory regions are organized in lists, which is not ideal for CoW indeed.

Why not just track which memory region we should restore

In some way this is already done, but only internal. This stores the snapshot_level as priority of the MemoryRegion. So the rollback is only remove the regions with a higher or equal priority then the current level and then decrease the level. This is given more or less for free, because the priority is also used to decide which region is active.

What can be added is that context_save also makes a snapshot and on context_restore this snapshot is restored. But this only works with the same uc_engine object.

You could assume a context is only restored with the same uc_engine object. Or we could firstly record all snapshots (contexts) we have saved and check against it when a context is going to be restored. Note this was once implemented as a workaround to support nested uc_emu_start but soon got removed.

Reference: 1044403

A bit more context what we want to do with this feature: We would like to emulate a program till it's reads an input then make a snapshot (including context save and save the internal state of the OS emulation). After that emulate till the next input. then check if this has lead to a new block, if not roll back and test with a different input.

That sounds like what fuzzware once did and we would like to backport and I perfectly understand your motivation. Could you try to merge this to uc_context_save and provide a switch as uc_ctl?

I will start to review the implementation details probably next week, sorry for possible delay. XD

PhilippTakacs · 2023-04-11T12:56:54Z

Could you try to merge this to uc_context_save and provide a switch as uc_ctl?

I have add this. If enabled uc_context_save makes a snapshot. uc_context_restore will restore to the point directly after the snapshot is taken.

That sounds like what fuzzware once did and we would like to backport

This sounds interesting. Is there some sort of overview about what and how they implement this?

wtdcode · 2023-04-11T13:07:47Z

Could you try to merge this to uc_context_save and provide a switch as uc_ctl?

I have add this. If enabled uc_context_save makes a snapshot. uc_context_restore will restore to the point directly after the snapshot is taken.

Really nice, thanks!

That sounds like what fuzzware once did and we would like to backport

This sounds interesting. Is there some sort of overview about what and how they implement this?

Check this: https://github.com/fuzzware-fuzzer/fuzzware-emulator/blob/20a3deb9227606f2efd88baa345bc3ad695d34a2/harness/fuzzware_harness/native/uc_snapshot.c

They implemented CoW by manipulating memory protections as explained in the file.

PhilippTakacs · 2023-04-17T16:14:16Z

I think the code is complete now and works as inspected.

I'll write some tests and add more documentation in the next days.

PhilippTakacs · 2023-04-24T11:28:05Z

Found some issues with the alias handling, which needs to be resolved before merging this.

Problem is to find the region inside the address_space_memory. A simple solution would be to disallow uc_mem_unmap/uc_mem_prot in combination with snapshots.

PhilippTakacs · 2023-04-27T14:19:51Z

Sorry for the force push, but I had to do some rewrite.

The snapshots still works the same way, but now uc_mem_unmap and uc_mem_protect can't be used in combination with snapshots. I don't think this is a big issue, because with the vtlb and the tlb_hook change protection and unmapping is still possible.

The snapshots should work as expected. I'll write some more tests and documentation.

PhilippTakacs · 2023-05-15T14:42:50Z

I will start to review the implementation details probably next week, sorry for possible delay. XD

Hi just wanted to know if you had time to look at the code or when you have time for it?

wtdcode · 2023-05-15T15:17:13Z

I will start to review the implementation details probably next week, sorry for possible delay. XD

Hi just wanted to know if you had time to look at the code or when you have time for it?

I'm sick recently and spend most of my time in bed. Sorry I can't estimate when I could review this but I will soon have another docker appointment recently.

qemu/exec.c

wtdcode · 2023-05-19T21:00:28Z

qemu/exec.c

@@ -1145,6 +1164,7 @@ void qemu_ram_free(struct uc_struct *uc, RAMBlock *block)

    QLIST_REMOVE_RCU(block, next);
    uc->ram_list.mru_block = NULL;
+    uc->ram_list.freed = true;


I didn't understand why a new field freed was added.

This is an optimization. With the benchmark we found that ram write time increase over the number of snapshots. So I have look at this with a profiler and found that most time is spend in find_ram_offset(). This function search for the best fit of a new RAMBlock in the ram_addr_t address space (special internal address space for RAMBlocks). Because of the data structure this takes O(n^2).

When no RAMBlock was freed in this address space the new RAMBlock will be added at the end. ram_list.freed stores this information. Based on this boolean find_ram_offset_last() is called. This function skips the expensive search and just add the RAMBlock at the end of the address space.

I read find_ram_offset again and think a bit. Is it possible that we disable this optimization for fragmentation?

/* If it fits remember our place and remember the size * of gap, but keep going so that we might find a smaller * gap to fill so avoiding fragmentation. */ if (next - candidate >= size && next - candidate < mingap) { offset = candidate; mingap = next - candidate; }

In other words:

/* If it fits remember our place and remember the size * of gap, but keep going so that we might find a smaller * gap to fill so avoiding fragmentation. */ if (next - candidate >= size && next - candidate < mingap) { offset = candidate; mingap = next - candidate; break; // Stop searching. }

Could you benchmark if this solution is fast enough?

As far as I understand is the ram_addr_t address space an optimization for the dirty memory tracking. The allocator avoids fragmentation to have less bookkeeping in the memory tracking. So changing this allocator to fragment more might influence the execution speed in general, independent from the use of snapshots.

Could you benchmark if this solution is fast enough?

Will do, but my benchmark only benchmark the snapshots, not the overall execution of an complex binary with dynamic maps/unmaps. So I can't tell if this change influence the emulation speed.

Benchmarked it, it is about the same speed as without an optimization.

Looking at the code again, it's clear why: Adding fragmentation would only work, if there would be some free space to fragment. But in my case there is no free space so it needs to search for the block with the highest address. The ram_list.blocks is decreasing ordered by the size of the block so the later blocks by snapshots (only one page) will be added at the end. So for my use case adding fragmentation will not help.

By the way, I'm actually curious why QEMU doesn't have this problem (see https://github.com/qemu/qemu/blob/154e3b61ac9cfab9639e6d6207a96fff017040fe/softmmu/physmem.c#L1433) because this optimization seems so straightforward and common. A possible explanation is that, most qemu system emulation cases don't have too many ram blocks?

I also guess having many ram blocks is quite uncommon. Also dynamic allocating many ram blocks is not a thing in qemu.

uc.c

qemu/softmmu/memory.c

qemu/accel/tcg/cputlb.c

wtdcode · 2023-05-19T21:20:05Z

I carefully check your code and I love the idea to utilize existing qemu mechanism to support this. However, some concerns I have came up so far:

compatibility with uc_mem_map_ptr, e.g. will we invalidate the pointer?
API design, see comments above
what if user unmaps the regions with snapshot_level > 0?

PhilippTakacs · 2023-05-22T12:14:36Z

* compatibility with `uc_mem_map_ptr`, e.g. will we invalidate the pointer?

The uc_mem_map_ptr will work as before. The change is on CoW a new RAMBlock is allocated and mapped overlapping. So without using snapshots everything behaves the same. Only after a snapshot was made writes through unicorn will not access the pointer and reads for outside might not correspond to the current state.

* what if user unmaps the regions with `snapshot_level > 0`?

I have look long time to find a solution for this problem. I haven't found a good one, so uc_mem_unmap (edit: same for uc_mem_protect) will just return an error if snapshot_level > 0. To do unmapping there is still the tlb accessible. Therefore I would this would be nice to have but not important.

PhilippTakacs · 2023-05-22T15:19:01Z

Sorry for the force push wanted the CI to run and remove the fixup commits.

wtdcode · 2023-05-23T00:42:37Z

* compatibility with `uc_mem_map_ptr`, e.g. will we invalidate the pointer?
The uc_mem_map_ptr will work as before. The change is on CoW a new RAMBlock is allocated and mapped overlapping. So without using snapshots everything behaves the same. Only after a snapshot was made writes through unicorn will not access the pointer and reads for outside might not correspond to the current state.

That’s exactly my concern because the pointer will point to some outdated content. But I also don’t see any obvious solution so maybe we can only document this clearly.

* what if user unmaps the regions with `snapshot_level > 0`?
I have look long time to find a solution for this problem. I haven't found a good one, so uc_mem_unmap (edit: same for uc_mem_protect) will just return an error if snapshot_level > 0. To do unmapping there is still the tlb accessible. Therefore I would this would be nice to have but not important.

What if we just drop all snapshots and unmap the region? Regarding uc_mem_protect, probably more cases have to be carefully handled.

PhilippTakacs · 2023-05-26T16:13:47Z

* compatibility with `uc_mem_map_ptr`, e.g. will we invalidate the pointer?
The uc_mem_map_ptr will work as before. The change is on CoW a new RAMBlock is allocated and mapped overlapping. So without using snapshots everything behaves the same. Only after a snapshot was made writes through unicorn will not access the pointer and reads for outside might not correspond to the current state.
That’s exactly my concern because the pointer will point to some outdated content. But I also don’t see any obvious solution so maybe we can only document this clearly.

This is kind of expected, because you can use the same pointer at two places with different writes (i.e. multiple mmap with MAP_PRIVAT). I'll write some examples and documentation for this.

* what if user unmaps the regions with `snapshot_level > 0`?
I have look long time to find a solution for this problem. I haven't found a good one, so uc_mem_unmap (edit: same for uc_mem_protect) will just return an error if snapshot_level > 0. To do unmapping there is still the tlb accessible. Therefore I would this would be nice to have but not important.
What if we just drop all snapshots and unmap the region? Regarding uc_mem_protect, probably more cases have to be carefully handled.

For full unmap it would be possible to unmap the region and store it inside the uc_struct with the current level. On a restore just remap the region. The problem is partial unmap. With the current way CoW is implemented it would require to itterate over the FlatView and copy the old region in the worst case per page (expect unmaped parts). This could be done, but it's would be to expensive for us.

I have tried to implement this in a way without copy so that the unmap/protect without snapshots would be faster as well. But I have not found a sane way to do this.

wtdcode · 2023-05-26T20:59:55Z

* compatibility with `uc_mem_map_ptr`, e.g. will we invalidate the pointer?
The uc_mem_map_ptr will work as before. The change is on CoW a new RAMBlock is allocated and mapped overlapping. So without using snapshots everything behaves the same. Only after a snapshot was made writes through unicorn will not access the pointer and reads for outside might not correspond to the current state.
That’s exactly my concern because the pointer will point to some outdated content. But I also don’t see any obvious solution so maybe we can only document this clearly.
This is kind of expected, because you can use the same pointer at two places with different writes (i.e. multiple mmap with MAP_PRIVAT). I'll write some examples and documentation for this.
* what if user unmaps the regions with `snapshot_level > 0`?
I have look long time to find a solution for this problem. I haven't found a good one, so uc_mem_unmap (edit: same for uc_mem_protect) will just return an error if snapshot_level > 0. To do unmapping there is still the tlb accessible. Therefore I would this would be nice to have but not important.
What if we just drop all snapshots and unmap the region? Regarding uc_mem_protect, probably more cases have to be carefully handled.
For full unmap it would be possible to unmap the region and store it inside the uc_struct with the current level. On a restore just remap the region. The problem is partial unmap. With the current way CoW is implemented it would require to itterate over the FlatView and copy the old region in the worst case per page (expect unmaped parts). This could be done, but it's would be to expensive for us.

I see and make sense and my idea is:

If it's a full unmap, we do everything nicely, unmap the region and map the region when restoring the snapshot later on.
If it's a partial unmap, tell users an error as you previous proposed. This is fine if we document it.

PhilippTakacs · 2023-06-02T09:50:34Z

So unmap is implemented. the hack for uc->mapped_blocks bugs me a bit, but this would require a bigger redesign to make this clean.

For the types (mr->addr and mr->size) I'm currently looking for a solution. Next I'll add some test, samples and some documentation.

PhilippTakacs · 2023-06-12T15:03:25Z

So added some more tests and fixed some bugs now the unmap works also with snapshots. I'm not realy happy with it, because it needs some dirty hacks. The read of the first subregion I can't fix. The hack with the snapshot_level at unmap can be fixed by using a new struct. All in all it's many fiddling around with the memory regions. Also I don't expect that this is used very often.

To fix the merge conflicts I would like to rebase the branch (and forcepush). This way I can also remove the fixup commits and update some commit messages.

By the way: can you add me to the CREDITS.TXT?

wtdcode · 2023-06-12T15:54:01Z

By the way: can you add me to the CREDITS.TXT?

We are fine on this and please add a new line next time! Thanks for your contributions! ;)

PhilippTakacs · 2023-06-26T13:44:57Z

Hi, just a friendly reminder for this pr. Or did I forgot some change requests?

wtdcode · 2023-06-28T04:46:27Z

uc.c

@@ -242,6 +244,10 @@ static uc_err uc_init(uc_engine *uc)
        uc->reg_reset(uc);
    }

+    uc->context_content = UC_CTL_CONTEXT_CPU;
+
+    uc->unmapped_regions = g_array_new(false, false, sizeof(MemoryRegion*));


I fail to find the location to free this memory. Do I miss something?

No, I missed the free.

wtdcode · 2023-06-28T06:17:17Z

The PR looks pretty good to me and is ready to go! Just a few minor changes left to be solved.

Thanks!

PhilippTakacs · 2023-07-11T09:31:45Z

The PR looks pretty good to me and is ready to go! Just a few minor changes left to be solved.

Did I miss something or is now everything ready to go?

wtdcode · 2023-07-11T09:45:21Z

The PR looks pretty good to me and is ready to go! Just a few minor changes left to be solved.

Did I miss something or is now everything ready to go?

Really sorry that I thought I have replied to you. This PR is ready to go and just fixing the conflicting files should be fine enough.

first version has bugs

Uses Copy on Write to make it posible to restore the memory state after a snapshot was made. To restore all MemoryRegions created after the snapshot are removed.

simple benchmark for the snapshots

The ram_offset allocator searches the smalest gap in the ram_offset address space. This is slow especialy in combination with many allocation (i.e. snapshots). When it is known that there is no gap, this is now optimized.

still has todos and need tests

wtdcode · 2023-07-11T11:24:08Z

Great thanks!

PhilippTakacs · 2023-07-11T11:36:01Z

Thanks

PhilippTakacs marked this pull request as ready for review April 17, 2023 14:52

PhilippTakacs force-pushed the cow branch 2 times, most recently from c5f68ec to f0c6c89 Compare April 17, 2023 15:55

PhilippTakacs marked this pull request as draft April 24, 2023 11:22

PhilippTakacs force-pushed the cow branch 2 times, most recently from 59e9a16 to 90bf82e Compare April 27, 2023 13:52

PhilippTakacs marked this pull request as ready for review April 27, 2023 14:19