Performance degradation after fixing #95 #100

andreeaflorescu · 2020-06-03T07:26:23Z

When implementing the fix for #95, we introduced a performance degradation on (some?) glibc builds. On Firecracker with iperf3 and we observed a performance degradation of 5% for glibc builds.
On Cloud Hypervisor, the performance degradation is significantly worse. Observed impact is up to 50%. More details here: cloud-hypervisor/cloud-hypervisor#1258

Opening this issue so that we can decide on the next steps for fixing the performance degradation.

My 2-cents: I wouldn't want to introduce this fix only for x86_64 musl builds & aarch64 glibc & musl builds because we cannot know for sure what glibc versions are people using out there, hence we cannot know for sure that glibc is doing the right thing (i.e. optimizing memcpy at a higher granularity than 1 byte).

I would say that the underlying problem which is the reason for the performance degradation & the bug, is that we lose type information about the object that needs to written/read. I would rather like us to work on improving the interface so that type information doesn't need to be sort-of inferred (by checking alignments & reading/writing in the largest possible chunks). I would need to do some experiments before having a solution here.

Another thing that I believe is of paramount importance at this moment is to add performance testing to vm-memory. Pretty much every other function in vm-memory is on the critical performance path. We should make sure to not introduce regression here as we continue the development.

CC: @sboeuf @rbradford @sameo @alexandruag @serban300 @bonzini

bonzini · 2020-06-03T11:07:57Z

we lose type information about the object that needs to written/read

Why is that causing performance degradation? The underlying object is likely &[u8], but because the memcpy-like code does not care about the underlying type it is still able to copy in 64-bit chunks. The problem is simply that the glibc memcpy is more sophisticated.

Pretty much every other function in vm-memory is on the critical performance path. We should make sure to not introduce regression here as we continue the development.

Agreed. Though a better implementation of virtio-net could avoid the copies by doing recv and send directly into memory.

rbradford · 2020-06-03T13:02:02Z

@bonzini We're looking at using .read_from() and .write_to() but that still involve at least one "vm-memory copy" (see cloud-hypervisor/cloud-hypervisor#1265)

rbradford · 2020-06-04T14:44:06Z

@bonzini We're looking at using .read_from() and .write_to() but that still involve at least one "vm-memory copy" (see cloud-hypervisor/cloud-hypervisor#1265)

This doesn't work due to TAP API limitations.

rbradford · 2020-06-04T14:50:38Z

I looked at this issue some more and i'd like to propose that it comes down to a limitation of the current vm-memory API.

The read_obj() / write_obj() functions (or their slice variants) are the only "safe" (in terms of bounds checking) way to writing to or reading from the guest memory. In the case of updating the queue descriptors in virtio you do need a fully atomic write. In the case of filling the allocated buffer from the network device with data we don't need any atomicity. In fact as per the spec the guest will not try and access it until after the queue has been updated and the memory fenced.

So either we need a write_obj_atomic() vs write_obj() or a write_obj() vs write_obj_volatile(). Unless i'm missing something there is now no API that lets you copy into the guest ram without making any memory guarantees (which is the way it worked before.)

bonzini · 2020-06-04T15:51:36Z

But the atomicity is not the reason for the performance decrease, the current code is faster than a stupid byte-by-byte copy. The reason why performance got worse is that the system memcpy does not guarantee atomicity but, if it did, the new code is likely slower.

rbradford · 2020-06-04T15:57:46Z

@bonzini My argument is that we don't need atomicity all the time only for specific operations and hence why there needs to be atomic and non-atomic versions of the API.

jiangliu · 2020-06-04T16:06:36Z

I looked at this issue some more and i'd like to propose that it comes down to a limitation of the current vm-memory API.

The read_obj() / write_obj() functions (or their slice variants) are the only "safe" (in terms of bounds checking) way to writing to or reading from the guest memory. In the case of updating the queue descriptors in virtio you do need a fully atomic write. In the case of filling the allocated buffer from the network device with data we don't need any atomicity. In fact as per the spec the guest will not try and access it until after the queue has been updated and the memory fenced.

So either we need a write_obj_atomic() vs write_obj() or a write_obj() vs write_obj_volatile(). Unless i'm missing something there is now no API that lets you copy into the guest ram without making any memory guarantees (which is the way it worked before.)

I agree on this direction.
We may add read/write_{u8,u16,u32,u64} for naturally aligned address. For those naturally aligned access, we could assume it won't cross region boundary, thus get rid of the heavy try_access().

So we have two classes of interfaces:

read/write_{u8,u16,u32,u64} for naturally aligned access, with a simple quick patch and ensuring atomic access.
stream oriented access by read/write_xxx(), which handle the cross region boundary case but doesn't ensure atomic access.

jiangliu · 2020-06-04T16:21:37Z

When analyzing the disassembled code, it's really a little heavy by calling try_access() for every guest memory access. We should build quick path for those accesses which never cross the region boundary.

bonzini · 2020-06-04T17:55:04Z

My argument is that we don't need atomicity all the time only for specific operations and hence why there needs to be atomic and non-atomic versions of the API.

No, that would be true if non-atomic versions would provide additional value. Right now the value would be speed, but that should not be the case with a properly optimized VolatileMemory copy, since glibc memcpy provides both atomicity and speed.

@jiangliu It should not call try_access() any more after #95 went in, though.

alexandruag · 2020-06-14T22:47:28Z

Hi! try_access is still invoked often, as every method from the Bytes implementation for GuestMemory uses try_access or calls another method that does. Also, while memcpy implementations are fast, they don't appear to be as fast as using something like a single read(_volatile) or write(_volatile) for some T. Unfortunately, we can't use those anymore for read_obj and write_obj on a GuestMemoryMmap, because the object can span adjacent regions, which are arbitrarily placed in the VMM process address space :(

What do you think about enforcing (via the GuestMemoryMmap implementation) that regions which are adjacent in the guest physical address space, are also adjacent in the VMM process address space? This can be done using an initial mmap that only reserves a virtual address range, and then subsequent mmaps + MAP_FIXED within that range to place each particular region right where we want it. This introduces the limitation of having to declare the maximum amount of memory available to the guest (the size of the initial mmap), but hopefully that's not a concern.

With this in place, we only have to check that a memory access takes place within a valid guest physical address range, before doing it in one go. Moreover, the valid ranges (potentially including multiple adjacent regions) can be precomputed to speed up future validations every time the guest memory layout changes.

alexandruag · 2020-07-13T20:07:51Z

Hi again! Here goes another wall of text :D Now is a good time to polish and clear up some aspects around the vm-memory interface, while thinking about optimizing the implementation. What we're looking at is there are the multiple ways of achieving the same thing (but not all of them always applicable), some operations don't have clearly specified semantics (with respect to atomicity, volatility, etc.), and certain design aspects hinge on assumptions which are worth validating again.

For example, GuestMemory allows specific accesses through its implementation of the Bytes interface, or by directly working with VolatileSlice (via GuestMemory::get_slice) and the related abstractions. The code paths are disjoint to a non-negligible extent, and the latter cannot be exclusively relied upon because it doesn't allow cross-region accesses. This adds (undesirable IMO) nuances to the interface and how a T: GuestMemory object is used.

It looks like the primitives we're looking for are a set of methods similar to what Bytes exposes right now, together with some Array<T> and Ref<T> abstractions, kinda like what we already have, but that also transparently work across regions and parts of the guest memory that are backed by something else than a host memory area. These two will be implementation-specific, so it seems natural to have them as associated types. Different implementation will be able to simplify things as much as their constraints allow. I've started building a prototype for GuestMemoryMmap (which also attempts to implement the above comment), and was wondering if people have any thoughts, concerns, or are trying things out as well.

In terms of validating assumptions, I wanted to start by asking what use cases are we targeting by allowing that certain guest memory ranges don't have to be backed by memory on the host (i.e. get_host_address may return HostAddressNotAvailable)?

jiangliu · 2020-07-14T03:39:46Z

Currently we are using the vm-memory's interfaces in three typical ways:

get_atomic_ref() for atomic access to specific fields. This ensures atomicity.
read_obj()/write_obj() for a whole object. This doesn't ensure atomicity.
read()/write() for byte stream based access. This doesn't ensure atomicity.
For case 1, the access should not cross region boundary. But the way to use get_atomic_ref() is a little complex because the GuestMemory doesn't expose the interface directly. https://github.com/cloud-hypervisor/vm-virtio/blob/dragonball/src/queue.rs#L791
For case 2/3, it doesn't ensure atomicity. And we should optimize for case 3, because it may be used to copy bulk data for net/blk/fs devices.

So it may help to

add atomic interfaces to GuestMemory
explicitly state which interfaces ensure atomicity and which don't.

rbradford · 2020-07-14T13:42:04Z

This is the solution we use with Cloud Hypervisor that mitigates the performance drop: We only use the slower alignment checked write for copies <= size of usize:

cloud-hypervisor@708e9aa

This means the same API can be used for small updates that must be atomic (like those for updating virtio queue offsets) and for large bulk copies where there is no expectation of that behaviour.

bonzini · 2020-07-14T18:03:56Z

I like @rbradford's solution very much, possibly extended to 16 bytes.

alexandruag · 2020-07-15T14:25:20Z

That's a cool implementation! I think we should clear up some things at the interface level as well; opened #102 and would greatly appreciate if ppl can take a look.

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Fixes: rust-vmm#100 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

Where small objects are those objects that are less then the native data width for the platform. This ensure that volatile and alignment safe read/writes are used when updating structures that are sensitive to this such as virtio devices where the spec requires writes to be atomic. Fixes: cloud-hypervisor/cloud-hypervisor#1258 Fixes: #100 Signed-off-by: Rob Bradford <robert.bradford@intel.com>

andreeaflorescu mentioned this issue Jun 3, 2020

Major performance change with vm-memory bump cloud-hypervisor/cloud-hypervisor#1258

Closed

alexandruag mentioned this issue Jul 15, 2020

Iron out the GuestMemory interface #102

Open

rbradford mentioned this issue Oct 2, 2020

volatile_memory: Only use volatile copy for small objects #117

Merged

alexandruag closed this as completed in #117 Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation after fixing #95 #100

Performance degradation after fixing #95 #100

andreeaflorescu commented Jun 3, 2020

bonzini commented Jun 3, 2020

rbradford commented Jun 3, 2020

rbradford commented Jun 4, 2020

rbradford commented Jun 4, 2020

bonzini commented Jun 4, 2020

rbradford commented Jun 4, 2020

jiangliu commented Jun 4, 2020

jiangliu commented Jun 4, 2020

bonzini commented Jun 4, 2020

alexandruag commented Jun 14, 2020

alexandruag commented Jul 13, 2020

jiangliu commented Jul 14, 2020

rbradford commented Jul 14, 2020 •

edited

Loading

bonzini commented Jul 14, 2020

alexandruag commented Jul 15, 2020

Performance degradation after fixing #95 #100

Performance degradation after fixing #95 #100

Comments

andreeaflorescu commented Jun 3, 2020

bonzini commented Jun 3, 2020

rbradford commented Jun 3, 2020

rbradford commented Jun 4, 2020

rbradford commented Jun 4, 2020

bonzini commented Jun 4, 2020

rbradford commented Jun 4, 2020

jiangliu commented Jun 4, 2020

jiangliu commented Jun 4, 2020

bonzini commented Jun 4, 2020

alexandruag commented Jun 14, 2020

alexandruag commented Jul 13, 2020

jiangliu commented Jul 14, 2020

rbradford commented Jul 14, 2020 • edited Loading

bonzini commented Jul 14, 2020

alexandruag commented Jul 15, 2020

rbradford commented Jul 14, 2020 •

edited

Loading