Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding FFI support to Miri, by emarteca #2365

Closed
RalfJung opened this issue Jul 13, 2022 · 12 comments
Closed

Adding FFI support to Miri, by emarteca #2365

RalfJung opened this issue Jul 13, 2022 · 12 comments
Labels
A-shims Area: This affects the external function shims C-project Category: a larger project is being tracked here, usually with checkmarks for individual steps

Comments

@RalfJung
Copy link
Member

RalfJung commented Jul 13, 2022

This issue is specifically tracking the effort by @emarteca and @maurer to add FFI support to Miri. There's a design document here, but it does not go into a ton of the actual implementation details.

@RalfJung RalfJung added C-project Category: a larger project is being tracked here, usually with checkmarks for individual steps A-shims Area: This affects the external function shims labels Jul 13, 2022
@RalfJung
Copy link
Member Author

Echoing what I said on Zulip:

If this means changes to the core interpreter engine, I am concerned. That engine is carefully structured to prioritize correctness and readability over basically everything else (with some concession to performance, as Miri would probably be >10x slower otherwise).
If this is "just" loads of code in a new subdir in Miri that we can easily disable if it starts causing problems, then that's fine.

@maurer replied

the main changes would be to add another value type (for foreign pointers), and another memory type (a wrapper for Rust objects with a parallel native representation).

@RalfJung
Copy link
Member Author

add another value type (for foreign pointers)

A new value type would be a Big Deal and is basically not an option. But I also don't think you will need it. We already have different types of memory (data and functions, and we will hopefully have vtables soon). But they are all identified by an AllocId so they can all use the same value type for their pointers. Shouldn't that also work for foreign pointers?

The new memory type seems indeed pretty fundamental. I would be interested to see your design there before you spend a lot of effort implementing it. The key challenges I see are:

  • Doing this with minimal changes to the part of the interpreter that is inside rustc, i.e., as much as possible should be abstracted via the Machine trait so that it can be done locally in Miri without touching the core engine.
  • Relatedly, how do these foreign allocations support get_ptr_alloc? Functions and vtables do not contain any data that the Rust program can actually read, so this is easy for them. But your allocations will obviously actually have data in them, so it should be possible to have an AllocRef to them. And that seems hard to do without invasive changes?

@emarteca
Copy link

emarteca commented Jul 13, 2022

The new memory type seems indeed pretty fundamental. I would be interested to see your design there before you spend a lot of effort implementing it.

I was thinking we'd extend the MiriMemoryKind enum with a new kind (something like CInternal). This would be the kind of memory used to store pointers to C. Then, when memory of this kind is read, we can dispatch the read to the corresponding C memory.
Essentially, I was thinking this kind would be how we'd distinguish the memory as corresponding to C memory. If this is too invasive of a change, we can likely do this a different way (such as through the AllocExtra).

Doing this with minimal changes to the part of the interpreter that is inside rustc, i.e., as much as possible should be abstracted via the Machine trait so that it can be done locally in Miri without touching the core engine.

Agreed. So far I haven't touched the rustc internals at all.

Relatedly, how do these foreign allocations support get_ptr_alloc?

Good point -- I'm not sure. I've messed around with this a bit and ran into some problems trying to get it to work. I think it'll depend a lot on how we end up representing the pointers internally (which is currently being discussed on zulip)

@emarteca
Copy link

Some suggestions/discussion on pointer representation from zulip:

@oli-obk says:

I wanted to give direct access to the underlying miri allocation slice to the C code.
Wrt size and alignment, do we ever need to know these for allocations from FFI code? We can just use the most permissive alignment and the size can be to the end of the address space.

...
@maurer

I am still concerned about this "use the same allocator to create pointers" strategy though, since it will mean that Rust Allocations and the C FFI allocation will need to interleave, which makes the "just have a really big allocation encompassing all of virtual memory" strategy not function
(since Rust objects would end up inside it)

@oli-obk

Hmm, good point. We could still maintain a map instead of just a range, but that's expensive
But maybe it's the least expensive strategy?
Everything else has overhead, too, after all. This strategy is simple and cheap until you need to deref a pointer from miri. Then you need to search the map to check if it's a miri alloc and otherwise do the FFI deref
With strict provenance, this is rarely an issue, as Rust code will almost always have the provenance for cheap lookup available

@maurer

Another possible strategy would be to use mmap on Linux/OSX and VirtualAlloc on Windows and friends to to grab a large region of virtual address space for the Miri allocator to select from. As long as we never actually touch it, that address space won't cost anything, but the mapping will prevent the native heap / loaded libraries from intersecting.
e.g. we reserve 4GB at some random location, and if the pointer is in that span, you look it up as a Miri object, and if not, we assume it's a native pointer and use the special AllocId

@RalfJung
Copy link
Member Author

RalfJung commented Jul 15, 2022

Looks like the design document moved to https://hackmd.io/eFY7Jyl6QGeGKQlJvBp6pw. (I haven't had the time to look at it yet.) Thanks for putting it on hackmd!

@RalfJung
Copy link
Member Author

I was thinking we'd extend the MiriMemoryKind enum with a new kind (something like CInternal).

How will that be enough? Memory kinds only serve to make sure the allocator and deallocator match up. They still all have the same representation in the interpreter, the same Allocation type. You will need a completely different representation, won't you? No tracking of pointers and uninit, for example.

I think it'll depend a lot on how we end up representing the pointers internally

What even is the design space here? Pointers are an absolute address plus a Option<Provenance>, I think that is fairly fixed and I would have severe reservations about changing it.

@emarteca
Copy link

How will that be enough? Memory kinds only serve to make sure the allocator and deallocator match up. They still all have the same representation in the interpreter, the same Allocation type. You will need a completely different representation, won't you? No tracking of pointers and uninit, for example.
What even is the design space here? Pointers are an absolute address plus a Option<Provenance>, I think that is fairly fixed and I would have severe reservations about changing it.

Sorry, I should've updated this issue WRT the discussion on zulip. The design doc on hackMD is up-to-date though -- here is the relevant section.
We're not planning to add any new value types or memory types anymore.

The main high-level idea so far (from discussion with @RalfJung,@maurer,@oli-obk, and a few others on zulip, and taken from the design doc):

Distinguishing Miri memory from external/foreign memory

In order to be able to distinguish between foreign memory and Miri memory (i.e., for a given location in memory, determine if it corresponds to C memory or Miri), we propose to reserve a large section of virtual memory for the Miri allocator. This way, we ensure there is no overlap in the memory used by the Miri allocator, and external calls. Using this, we can tell if the memory is Miri internal by looking at its AllocID.

If the return of a C call is a Miri pointer (i.e., if it is contained in the memory range reserved by the Miri allocator), then we will need to handle this case: likely we'll need to create the corresponding Allocation object (if it doesn't already exist), and we'll have to track that this Miri memory is shared with C (as discussed in the design doc).
If the return of a C call is a C pointer (i.e., if it is in the foreign range of memory), then maybe we can represent it with placeholder AllocID that indicates that it is a foreign pointer.

Any advice/ideas on this proposed plan are welcome :)
In particular:

  • How should we go about reserving a section of memory for the Miri allocator?
  • Is it possible to use the same AllocID for all foreign pointers, and what are the pros/cons of this idea?
  • Do the machine hooks we propose to use (memory_read, ptr_get_alloc) seem reasonable (any other hooks we should be using instead / in addition?)
  • Are there any situations in which we will actually need to maintain two versions of some shared memory? (i.e., a Miri version and a C version)
    • And, following this, if so, we need to design a plan for memory synchronization.Advice appreciated!

@RalfJung
Copy link
Member Author

RalfJung commented Jul 18, 2022

If the return of a C call is a Miri pointer (i.e., if it is contained in the memory range reserved by the Miri allocator)

I don't understand. When you get data back from C it is an untyped raw bag of bytes. How do you even know whether something in there is a pointer?

And when you pass a pointer from Miri to C, how are changes that C is making reflected back on the Miri side? The document says to just pass through the pointer, but that makes no sense to me -- Miri memory carries extra metadata, like which bytes are initialized and pointer provenance. You cannot just give C a pointer to the bytes part of a Miri Allocation and ignore those other parts of it.

If the return of a C call is a C pointer (i.e., if it is in the foreign range of memory), then maybe we can represent it with placeholder AllocID that indicates that it is a foreign pointer.

What happens when a Miri pointer is written there, what do we do with the provenance metadata? What happens when Miri wants to write uninitialized bytes there?

Fundamentally, a 'byte' of memory that we can pass to/from C is just a u8. But a byte of memory in Miri is much closer to the AbstractByte type defined here. (A byte in C is very similar to what it is in Miri, but all those extra details are lost when C is compiled to assembly -- similar to what happens in Rust.) That mismatch in information content has to be accounted for somehow, and I don't see your plan taking this into account.

@emarteca
Copy link

When you get data back from C it is an untyped raw bag of bytes. How do you even know whether something in there is a pointer?

We'll know if a pointer to C memory is directly returned, since the function returning it will have the return type specified as a pointer. You're right though, that we won't know what's in that C memory: the main problem (as I understand it) is that we won't know if that pointer is to memory that contains another pointer to somewhere else in C memory.
Copy-pasting what I said on zulip, I wonder if we can use the fact that we'll know if a given pointer is from C: and then if it is from C, we can dispatch some int2ptr cast explicitly. I think this would only be needed in chained pointer accesses where the root is a C pointer, and the intermediate pointers are not stored in Miri (so Miri won't know their type), right?

So a case like:

let c_ptr: *i32 = some_c_function();
// miri knows c_ptr is a pointer

c_ptr[0][0]; // uh oh, Miri doesn't know c_ptr[0] is a pointer
             // so it'll currently auto-cast c_ptr[0] to a pointer with int2ptr

What about provenance?

We've updated the doc with a proposed plan on how to deal with provenance.
We have two potential ideas, of which we prefer this one -- the high-level of this idea is to have one provenance value (similar to the Wildcard provenance) specifically for pointers to C memory, and for any Miri pointers that are exposed to C. This would only result in a provenance precision loss for data that is exposed to C.

Thanks for the feedback and for looking at this with us! :)

@RalfJung
Copy link
Member Author

RalfJung commented Jul 22, 2022

the main problem (as I understand it) is that we won't know if that pointer is to memory that contains another pointer to somewhere else in C memory.

We don't know which part of that memory even is a pointer. There could be pointers to other C memory there, or pointers to Miri memory, and we will have no clue. We cannot use the types to guide us because types are allowed to be wrong in C.

I plan to remove the ptr_from_addr_transmute hook soon, that can help simplify the interpreter a lot. So you shouldn't rely on it for your plans.

We've updated the doc with a proposed plan on how to deal with provenance.

I can't quite figure out what you are even proposing here, sorry. It seems to zoom in on a tiny aspect of the problem when you haven't sketched out the larger architecture of the approach yet.

  • When a Miri pointer is passed to C, how can C make any sense of that memory?
void print(int *x) { printf("%d\n", *x); }
void print2(int **x) { printf("%d\n", **x); }
  • What when C writes to that memory?
void set(int *x, int val) { *x = val; }
void set2(int **x, int val) { **x = val; }
  • What about C writing a pointer to that memory?
void setptr(int **x, int *val) { *x = val; }
void setptr2(int ***x, int *val) { **x = val; }

None of these even involve C creating any new allocations. Let's first solve the "simpler" cases. :)

Please start with the first case and describe in detail what exactly happens: given the print argument as a Miri Pointer, what exactly do we do? How does the Miri machine prepare its state to be able to pass a pointer to C that can make all these operations meaningful? Does Miri do any kind of "post-processing" after C returns?

I don't think you will need a "sync list", and Miri already contains what is basically an implementation of PNVI-ae-udi. I don't quite understand which problem that part of the proposal solves, but it seems too complicated on the one hand while also not talking about the real issue I am seeing on the other hand.^^ Mainly, what exactly does the pointer given to C even point to, and how does Miri prepare that memory so that when C does reads and writes, it all makes sense?

similar to the Wildcard provenance

I think wildcard provenance is indeed the key -- it's not just similar to wildcard provenance, it's the same thing! (So indeed, having both FFI support for pointers and strict provenance will just not work.)

@RalfJung RalfJung changed the title Adding FFI support to Miri, by emarteca Adding FFI support to Miri Apr 19, 2024
@RalfJung RalfJung changed the title Adding FFI support to Miri Adding FFI support to Miri, by emarteca Apr 19, 2024
@RalfJung
Copy link
Member Author

RalfJung commented Apr 19, 2024

This project isn't moving forward any more, so I will close this issue and open a new one instead to track just having more FFI support: #3491.

@RalfJung
Copy link
Member Author

I should also say thanks a lot to @emarteca and @maurer for the initial work here, this was great to get us started. :) Once the initial support via libffi is in place, it's much easier for me to extend that support -- now it's mostly "just" regular Miri programming, as most of the dark FFI magic is done. I hope. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-shims Area: This affects the external function shims C-project Category: a larger project is being tracked here, usually with checkmarks for individual steps
Projects
None yet
Development

No branches or pull requests

2 participants