Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Fixing the loophole in unallow & unsubscribe #1905

Closed
gendx opened this issue Jun 3, 2020 · 22 comments
Closed

[RFC] Fixing the loophole in unallow & unsubscribe #1905

gendx opened this issue Jun 3, 2020 · 22 comments
Labels
tock-2.0 Issues and PRs related to Tock version 2.0.

Comments

@gendx
Copy link
Contributor

gendx commented Jun 3, 2020

Some months ago, an issue was open on libtock-rs regarding a potential loophole when unsubscribing from a callback or unallowing a slice (tock/libtock-rs#143). This led to some documentation changes in #1831.

The discussion was revived last week, and this loophole is a deeper problem than I previously thought. I'd like to expose what the problem is in the current model, and propose a way to fix it.

I also think this is a major issue that should be addressed before releasing Tock 2.0.

TLDR: The current grant infrastructure gives too much freedom to capsules, and doesn't fit in the threat model of untrusted capsules, because capsules can keep access to callbacks and slices forever.

Observations

Allow and the AppSlice type

When a userspace application uses the allow syscall to allow or disallow a slice, the kernel scheduler creates an AppSlice object via the Process::allow() function and then dispatches it to the relevant capsules with Driver::allow().

tock/kernel/src/sched.rs

Lines 495 to 515 in c814364

Syscall::ALLOW {
driver_number,
subdriver_number,
allow_address,
allow_size,
} => {
let res = platform.with_driver(driver_number, |driver| {
match driver {
Some(d) => {
match process.allow(allow_address, allow_size) {
Ok(oslice) => d.allow(
process.appid(),
subdriver_number,
oslice,
),
Err(err) => err, /* memory not valid */
}
}
None => ReturnCode::ENODEVICE,
}
});

tock/kernel/src/driver.rs

Lines 106 to 113 in c814364

fn allow(
&self,
app: AppId,
minor_num: usize,
slice: Option<AppSlice<Shared, u8>>,
) -> ReturnCode {
ReturnCode::ENOSUPPORT
}

A typical capsule will take this AppSlice object and transfer it into the grant object corresponding to this (capsule, process) pair.
For example, the RNG driver has an App object (put into the grant), which stores a handle to the allowed buffer in its buffer field.

tock/capsules/src/rng.rs

Lines 33 to 38 in c814364

pub struct App {
callback: Option<Callback>,
buffer: Option<AppSlice<Shared, u8>>,
remaining: usize,
idx: usize,
}

tock/capsules/src/rng.rs

Lines 135 to 152 in c814364

fn allow(
&self,
appid: AppId,
allow_num: usize,
slice: Option<AppSlice<Shared, u8>>,
) -> ReturnCode {
// pass buffer in from application
match allow_num {
0 => self
.apps
.enter(appid, |app, _| {
app.buffer = slice;
ReturnCode::SUCCESS
})
.unwrap_or_else(|err| err.into()),
_ => ReturnCode::ENOSUPPORT,
}
}

This AppSlice object directly contains a raw pointer + length. These are private fields, and one would expect that only unsafe operations would allow to access the slice, and because capsules are forbidden to use unsafe code (this restriction is the basis of "untrusted capsules" in the threat model), we may think that this isn't problematic.

tock/kernel/src/mem.rs

Lines 64 to 67 in c814364

pub struct AppSlice<L, T> {
ptr: AppPtr<L, T>,
len: usize,
}

However, AppSlice implements AsRef<[T]> and AsMut<[T]>, which completely bypasses all checks for unsafety!

tock/kernel/src/mem.rs

Lines 124 to 128 in c814364

impl<L, T> AsRef<[T]> for AppSlice<L, T> {
fn as_ref(&self) -> &[T] {
unsafe { slice::from_raw_parts(self.ptr.ptr.as_ref(), self.len) }
}
}

So where are the loopholes in this setup?

  • Upon a "disallow" request (allow a null pointer), a capsule is not required to remove its reference to the corresponding AppSlice. This means that:
    • The lifetime of any allowed buffer should last forever ('static), because a capsule can potentially access any allowed buffer forever. This is mentioned in Subscriptions and shared memory have a loophole which can lead to UB libtock-rs#143 (comment).
    • More importantly, once a memory slice is allowed from userspace to the kernel, there is no way for the userspace to get back exclusive access to this slice! A malicious capsule can keep an AppSlice handle to it forever. In Subscriptions and shared memory have a loophole which can lead to UB libtock-rs#143 (comment), it's proposed to work-around that by using a combination of Cell and/or clobbering memory upon context switches and/or using volatile memory accesses. However, none of these work-arounds fix the underlying issue (data race between userspace and the kernel).
      Indeed, from the userspace's point of view, the kernel runs as a separate thread that can be scheduled after any userspace instruction (due to an interrupt happening at any time). So if we consider the example of a driver sending/receiving packets, and passing them to/from userspace via allowed slices, there will always be a risk that userspace reads half a packet, is then preempted by the capsule which writes a completely new packet, and then userspace resumes to read the second half of a different packet. And vice-versa with userspace writing a packet and the capsule reading it.
      It's also mentioned that the driver cannot rely on the content of an allowed slice to remain unchanged.

      tock/kernel/src/driver.rs

      Lines 102 to 104 in c814364

      /// The buffer is __shared__ between the application and driver, meaning the
      /// driver should not rely on the contents of the buffer to remain
      /// unchanged.
      However, the situation on the driver side is a bit different, because a capsule cannot be preempted. If an interrupt happens, the interrupt handler will just mark it for processing later, but the scheduler will go back to where it was in the driver code. So within the scope of a function inside a capsule, we can consider that an allowed slice is exclusively available to the capsule.
  • A process could allow the same slice for multiple drivers. Then these drivers could concurrently read-write to the same memory. This lack of check is actually a bug documented here

    tock/kernel/src/process.rs

    Lines 1074 to 1078 in c814364

    // The `unsafe` promise we should be making here is that this
    // buffer is inside of app memory and that it does not create any
    // aliases (i.e. the same buffer has not been `allow`ed twice).
    //
    // TODO: We do not currently satisfy the second promise.

What can we do about it (short version)?

  • The only way to synchonize access to slices is via the allow syscall. allow(ptr) transfers exclusive access from userspace to the kernel. allow(null) must transfer back exclusive access from the kernel to userspace. We can imagine that allow(null) could fail (although I don't have any example of such a capsule), but that should result in an error code so that userspace can e.g. retry later. If allow(null) can silently fail to get back exclusive access, then slices allowed to the kernel are effectively thrown from userspace into a black hole, from which information can never be read back.
    The current AppSlice infrastructure fails to offer such guarantees.
  • The kernel should make sure that allowed slices don't overlap between capsules (or within a capsule).

Subscribe and the Callback type

A similar problem happens with callback subscriptions.

What do capsules do?

Most capsules don't actually exploit this loophole. Instead, there is a large amount of code duplication, as pretty much all capsules have the following code.

match allow_num { 
    0 => self 
        .apps 
        .enter(appid, |app, _| { 
            app.buffer = slice; 
            ReturnCode::SUCCESS 
        }) 
        .unwrap_or_else(|err| err.into()), 
    _ => ReturnCode::ENOSUPPORT, 
}

An interesting exception is the UDP driver, for which slice 1 is never disallowed. I assume this is a bug in this driver rather than intentional.

match allow_num {
0 => app.app_read = slice,
1 => match slice {
Some(s) => {
if s.len() > self.max_tx_pyld_len {
success = false;
} else {
app.app_write = Some(s);
}
}
None => {}
},
2 => app.app_cfg = slice,
3 => app.app_rx_cfg = slice,

Question: Is there any legitimate use case where a capsule denies revocation of an allowed slice or of a subscribed callback?

Another note is that all capsules that I looked at allocate allow/callback numbers sequentially, i.e. allow numbers are 0..N and subscribe numbers are 0..M, were N, M vary by capsule.

About the memory footprint of grants

As I previously mentioned in #1761 (comment), there is quite some amount of redundancy in the storage of AppSlice and Callback. For each (capsule, process, allow/subscribe number) tuple there is a reference to an AppId which is completely redundant.

How to fix it? [Currently a rough draft]

In this section, I draft a proposal of how this loophole could be fixed. This makes sure that capsules cannot decide to keep a handle to slices and callbacks. This also optimizes the memory layout of grants, to avoid the pitfalls noted in #1761 (comment).

Memory layout of grants

I first propose to revisit the memory layout of grants, taking into account the fact that each driver uses allow numbers 0..N and subscribe numbers 0..M.

Taking as basis the current memory layout

tock/kernel/src/grant.rs

Lines 180 to 195 in c814364

// 0x0040000 ┌────────────────────
// │ GrantPointer0 [0x003FFC8]
// │ GrantPointer1 [0x003FFC0]
// │ ...
// │ GrantPointerN [0x0000000 (NULL)]
// ├────────────────────
// │ Process Control Block
// 0x003FFE0 ├────────────────────
// │ GrantRegion0
// 0x003FFC8 ├────────────────────
// │ GrantRegion1
// 0x003FFC0 ├────────────────────
// │
// │ --unallocated--
// │
// └────────────────────

I propose to replace the opaque memory layout of GrantRegionI by:

  • a list of N (pointer, usize) pairs (representing the allowed slices),
  • a list of M pointers (representing the callbacks),
  • a driver-specific object.
// 0x0040000  ┌────────────────────
//            │   Grant Pointers
//            ├────────────────────
//            │   Process Control Block
// 0x003FFE0  ├────────────────────
//            │   GrantRegion0:
//            | ├──────────────────
//            │ | AllowPtr0
//            │ | AllowLen0
//            │ | AllowPtr1
//            │ | AllowLen1
//            │ | ...
//            │ | AllowPtrN
//            │ | AllowLenN
//            | ├──────────────────
//            │ | CallbackPtr0
//            │ | CallbackPtr1
//            │ | ...
//            │ | CallbackPtrM
//            | ├──────────────────
//            │ | CapsuleRegion0
//            | ├──────────────────
// 0x003FFC8  ├────────────────────
//            │   GrantRegion1
// 0x003FFC0  ├────────────────────
//            │
//            │   --unallocated--
//            │
//            └────────────────────

There needs to be a way for each capsule to declare the values N and M upon initialization. In the long run, this should be possible to integrate in the Grant and AppliedGrant objects at compile time with const generics (rust-lang/rust#44580). The const generics feature might even be stable enough today if we want to go that route.

If we take the example of the RNG driver, then N = 1 (one slice), M = 1 (one callback). The App and RngDriver structures

tock/capsules/src/rng.rs

Lines 32 to 44 in c814364

#[derive(Default)]
pub struct App {
callback: Option<Callback>,
buffer: Option<AppSlice<Shared, u8>>,
remaining: usize,
idx: usize,
}
pub struct RngDriver<'a> {
rng: &'a dyn Rng<'a>,
apps: Grant<App>,
getting_randomness: Cell<bool>,
}

become the following. Note that the callback and buffer fields are not in the App struct anymore.

#[derive(Default)]
pub struct App {
    remaining: usize,
    idx: usize,
}

pub struct RngDriver<'a> {
    rng: &'a dyn Rng<'a>,
    apps: Grant<App>,
    getting_randomness: Cell<bool>,
}

Kernel scheduler

Upon allow and subscribe syscalls, the kernel scheduler directly modifies the tables containing the allow ptr/len or callback ptr, without asking the driver. The only failure modes worth returning an error to userspace are:

  • out-of-bound index for the allow/callback entry (i.e. greater than or equal to N/M respectively),
  • pointer/slice outside of the process memory,
  • allowing a slice overlapping with another allowed slice.

Capsules could be given some amount of control in what can be allowed. For example, the UDP driver may reject too large slices

if s.len() > self.max_tx_pyld_len {
success = false;
} else {
app.app_write = Some(s);
}

In that case, the Driver::allow and Driver::subscribe functions have to change, to avoid giving access to a Callback or AppSlice object. I guess something like the following is reasonable:

trait Driver {
    ...
    // The capsule can:
    // - accept the slice by returning SUCCESS,
    // - reject the slice by returning an error code, forwarded to userspace.
    fn allow(
        &self,
        allow_num: usize,
        slice_size: usize,
    ) -> ReturnCode;
}

Capsule access to slices and callbacks

When a capsule want to access an allowed slice or callback, it queries it from the AppliedGrant object (which has a handle to the necessary AppId). In particular, the enter function would provide handles to new AllowedTable and CallbackTable objects.

tock/kernel/src/grant.rs

Lines 25 to 35 in c814364

impl<T> AppliedGrant<T> {
pub fn enter<F, R>(self, fun: F) -> R
where
F: FnOnce(&mut Owned<T>, &mut Allocator) -> R,
R: Copy,
{
let mut allocator = Allocator { appid: self.appid };
let mut root = Owned::new(self.grant, self.appid);
fun(&mut root, &mut allocator)
}
}

impl<T> AppliedGrant<T> {
    pub fn enter<F, R>(self, fun: F) -> R
    where
        F: FnOnce(&mut Owned<T>, &mut Allocator, &mut AllowedTable, &CallbackTable) -> R,
        R: Copy,
    {
        let mut allocator = Allocator { appid: self.appid };
        let mut root = Owned::new(self.grant, self.appid);
        let mut allowed_table = /* TBD */;
        let mut callback_table = /* TBD */;
        fun(&mut root, &mut allocator, &mut allowed_table, &callback_table)
    }
}

Details are still TBD, but the tables could essentially provide the following APIs.

impl AllowedTable {
    // Returns a valid slice if i is within bounds and a slice is currently allowed at this index.
    fn get(&self, i) -> Option<&[u8]> { /* TBD*/ }
    fn get_mut(&mut self, i) -> Option<&mut [u8]> { /* TBD*/ }
}

impl CallbackTable {
    // Returns a valid callback if i is within bounds and a callback is currently subscribed to at this index.
    fn get(&self, i) -> Option<&Callback> { /* TBD*/ }
}

Kernel structures

The AppSlice and Callback structures will have to be revisited accordingly and/or won't be needed anymore.

@bradjc
Copy link
Contributor

bradjc commented Jun 3, 2020

First, I kind of like this proposal. Allowing the core kernel (aka the kernel crate) to actually track app slices and callbacks seems like a more robust design, for a couple reasons:

  • It would also remove the "best effort" conventions we have (like passing in a null pointer clears a callback), since the core kernel could actually enforce it.
  • It would provide structure to capsules, as they would have to follow the rules around grants, appslices, and callbacks, making it harder to write non-virtualized capsules (aka a capsule couldn't use an appslice without also using a grant). I think our experience of writing capsules over the last few years shows that there is a pattern for how these tools are used, and capsules don't need the flexibility we currently give them.
  • It would really solidify our argument that grants are key to making Tock possible.

That being said, I don't think there is anything wrong with what we currently have, it's just a different model. If the kernel is malicious, I don't see how userspace can expect any guarantees, so it's reasonable to expect that capsules do want to do the correct thing. And I also think it's a perfectly reasonable design that once an app shares memory with the kernel the app can't revoke it. I'm not saying it's the best model, necessarily.

@bradjc
Copy link
Contributor

bradjc commented Jun 3, 2020

As for Tock 2.0, I worry that we won't be able to ship it because everything that we want to include. For this issue, I think we can reasonably use our current machinery in the 2.0 release, but with a different promised interface to userspace as a temporary solution. The implementation might not exactly match the promised interface, which wouldn't be ideal, but then we could freely implement it post 2.0 without any breaking changes.

@gendx
Copy link
Contributor Author

gendx commented Jun 4, 2020

That being said, I don't think there is anything wrong with what we currently have, it's just a different model. If the kernel is malicious, I don't see how userspace can expect any guarantees, so it's reasonable to expect that capsules do want to do the correct thing.

Regarding the fact that there is a loophole in what we currently have, this is discussed in tock/libtock-rs#143.

In particular, quoting @jrvanwhy's comment (tock/libtock-rs#143 (comment)).

In Tock's current (pre-2.0) implementation, a shared allow region cannot be revoked. As a result, all buffers passed to allow must have static lifetime.

Unfortunately, this means we cannot share buffers on the stack, and further than all accesses to the buffers must be done in a single instruction (to avoid race conditions with the kernel).

I'm expecting to rework the system call API, so I'll address that as part of the rework. Unfortunately, it won't be nearly as nice to work with as the current setup. This may be worth addressing in Tock 2.0.

In particular, I don't see how one can possibly guarantee things like "all accesses to the buffers must be done in a single instruction" (how can one read or write a USB packet of 64 bytes in a single instruction?).

I also encourage you to re-read the "So where are the loopholes in this setup?" part of my first comment here (#1905 (comment)). If you think that there is nothing wrong about any of these points, could you please be more specific about why?

And I also think it's a perfectly reasonable design that once an app shares memory with the kernel the app can't revoke it. I'm not saying it's the best model, necessarily.

It's totally unreasonable for capsules transmitting/receiving packets such as USB or UDP. When either of the userspace or the kernel reads/writes a packet, it must have exclusive access to the underlying slice for the complete duration of this operation (which necessarily takes multiple CPU instructions for packets of more than 4 bytes on 32-bit CPUs).
If userspace has no way of reclaiming back an allowed slice, there is no way to transmit a packet from the kernel to userspace.

@lschuermann
Copy link
Member

I'm not entirely sure I understand the reasoning behind capsules not being allowed to decide whether to keep a Slice or Callback upon unallow/subscribe.

On the other hand, I can imagine an approach where the Rust ownership guarantees are used. Upon an unallow/unsubscribe, the capsule has to move the Slice or Callback back into the kernel (e.g. allow returning an Option<AppSlice<u8, Shared>>). This makes both allow/subscribe and unallow/unsubscribe fallible. Such an approach still allows the kernel to keep track of exactly what slices or callbacks the capsule currently has access too, and the userspace libraries can simply fail and not return the buffer/closure reference if the capsule refuses the hand back the Slice or Callback respectively.

This could potentially better encapsulate the idea of a capsule currently working on something, dependent on buffers being shared. Imagine a DMA operation: if this were to happen on an allowed buffer, we wouldn't want the app to forcibly revoke access to that buffer while the operation is still ongoing. This isn't an issue yet, but could bring us closer to actually allow AppSlices to be passed further down the into the kernel for zero copy DMA to userspace.

In addition to that, we would avoid evaluating the AppSlice's or Callback's validity on a per-access basis, but entirely rely on the Rust ownership guarantees with safe code. The one-time cost of registering and de-registering of the AppSlices or Callbacks in the kernel remains.

The current approach with a capsule working on something and a Slice or Callback being removed in the meantime is to continue the (asynchronous) operation, then notice the resource has vanished, throwing away the results - or worse any report of side effects that this operation has caused.

Permitting a capsule to refuse giving up on shared Slices or Callbacks would not violate any of the safety concerns you mentioned, but introduces the risk of blocking applications on repeatedly trying to unallow/subscribe. Denial of service is an issue capsules can cause anyways and as such is not a threat we can reasonably protect against.

@gendx
Copy link
Contributor Author

gendx commented Jun 4, 2020

Capsules could indeed be allowed to reject the unallow/unsubscribe request, but this should be communicated to userspace via an error code such as EBUSY, so that userspace can retry later.

I still see some problems with @lschuermann's proposal of "transferring back" the slice by returning an Option<AppSlice<u8, Shared>>.

  • Capsules that manipulate multiple slices could (by malice or by mistake) return back the wrong slice. For example, when requested to unallow slice 0, the capsule could return back the previously allowed slice 1. Then userspace will think that it has exclusive access to slice 0 and no access to slice 1, whereas it's the other way around. Making sure the kernel holds the table of slices/callbacks avoids this.
  • The current AppSlice and Callback types incur a memory overhead because each of these objects stores an AppId. This AppId is always redundant in the cases where these objects are used.

So I'd rather suggest to give capsules freedom to accept or reject allow/subscribe requests, by returning either a boolean (surfaced back as EBUSY if the request is rejected) or a return code (with semantics to define), but without giving them ownership of the slice/callback.

@lschuermann
Copy link
Member

Capsules that manipulate multiple slices could (by malice or by mistake) return back the wrong slice. For example, when requested to unallow slice 0, the capsule could return back the previously allowed slice 1. Then userspace will think that it has exclusive access to slice 0 and no access to slice 1, whereas it's the other way around. Making sure the kernel holds the table of slices/callbacks avoids this.

First and foremost, I'm vouching for a kernel-held table of slices/callbacks as the kernel would need to ensure that no region is shared twice, etc. I think this issue can be easily circumvented by giving each shared Slice a (counted) ID and storing that in the table as well. In case of mismatch (would be a pretty bad error in the capsule), that's an unrecoverable situation.

The current AppSlice and Callback types incur a memory overhead because each of these objects stores an AppId. This AppId is always redundant in the cases where these objects are used.

When the kernel holds a table anyways, an AppSlice could be made as thin as a pointer to the memory, the length, and a (kernel internal / hidden) pointer to the entry in the kernel-held table. We're talking about 3 usizes here, if we were to include an AppId that would be another 3 usizes.

I'm vouching for this approach (which should be from a user perspective quite similar to your approach @gendx) because of

  • reducing dynamic checks & cost, instead relying on a feature the language gives us for free anyways (accessing the raw slice underneath the AppSlice would probably be optimized out entirely)
  • going a right step in the direction of using userspace Slices deep down in the kernel - which is controversial, but true Slice ownership (I know, there's still parallel userspace access) could enable

@jrvanwhy
Copy link
Contributor

jrvanwhy commented Jun 4, 2020

  • More importantly, once a memory slice is allowed from userspace to the kernel, there is no way for the userspace to get back exclusive access to this slice! A malicious capsule can keep an AppSlice handle to it forever. In tock/libtock-rs#143 (comment), it's proposed to work-around that by using a combination of Cell and/or clobbering memory upon context switches and/or using volatile memory accesses. However, none of these work-arounds fix the underlying issue (data race between userspace and the kernel).

This gets into the distinction between a data race (defined in the Rustonomicon) and other race conditions. The exact meaning of volatile hasn't been set in stone yet, but from what I see of the conversation, read_volatile and write_volatile will probably be defined to compile to a sequence of 1 or more read/write operations that each execute in a single instruction. Interrupts are synchronized to occur between instructions of the userspace thread, which should(*) provide the synchronization we need between userspace and the kernel. That allows us to avoid data races in the kernelspace/userspace boundary, even with shared memory.

Volatile reads and writes can still tear across the userspace/kernel boundary, however. That represents a different race condition that doesn't cause undefined behavior if handled appropriately (i.e. only read types where arbitrary bit patterns are acceptable).

(*) If volatile accesses don't have this property, it would be impossible for kernel code to share data between an interrupt handler and the main thread on systems without atomic instructions. That seems like a necessary property for Rust to have given its embedded use cases.

@gendx
Copy link
Contributor Author

gendx commented Jun 5, 2020

  • More importantly, once a memory slice is allowed from userspace to the kernel, there is no way for the userspace to get back exclusive access to this slice! A malicious capsule can keep an AppSlice handle to it forever. In tock/libtock-rs#143 (comment), it's proposed to work-around that by using a combination of Cell and/or clobbering memory upon context switches and/or using volatile memory accesses. However, none of these work-arounds fix the underlying issue (data race between userspace and the kernel).

This gets into the distinction between a data race (defined in the Rustonomicon) and other race conditions. The exact meaning of volatile hasn't been set in stone yet, but from what I see of the conversation, read_volatile and write_volatile will probably be defined to compile to a sequence of 1 or more read/write operations that each execute in a single instruction. Interrupts are synchronized to occur between instructions of the userspace thread, which should(*) provide the synchronization we need between userspace and the kernel. That allows us to avoid data races in the kernelspace/userspace boundary, even with shared memory.

I don't see how volatile has anything to do with this. Consider the following example (in userspace with libtock-rs).

// Handler invoke upon a callback from the kernel that a packet was received.
fn handle_receive(allowed_slice: &[u8; 64]) {
    let mut packet = [0; 64];
    packet.copy_from_slice(&allowed_slice);
    // Do something with the received packet.
}

Even with volatile semantics, it's clear that copy_from_slice cannot be implemented with a single CPU instruction (64 bytes are too many). There will be multiple read instructions. So even if the individual reads that compose this copy each have clearly defined semantics, it doesn't change the problem that an interrupt could happen in between (assuming the capsule exploits the loophole).

The following sequence of events can happen, volatile or not:

  1. packet[..32].copy_from_slice(&allowed_slice[..32]);.
  2. An interrupt occurs. The kernel preempts the application.
  3. The capsule, which still has access to the allowed slice, decides to exploit the loophole and write a new packet in the allowed slice.
  4. The kernel schedules the application again.
  5. packet[32..].copy_from_slice(&allowed_slice[32..]);.

Now, the packet buffer contains corrupted contents, a mix of two packets.

I hope that we can all agree that this is a problem.

And that the only way to prevent this from happening is to have a way for the kernel to tell userspace that it guarantees it exclusive access to allowed_slice (i.e. no capsule has any access to the allowed slice). So that even in the event of an interruption, the memory in allowed_slice isn't read nor written by anything in the kernel (capsules included) - in other words that step 3 cannot happen. And the way to communicate between userspace and the kernel who has access to allowed slices is the allow syscall.

Volatile reads and writes can still tear across the userspace/kernel boundary, however. That represents a different race condition that doesn't cause undefined behavior if handled appropriately (i.e. only read types where arbitrary bit patterns are acceptable).

Maybe the problem I outlined is not a "data race" in the rustnomicon sense, but it's still a race condition that causes memory corruption. In the logical sense, a 64-byte packet is a type of its own. Even if reading a corrupted packet doesn't directly trigger a buffer overflow or diversion of the control flow (via corruption of the stack, or of pointers in an object's vtable), I disagree with the statement "it's fine, all bit patterns are acceptable in a [u8; 64]".

Because such a statement is equivalent to "the contents of an allowed slice are always arbitrary", and that is contrary to the semantics that we want for a driver.

For example, for a USB driver, we want userspace semantics meaning something like "after allowing a receive slice, sending a receive command, followed by the receive callback, and unallowing the receive slice, the receive slice now contains the next USB packet received by the USB controller". If we replace these semantics by "after allowing a receive slice, sending a receive command, followed by the receive callback, and unallowing the receive slice, the receive slice now contains arbitrary contents", what can we use such a driver for??

As such, it's as much rendering the whole system unusable as another form of UB. And it's as much a potential security vulnerability depending on what the application does with the now-corrupted packet.

@hudson-ayers
Copy link
Contributor

hudson-ayers commented Jun 5, 2020

No matter what, an application has to trust that a capsule it uses is written correctly. Even if the core kernel were responsible for managing allowed buffers, the USB driver could write arbitrary contents into the receive slice before the slice is unallowed by the application. So is the current reliance on trusting a capsule to respect an unallow request really any different than how apps must always trust a capsule to correctly populate a packet with the received information? (with the caveat that we are more likely to catch one in code review than the other)

(updated for clarity in response to brads comment)

@lschuermann
Copy link
Member

lschuermann commented Jun 5, 2020

an application has to trust that a capsule it uses is written correctly

That is correct. However, when the core kernel can guarantee that a capsule no longer has access to a slice of app memory, the app can reasonably assume the content in that slice never unexpectedly changes, or else the core kernel would have a bug. I side with @gendx here, this would be an improvement over the guarantees which the kernel can provide with app/capsule interaction. It may not be UB in the ISA / Rust sense, but significantly reduces the damages that poorly written / malicious capsules can cause. Buffer handling is hard after all 😄; no one wants a capsule holding on to all AppSlices it's ever seen and at some point writing garbage to userspace.


Changed terminology as per @bradjc's request.

@bradjc
Copy link
Contributor

bradjc commented Jun 5, 2020

I think so we can have a coherent discussion, we have to use the same terminology. From https://github.com/tock/tock/blob/master/doc/Design.md, everything under the syscall interface is the kernel. That is, a malicious capsule means the kernel is malicious. An easy way to separate out capsules is to refer to the "core kernel".

@bradjc
Copy link
Contributor

bradjc commented Jun 5, 2020

I also encourage you to re-read the "So where are the loopholes in this setup?" part of my first comment here (#1905 (comment)). If you think that there is nothing wrong about any of these points, could you please be more specific about why?

In general, capsules and apps can work together to achieve the desired behavior. Maybe this doesn't map well to a Rust userpsace, and maybe userspace wants a less "trusting" interface to the capsules. A shared buffer isn't inherently a problem.

But it's important, when discussing designs, to be open to alternative designs that prioritize different objectives. I think you are arguing that first-class support for a Rust userland is a key objective for the Tock kernel. Someone else with a different goal might argue for a very different design. That doesn't mean one is right or wrong. Once we agree on the objectives, then we can evaluate how well a design achieves those objectives.

In general, I think Tock wants to support a robust libtock-rs. But, it's not the only objective.

@gendx
Copy link
Contributor Author

gendx commented Jun 5, 2020

I understand the trust boundaries as follows.

  • From a functionality point of view, an application has to trust that a capsule implements the functionality it advertises. If an USB capsule writes a "lorem ipsum" in the allowed slices instead of received packets, the system doesn't work.
  • However, I think a realistic source of concern with capsules is that they could unintentionally have a bug, and Tock's design is here to limit the risk of such unintentional bugs.

This is, I think, the spirit of https://github.com/tock/tock/blob/master/doc/Design.md#capsules.

In particular, the following sentence doesn't agree with @bradjc's assessment that "a malicious capsule means the kernel is malicious".

[...] Rust’s type and module systems protect the core kernel from buggy or malicious capsules.

Then, the design mentions resources.

Unless a capsule is able to subvert the Rust type system, it can only access resources explicitly granted to it, and only in ways permitted by the interfaces those resources expose.

To me, the issue being discussed here deals exactly with this latter point. The loophole originally described in tock/libtock-rs#143 is exactly that: capsules that hold onto resources (callbacks, application slices), when they shouldn't anymore (due to an unallow/unsubscribe request).

If there is a way to leverage the "Rust type system" (under which I encompass Callback, AppSlice, and potential future types such as CallbackTable proposed in #1905 (comment)), to limit malicious (or more realistically unintentional) use of resources, then we should do it.


I think so we can have a coherent discussion, we have to use the same terminology. From https://github.com/tock/tock/blob/master/doc/Design.md, everything under the syscall interface is the kernel. [...] An easy way to separate out capsules is to refer to the "core kernel".

Well, in practice I interact more with the "kernel" as in everything in the kernel/ folder, than with the design documentation. So it would make sense to apply this terminology to the code as well, and git mv kernel core so that the good terminology is clear for everyone.

@gendx
Copy link
Contributor Author

gendx commented Jun 5, 2020

I also encourage you to re-read the "So where are the loopholes in this setup?" part of my first comment here (#1905 (comment)). If you think that there is nothing wrong about any of these points, could you please be more specific about why?

In general, capsules and apps can work together to achieve the desired behavior. Maybe this doesn't map well to a Rust userpsace, and maybe userspace wants a less "trusting" interface to the capsules. A shared buffer isn't inherently a problem.

A shared buffer is inherently a problem, as shown by the example I've given in #1905 (comment) (unless there's a flaw in this example).

And this has nothing to do with the userspace being written in Rust, the use of Rust was just illustrative. The same problem (copying a 64 byte packet from an allowed slice to a local buffer, and being interrupted by the kernel in the middle of it, with a malicious/buggy capsule writing the next packet in the slice) will also exist in a userspace written in C or any language.

But it's important, when discussing designs, to be open to alternative designs that prioritize different objectives. I think you are arguing that first-class support for a Rust userland is a key objective for the Tock kernel. Someone else with a different goal might argue for a very different design. That doesn't mean one is right or wrong. Once we agree on the objectives, then we can evaluate how well a design achieves those objectives.

In general, I think Tock wants to support a robust libtock-rs. But, it's not the only objective.

I have only given illustrative examples of a Rust userland, but the problem discussed here is by no means Rust-specific. The fact that the problem was first discovered in the libtock-rs repository (tock/libtock-rs#143) has nothing to do with it being a first-class or third-class userspace.


Of course, we can also ignore this potential loophole and just trust capsules to do the right thing (e.g. destroying handles to app slices upon allow(null)). I've proposed that in tock/libtock-rs#143 (comment).

But I also think that fixing this loophole is worth it because:

  • As mentioned in the design, capsules should not be trusted to manage resources.
  • The current setup is redundant (the Driver::allow() implementation is pretty much duplicated code across capsules) and has unnecessary overhead (each AppSlice contains an AppId).
  • This loophole is actually being exploited in practice (maliciously or as a bug) - app_write is never cleared back to None in the UDP capsule:
    match allow_num {
    0 => app.app_read = slice,
    1 => match slice {
    Some(s) => {
    if s.len() > self.max_tx_pyld_len {
    success = false;
    } else {
    app.app_write = Some(s);
    }
    }
    None => {}
    },
    2 => app.app_cfg = slice,
    3 => app.app_rx_cfg = slice,

@phil-levis
Copy link
Contributor

This is great. This is an important problem and I agree we need to solve it for Tock 2.0.

However, I'd like to separate this discussion into 3 steps:

  1. Precisely identifying the problem. I think your text is 90% of the way there.
  2. Precisely defining the properties of a successful solution: what are the criteria for a successful design?
  3. Only after we've done both of above, discussing proposed solutions.

What do you think?

@gendx
Copy link
Contributor Author

gendx commented Jun 5, 2020

This is great. This is an important problem and I agree we need to solve it for Tock 2.0.

However, I'd like to separate this discussion into 3 steps:

  1. Precisely identifying the problem. I think your text is 90% of the way there.
  2. Precisely defining the properties of a successful solution: what are the criteria for a successful design?
  3. Only after we've done both of above, discussing proposed solutions.

What do you think?

I definitely agree. I tried to sketch a solution, to get an understanding of whether this problem could be reasonably solved, but the most important for now is indeed to agree on our understanding of the problem, and on the properties we want to achieve.


Referring back to the threat model (https://github.com/tock/tock/tree/master/doc/threat_model#isolation-provided-to-processes), I think it would be relevant to define the following parts more precisely.

A process' data may not be accessed by other processes or by capsules, unless explicitly permitted by the process.

My understanding of "explicitly permitted by the process" is that allow(capsule, num, ptr, len) starts an explicit permission from the process, and that allow(capsule, num, null, 0) revokes this permission. But there is room for interpretation in the current formulation (e.g. allow(null) could fail with EBUSY, in which case the process has to retry again to revoke the permission). So we have to agree on something more precise here (in terms of syscalls, e.g. "A process explicitly permits access to data via the allow syscall with parameters XYZ [...]").

Process data may not be modified by other processes or by capsules, except when allowed by the process.

The same remark applies here, we need to precisely define the starting and ending points of "when allowed".

@jrvanwhy
Copy link
Contributor

jrvanwhy commented Jun 5, 2020

  • More importantly, once a memory slice is allowed from userspace to the kernel, there is no way for the userspace to get back exclusive access to this slice! A malicious capsule can keep an AppSlice handle to it forever. In tock/libtock-rs#143 (comment), it's proposed to work-around that by using a combination of Cell and/or clobbering memory upon context switches and/or using volatile memory accesses. However, none of these work-arounds fix the underlying issue (data race between userspace and the kernel).

This gets into the distinction between a data race (defined in the Rustonomicon) and other race conditions. The exact meaning of volatile hasn't been set in stone yet, but from what I see of the conversation, read_volatile and write_volatile will probably be defined to compile to a sequence of 1 or more read/write operations that each execute in a single instruction. Interrupts are synchronized to occur between instructions of the userspace thread, which should(*) provide the synchronization we need between userspace and the kernel. That allows us to avoid data races in the kernelspace/userspace boundary, even with shared memory.

I don't see how volatile has anything to do with this. Consider the following example (in userspace with libtock-rs).

// Handler invoke upon a callback from the kernel that a packet was received.
fn handle_receive(allowed_slice: &[u8; 64]) {
    let mut packet = [0; 64];
    packet.copy_from_slice(&allowed_slice);
    // Do something with the received packet.
}

Even with volatile semantics, it's clear that copy_from_slice cannot be implemented with a single CPU instruction (64 bytes are too many). There will be multiple read instructions. So even if the individual reads that compose this copy each have clearly defined semantics, it doesn't change the problem that an interrupt could happen in between (assuming the capsule exploits the loophole).

The following sequence of events can happen, volatile or not:

  1. packet[..32].copy_from_slice(&allowed_slice[..32]);.
  2. An interrupt occurs. The kernel preempts the application.
  3. The capsule, which still has access to the allowed slice, decides to exploit the loophole and write a new packet in the allowed slice.
  4. The kernel schedules the application again.
  5. packet[32..].copy_from_slice(&allowed_slice[32..]);.

Now, the packet buffer contains corrupted contents, a mix of two packets.

If the capsule is well-behaved (doesn't touch the buffer after an allow(null) call), then this race condition cannot happen. Corollary: this race condition only happens when a capsule is misbehaving.

If a capsule is malicious, then using volatile operations changes the outcome from "the capsule triggers undefined behavior" to "the app may read corrupt data from the buffer". If the capsule is malicious, then inserting corrupt data into the buffer was always a possibility anyway.

I hope that we can all agree that this is a problem.

I want to distinguish bugs (such as non-compliance with the threat model) from undesirable design decisions (such as requiring all buffers to be static). I agree that both of them are problems, but the level of importance and timelines for solving them are likely different.

@jrvanwhy
Copy link
Contributor

jrvanwhy commented Jun 5, 2020

To clarify: the blast radius of "trusting a capsule to not return corrupt data" and "trusting the capsule to not invoke undefined behavior" are very different.

A malicious capsule that was trusted (incorrectly) to not invoke undefined behavior can potentially compromise any app data or behavior.

A malicious capsule that was trusted to behave correctly can break the parts of the app that rely on that behavior. E.g. a malicious capsule that exposes a logging API cannot compromise an app's cryptographic keys because that app won't share its keys with the logging capsule anyway. It could totally destroy logging and lie about what the app did, of course.

@lschuermann
Copy link
Member

Another possible issues comes up when thinking about app restarts. In the current implementation (as far as I can tell), AppSlices never check whether the app originally granting this memory is still the app that's currently running with respect to this RAM region.

As long as that's the case, we don't only have the issue of when an process shares a buffer/callback once, it must forever trust the capsule that this memory will not be modified unexpectedly, but instead a restarted process must assume that all of its memory can be modified by any capsule at any time, as the process does not necessarily know what slices have been shared in the previous instance of that same app.

This is clearly undesirable and shows that even with regards to process restarts, we cannot continue to ignore resources previously shared by the process into the kernel. To avoid this, the kernel must keep track of associated resources and run a cleanup routine prior to starting a new instance. This is a step in the right direction.

@gendx
Copy link
Contributor Author

gendx commented Jun 8, 2020

If the capsule is well-behaved (doesn't touch the buffer after an allow(null) call), then this race condition cannot happen. Corollary: this race condition only happens when a capsule is misbehaving.

If a capsule is malicious, then using volatile operations changes the outcome from "the capsule triggers undefined behavior" to "the app may read corrupt data from the buffer". If the capsule is malicious, then inserting corrupt data into the buffer was always a possibility anyway.

Let's keep in mind that there is a whole spectrum between a "perfect capsule" and a "totally malicious capsule". In practice, we likely won't face a "totally malicious capsule", because a basic code review will rule that out. But we will rather face more subtle bugs due to forgetting one line of code or complex concurrency logic. See the UDP example where one of the slices is never redeemed to the app.

So saying that malicious capsules can do anything and stopping at that fact is a wrong approach IMO. Because this misses the realistic bugs, such as a packet buffer being corrupted due to a subtle bug leading to concurrent access by the capsule and userspace.

I also don't fully agree on "inserting corrupt data into the buffer was always a possibility anyway". Because with clearly defined boundaries on ownership between allow operations, then a malicious capsules cannot corrupt the data while a packet is owned by userspace. It's limited to do it when the capsules owns it.

In any case, a minimum of code review of the capsules one uses is necessary (as a sanity check to make sure that each capsule provides the functionally it advertises). An obviously malicious capsule won't pass such review, but more subtle bugs could slip in. And this is where IMO the threat model and the safeguards provided by the core kernel can help, because they limit the scope of what a capsule can possibly do, and therefore limit the scope of the possible behaviors one has to account for when reviewing the code. It's much easier to reason about resources exclusively owned by userspace/capsules, rather than to manually go through the logic to rule out all possible data races.

I hope that we can all agree that this is a problem.

I want to distinguish bugs (such as non-compliance with the threat model) from undesirable design decisions (such as requiring all buffers to be static). I agree that both of them are problems, but the level of importance and timelines for solving them are likely different.

I'd like to point out that a loophole is not in itself a bug. It can lead to a bug if some capsule exploits this loophole, but until a capsule actually exploits it there is no bug. So even if in principle a capsule could violate the threat model (although as I mentioned in #1905 (comment) the threat model is not precise enough yet to know for sure if there's really a violation) by retaining access to a slice after disallow, the threat model is not violated if none of the capsules violates it. And apart from the UDP example (which I personally don't rely on so far), I haven't seen any violation in practice.

So in terms of timeline, I don't see an urgent need to fix any bug. But I see an opportunity to improve the threat model's wording as well as the design of allowed slices (and callbacks) in the kernel, to reduce the risk of potential future bugs. Whether that should be a priority is up to the core group to decide.

Regarding userspace, unless there is a specific capsule that indeed keeps a hold on a slice forever, I don't see a need to require all capsules to require static buffers. That would be a way too strong limitation that is neither pragmatic nor necessary (there's currently no capsule exploiting the potential loophole). It's much more pragmatic to instead focus on how to get rid of the loophole so that the loophole never becomes a bug in the future.

@hudson-ayers
Copy link
Contributor

We discussed this in depth on the core call today (see #1929). The conclusion was that there is general consensus that having the core kernel track callbacks and slices including support for unallow, rather than relying on capsules to "do the right thing", is a positive change, and that a PR that accomplishes this would be welcomed.

We did not discuss exactly what such a design would look like, but there was agreement that a design which makes it impossible for unallow to fail would be vastly preferable to one where unallow can fail, from the perspective of reducing the complexity of userspace error handling. There was also some discussion of whether it should be allowed for userspace buffers to be passed to hardware, and the conclusion was that this should be possible if there is a guarantee that an app will not run until a DMA call succeeds or aborts.

@gendx gendx mentioned this issue Jun 15, 2020
2 tasks
@hudson-ayers hudson-ayers added the tock-2.0 Issues and PRs related to Tock version 2.0. label Oct 27, 2020
@phil-levis phil-levis mentioned this issue Feb 15, 2021
23 tasks
@bradjc
Copy link
Contributor

bradjc commented Jul 13, 2021

For now, this issue has been addressed by #2639 (upcalls) and #2632 (process buffers).

A capsule can still deny returning a process buffer to userspace and there is no mechanism to allow userspace to ask the kernel to force the return of the process buffer. But, this may be a sufficient design for our intended use cases. If we find out that the current guarantees are not strong enough we should open a new issue.

@bradjc bradjc closed this as completed Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tock-2.0 Issues and PRs related to Tock version 2.0.
Projects
None yet
Development

No branches or pull requests

6 participants