New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add privilege escalation detection support #3471

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@mjg59
Contributor

mjg59 commented Dec 12, 2016

This consists of several components:

  1. Kernel support for sending notifications on certain events.
  2. qemu support for passing those notifications on to a policy agent.
  3. A policy agent that builds a tree of process state and verifies that
    this state is unmodified during in-kernel privilege checks.

There's also a minimal amount of code in rkt to add support for this.

Notifications are handled by hitting an io port and passing arguments in
registers. The kernel then blocks the running process until the notification
is handled. This is achieved by the policy agent clearing the blocking flag
and allowing execution to proceed.

A --monitor flag is added to indicate that this mode should be enabled. If
not passed, qemu will simply instruct the kernel to continue rather than
waiting for monitoring.

@mjg59

This comment has been minimized.

Show comment
Hide comment
@mjg59

mjg59 Dec 12, 2016

Contributor

TODO:

  • Add alternatives support for the hypercall so there's no performance overhead when monitoring is disabled
Contributor

mjg59 commented Dec 12, 2016

TODO:

  • Add alternatives support for the hypercall so there's no performance overhead when monitoring is disabled
@mjg59

This comment has been minimized.

Show comment
Hide comment
@mjg59

mjg59 Dec 13, 2016

Contributor

gif_one

The kernel in this demo has been deliberately backdoored such that opening /proc/interrupts raises the current process to UID and GID 0. This is then detected by the policy agent and the container is shut down automatically.

Contributor

mjg59 commented Dec 13, 2016

gif_one

The kernel in this demo has been deliberately backdoored such that opening /proc/interrupts raises the current process to UID and GID 0. This is then detected by the policy agent and the container is shut down automatically.

@mjg59

This comment has been minimized.

Show comment
Hide comment
@mjg59

mjg59 Dec 13, 2016

Contributor

Anyone wanting to test this out - you'll need to build the qemu-based KVM stage 1 (./configure --with-stage1-flavors=kvm --with-stage1-kvm-hypervisors=qemu), pass an appropriate --stage1 argument to use that and pass the --monitor argument to the run command.

Contributor

mjg59 commented Dec 13, 2016

Anyone wanting to test this out - you'll need to build the qemu-based KVM stage 1 (./configure --with-stage1-flavors=kvm --with-stage1-kvm-hypervisors=qemu), pass an appropriate --stage1 argument to use that and pass the --monitor argument to the run command.

@philips

This comment has been minimized.

Show comment
Hide comment
@philips
Contributor

philips commented Dec 15, 2016

@philips

This comment has been minimized.

Show comment
Hide comment
@philips

philips Dec 16, 2016

Contributor

@mjg59 could you document the ABI with xor on r12 a bit in the relevant patches?

Contributor

philips commented Dec 16, 2016

@mjg59 could you document the ABI with xor on r12 a bit in the relevant patches?

Add privilege escalation detection support
This consists of several components:

1) Kernel support for sending notifications on certain events.
2) qemu support for passing those notifications on to a policy agent.
3) A policy agent that builds a tree of process state and verifies that
   this state is unmodified during in-kernel privilege checks.

There's also a minimal amount of code in rkt to add support for this.

Notifications are handled by hitting an io port and passing arguments in
registers. The kernel then blocks the running process until the notification
is handled. This is achieved by the policy agent clearing the blocking flag
and allowing execution to proceed.

Change in privilege triggered by executing a setuid application is handled
specially. --fixed-r12 is passed to the compiler in order to prevent r12
being used in any normal code. The only write instruction to r12 is in an
inline function that is called at the top of execve(), storing the PID.
When the kernel hits the monitor, the monitor ensures that the PID of the
process requesting the privilege change matches the PID in r12, thus
ensuring that the request came via the full exec() call chain. This means
that all appropriate checks in exec() have already passed, making it
difficult for an attacker to trick the monitor into accepting elevated
privileges.

A --monitor flag is added to indicate that this mode should be enabled. If
not passed, qemu will simply instruct the kernel to continue rather than
waiting for monitoring.

One special-cased event is the kernel notifying qemu when it is marking
kernel text read-only. This triggers an mprotect() call in qemu that marks
these pages as read-only in the host page table. Any attempt by the guest
to mark them read-write (in order to disable the syscall notification,
for instance) will trigger a segfault in the host.
@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Jan 20, 2017

Member

Sorry for the delay in getting to this. Concerns here are mostly around exposing, mantaining and distributing these features, in particular:

  • there is a new --monitor flag for run which is very specific to this feature. At the same time it is a very vague keyword and we already have way too many CLI flags. I think this one can be moved to a pod annotation or env flag for stage1.
  • this introduces a dependency on libvmi, which hasn't been released for almost two years now (and last was a -rc). I fearing about codebase size and maintenance, but I don't know that project that well.
  • it also adds a custom patch on top libvmi, which seems to be coming from libvmi/libvmi#295 and bit-rotting there. I'm not sure if it applied cleanly, but it looks like another pain point for updates.
  • it seems to reserve r12 in for its signaling purposes. I honestly don't have any suggestion in that sense, except that I wish there was a better/common way of this without hard-reserving a register and moving all users across the codebase out of it.
  • the new exec_token in task_struct would probably require quite a bit of kernel-design discussion which I don't think I'm qualified for. Also the prototype in syscalls.h smells like shortcut taking.
  • the pmemaccess patch for qemu seems to be coming from the outside, but I didn't manage to track down the source and doesn't seem to be in-tree. Is it somewhere close to be upstreamed?
  • did qemu upstream expressed any comment on the hypercall-monitor command and the associated hypercall_fd? Any interest in having such a feature?

I'm probably missing some more things here. I think this is interesting to see happening as an experiment, but I don't think we should be maintaining it here. Similarly to #3487, this would ideally live a better life in a separate stage1-only repo shipping its own images. I'm not sure how upstreaming all the pieces in other projects could proceed, but from our side we may already start growing generic support for this kind of experimental flags to be passed to stage1.

Member

lucab commented Jan 20, 2017

Sorry for the delay in getting to this. Concerns here are mostly around exposing, mantaining and distributing these features, in particular:

  • there is a new --monitor flag for run which is very specific to this feature. At the same time it is a very vague keyword and we already have way too many CLI flags. I think this one can be moved to a pod annotation or env flag for stage1.
  • this introduces a dependency on libvmi, which hasn't been released for almost two years now (and last was a -rc). I fearing about codebase size and maintenance, but I don't know that project that well.
  • it also adds a custom patch on top libvmi, which seems to be coming from libvmi/libvmi#295 and bit-rotting there. I'm not sure if it applied cleanly, but it looks like another pain point for updates.
  • it seems to reserve r12 in for its signaling purposes. I honestly don't have any suggestion in that sense, except that I wish there was a better/common way of this without hard-reserving a register and moving all users across the codebase out of it.
  • the new exec_token in task_struct would probably require quite a bit of kernel-design discussion which I don't think I'm qualified for. Also the prototype in syscalls.h smells like shortcut taking.
  • the pmemaccess patch for qemu seems to be coming from the outside, but I didn't manage to track down the source and doesn't seem to be in-tree. Is it somewhere close to be upstreamed?
  • did qemu upstream expressed any comment on the hypercall-monitor command and the associated hypercall_fd? Any interest in having such a feature?

I'm probably missing some more things here. I think this is interesting to see happening as an experiment, but I don't think we should be maintaining it here. Similarly to #3487, this would ideally live a better life in a separate stage1-only repo shipping its own images. I'm not sure how upstreaming all the pieces in other projects could proceed, but from our side we may already start growing generic support for this kind of experimental flags to be passed to stage1.

@mjg59

This comment has been minimized.

Show comment
Hide comment
@mjg59

mjg59 Jan 20, 2017

Contributor

Most of the kernel stuff is never going upstream - it's just too special cased and invasive, and I haven't identified a generic solution to the problem (some amount of this could potentially be handled through audit, but that's almost certainly months if not years of upstream conversation and wouldn't cover everything). That means that most of the qemu code is also un-upstreamable.

The libvmi core code is mostly plausible, and I can make an effort to get that upstream.

pmemaccess for qemu - I'm not sure why that's still out of tree. I can look into that.

So I think we're left with an indication that the best way to handle this is to keep it as a separate stage 1 image for now. That requires support for stage 1s to pass options back to stage 0 somehow. Is there a design plan around that? It also requires that someone commit to maintaining the independent repository.

Contributor

mjg59 commented Jan 20, 2017

Most of the kernel stuff is never going upstream - it's just too special cased and invasive, and I haven't identified a generic solution to the problem (some amount of this could potentially be handled through audit, but that's almost certainly months if not years of upstream conversation and wouldn't cover everything). That means that most of the qemu code is also un-upstreamable.

The libvmi core code is mostly plausible, and I can make an effort to get that upstream.

pmemaccess for qemu - I'm not sure why that's still out of tree. I can look into that.

So I think we're left with an indication that the best way to handle this is to keep it as a separate stage 1 image for now. That requires support for stage 1s to pass options back to stage 0 somehow. Is there a design plan around that? It also requires that someone commit to maintaining the independent repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment