Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Injection BSOD on W7x64 #576

Closed
ultrapikachu opened this issue Feb 25, 2019 · 12 comments · Fixed by #708
Closed

Injection BSOD on W7x64 #576

ultrapikachu opened this issue Feb 25, 2019 · 12 comments · Fixed by #708

Comments

@ultrapikachu
Copy link

Hi
At first , injector work successfully and can inject executable on target vm
but when i try close process on target vm, system going on BSOD
screenshot_24

@tklengyel
Copy link
Owner

You mean you close a process within the guest itself?

@ultrapikachu
Copy link
Author

Yes. Firstly i started process with inejctor after i connect vnc and i try click exit button on target process window, after i got bsod

@tklengyel
Copy link
Owner

Did the injector exit correctly?

@ultrapikachu
Copy link
Author

Yes.
I tried 4 times again and i wait after injection but system going bsod only exit of process

@tklengyel
Copy link
Owner

I can't reproduce this on my end

@Dos98
Copy link
Contributor

Dos98 commented Mar 17, 2019

I can't reproduce this on my end

+1
process injector is working correctly on Win7 x64. System is not crashing after closing any process.

@mtarral
Copy link
Contributor

mtarral commented Apr 17, 2019

Hi,

I think i can shed some lights on this issue, because I have been able to reproduce this kind of behavior on Windows 7 x64.

Testing

I have developed a test suite to evaluate DRAKVUF's robustness and ensure of its quality in corner cases, and I have been focused on testing the injection on Windows 7 for now.

I have found it to be very stable, in most cases.
For example the createproc injection was battle tested with 5000 successives injections:

create_proc_5000

However shellexec appears to be less stable:

shellexec_fail_1vcpu

I'd like to mention that the VM has only 1 VCPU, but the more VCPUs you have, the more likely the bug will appear. (race condition ?)

Here, much faster to crash with 4 VCPUs:

shellexec_fail_4vcpu

DRAKVUF output

1555424244.820902 Found module RPCRT4.dll
1555424244.820919 Found module GDI32.dll
1555424244.820928 Found module USER32.dll
1555424244.820935 Found module LPK.dll
1555424244.820942 Found module USP10.dll
1555424244.820958 Found module SHLWAPI.dll
1555424244.820966 Found module SHELL32.dll
1555424244.821058       ShellExecuteW @ 0x7fefe39983c
Starting injection loop
1555424244.821075 Started DRAKVUF loop
1555424259.906133 DRAKVUF loop broke unexpectedly: [Errno: 4] Interrupted system call
1555424259.906193 DRAKVUF loop finished
[INJECT] TIME:1555424259.906209 STATUS:Error ERROR_CODE:0 ERROR:"(null)"
Finished with injection. Ret: 0. Error: OK(0)
Injector freed
Process startup failed
1555424259.906237 close_vmi starting
1555424259.906266 Removed memtrap for GFN 0xff1b9 in altp2m view 1
1555424260.544601 close_vmi finished

Even if the process successfully executed in the VM, the injection failed, and a BSOD followed, with message PAGE_FAULT_IN_NONPAGED_AREA

Windbg analysis

Analyzing the bug with WinDBG gives us the following information:

FAULTING_IP:
win32k!RawInputThread+6ed

windbg_win32k_RawInputThread

What's really surprising is that the instruction pointer (Arg3) and the memory referenced (Arg1) are the same address.

However, the disassembly is wrong since mov edi, eax cannot trigger a PF.
Using a better disassembler like IDA should give us the real instruction.

Since you implemented the project and digged into Xen's API, do you have an idea or a lead that we could follow to debug this ?

Thanks.

@mtarral
Copy link
Contributor

mtarral commented Apr 17, 2019

I got a new BSOD, but this time the message is different:
bsod_pfn_corrupt

Bugcheck with WinDBG

bugcheck_pfn

A PTE or PFN is corrupt: this shows that, under specific conditions, DRAKVUF may alter the state of the guest, even for a very short time window.

@tklengyel
Copy link
Owner

DRAKVUF may alter the state of the guest, even for a very short time window

It alters the guest state for the whole duration of DRAKVUF being active but these changes should not be visible to the guest other then what we discuss in http://dfrws.org/conferences/dfrws-usa-2018/sessions/who-watches-watcher-detecting-hypervisor-introspection. Injector is even more intrusive and the injector was not designed to be stealthy. The breakpoints it uses are still are by default but the stack modification for example are not. That said, it should not bluescreen your machine. If you can trigger the BSOD consistently then we may stand a chance to debug it.

@Wenzel
Copy link

Wenzel commented Apr 18, 2019

but these changes should not be visible to the guest

Yes that's what I meant.
The guest should not crash because one of its pagetables has been modified by the hypervisor.

If you can trigger the BSOD consistently then we may stand a chance to debug it.

Luckily, the answer is yes, this bug is 100% reproducible.
Just running my test suite and aiming for 5000 successives injections on a multi-vcpu context will trigger it at some point.

Would you be interested in me sharing this test suite for Drakvuf ?
If you could reproduce this bug on your environment, it would sure be a step forward to see this issue resolved in the future.

@tklengyel
Copy link
Owner

The guest should not crash because one of its pagetables has been modified by the hypervisor.

We don't modify the guests' pagetables anywhere.

will trigger it at some point

That doesn't sound very easily reproducible to me :)

If you could reproduce this bug on your environment, it would sure be a step forward to see this issue resolved in the future.

Perhaps but at this point it's very unlikely I would have the time to play with it. Of course if you can open-source it then it may help someone else with more time to digg into the issue.

@mtarral
Copy link
Contributor

mtarral commented Jun 13, 2019

Hi,

I have made progress regarding what part of the code can trigger the BSOD

1 - breakpoint insert/remove

Since the plugins dealing with syscalls needs to insert breakpoints during the analysis, to capture their return value, I decided to disable that feature, and see if it changes the BSOD.

I made the following changes:

diff --git a/src/plugins/procmon/procmon.cpp b/src/plugins/procmon/procmon.cpp
index a008838..2d9fe77 100644
--- a/src/plugins/procmon/procmon.cpp
+++ b/src/plugins/procmon/procmon.cpp
@@ -448,6 +448,9 @@ static event_response_t terminate_process_hook(
 
 static event_response_t create_user_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtCreateuserProcess cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     // PHANDLE ProcessHandle
     addr_t process_handle_addr = drakvuf_get_function_argument(drakvuf, info, 1);
     // PRTL_USER_PROCESS_PARAMETERS RtlUserProcessParameters
@@ -457,6 +460,9 @@ static event_response_t create_user_process_hook_cb(drakvuf_t drakvuf, drakvuf_t
 
 static event_response_t terminate_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtTerminate cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     // HANDLE ProcessHandle
     addr_t process_handle = drakvuf_get_function_argument(drakvuf, info, 1);
     // NTSTATUS ExitStatus
@@ -558,6 +564,9 @@ static event_response_t open_process_return_hook_cb(drakvuf_t drakvuf, drakvuf_t
 
 static event_response_t open_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtOpenProcess cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     auto plugin = get_trap_plugin<procmon>(info);
     if (!plugin)
         return VMI_EVENT_RESPONSE_NONE;
@@ -607,6 +616,9 @@ static event_response_t open_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_inf
 
 static event_response_t protect_virtual_memory_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtProtectVirtualMemory cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     gchar* escaped_pname = NULL;
     // HANDLE ProcessHandle
     uint64_t process_handle = drakvuf_get_function_argument(drakvuf, info, 1);

Well, it turns that Drakvuf is much more stable then:

620b481a6cb0471bbd21583d30fc7721

Remember that it was crashing after ~20 tests before.

Doing extended testing, I still have a BSOD, at some point:

c1e7cf3d65054a3babb24a979efccc79

Conclusions so far:

  • inserting/removing traps during the analysis influence the likelihood of triggering BSOD
  • you need repeated testing with setup/teardown code, on the same VM to trigger it (running drakvuf one time, for a couple hours, on a VM with heavy workload would not trigger the bug cc@skvl)
  • there are other issues left that can lead to the BSOD

I would like to reimplement that in the little xen-drakvuf test that I posted on xen-devel mailing list.

Would you like to briefly explain where do you set the breakpoint on syscall return, once you are at the syscall entry handler ?

I saw this:

    auto trap = plugin->register_trap<procmon, process_creation_result_t<procmon>>(
                    drakvuf,
                    info,
                    plugin,
                    process_creation_return_hook,
                    breakpoint_by_pid_searcher());
struct breakpoint_by_pid_searcher
{
    drakvuf_trap_t* operator()(drakvuf_t drakvuf, drakvuf_trap_info_t* info, drakvuf_trap_t* trap) const
    {
        if (trap)
        {
            access_context_t ctx =
            {
                .translate_mechanism = VMI_TM_PROCESS_DTB,
                .dtb = info->regs->cr3,
                .addr = info->regs->rsp,
            };

so regs->rsp ?

2 - Teardown code

As explained before, it is impossible to crash the VM while drakvuf is simply monitoring it for hours.
I think the reason is because you need teardown code to create the BSOD.

I managed to reproduce the bug with a single Drakvuf run.
My setup is the Windows 7 VM, and the following powershell script, which will randomly pops 10 processes and then kill them, in multiple batches:

$processes = "notepad", "mspaint", "powershell", "cmd"

For ($i=1; $i -le 10; $i++) {
    Write-Host "batch $i"
    $proc_running = New-Object System.Collections.ArrayList
    For ($j=1; $j -le 10; $j++) {
      $proc_name = $processes[(Get-Random -Maximum ([array]$processes).count)]
      Write-Host "[$i] $j - $proc_name"
      $proc = Start-Process -FilePath "$proc_name" -PassThru -WindowStyle Minimized
      $proc_running.Add($proc) | Out-Null
    }

    foreach ($proc in $proc_running) {
        Stop-Process -InputObject $proc
    }
}
  1. start Drakvuf: sudo ./src/drakvuf -d win7 -r profile.json -a procmon -a bsodmon -a crashmon > /dev/null
  2. start the powershell script on the host
  3. when the script is around batch ~8, stop drakvuf (CTRL-C)
  4. you should have a chance to see a BSOD/app crash.

The idea is that there is a bug in the teardown code, and if the previously breakpointed/remapped pages are hit while drakvuf shuts down, the BSOD triggers.

However, I quickly looked at the teardown code, and it is protected by drakvuf_pause/drakvuf_resume,
so the reason is unclear to me.
Is it possible that pause/resume misbehave ?
should we use vmi_pause_vm/vmi_resume_vm ?

If you want to test it, just copy the code in your vm.
I'm using libguestfs for that:
virt-copy-in -a ~/vms/win7.qcow2 ~/stress_test.ps1 /Users/vagrant/Desktop/

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants