Injection BSOD on W7x64 #576

ultrapikachu · 2019-02-25T05:08:38Z

Hi
At first , injector work successfully and can inject executable on target vm
but when i try close process on target vm, system going on BSOD

tklengyel · 2019-02-25T13:54:55Z

You mean you close a process within the guest itself?

ultrapikachu · 2019-02-25T21:24:42Z

Yes. Firstly i started process with inejctor after i connect vnc and i try click exit button on target process window, after i got bsod

tklengyel · 2019-02-25T22:07:35Z

Did the injector exit correctly?

ultrapikachu · 2019-02-26T04:34:52Z

Yes.
I tried 4 times again and i wait after injection but system going bsod only exit of process

tklengyel · 2019-02-26T16:48:05Z

I can't reproduce this on my end

Dos98 · 2019-03-17T18:51:44Z

I can't reproduce this on my end

+1
process injector is working correctly on Win7 x64. System is not crashing after closing any process.

mtarral · 2019-04-17T08:27:18Z

Hi,

I think i can shed some lights on this issue, because I have been able to reproduce this kind of behavior on Windows 7 x64.

Testing

I have developed a test suite to evaluate DRAKVUF's robustness and ensure of its quality in corner cases, and I have been focused on testing the injection on Windows 7 for now.

I have found it to be very stable, in most cases.
For example the createproc injection was battle tested with 5000 successives injections:

However shellexec appears to be less stable:

I'd like to mention that the VM has only 1 VCPU, but the more VCPUs you have, the more likely the bug will appear. (race condition ?)

Here, much faster to crash with 4 VCPUs:

DRAKVUF output

1555424244.820902 Found module RPCRT4.dll
1555424244.820919 Found module GDI32.dll
1555424244.820928 Found module USER32.dll
1555424244.820935 Found module LPK.dll
1555424244.820942 Found module USP10.dll
1555424244.820958 Found module SHLWAPI.dll
1555424244.820966 Found module SHELL32.dll
1555424244.821058       ShellExecuteW @ 0x7fefe39983c
Starting injection loop
1555424244.821075 Started DRAKVUF loop
1555424259.906133 DRAKVUF loop broke unexpectedly: [Errno: 4] Interrupted system call
1555424259.906193 DRAKVUF loop finished
[INJECT] TIME:1555424259.906209 STATUS:Error ERROR_CODE:0 ERROR:"(null)"
Finished with injection. Ret: 0. Error: OK(0)
Injector freed
Process startup failed
1555424259.906237 close_vmi starting
1555424259.906266 Removed memtrap for GFN 0xff1b9 in altp2m view 1
1555424260.544601 close_vmi finished

Even if the process successfully executed in the VM, the injection failed, and a BSOD followed, with message PAGE_FAULT_IN_NONPAGED_AREA

Windbg analysis

Analyzing the bug with WinDBG gives us the following information:

FAULTING_IP:
win32k!RawInputThread+6ed

What's really surprising is that the instruction pointer (Arg3) and the memory referenced (Arg1) are the same address.

However, the disassembly is wrong since mov edi, eax cannot trigger a PF.
Using a better disassembler like IDA should give us the real instruction.

Since you implemented the project and digged into Xen's API, do you have an idea or a lead that we could follow to debug this ?

Thanks.

mtarral · 2019-04-17T10:40:32Z

I got a new BSOD, but this time the message is different:

Bugcheck with WinDBG

A PTE or PFN is corrupt: this shows that, under specific conditions, DRAKVUF may alter the state of the guest, even for a very short time window.

tklengyel · 2019-04-17T14:10:21Z

DRAKVUF may alter the state of the guest, even for a very short time window

It alters the guest state for the whole duration of DRAKVUF being active but these changes should not be visible to the guest other then what we discuss in http://dfrws.org/conferences/dfrws-usa-2018/sessions/who-watches-watcher-detecting-hypervisor-introspection. Injector is even more intrusive and the injector was not designed to be stealthy. The breakpoints it uses are still are by default but the stack modification for example are not. That said, it should not bluescreen your machine. If you can trigger the BSOD consistently then we may stand a chance to debug it.

Wenzel · 2019-04-18T08:45:27Z

but these changes should not be visible to the guest

Yes that's what I meant.
The guest should not crash because one of its pagetables has been modified by the hypervisor.

If you can trigger the BSOD consistently then we may stand a chance to debug it.

Luckily, the answer is yes, this bug is 100% reproducible.
Just running my test suite and aiming for 5000 successives injections on a multi-vcpu context will trigger it at some point.

Would you be interested in me sharing this test suite for Drakvuf ?
If you could reproduce this bug on your environment, it would sure be a step forward to see this issue resolved in the future.

tklengyel · 2019-04-18T14:40:08Z

The guest should not crash because one of its pagetables has been modified by the hypervisor.

We don't modify the guests' pagetables anywhere.

will trigger it at some point

That doesn't sound very easily reproducible to me :)

If you could reproduce this bug on your environment, it would sure be a step forward to see this issue resolved in the future.

Perhaps but at this point it's very unlikely I would have the time to play with it. Of course if you can open-source it then it may help someone else with more time to digg into the issue.

mtarral · 2019-06-13T13:06:52Z

Hi,

I have made progress regarding what part of the code can trigger the BSOD

1 - breakpoint insert/remove

Since the plugins dealing with syscalls needs to insert breakpoints during the analysis, to capture their return value, I decided to disable that feature, and see if it changes the BSOD.

I made the following changes:

diff --git a/src/plugins/procmon/procmon.cpp b/src/plugins/procmon/procmon.cpp
index a008838..2d9fe77 100644
--- a/src/plugins/procmon/procmon.cpp
+++ b/src/plugins/procmon/procmon.cpp
@@ -448,6 +448,9 @@ static event_response_t terminate_process_hook(
 
 static event_response_t create_user_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtCreateuserProcess cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     // PHANDLE ProcessHandle
     addr_t process_handle_addr = drakvuf_get_function_argument(drakvuf, info, 1);
     // PRTL_USER_PROCESS_PARAMETERS RtlUserProcessParameters
@@ -457,6 +460,9 @@ static event_response_t create_user_process_hook_cb(drakvuf_t drakvuf, drakvuf_t
 
 static event_response_t terminate_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtTerminate cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     // HANDLE ProcessHandle
     addr_t process_handle = drakvuf_get_function_argument(drakvuf, info, 1);
     // NTSTATUS ExitStatus
@@ -558,6 +564,9 @@ static event_response_t open_process_return_hook_cb(drakvuf_t drakvuf, drakvuf_t
 
 static event_response_t open_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtOpenProcess cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     auto plugin = get_trap_plugin<procmon>(info);
     if (!plugin)
         return VMI_EVENT_RESPONSE_NONE;
@@ -607,6 +616,9 @@ static event_response_t open_process_hook_cb(drakvuf_t drakvuf, drakvuf_trap_inf
 
 static event_response_t protect_virtual_memory_hook_cb(drakvuf_t drakvuf, drakvuf_trap_info_t* info)
 {
+    PRINT_DEBUG("NtProtectVirtualMemory cb !\n");
+    return VMI_EVENT_RESPONSE_NONE;
+
     gchar* escaped_pname = NULL;
     // HANDLE ProcessHandle
     uint64_t process_handle = drakvuf_get_function_argument(drakvuf, info, 1);

Well, it turns that Drakvuf is much more stable then:

Remember that it was crashing after ~20 tests before.

Doing extended testing, I still have a BSOD, at some point:

Conclusions so far:

inserting/removing traps during the analysis influence the likelihood of triggering BSOD
you need repeated testing with setup/teardown code, on the same VM to trigger it (running drakvuf one time, for a couple hours, on a VM with heavy workload would not trigger the bug cc@skvl)
there are other issues left that can lead to the BSOD

I would like to reimplement that in the little xen-drakvuf test that I posted on xen-devel mailing list.

Would you like to briefly explain where do you set the breakpoint on syscall return, once you are at the syscall entry handler ?

I saw this:

    auto trap = plugin->register_trap<procmon, process_creation_result_t<procmon>>(
                    drakvuf,
                    info,
                    plugin,
                    process_creation_return_hook,
                    breakpoint_by_pid_searcher());

struct breakpoint_by_pid_searcher
{
    drakvuf_trap_t* operator()(drakvuf_t drakvuf, drakvuf_trap_info_t* info, drakvuf_trap_t* trap) const
    {
        if (trap)
        {
            access_context_t ctx =
            {
                .translate_mechanism = VMI_TM_PROCESS_DTB,
                .dtb = info->regs->cr3,
                .addr = info->regs->rsp,
            };

so regs->rsp ?

2 - Teardown code

As explained before, it is impossible to crash the VM while drakvuf is simply monitoring it for hours.
I think the reason is because you need teardown code to create the BSOD.

I managed to reproduce the bug with a single Drakvuf run.
My setup is the Windows 7 VM, and the following powershell script, which will randomly pops 10 processes and then kill them, in multiple batches:

$processes = "notepad", "mspaint", "powershell", "cmd"

For ($i=1; $i -le 10; $i++) {
    Write-Host "batch $i"
    $proc_running = New-Object System.Collections.ArrayList
    For ($j=1; $j -le 10; $j++) {
      $proc_name = $processes[(Get-Random -Maximum ([array]$processes).count)]
      Write-Host "[$i] $j - $proc_name"
      $proc = Start-Process -FilePath "$proc_name" -PassThru -WindowStyle Minimized
      $proc_running.Add($proc) | Out-Null
    }

    foreach ($proc in $proc_running) {
        Stop-Process -InputObject $proc
    }
}

start Drakvuf: sudo ./src/drakvuf -d win7 -r profile.json -a procmon -a bsodmon -a crashmon > /dev/null
start the powershell script on the host
when the script is around batch ~8, stop drakvuf (CTRL-C)
you should have a chance to see a BSOD/app crash.

The idea is that there is a bug in the teardown code, and if the previously breakpointed/remapped pages are hit while drakvuf shuts down, the BSOD triggers.

However, I quickly looked at the teardown code, and it is protected by drakvuf_pause/drakvuf_resume,
so the reason is unclear to me.
Is it possible that pause/resume misbehave ?
should we use vmi_pause_vm/vmi_resume_vm ?

If you want to test it, just copy the code in your vm.
I'm using libguestfs for that:
virt-copy-in -a ~/vms/win7.qcow2 ~/stress_test.ps1 /Users/vagrant/Desktop/

Thanks

mtarral mentioned this issue May 6, 2019

BSOD when injecting on Windows 10 protected by KPTI #622

Closed

mtarral mentioned this issue Jul 23, 2019

Proposal: Drakvuf test framework #662

Open

tklengyel added bug help wanted labels Sep 6, 2019

tklengyel mentioned this issue Oct 5, 2019

Fix lingering remapped gfns after memaccess traps are gone #708

Merged

tklengyel closed this as completed in #708 Oct 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Injection BSOD on W7x64 #576

Injection BSOD on W7x64 #576

ultrapikachu commented Feb 25, 2019

tklengyel commented Feb 25, 2019

ultrapikachu commented Feb 25, 2019

tklengyel commented Feb 25, 2019

ultrapikachu commented Feb 26, 2019

tklengyel commented Feb 26, 2019

Dos98 commented Mar 17, 2019

mtarral commented Apr 17, 2019

mtarral commented Apr 17, 2019

tklengyel commented Apr 17, 2019

Wenzel commented Apr 18, 2019 •

edited

tklengyel commented Apr 18, 2019

mtarral commented Jun 13, 2019

Injection BSOD on W7x64 #576

Injection BSOD on W7x64 #576

Comments

ultrapikachu commented Feb 25, 2019

tklengyel commented Feb 25, 2019

ultrapikachu commented Feb 25, 2019

tklengyel commented Feb 25, 2019

ultrapikachu commented Feb 26, 2019

tklengyel commented Feb 26, 2019

Dos98 commented Mar 17, 2019

mtarral commented Apr 17, 2019

Testing

Windbg analysis

mtarral commented Apr 17, 2019

tklengyel commented Apr 17, 2019

Wenzel commented Apr 18, 2019 • edited

tklengyel commented Apr 18, 2019

mtarral commented Jun 13, 2019

1 - breakpoint insert/remove

2 - Teardown code

Wenzel commented Apr 18, 2019 •

edited