Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the code #10

Closed
98hq opened this issue Nov 6, 2023 · 11 comments
Closed

Some questions about the code #10

98hq opened this issue Nov 6, 2023 · 11 comments

Comments

@98hq
Copy link

98hq commented Nov 6, 2023

This is a great job, I have some problems reading the code in main.c, can you help me:

  1. In the initialization function, rewrite_code() is first executed; and then load_hook_lib() is executed; load_hook_lib will call many system calls. Have these system calls been hooked at this time? If so, will this cause any problems?
  2. I am using libopcodes for the first time, can you provide some reference materials? In addition, what is the role of macro NEW_DIS_ASM?
/* discard pushed 0x90 for 0xeb 0x6a 0x90 if rax is n * 3 + 1 */
	"pushq %rdi \n\t"
	"pushq %rax \n\t"
	"movabs $0xaaaaaaaaaaaaaaab, %rdi \n\t"
	"imul %rdi, %rax \n\t"
	"cmp %rdi, %rax \n\t"
	"popq %rax \n\t"
	"popq %rdi \n\t"
	"jb skip_pop \n\t"
	"addq $8, %rsp \n\t"
	"skip_pop: \n\t"

	"cmpq $15, %rax \n\t" // rt_sigreturn
	"je do_rt_sigreturn \n\t"
	"pushq %rbp \n\t"
	"movq %rsp, %rbp \n\t"
  1. What is the principle of distinguishing whether rax is 3n+1 in the above code? I don't understand the meaning of this code. In addition, why is the rt_sigreturn function filtered in the above code? Did I miss something?
if (rax_on_stack == __NR_clone3)
		return -ENOSYS; /* workaround to trigger the fallback to clone */

	if (rax_on_stack == __NR_clone) {
		if (rdi & CLONE_VM) { // pthread creation
			/* push return address to the stack */
			rsi -= sizeof(uint64_t);
			*((uint64_t *) rsi) = retptr;
		}
	}
  1. Why should we distinguish clone system calls in the above code and make subsequent settings?
    Thanks for your help!!
@yasukata
Copy link
Owner

Thank you for your questions.

  1. In the initialization function, rewrite_code() is first executed; and then load_hook_lib() is executed; load_hook_lib will call many system calls. Have these system calls been hooked at this time? If so, will this cause any problems?

Yes, during the binary rewiring phase, some system calls are hooked. But, the initially applied hook function (enter_syscall) simply invokes a kernel-space system call that the rewriting program wishes to perform, therefore, the program can continue to run.

  1. I am using libopcodes for the first time, can you provide some reference materials? In addition, what is the role of macro NEW_DIS_ASM?

Regarding the reference materials for libopcodes, I could not find a decent one; I tried to find its usage from random online materials.

About the NEW_DIS_ASM ifdef section, we needed it because the API of the disassembler library has been changed since version 2.39 (please refer to Makefile) and we needed to differentiate the code for the new and old APIs.

  1. What is the principle of distinguishing whether rax is 3n+1 in the above code? I don't understand the meaning of this code. In addition, why is the rt_sigreturn function filtered in the above code? Did I miss something?

Checking whether rax is 3n + 1 or not is for a technique to improve the efficiency of the code at virtual address 0 ~ maximum syscall number, described in https://github.com/yasukata/zpoline/tree/master/Documentation#reducing-nop-overhead-by-0xeb-0x6a-0x90-may-2023 .
Initially, we filled the address range between 0 and the maximum syscall number with nop (0x90), but, it is costly to slide down all nops.
An approach to mitigate this cost is to put the sequence of 0xeb 0x6a 0x90 instead of 0x90 0x90 0x90, up to the address "max syscall number - 0x6a".
The meaning of this sequence for an x86-64 CPU depends on the address where the execution lands, and there are three patterns:

  1. if the address is n * 3 + 0: jmp 0x6a; nop; jmp 0x6a; ...
  2. if the address is n * 3 + 1: push 0x90; jmp 0x6a; nop; jmp 0x6a; ...
  3. if the address is n * 3 + 2: nop; jmp 0x6a; nop; ...

As seen, in case 2 (the address is n * 3 + 1), the code pushes 0x90 to the stack; this 0x90 is not necessary and we wish to discard it.
The code headed by the message /* discard pushed 0x90 for 0xeb 0x6a 0x90 if rax is n * 3 + 1 */ is to check if the jump destination address is n * 3 + 1 and if so it shrinks the stack by 8 bytes to discard 0x90, which is pushed by push 0x90; essentially, I wanted to implement the following C code by this assembly code block.

if (rax_register_vallue % 3 == 1) {
  rsp_register_value += 8;
}

Regarding rt_sigreturn, we omitted it from the C-based hook function implementation because it is a bit special and complicated to handle.

  1. Why should we distinguish clone system calls in the above code and make subsequent settings?

The reason why we differentiate the setting for the clone system call comes from the code below.

void ____asm_impl(void)
{
        /*  
         * enter_syscall triggers a kernel-space system call
         */
        asm volatile (
        ".globl enter_syscall \n\t"
        "enter_syscall: \n\t"
        "movq %rdi, %rax \n\t"
        "movq %rsi, %rdi \n\t"
        "movq %rdx, %rsi \n\t"
        "movq %rcx, %rdx \n\t"
        "movq %r8, %r10 \n\t"
        "movq %r9, %r8 \n\t"
        "movq 8(%rsp),%r9 \n\t"
        ".globl syscall_addr \n\t"
        "syscall_addr: \n\t"
        "syscall \n\t"
        "ret \n\t"
        );  

After the hook is applied, we trigger a system call through the code above (enter_syscall); this implementation gets back to the caller by ret (at the bottom of the code block above), so, it assumes the caller's address is stored at the top of the stack.

The point here is that, in my understanding, a new thread made by the clone system call will initially come to the instruction right after syscall and that is ret in enter_syscall above for our case, and when a newly created thread comes to this ret for the first time, its stack does not have the caller's address on its top; therefore, we do this by ourselves by the following code in syscall_hook that pushes the caller's address to the stack of the new thread before the parent process/thread invokes a clone system call. (the rsi register has the argument for the clone system call and it specifies the address of the stack for the newly created thread.)

			/* push return address to the stack */
			rsi -= sizeof(uint64_t);
			*((uint64_t *) rsi) = retptr;

Thank you very much for your interest.

@98hq
Copy link
Author

98hq commented Nov 15, 2023

Thank you for your questions.

  1. In the initialization function, rewrite_code() is first executed; and then load_hook_lib() is executed; load_hook_lib will call many system calls. Have these system calls been hooked at this time? If so, will this cause any problems?

Yes, during the binary rewiring phase, some system calls are hooked. But, the initially applied hook function (enter_syscall) simply invokes a kernel-space system call that the rewriting program wishes to perform, therefore, the program can continue to run.

  1. I am using libopcodes for the first time, can you provide some reference materials? In addition, what is the role of macro NEW_DIS_ASM?

Regarding the reference materials for libopcodes, I could not find a decent one; I tried to find its usage from random online materials.

About the NEW_DIS_ASM ifdef section, we needed it because the API of the disassembler library has been changed since version 2.39 (please refer to Makefile) and we needed to differentiate the code for the new and old APIs.

  1. What is the principle of distinguishing whether rax is 3n+1 in the above code? I don't understand the meaning of this code. In addition, why is the rt_sigreturn function filtered in the above code? Did I miss something?

Checking whether rax is 3n + 1 or not is for a technique to improve the efficiency of the code at virtual address 0 ~ maximum syscall number, described in https://github.com/yasukata/zpoline/tree/master/Documentation#reducing-nop-overhead-by-0xeb-0x6a-0x90-may-2023 . Initially, we filled the address range between 0 and the maximum syscall number with nop (0x90), but, it is costly to slide down all nops. An approach to mitigate this cost is to put the sequence of 0xeb 0x6a 0x90 instead of 0x90 0x90 0x90, up to the address "max syscall number - 0x6a". The meaning of this sequence for an x86-64 CPU depends on the address where the execution lands, and there are three patterns:

  1. if the address is n * 3 + 0: jmp 0x6a; nop; jmp 0x6a; ...
  2. if the address is n * 3 + 1: push 0x90; jmp 0x6a; nop; jmp 0x6a; ...
  3. if the address is n * 3 + 2: nop; jmp 0x6a; nop; ...

As seen, in case 2 (the address is n * 3 + 1), the code pushes 0x90 to the stack; this 0x90 is not necessary and we wish to discard it. The code headed by the message /* discard pushed 0x90 for 0xeb 0x6a 0x90 if rax is n * 3 + 1 */ is to check if the jump destination address is n * 3 + 1 and if so it shrinks the stack by 8 bytes to discard 0x90, which is pushed by push 0x90; essentially, I wanted to implement the following C code by this assembly code block.

if (rax_register_vallue % 3 == 1) {
  rsp_register_value += 8;
}

Regarding rt_sigreturn, we omitted it from the C-based hook function implementation because it is a bit special and complicated to handle.

  1. Why should we distinguish clone system calls in the above code and make subsequent settings?

The reason why we differentiate the setting for the clone system call comes from the code below.

void ____asm_impl(void)
{
        /*  
         * enter_syscall triggers a kernel-space system call
         */
        asm volatile (
        ".globl enter_syscall \n\t"
        "enter_syscall: \n\t"
        "movq %rdi, %rax \n\t"
        "movq %rsi, %rdi \n\t"
        "movq %rdx, %rsi \n\t"
        "movq %rcx, %rdx \n\t"
        "movq %r8, %r10 \n\t"
        "movq %r9, %r8 \n\t"
        "movq 8(%rsp),%r9 \n\t"
        ".globl syscall_addr \n\t"
        "syscall_addr: \n\t"
        "syscall \n\t"
        "ret \n\t"
        );  

After the hook is applied, we trigger a system call through the code above (enter_syscall); this implementation gets back to the caller by ret (at the bottom of the code block above), so, it assumes the caller's address is stored at the top of the stack.

The point here is that, in my understanding, a new thread made by the clone system call will initially come to the instruction right after syscall and that is ret in enter_syscall above for our case, and when a newly created thread comes to this ret for the first time, its stack does not have the caller's address on its top; therefore, we do this by ourselves by the following code in syscall_hook that pushes the caller's address to the stack of the new thread before the parent process/thread invokes a clone system call. (the rsi register has the argument for the clone system call and it specifies the address of the stack for the newly created thread.)

			/* push return address to the stack */
			rsi -= sizeof(uint64_t);
			*((uint64_t *) rsi) = retptr;

Thank you very much for your interest.

thank you for your reply
First of all, I don’t know if my understanding of rt_sigreturn is correct. Once rt_sigreturn is encountered in the code, first add 8 to rsp, that is, discard the return address, and then execute the "syscall" instruction. The reason for discarding the return address here is that it will not return right?
I have two questions, first, on my 64-bit system, calling rt_sigreturn directly does not seem to execute the syscall instruction, am I missing something? Second, if you don’t add 8 to rsp, will there be an error?

"do_rt_sigreturn:"
	"addq $8, %rsp \n\t"
	"jmp syscall_addr \n\t"
"syscall_addr: \n\t"
	"syscall \n\t"
	"ret \n\t"

I have a few questions regarding the filtering of clone system calls. First, why not filter the fork system call?
Second, clone’s manual points out:

In this case, for correct operation, the CLONE_VM option should not be specified. (If the child shares the parent's memory because of the use of the CLONE_VM flag, then no copy-on-write duplication occurs and chaos is likely to result.)

Then if the flags contain the CLONE_VM flag, it means that the child process and the parent process share the memory of the parent process, then I think there is a return address in the stack at this time.

Finally, I want to understand how the system call hook overhead is measured in 3.2 of the paper.
What tool was used for measurement?
Do the time results in the paper include hook settings? For example, the execution of setup_trampoline(), rewrite_code(), and load_hook_lib() functions. I think rewrite_code is more time-consuming, can you provide me with more information? I want to reproduce the hook methods mentioned in the paper and the overhead of evaluating them.

@yasukata
Copy link
Owner

Thank you for your message.

I have two questions, first, on my 64-bit system, calling rt_sigreturn directly does not seem to execute the syscall instruction, am I missing something?

I guess, it may depend on how "directly" call rt_sigreturn, but the following code triggering rt_sigreturn (syscall number 15) caused a segmentation fault in my environment (x86-64); so, I think rt_sigreturn seems to be executed and it does something.

int main(void)
{
	asm volatile ("movq $15, %rax");
	asm volatile ("syscall");
	return 0;
}

But, essentially, it seems that rt_sigreturn does not assume to be directly called according to the manual ( https://man7.org/linux/man-pages/man2/sigreturn.2.html ).

       sigreturn() exists only to allow the implementation of signal
       handlers.  It should never be called directly.

Second, if you don’t add 8 to rsp, will there be an error?

Yes, when I removed this addition to rsp, I found a handler registered with the signal system call does not work properly.

I have a few questions regarding the filtering of clone system calls.

Regarding clone/fork, please let me first summarize my understanding.

  1. premise: every execution context (process/thread) should have a dedicated stack memory region, meaning that a stack memory region should not be physically shared by multiple execution contexts; therefore, when a new thread or a new process is created, a new physical memory region has allocated for its stack.
  2. how a physical memory region is allocated for the stack of a new process/thread
    1. process creation (fork): as the nature of fork, the kernel allocates new physical memory regions to duplicate the memory of the parent process; here, the physical memory for the stack of a newly created child process will be allocated in this duplication procedure.
    2. thread creation (clone + CLONE_VM): the parent thread, as part of user-space application logic, allocates a memory region and requests the kernel to associate it with a newly created child thread as its stack through the argument of the clone system call.
  3. the initial content of the stack of a child process/thread
    1. process creation (fork): as mentioned in 2(i), the stack memory for a newly created child process is "duplicated"; therefore, the content of the stack of the child process is the "same" as that of the parent process.
    2. thread creation (clone + CLONE_VM): as mentioned in 2(ii), the stack memory for a newly created child thread is allocated by the parent thread, and the parent thread can put arbitrary content on it before triggering the clone system call; therefore, when a new thread is created, the content of its stack is usually "different" from that of the parent thread (while it would be possible to have the same content if the parent thread wishes to do so).
  4. regarding the return address
    1. process creation (fork): as mentioned in 3(i), the content of the stack of a child process is the same as that of the parent process; therefore, if the stack of the parent process has the return address on its top, the stack of the child process also has the same return address on its top.
    2. thread creation (clone + CLONE_VM): as mentioned in 3(ii), the content of a child thread is usually different from the stack of the parent thread; so, even if the stack of the parent thread has a return address on its top, the stack of the child thread may not have the return address on its top.

Then if the flags contain the CLONE_VM flag, it means that the child process and the parent process share the memory of the parent process, then I think there is a return address in the stack at this time.

Yes, I also think there is a return address in the stack at this time, but that is only for the stack of the parent thread.

As mentioned in 4(ii), even if the stack of the parent thread has a return address at the top of it, the stack of the child thread may not have it; this is why we manually put it on the stack of the child thread in syscall_hook.

First, why not filter the fork system call?

Contrary, as mentioned in 4(i), the stack of a process, newly created by fork, has the return address at its top, therefore, we do not need the procedure, done in syscall_hook for clone + CLONE_VM, to manually put the return address on the stack of the child thread; this is why we do not filter the fork system call.

Finally, I want to understand how the system call hook overhead is measured in 3.2 of the paper.
What tool was used for measurement?

To measure the system call overhead, I use the following program which executes a loop for a certain number of times (specified by -c), and in each loop, it executes the getpid system call; I measure the time to finish it, then, get the average time for executing a getpid system call by dividing the measured time by the loop count. Please note that the loop count has to be large enough to stabilize the result.

For this time, let's say we compile the following program and generate an executable file named a.out.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <getopt.h>
#include <assert.h>

extern pid_t do_getpid(void);

void __do_getpid(void)
{
	asm volatile (".globl do_getpid");
	asm volatile ("do_getpid:");
	asm volatile ("movq $39, %rax");
	asm volatile ("syscall");
	asm volatile ("ret");
}

int main(int argc, char* const* argv)
{
	int ch;
	unsigned long loopcnt = 0;

	while ((ch = getopt(argc, argv, "c:")) != -1) {
		switch (ch) {
		case 'c':
			loopcnt = atol(optarg);
			break;

		default:
			printf("unknown option\n");
			exit(1);
		}
	}

	if (!loopcnt) {
		printf("please specify loop count by -c\n");
		exit(0);
	}

	{
		pid_t my_pid = getpid();
		{
			unsigned long t;
			{
				struct timespec ts;
				clock_gettime(CLOCK_REALTIME, &ts);
				t = ts.tv_sec * 1000000000UL + ts.tv_nsec;
			}
			{
				unsigned long i;
				for (i = 0; i < loopcnt; i++)
					assert(my_pid == do_getpid());
			}
			{
				struct timespec ts;
				clock_gettime(CLOCK_REALTIME, &ts);
				t = ts.tv_sec * 1000000000UL + ts.tv_nsec - t;
			}
			printf("average %lu nsec\n", t / loopcnt);
		}
	}

	return 0;
}

For the hook-applied case, to avoid executing the kernel-space getpid system call, I use the following hook program that always returns a dummy value (10000 this time) for a getpid system call rather than enters the kernel by syscall.

#include <stdio.h>
#include <syscall.h>

typedef long (*syscall_fn_t)(long, long, long, long, long, long, long);

static syscall_fn_t next_sys_call = NULL;

static long hook_function(long a1, long a2, long a3,
			  long a4, long a5, long a6,
			  long a7)
{
	if (a1 == __NR_getpid)
		return 10000;
	else
		return next_sys_call(a1, a2, a3, a4, a5, a6, a7);
}

int __hook_init(long placeholder __attribute__((unused)),
		void *sys_call_hook_ptr)
{
	next_sys_call = *((syscall_fn_t *) sys_call_hook_ptr);
	*((syscall_fn_t *) sys_call_hook_ptr) = hook_function;
	return 0;
}

To try this, please replace the content of apps/basic/main.c in this repository with the code above, then, compile it by make -C apps/basic to generate apps/basic/libzphook_basic.so.

The following command will execute a.out while applying the hook program apps/basic/libzphook_basic.so, and it will show the average time spent on each getpid system call execution. (this assumes that a.out is located in the top directory of this repository.)

LIBZPHOOK=./apps/basic/libzphook_basic.so LD_PRELOAD=./libzpoline.so ./a.out -c 100000

Do the time results in the paper include hook settings?

No, we do not involve the hook setting time as part of the system call hook overhead.

Thank you very much for your questions.

@98hq
Copy link
Author

98hq commented Nov 20, 2023

Thanks for your detailed reply, I will use the information you provided to reproduce. Finally, bro, this is really a great study.

@98hq
Copy link
Author

98hq commented Nov 21, 2023

@yasukata Hello, I saw that you used ptrace to hijack system calls in your paper? I wonder how this is done? Is it through PTRACE_SYSEMU or PTRACE_SYSCALL in ptrace? Can you share your code? I understand that all system calls are hijacked, and then other calls are executed normally, but the getpid system call is hijacked into asm_syscall_hook

@yasukata
Copy link
Owner

Thank you for your message.

The following program could be used for the getpid test; it leverages PTRACE_SYSCALL rather than PTRACE_SYSEMU.

In the following program, to selectively change the behavior of a specific system call:

  • at the entry of a system call, the tracer checks the system call number in rax. if (regs.orig_rax == __NR_getpid) {
  • if it is the system call that we wish to change the behavior, the tracer overwrites the value of rax with __NR_getpid and resumes the tracee. assert(!ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid));
  • the consequence is that the tracee executes the getpid system call instead of the one that the application wished to invoke (while the following program filters getpid here for this time because this test is to measure the overhead to emulate the getpid system call); essentially, the reason for replacing the system call number set to rax with __NR_getpid, at the system call entry, is to cancel the execution of the system call requested by the application.
  • then, at the exit of the (getpid) system call, the tracer can perform an arbitrary emulation procedure that can be considered a hook function; in the following program, the tracer sets 10000 to rax so that it will always return 10000 as the result of a getpid system call, and this behavior is the same as the hook function put in the previous post. if (skipped) { regs.rax = 10000; assert(!ptrace(PTRACE_SETREGS, pid, 0, &regs)); }
#include <stdio.h>
#include <stddef.h>
#include <stdbool.h>
#include <unistd.h>
#include <assert.h>
#include <wait.h>
#include <syscall.h>
#include <sys/user.h>
#include <sys/ptrace.h>

int main(int argc, char* const* argv)
{
	pid_t pid;
	assert(argc > 1);
	pid = fork();
	assert(pid >= 0);
	if (pid == 0) {
		assert(!ptrace(PTRACE_TRACEME, 0L, 0L, 0L));
		execvp(argv[1], &argv[1]);
	} else {
		int status;
		pid = wait(&status);
		assert(!ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL));
		assert(!ptrace(PTRACE_SYSCALL, pid, 0, 0));
		while (1) {
			bool skipped = false;
			struct user_regs_struct regs;
			pid = wait(&status);
			if (WIFEXITED(status))
				break;
			assert(!ptrace(PTRACE_GETREGS, pid, 0, &regs));
			if (regs.orig_rax == __NR_getpid) {
				assert(!ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid));
				skipped = true;
			}
			assert(!ptrace(PTRACE_SYSCALL, pid, 0, 0));
			pid = wait(&status);
			if (WIFEXITED(status))
				break;
			if (skipped) {
				regs.rax = 10000;
				assert(!ptrace(PTRACE_SETREGS, pid, 0, &regs));
			}
			assert(!ptrace(PTRACE_SYSCALL, pid, 0, 0));
		}
	}
	return 0;
}

Let's say the code above is saved in a file named ptracegetpid.c; the following will generate an executable file named ptracegetpid that works as the tracer and emulates the getpid system call.

gcc -O3 ptracegetpid.c -o ptracegetpid

The following executes the test where a.out is the program attached in the previous post; it executes getpid for a certain number of times specified by -c.

./ptracegetpid ./a.out -c 1000

Thank you for your question.

@98hq
Copy link
Author

98hq commented Nov 24, 2023

Thank you for your message.

The following program could be used for the getpid test; it leverages PTRACE_SYSCALL rather than PTRACE_SYSEMU.

In the following program, to selectively change the behavior of a specific system call:

  • at the entry of a system call, the tracer checks the system call number in rax. if (regs.orig_rax == __NR_getpid) {
  • if it is the system call that we wish to change the behavior, the tracer overwrites the value of rax with __NR_getpid and resumes the tracee. assert(!ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid));
  • the consequence is that the tracee executes the getpid system call instead of the one that the application wished to invoke (while the following program filters getpid here for this time because this test is to measure the overhead to emulate the getpid system call); essentially, the reason for replacing the system call number set to rax with __NR_getpid, at the system call entry, is to cancel the execution of the system call requested by the application.
  • then, at the exit of the (getpid) system call, the tracer can perform an arbitrary emulation procedure that can be considered a hook function; in the following program, the tracer sets 10000 to rax so that it will always return 10000 as the result of a getpid system call, and this behavior is the same as the hook function put in the previous post. if (skipped) { regs.rax = 10000; assert(!ptrace(PTRACE_SETREGS, pid, 0, &regs)); }
#include <stdio.h>
#include <stddef.h>
#include <stdbool.h>
#include <unistd.h>
#include <assert.h>
#include <wait.h>
#include <syscall.h>
#include <sys/user.h>
#include <sys/ptrace.h>

int main(int argc, char* const* argv)
{
	pid_t pid;
	assert(argc > 1);
	pid = fork();
	assert(pid >= 0);
	if (pid == 0) {
		assert(!ptrace(PTRACE_TRACEME, 0L, 0L, 0L));
		execvp(argv[1], &argv[1]);
	} else {
		int status;
		pid = wait(&status);
		assert(!ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL));
		assert(!ptrace(PTRACE_SYSCALL, pid, 0, 0));
		while (1) {
			bool skipped = false;
			struct user_regs_struct regs;
			pid = wait(&status);
			if (WIFEXITED(status))
				break;
			assert(!ptrace(PTRACE_GETREGS, pid, 0, &regs));
			if (regs.orig_rax == __NR_getpid) {
				assert(!ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid));
				skipped = true;
			}
			assert(!ptrace(PTRACE_SYSCALL, pid, 0, 0));
			pid = wait(&status);
			if (WIFEXITED(status))
				break;
			if (skipped) {
				regs.rax = 10000;
				assert(!ptrace(PTRACE_SETREGS, pid, 0, &regs));
			}
			assert(!ptrace(PTRACE_SYSCALL, pid, 0, 0));
		}
	}
	return 0;
}

Let's say the code above is saved in a file named ptracegetpid.c; the following will generate an executable file named ptracegetpid that works as the tracer and emulates the getpid system call.

gcc -O3 ptracegetpid.c -o ptracegetpid

The following executes the test where a.out is the program attached in the previous post; it executes getpid for a certain number of times specified by -c.

./ptracegetpid ./a.out -c 1000

Thank you for your question.

Thank you for your detailed reply
I have two questions about the above program:

  1. Why is it necessary to set the value of orig_rax to __NR_getpid in the loop when it is found that the value of regs.orig_rax is __NR_getpid? Maybe you just need to set the value of skipped to true
    if (regs.orig_rax == __NR_getpid) { assert(!ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid)); skipped = true; }
  2. The time measured in this way is actually the tracer directly modifying the return value after the kernel executes getpid, right? Is it possible that at the entry point where the tracee program executes getpid, the tracer program can change rip to skip the execution of the syscall instruction, and then modify rax to 10000. I think that in this way, the execution of getpid will not enter the kernel, but will be completed in user space. Because the previous overhead measurement of the sud hooking system call returns 10000 directly in the user space without entering the kernel for execution.

@98hq
Copy link
Author

98hq commented Nov 24, 2023

Regarding the second question, after debugging, I found that the entrance and exit rips of the system call are the same, so the idea of adjusting the rips I mentioned is not feasible.

@yasukata
Copy link
Owner

  1. Why is it necessary to set the value of orig_rax to __NR_getpid in the loop when it is found that the value of regs.orig_rax is __NR_getpid? Maybe you just need to set the value of skipped to true
    if (regs.orig_rax == __NR_getpid) { assert(!ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid)); skipped = true; }

Yes, I think this is correct; even if we do not have ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid) at the entry point, the behavior of the program attached in the previous post will be the same.

The reason why I put ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid) at the entry point is that the program in the previous post is to see the overhead, added by the hook mechanism, to emulate a system call, and for this purpose, I wished to involve the overhead of ptrace(PTRACE_POKEUSER, pid, offsetof(struct user_regs_struct, orig_rax), __NR_getpid) because it is generally needed except the case of getpid.

  1. The time measured in this way is actually the tracer directly modifying the return value after the kernel executes getpid, right? Is it possible that at the entry point where the tracee program executes getpid, the tracer program can change rip to skip the execution of the syscall instruction, and then modify rax to 10000.

I also think it would be nice if we could avoid getpid just for canceling the originally requested system call, but I could not find other easy options.

I think a discussion in a paper https://www.usenix.org/conference/atc22/presentation/jansen (Appendices B.2 Reducing Per-syscall ptrace Stops) provides good insight.

Thank you for your message.

@98hq
Copy link
Author

98hq commented Nov 27, 2023

thanks for your reply

@yasukata
Copy link
Owner

Thank you very much for the series of questions.

I would close this issue, but please feel free to reopen this or open another one if you have further comments or questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants