From a9341875b03ef14173c5bffc262f01a0b38d5447 Mon Sep 17 00:00:00 2001 From: Remzi Arpaci-Dusseau Date: Mon, 29 Jan 2018 09:20:14 -0600 Subject: [PATCH] Added first xv6 project and background --- README.md | 1 + initial-xv6/README.md | 52 +++++ initial-xv6/background.md | 422 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 475 insertions(+) create mode 100644 initial-xv6/README.md create mode 100644 initial-xv6/background.md diff --git a/README.md b/README.md index fc0a53ec..aeab5dfa 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,7 @@ journey; you'll have to do more on your own to truly become proficient. * [Unix Utilities](https://github.com/remzi-arpacidusseau/ostep-projects/tree/master/initial-utilities) +* [Intro To xv6](https://github.com/remzi-arpacidusseau/ostep-projects/tree/master/initial-xv6) diff --git a/initial-xv6/README.md b/initial-xv6/README.md new file mode 100644 index 00000000..d392f06e --- /dev/null +++ b/initial-xv6/README.md @@ -0,0 +1,52 @@ + +# Intro To Kernel Hacking + +To develop a better sense of how an operating system works, you will also +do a few projects *inside* a real OS kernel. The kernel we'll be using is a +port of the original Unix (version 6), and is runnable on modern x86 +processors. It was developed at MIT and is a small and relatively +understandable OS and thus an excellent focus for simple projects. +More information about xv6, including a very useful book which you might want +to read, is available [here](https://pdos.csail.mit.edu/6.828/2017/xv6.html). + +This first project is just a warmup, and thus relatively light on work. The +goal of the project is simple: to add a system call to xv6. Your system call, +**getreadcount()**, simply returns how many times that the **read()** system +call has been called by user processes since the time that the kernel was +booted. + +## Your System Call + +Your new system call should look have the following return codes and +parameters: + +```c +int getreadcount(void) +``` + +Your system call returns the value of a counter (perhaps called **readcount** +or something like that) which is incremented every time any process calls the +**read()** system call. That's it! + +## Tips + +Watch this [discussion video](https://www.youtube.com/watch?v=vR6z2QGcoo8) -- +it contains a detailed walk-through of all the things you need to know to +unpack xv6, build it, and modify it to make this project successful. + +One good way to start hacking inside a large code base is to find something +similar to what you want to do and to carefully copy/modify that. Here, you +should find some other system call, like **getpid()** (or any other simple +call). Copy it in all the ways you think are needed, and then modify it to do +what you need. + +Most of the time will be spent on understanding the code. There shouldn't +be a whole lot of code added. + +Using gdb (the debugger) may be helpful in understanding code, doing code +traces, and is helpful for later projects too. Get familiar with this fine +tool! + + + + diff --git a/initial-xv6/background.md b/initial-xv6/background.md new file mode 100644 index 00000000..40f22668 --- /dev/null +++ b/initial-xv6/background.md @@ -0,0 +1,422 @@ +# xv6 System Call Background + +To be able to implement this project, you'll have to understand a little bit +about how xv6 implements system calls. As you recall from the [OS +book](http://www.ostep.org/), a system call is a protected transfer of control +from an application (running in *user mode*) to the OS (running in *kernel +mode*). The general approach, which we refer to as *limited direct execution* +(*LDE*), enables the kernel to maintain control of the machine while generally +letting user applications run efficiently and without kernel intervention. + +We'll specifically trace what happens in the code in order to understand a +*system call*. System calls allow the operating system to run code on the +behalf of user requests but in a protected manner, both by jumping into the +kernel (in a very specific and restricted way) and also by simultaneously +raising the privilege level of the hardware, so that the OS can perform +certain restricted operations. + +## System Call Overview + +Before delving into the details, we first provide an overview of the entire +process. The problem we are trying to solve is simple: how can we build a +system such that the OS is allowed access to all of the resources of the +machine (including access to special instructions, to physical memory, +and to any devices) while user programs are only able to do so in a restricted +manner? + +The way we achieve this goal is with hardware support. The hardware must +explicitly have a notion of privilege built into it, and thus be able to +distinguish when the OS is running versus typical user applications. + +## Getting Into The Kernel: A Trap + +The first step in a system call begins at user-level with an application. The +application that wishes to make a system call (such as **read()**) calls the +relevant library routine. However, all the library version of the system call +does is to place the proper arguments in relevant registers and issue some +kind of **trap** instruction, as we see in an expanded version of **usys.S** +(Some macros are used to define these functions so as to make life +easier for the kernel developer; the example shows the macro expanded to the +actual assembly code). + +``` +.globl read; +read: + movl $6, %eax; + int $64; + ret +``` +File: **usys.S** + +Here we can see that the **read()** library function actually doesn't do much +at all; it moves the value 5 into the register **%eax** and issues the x86 +trap instruction which is confusingly called **int** (short for *interrupt*). +The value in **%eax** is going to be used by the kernel to *vector* to the +right system call, i.e., it determines which system call is being invoked. The +**int** instruction takes one argument (here it is 64), which tells the +hardware which trap type this is. In xv6, trap 64 is used to handle system +calls. Any other arguments which are passed to the system call are passed on +the stack. + +## Kernel Side: Trap Tables + +Once the **int** instruction is executed, the hardware takes over and does a +bunch of work on behalf of the caller. One important thing the hardware does +is to raise the *privilege level* of the CPU to kernel mode; on x86 this is +usually means moving from a *CPL* *(Current Privilege Level)* of 3 (the level +at which user applications run) to CPL 0 (in which the kernel runs). Yes, +there are a couple of in-between privilege levels, but most systems do not +make use of these. + +The second important thing the hardware does is to transfer control to the +*trap vectors* of the system. To enable the hardware to know what code to run +when a particular trap occurs, the OS, when booting, must make sure to inform +the hardware of the location of the code to run when such traps take +place. This is done in **main.c** as follows: + +```c +int +mainc(void) +{ + ... + tvinit(); // trap vectors initialized here + ... +} +``` +FILE: **main.c** + +The routine **tvinit()** is the relevant one here. Peeking inside of it, we +see: + +```c +void tvinit(void) +{ + int i; + + for(i = 0; i < 256; i++) + SETGATE(idt[i], 0, SEG_KCODE<<3, vectors[i], 0); + + // this is the line we care about... + SETGATE(idt[T_SYSCALL], 1, SEG_KCODE<<3, vectors[T_SYSCALL], DPL_USER); + + initlock(&tickslock, "time"); +} +``` +FILE: **trap.c** + +The **SETGATE()** macro is the relevant code here. It is used to set the +**idt** array to point to the proper code to execute when various traps and +interrupts occur. For system calls, the single **SETGATE()** call (which +comes after the loop) is the one we're interested in. Here is what the macro +does (as well as the gate descriptor it sets): + +```c +// Gate descriptors for interrupts and traps +struct gatedesc { + uint off_15_0 : 16; // low 16 bits of offset in segment + uint cs : 16; // code segment selector + uint args : 5; // # args, 0 for interrupt/trap gates + uint rsv1 : 3; // reserved(should be zero I guess) + uint type : 4; // type(STS_{TG,IG32,TG32}) + uint s : 1; // must be 0 (system) + uint dpl : 2; // descriptor(meaning new) privilege level + uint p : 1; // Present + uint off_31_16 : 16; // high bits of offset in segment +}; + +// Set up a normal interrupt/trap gate descriptor. +// - istrap: 1 for a trap (= exception) gate, 0 for an interrupt gate. +// interrupt gate clears FL_IF, trap gate leaves FL_IF alone +// - sel: Code segment selector for interrupt/trap handler +// - off: Offset in code segment for interrupt/trap handler +// - dpl: Descriptor Privilege Level - +// the privilege level required for software to invoke +// this interrupt/trap gate explicitly using an int instruction. +#define SETGATE(gate, istrap, sel, off, d) \ +{ \ + (gate).off_15_0 = (uint) (off) & 0xffff; \ + (gate).cs = (sel); \ + (gate).args = 0; \ + (gate).rsv1 = 0; \ + (gate).type = (istrap) ? STS_TG32 : STS_IG32; \ + (gate).s = 0; \ + (gate).dpl = (d); \ + (gate).p = 1; \ + (gate).off_31_16 = (uint) (off) >> 16; \ +} +``` +FILE: **mmu.h** + +As you can see from the code, all the **SETGATE()** macros does is set the +values of an in-memory data structure. Most important is the **off** +parameter, which tells the hardware where the trap handling code is. In the +initialization code, the value **vectors[T_SYSCALL]** is passed in; thus, +whatever the **vectors** array points to will be the code to run when a system +call takes place. There are other details (which are important too); consult +an [x86 hardware architecture +manuals](http://www.intel.com/products/processor/manuals) (particularly +Chapters 3a and 3b) for more information. + +Note, however, that we still have not informed the hardware of this +information, but rather filled a data structure. The actual hardware informing +occurs a little later in the boot sequence; in xv6, it happens in the routine +**mpmain()** in the file **main.c**, which calls **idtinit** in **trap.c**, +which calls **lidt()** in the include file **x86.h**: + +```c +static void +mpmain(void) +{ + idtinit(); + ... + +void +idtinit(void) +{ + lidt(idt, sizeof(idt)); +} + +static inline void +lidt(struct gatedesc *p, int size) +{ + volatile ushort pd[3]; + + pd[0] = size-1; + pd[1] = (uint)p; + pd[2] = (uint)p >> 16; + + asm volatile("lidt (%0)" : : "r" (pd)); +} +``` + +Here, you can see how (eventually) a single assembly instruction is called to +tell the hardware where to find the *interrupt descriptor table (IDT)* in +memory. Note this is done in **mpmain()** as each processor in the system +must have such a table (they all use the same one of course). Finally, after +executing this instruction (which is only possible when the kernel is running, +in privileged mode), we are ready to think about what happens when a user +application invokes a system call. + +```c +struct trapframe { + // registers as pushed by pusha + uint edi; + uint esi; + uint ebp; + uint oesp; // useless & ignored + uint ebx; + uint edx; + uint ecx; + uint eax; + + // rest of trap frame + ushort es; + ushort padding1; + ushort ds; + ushort padding2; + uint trapno; + + // below here defined by x86 hardware + uint err; + uint eip; + ushort cs; + ushort padding3; + uint eflags; + + // below here only when crossing rings, such as from user to kernel + uint esp; + ushort ss; + ushort padding4; +}; +``` +File: **x86.h** + +## From Low-level To The C Trap Handler + +The OS has carefully set up its trap handlers, and thus we are ready to see +what happens on the OS side once an application issues a system call via the +**int** instruction. Before any code is run, the hardware must perform a +number of tasks. The first thing it does are those tasks which are +difficult/impossible for the software to do itself, including saving the +current PC (IP or EIP in Intel terminology) onto the stack, as well as a +number of other registers such as the **eflags** register (which contains the +current status of the CPU while the program was running), stack pointer, and +so forth. One can see what the hardware is expected to save by looking at the +**trapframe** structure as defined in **x86.h**. + +As you can see from the bottom of the trapframe structure, some pieces of the +trap frame are filled in by the hardware (up to the **err** field); the rest +will be saved by the OS. The first code OS that is run is **vector64()** +as found in **vectors.S** (which is automatically generated by the script +**vectors.pl**). + +```c +.globl vector64 +vector64: + pushl $64 + jmp alltraps +``` +File: **vectors.S** (generated by **vectors.pl**) + +This code pushes the trap number onto the stack (filling in the **trapno** +field of the trap frame) and then calls **alltraps()** to do most of the +saving of context into the trap frame. + +``` + # vectors.S sends all traps here. +.globl alltraps +alltraps: + # Build trap frame. + pushl %ds + pushl %es + pushal + + # Set up data segments. + movl $SEG_KDATA_SEL, %eax + movw %ax,%ds + movw %ax,%es + + # Call trap(tf), where tf=%esp + pushl %esp + call trap + addl $4, %esp +``` +File: **trapasm.S** + +The code in **alltraps()** pushes a few more segment registers (not described +here, yet) onto the stack before pushing the remaining general purpose +registers onto the trap frame via a **pushal** instruction. Then, the OS +changes the descriptor segment and extra segment registers so that it can +access its own (kernel) memory. Finally, the C trap handler is called. + +## The C Trap Handler + +Once done with the low-level details of setting up the trap frame, the +low-level assembly code calls up into a generic C trap handler called +**trap()**, which is passed a pointer to the trap frame. This trap handler is +called upon all types of interrupts and traps, and thus check the trap number +field of the trap frame (**trapno**) to determine what to do. The first check +is for the system call trap number (**T_SYSCALL**, or 64 as defined somewhat +arbitrarily in **traps.h**), which then handles the system call, as you see +here: + +```c +void +trap(struct trapframe *tf) +{ + if(tf->trapno == T_SYSCALL){ + if(cp->killed) + exit(); + cp->tf = tf; + syscall(); + if(cp->killed) + exit(); + return; + } + ... // continues +} +``` +FILE: **trap.c** + +The code isn't too complicated. It checks if the current process (that made +the system call) has been killed; if so, it simply exits and cleans up the +process (and thus does not proceed with the system call). It then calls +**syscall()** to actually perform the system call; more details on that +below. Finally, it checks whether the process has been killed again before +returning. Note that we'll follow the return path below in more detail. + +```c +static int (*syscalls[])(void) = { +[SYS_chdir] sys_chdir, +[SYS_close] sys_close, +[SYS_dup] sys_dup, +[SYS_exec] sys_exec, +[SYS_exit] sys_exit, +[SYS_fork] sys_fork, +[SYS_fstat] sys_fstat, +[SYS_getpid] sys_getpid, +[SYS_kill] sys_kill, +[SYS_link] sys_link, +[SYS_mkdir] sys_mkdir, +[SYS_mknod] sys_mknod, +[SYS_open] sys_open, +[SYS_pipe] sys_pipe, +[SYS_read] sys_read, +[SYS_sbrk] sys_sbrk, +[SYS_sleep] sys_sleep, +[SYS_unlink] sys_unlink, +[SYS_wait] sys_wait, +[SYS_write] sys_write, +}; + +void +syscall(void) +{ + int num; + + num = cp->tf->eax; + if(num >= 0 && num < NELEM(syscalls) && syscalls[num]) + cp->tf->eax = syscalls[num](); + else { + cprintf("%d %s: unknown sys call %d\n", + cp->pid, cp->name, num); + cp->tf->eax = -1; + } +} +] +``` +File: **syscall.c** + +## Vectoring To The System Call + +Once we finally get to the **syscall()** routine in **syscall.c**, not much +work is left to do (see above). The system call number has been passed to us +in the register **%eax**, and now we unpack that number from the trap frame +and use it to call the appropriate routine as defined in the system call table +**syscalls[]**. Pretty much all operating systems have a table similar to this +to define the various system calls they support. After carefully checking that +the system call number is in bounds, the pointed-to routine is called to +handle the call. For example, if the system call **read()** was called by the +user, the routine **sys_read()** will be invoked here. The return value, you +might note, is stored in **%eax** to pass back to the user. + +## The Return Path + +The return path is pretty easy. First, the system call returns an integer +value, which the code in **syscall()** grabs and places into the **%eax** +field of the trap frame. The code then returns into **trap()**, which simply +returns into where it was called from in the assembly trap handler. + +```c + # Return falls through to trapret... +.globl trapret +trapret: + popal + popl %es + popl %ds + addl $0x8, %esp # trapno and errcode + iret +``` +File: **trapasm.S** + +This return code doesn't do too much, just making sure to pop the relevant +values off the stack to restore the context of the running process. Finally, +one more special instruction is called: **iret**, or the **return-from-trap** +instruction. This instruction is similar to a return from a procedure call, +but simultaneously lowers the privilege level back to user mode and jumps back +to the instruction immediately following the **int** instruction called to +invoke the system call, restoring all the state that has been saved into the +trap frame. At this point, the user stub for **read()** (as seen in the +**usys.S** code) is run again, which just uses a normal +return-from-procedure-call instruction (**ret**) in order to return to the +caller. + +## Summary + +We have seen the path in and out of the kernel on a system call. As you can +tell, it is much more complex than a simple procedure call, and requires a +careful protocol on behalf of the OS and hardware to ensure that application +state is properly saved and restored on entry and return. As always, the +concept is easy: with operating systems, the devil is always in the details. + +