OS self-learning material. Powered by ChatGPT4 >w<
Based on Fall 2022.
Lectures: https://www.youtube.com/watch?v=4FpG1DcvHzc&list=PLMcZASLnrVs5qyBC-dc4bFRtC8ROPYmS0&index=2
Website: https://inst.eecs.berkeley.edu/~cs162/fa22/
BaseCode: https://github.com/Berkeley-CS162
Basic usage of c, combined with I/O, editor, gdb and other useful tools.
Personally I would recommend go through these before we start:
- basic vim usage: http://www2.geog.ucl.ac.uk/~plewis/teaching/unix/vimtutor
- tmux usage(in chinese): https://www.ruanyifeng.com/blog/2019/10/tmux.html
Also here're some tips:
Why by using char*
can we get an entire string?
In C, a string is essentially an array of characters, but it's typically represented as a pointer to its first element -- a char *
. So when you have a char *
variable, it is effectively a string.
The strcmp
function compares two strings. It expects as arguments two pointers to the beginning of each string (two char *
). When you call strcmp(wc1->word, wc2->word)
, you're saying "compare the string that begins at wc1->word
with the string that begins at wc2->word
".
If you were to dereference the pointers with *wc1->word
and *wc2->word
, you would be providing the char
values (i.e., the first characters of each string) at those memory locations instead of the pointers themselves, which is not what strcmp
expects.
In other words, strcmp
doesn't want to know what the first characters of the strings are -- it wants to know where the strings start. By giving it char *
arguments, you're providing it with that information.
Why when initializing, we get WordCount **wclist
, instead of just use WordCount *wclist?
This double pointer WordCount **wclist
is used in this context because the function init_words
needs to modify the pointer wclist
itself, not the data it points to.
In C, function arguments are passed by value. This means that if you passed a single pointer WordCount *wclist
to a function and then tried to change wclist
within that function, you would only be changing a copy of the pointer, and not the original pointer itself.
However, if you want to modify the original pointer (for example, to point it to a different memory location or to set it to NULL), you need to pass a pointer to that pointer, hence WordCount **wclist
.
In the given code snippet, the line *wclist = NULL;
is setting the original pointer wclist
(which was passed into init_words
as a pointer to a pointer) to NULL. If you had passed wclist
as a single pointer, this operation would not affect the original wclist
pointer outside of the function.
To summarize, if you just want to change the data that the pointer points to, you can pass a single pointer. But if you need to change what the pointer itself points to, you should pass a double pointer.
How to use malloc
in C?
The new
keyword is used in C++ to dynamically allocate memory. It's not part of the C language.
In C, you use malloc()
, calloc()
, or realloc()
functions to allocate memory dynamically. Here's how you might do it for a WordCount
struct:
*wclist = (WordCount*)malloc(sizeof(WordCount));
Remember to check the result of malloc()
, because it will return NULL
if the memory allocation fails. Also, when you're done with the memory, you should free it with free()
to avoid memory leaks.
Don't forget to #include <stdlib.h>
at the top of your file, as malloc
, calloc
, realloc
, and free
are all defined in that header.
Why I have to run infile = fopen(argv[optind], "r");
twice if I want to use it twice?
Opening a file with fopen()
returns a FILE
pointer that is used to track the file's position, among other things. When you read from a file, the position moves forward in the file, so when you're done reading the file, the position is at the end. If you try to read from the file again without resetting the position, you won't get any data because you're already at the end of the file.
One way to reset the file position to the beginning is to use the fseek()
, rewind()
, or fsetpos()
functions. For example:
infile = fopen(argv[optind], "r");
// Do something
rewind(infile);
// Do something else
Another way, as in your example, is to close the file with fclose()
and then open it again with fopen()
. This will also reset the file position to the beginning:
infile = fopen(argv[optind], "r");
// Do something
fclose(infile);
infile = fopen(argv[optind], "r");
// Do something else
fclose(infile);
Both of these ways work, but the first one is often preferable because it doesn't involve the overhead of closing and opening the file again. However, if you're done with the file after the first operation and you don't need to use it again until some time later, it might be better to close it as soon as you're done with it to free up system resources. In such a case, you would need to open the file again before you can use it.
Also here's a full explanation of the hint:
This hint is providing you a series of Unix commands that are chained together using pipes (|
) to accomplish a similar goal to your program: to count the frequency of words in a file and then sort them. Here's what each command is doing:
-
cat <filename>
: This command prints the content of the file to the standard output (stdout). -
tr " " "\n"
: This command replaces all spaces with newline characters, effectively putting each word on its own line. -
tr -s "\n"
: This command squeezes multiple consecutive newline characters into a single newline. -
tr "[:upper:]" "[:lower:]"
: This command translates all uppercase characters to lowercase. -
tr -d -C "[:lower:]\n"
: This command deletes every character that is not a lowercase letter or a newline. This helps to remove punctuation and other non-alphabetical characters. -
sort
: This command sorts its input lines. -
uniq -c
: This command removes duplicate lines and precedes each line with the count of occurrences. -
sort -n
: This command sorts its input lines numerically.
Specifically in tr -d -C "[:lower:]\n":
tr
command stands for "translate" and it is used in Unix systems to translate, squeeze, and/or delete characters from standard input, writing the result to standard output.
In tr -d -C "[:lower:]\n"
, there are two key parts to understand:
-
The
-d
option: This option tellstr
to delete characters. -
The
-C
option: This option complements the set of characters, meaning it inverts the set. It selects all characters that are NOT in the specified set.
So the command tr -d -C "[:lower:]\n"
means "delete all characters that are not lowercase letters or newline characters".
To understand this in more detail:
[:lower:]
matches all lowercase letters.\n
matches newline characters.
The -C
option inverts this set, so it matches all characters that are not lowercase letters and not newline characters. The -d
option then deletes these matched characters.
In Unix-based systems, there are two types of limits that can be set for a user: a soft limit and a hard limit.
Soft Limit: This is the value that the kernel enforces for the corresponding resource. The soft limit can be increased by the user up to the hard limit. If a process tries to consume more of the resource than the soft limit allows, the kernel will typically send a signal to the process that will cause it to terminate.
Hard Limit: This is the maximum value to which the soft limit can be increased by a process. The hard limit can only be reduced; once it has been lowered, it cannot be increased without rebooting the system or having superuser (root) privileges. The hard limit acts as a ceiling for the soft limit. If a process tries to consume more of the resource than the hard limit allows, the kernel will not allow it.
For example, suppose we have a user limit for open files, with a soft limit of 1024 and a hard limit of 2048. This means that by default, a user can open up to 1024 files simultaneously. However, the user can increase this limit up to 2048 (the hard limit) if they need to. But once they do so, they can't go back to the previous limit without superuser intervention.
These limits are used for managing resources on a system and ensuring that a single user or process can't consume too much of a resource and thereby impact other users or processes. They also provide a way for users to manage their own resource use.
Everything else we need to know are in man getlimit
.
Also here're some tips:
if (getrlimit(RLIMIT_NOFILE, &rlim) == 1) perror("getLimit")
perror("getrlimit")
is a function provided by the C standard library that prints an error message corresponding to the current value of the errno
variable. The getrlimit
function sets the errno
variable if it encounters an error.
In the code snippet perror("getrlimit")
, the string "getrlimit"
is passed as an argument to perror()
. This string is typically used to provide additional context or description about the error being reported.
When perror()
is called, it looks at the value of errno
and prints a descriptive error message to the standard error stream (stderr), along with the provided string and a colon. For example, it might output something like:
getrlimit: Invalid argument
The specific error message depends on the value of errno
, which is set by the system or library function that encountered the error. In the case of getrlimit()
, it could indicate an invalid argument, insufficient privileges, or other possible errors.
Using perror()
is a convenient way to print descriptive error messages related to system or library function failures, allowing you to provide more information about what went wrong during program execution.
Manuall: http://www.unknownroad.com/rtfm/gdbtut/gdbtoc.html
Offical documentation: https://www.sourceware.org/gdb/documentation/
Also here're the tips:
printf("%i%p", ...)
In the printf
function in C, %i
and %p
are format specifiers used to display integers and pointers, respectively.
%i
: This format specifier is used to print integers. It can handle both signed and unsigned integer values. When using%i
, the corresponding argument in theprintf
function should be an integer value.%p
: This format specifier is used to print pointers. It is typically used to display the memory address represented by a pointer variable. When using%p
, the corresponding argument in theprintf
function should be a pointer value.
Difference between step
and next
:
-
step (
s
): Thestep
command is used to step through the program one line at a time. If the current line contains a function call, thestep
command will enter that function and allow you to step through its code as well. It means that if the line being executed contains a function call, GDB will move to the first line of the called function and pause there. -
next (
n
): Thenext
command is used to execute the next line of code in the program without stepping into any function calls. If the current line contains a function call, thenext
command will execute that entire function without entering it. It means that if the line being executed contains a function call, GDB will execute the entire function as a single step and pause at the next line after the function call.
Also here's a demo with explanation.
fudanicpc@cat:~/Desktop/student0-main/hw-intro$ gdb map
Reading symbols from map...done.
(gdb) break map.c:6
# 0x699: The memory address where line 6 resides in the compiled program.
Breakpoint 1 at 0x699: file map.c, line 6.
(gdb) break map.c:7
Note: breakpoint 1 also set at pc 0x699.
Breakpoint 2 at 0x699: file map.c, line 7.
(gdb) break recurse.c:7
Breakpoint 6 at 0x6f9: file recurse.c, line 7.
(gdb) run
Starting program: /home/fudanicpc/Desktop/student0-main/hw-intro/map
.....
(gdb) run
Starting program: /home/fudanicpc/Desktop/student0-main/hw-intro/map
Breakpoint 1, main (argc=1, argv=0x7fffffffdee8) at map.c:16
16 volatile int i = 0;
(gdb) print(i)
$1 = 0
(gdb) print(i)
$2 = 0
(gdb) p i
$3 = 0
(gdb) p x
No symbol "x" in current context.
(gdb) backtrace
#0 main (argc=1, argv=0x7fffffffdee8) at map.c:19
(gdb) kill
Kill the program being debugged? (y or n) y
(gdb)
Think we need some prerequisite here:
objdump
is a command-line program used to display various information about object files in Unix-like operating systems. This can include object files, executable files, and relocatable files. Each file type you mentioned carries a different kind of information:
-
Executable file: When
objdump
is used on an executable file, it can display a lot of information. It can show the header information, assembly level instructions, and section information, among others. You can control what it shows through various flags:-h
displays the section headers-d
disassembles the executable sections-s
displays the full contents of all sections requested-j
displays the specified section
-
Object (*.o) file: Object files are typically generated by a compiler and consist of machine code along with data used by that code.
objdump
can show you similar information as for executable files: header info, assembly code, section info, etc. Object files typically represent modules of a program that are linked together to create the final executable. -
Assembly (*.s) file: This is a file written in assembly language, a low-level programming language.
objdump
wouldn't usually be used directly on an assembly file as it's not an object file or an executable file. Rather, assemblers likeas
are used to translate assembly files into object files.objdump
could be used on the resulting object file to display its contents.
The main difference between these files is their role in the process of going from source code to executable:
- The
.s
file is the assembly language translation of your source code. It is a human-readable text file. - The
.o
file is the compiled version of your.s
file (or directly from your source code). It is in machine language, but isn't yet a complete program. - The executable file is the linked version of one or more
.o
files. It is a complete program that can be run by the operating system.
Remember, the output of objdump
can be quite extensive and somewhat hard to read unless you're familiar with assembly language and the layout of object files and executables.
let's take a peek at map's assembly code :
-
.file "map.c"
: Specifies the name of the source file. -
.text
: Indicates the start of the text (code) section. -
.comm foo,4,4
: Declares a common symbolfoo
with a size of 4 bytes and alignment of 4 bytes. Common symbols are global variables that may be defined in multiple translation units, and the final definition is resolved during linking. -
.globl stuff
: Declares the symbolstuff
as a global symbol, meaning it can be accessed from other translation units. -
.data
: Indicates the start of the data section. -
.align 4
: Aligns the following data on a 4-byte boundary. -
.type stuff, @object
: Specifies that the symbolstuff
is of type "object". -
.size stuff, 4
: Specifies the size of thestuff
object as 4 bytes. -
stuff: .long 7
: Defines thestuff
object and initializes it with the value 7. -
.text
: Indicates the start of the text (code) section again. -
.globl main
: Declares the symbolmain
as a global symbol. -
.type main, @function
: Specifies that the symbolmain
is of type "function". -
.LFB5
: Indicates the start of a function named.LFB5
. -
.cfi_startproc
: Specifies the start of a new procedure for Call Frame Information (CFI). -
pushq %rbp
: Pushes the value of the base pointer onto the stack. -
.cfi_def_cfa_offset 16
: Defines the Canonical Frame Address (CFA) offset. -
.cfi_offset 6, -16
: Specifies the offset of the saved base pointer. -
movq %rsp, %rbp
: Moves the value of the stack pointer to the base pointer. -
.cfi_def_cfa_register 6
: Defines the base pointer as the Canonical Frame Address (CFA) register. -
subq $48, %rsp
: Allocates 48 bytes of space on the stack for local variables. -
movl %edi, -36(%rbp)
: Moves the value of the first function argument (%edi
) to a specific location on the stack. -
movq %rsi, -48(%rbp)
: Moves the value of the second function argument (%rsi
) to a specific location on the stack. -
movl $0, -20(%rbp)
: Moves the value 0 to a specific location on the stack. -
movl $100, %edi
: Moves the value 100 to the%edi
register. -
call malloc@PLT
: Calls themalloc
function to allocate memory for 100 bytes. -
movq %rax, -16(%rbp)
: Moves the returned value frommalloc
to a specific location on the stack. -
movl $3, %edi
: Moves the value 3 to the%edi
register. -
call recur@PLT
: Calls therecur
function. -
movl $0, %eax
: Moves the value 0 to the%eax
register. -
leave
: Restores the stack frame by setting the stack pointer to the base pointer. -
`.cfi_def_cfa 7,
Then the object accordingly(already processed by objdump -D
) :
- Disassembly of section .text: This section contains the disassembly of the code section.
<main>:
,0: push %rbp
,1: mov %rsp, %rbp
, etc.: These lines represent the disassembled instructions of themain
function. Each line corresponds to a machine instruction along with its hexadecimal address.push %rbp
,mov %rsp, %rbp
,sub $0x30, %rsp
, etc.: These instructions set up the function's stack frame by saving the base pointer, adjusting the stack pointer, and allocating space for local variables.<stuff>:
,0: 07
,1: 00 00
, etc.: These lines represent the disassembly of the.data
section. It shows the hexadecimal values of the bytes stored in thestuff
object.- Disassembly of section .comment: This section contains some compiler and linker-generated comments.
- Disassembly of section .eh_frame: This section contains information related to Exception Handling (EH) and stack unwinding.
or use objdump -h
to show the ELF:
fudanicpc@cat:~/Desktop/student0-main/hw-intro$ objdump -h map.o
map.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000043 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 00000004 0000000000000000 0000000000000000 00000084 2**2
CONTENTS, ALLOC, LOAD, DATA
2 .bss 00000000 0000000000000000 0000000000000000 00000088 2**0
ALLOC
3 .comment 00000025 0000000000000000 0000000000000000 00000088 2**0
CONTENTS, READONLY
4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 000000ad 2**0
CONTENTS, READONLY
5 .eh_frame 00000038 0000000000000000 0000000000000000 000000b0 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATAs
Here's an explanation of the columns:
Idx
: The index of the section.Name
: The name of the section.Size
: The size of the section in bytes.VMA
: The virtual memory address of the section.LMA
: The load memory address of the section.File off
: The offset of the section in the file.(offset: an integer indicating the distance (displacement) between the beginning of the object and a given element or point, presumably within the same object.)Algn
: The alignment requirement of the section.
Here's a breakdown of the sections in the map.o
file:
.text
(code/text segment): This section contains the executable code of the program. It is marked as CONTENTS, ALLOC, LOAD, RELOC, READONLY, and CODE. It has a size of00000043
bytes..data
: This section contains initialized global and static data. It is marked as CONTENTS, ALLOC, LOAD, and DATA. It has a size of00000004
bytes..bss
(Block Started by Symbol): This section represents uninitialized global and static data. It is marked as ALLOC. It has a size of00000000
bytes..comment
: This section contains compiler-generated comments. It is marked as CONTENTS and READONLY. It has a size of00000025
bytes..note.GNU-stack
: This section provides information about the GNU stack usage. It is marked as CONTENTS and READONLY. It has a size of00000000
bytes..eh_frame
: This section contains exception handling and stack unwinding information. It is marked as CONTENTS, ALLOC, LOAD, RELOC, READONLY, and DATA. It has a size of00000038
bytes.
Also plz note that the output of objdump -h
for an object file typically does not include a specific section for the heap because the heap is managed by the dynamic memory allocation functions (malloc
, free
, etc.) provided by the C runtime library.
In the output you provided for map.o
, you can see sections such as .text
, .data
, .bss
, .comment
, .note.GNU-stack
, and .eh_frame
. These sections represent parts of the object file related to the program's code, initialized data, uninitialized data, comments, stack unwinding, and stack-related information.
The heap, on the other hand, is managed at runtime by the C runtime library using memory allocation functions like malloc
. The memory allocated on the heap is not directly represented as a section in the object file itself. Instead, it is managed dynamically during program execution by the runtime environment.
Therefore, when inspecting the object file with objdump
, you won't see a specific section dedicated to the heap. The allocation and management of heap memory occur during runtime and are not directly reflected in the object file.
Also:
fudanicpc@cat:~/Desktop/student0-main/hw-intro$ objdump -t map.o
map.o: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 map.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .eh_frame 0000000000000000 .eh_frame
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000004 O *COM* 0000000000000004 foo
0000000000000000 g O .data 0000000000000004 stuff
0000000000000000 g F .text 0000000000000043 main
0000000000000000 *UND* 0000000000000000 _GLOBAL_OFFSET_TABLE_
0000000000000000 *UND* 0000000000000000 malloc
0000000000000000 *UND* 0000000000000000 recur
In this symbol table, each line represents a symbol entry. Here's an explanation of each column:
- The first column represents the symbol's address or offset. This is the first set of digits (all zeros in this case). In a relocatable object file, these will typically be zeros because the final address hasn't been determined yet. Once the object file is linked into an executable or a library, each symbol will be given a unique address in memory where it resides.
- The second column represents the symbol's visibility.
l
indicates a local symbol, andg
indicates a global symbol. If empty, it means that the visibility of the symbol is not explicitly specified. - The third column represents the symbol's type.
F
indicates a function symbol,O
indicates an object/data symbol, and*UND*
indicates an undefined symbol.d
indicates that the symbol is a debugging symbol. - The fourth column represents the size. This value represents the size of the symbol in bytes. For example, for the
main
function, the size is43
, which in hexadecimal translates to 67 bytes in decimal. - The last column represents the symbol's name.
Based on this table, you can see the symbols foo
, stuff
, main
, _GLOBAL_OFFSET_TABLE_
, malloc
, and recur
, along with their respective attributes.
https://blog.csdn.net/ingsuifon/article/details/125507849
Also, I personally recommend read CSAPP chapter3 beforehand, if you have little knowledge of machine-level language as I do. : )
This is the pintos manual: https://cs162.org/static/proj/pintos-docs/
Do plz set the environment variable.
export PATH=/home/fudanicpc/Desktop/group0-main/src/utils:$PATH
export PATH=/home/fudanicpc/Desktop/student0-main/proj-pregame/src/utils:$PATH
fudanicpc@cat:~/Desktop/student0-main/proj-pregame/src/utils$ pwd
export PATH=/home/fudanicpc/Desktop/student0-main/proj-pregame/src/utils:$PATH
Tips:
-
The ebp register, also known as the base pointer or frame pointer, usually points to the base of the current function's stack frame.
-
ebp + 4 is used to access the return address of the function.
-
ebp + 8 is used to access the first argument passed to the function (argc in this case).
-
ebp + 12 (or ebp + 0xc in hexadecimal) is used to access the second argument (argv in this case).
Tips:
Remember to modify src/utils/pintos-gdb. Customize GDBMACROS. set the file to enable debugpintos
. It will look something like: https://github.com/Berkeley-CS162/vagrant/blob/15096a1cac8e77a252bd24a4308355dd16d67560/modules/cs162/files/shell/bin/gdb-macros
Official Website: https://pdos.csail.mit.edu/6.S081/2020/
Environment Setup: https://zhuanlan.zhihu.com/p/464386728
My Personal Note: click here
Operating Systems: Three Easy Pieces, friendly to beginners >w<
Offitial Website: https://pages.cs.wisc.edu/~remzi/OSTEP/
Codes in Each Chapter: https://github.com/remzi-arpacidusseau/ostep-code
My Personal Note: click here
Modern Operating Systems