# Which Libraries Make Syscalls?


---
## Aim

Investigate which libraries make syscalls by running `strace` on the following applications (taken from _Sysfinder_'s evaluation section).
- [ ] hello, world (golang)
- [ ] hello, world (c)
- [ ] `nginx`
- [ ] `redis` 
- [ ] `memcached`
- [ ] `SQLite`
- [ ] `LightHttpd`
- [ ] `HAProxy`

All but the first two applications needed to be tested under some example load. Luckily, all ship with benchmarks, so these can be used to get a good approximation of `syscalls` used by each application. 

---
## Approach #1

(might need to run some benchmark: `strace` uses **dynamic analysis**

1. Run `strace` on main binary
2. Load trace into a `pandas` dataframe: only keep source application, syscall name
3. Do some exploratory data analysis

### What is `strace`?
- Userspace program to monitoring interaction between kernel and userspace processes
- Will give a dump of syscalls, arguments: **not**  originating libraries

### What is `ltrace`
- `ltrace` is like `strace` but for library calls: it can show which calls to different libraries are made
- For instance, running `ltrace -f -i -x '*' -l libc -o hwc.ltrace ./a.out` on a classic C hello world (below) produces a trace in `hwc.ltrace`.

```
$ head -n 10 hwc.ltrace

924843 [0x5b3e5bd24085] __libc_start_main@libc.so.6(0x5b3e5bd24149, 1, 0x7fff655dbb38, 0 > <unfinished ...>
924843 [0x73174822a233] __cxa_atexit@libc.so.6(0x731748529380, 0, 0, 0)                                                                                 = 0
  924843 [0x73174822a2b4] _init(1, 0x7fff655dbb38, 0x7fff655dbb48, 0x5b3e5bd24000)                                                                        = 0
  924843 [0x73174822a304] frame_dummy(1, 0x7fff655dbb38, 0x7fff655dbb48, 0x5b3e5bd26db8 <unfinished ...>
  924843 [0x73174822a304] register_tm_clones(1, 0x7fff655dbb38, 0x7fff655dbb48, 0x5b3e5bd26db8)                                                           = 0
924843 [0x73174822a304] <... frame_dummy resumed> )                                                                                                     = 0
  924843 [0x73174822a273] _dl_audit_preinit@ld-linux-x86-64.so.2(0x73174855d2e0, 0, 0x7fff655dbb48, 0x5b3e5bd26dc0)                                       = 0
  924843 [0x73174822a181] _setjmp@libc.so.6(0x7fff655dba40, 1, 0x7fff655dbb38, 0x5b3e5bd26dc0 <unfinished ...>
  924843 [0x73174822a181] __sigsetjmp@libc.so.6(0x7fff655dba40, 0, 0x7fff655dbb38, 0x5b3e5bd26dc0)                                                        = 0
  924843 [0x73174822a181] <... _setjmp resumed> )                                                                                                         = 0
```

Libaries seem to be prefixed with an `@`. Getting a list of libraries called can be done with

```
$ cat hwc.ltrace | grep -o @[a-zA-Z_.0-9-]* | sort | uniq

@ld-linux-x86-64.so.2
@libc.so.6
```

(I am not good enough at `grep` to remove the `@` symbol)

=> Unfortunately, neither of these give us the LIBRARY THAT INVOKED THE SYSCALL, which is unfortunate. More to be done here...

---
## Approach #2

- Use Program Counter data to try and estimate where syscalls were made

### Golang _"Hello, World"_

- Go, by default, builds staticly linked binaries. So start with C as its easier for now.


### C _"Hello, World"_

- Run `strace` and `ltrace` on hello world, both with -i flags. Gives wildly different addresses for the instruction pointer
    - Could be a result of having to run twice
    - Could be because the tools report differently
    - Either way, unsuitable
    
- `ltrace -S` will print out syscalls as well. Running `ltrace -f -i -S -o c-hw.ltrace ./main` gives

```ltrace
927283 [0x75e51f8039cb] SYS_brk(0)                                         = 0x6318a74bc000
927283 [0x75e51f804d2c] SYS_mmap(0, 8192, 3, 34)                           = 0x75e51f7dd000
927283 [0x75e51f8048cb] SYS_access("/etc/ld.so.preload", 04)               = -2
927283 [0x75e51f804b71] SYS_openat(0xffffff9c, 0x75e51f80d38f, 0x80000, 0) = 3
...
```

Loading it into a dataframe makes for easier analysis

In [17]:
import pandas as pd
import re

pattern = r'(\d+) \[(0x[0-9a-f]+)\] (SYS_)?([A-Za-z0-9_]+)\(([^)]*)\)\s*=\s*(.+)'
unfinished_pattern = r'(\d+) \[(0x[0-9a-f]+)\] ([A-Za-z0-9_]+)\(([^)]*) <unfinished ...>'
resumed_pattern = r'(\d+) \[(0x[0-9a-f]+)\] <\.\.\. ([A-Za-z0-9_]+) resumed>.*=\s*(.+)'

def parse_ltrace(fp: str) -> pd.DataFrame:
    entries = []
    unfinished_calls = {}
    
    with open(fp) as f:
        for line in f.readlines():
            if match := re.match(pattern, line):
                pid, ic_addr, syscall_prefix, call, params, return_value = match.groups()
    
                call_type = "syscall" if syscall_prefix else "libcall"
                call_name = call if call_type == "libcall" else call
    
                entries.append({
                    "pid": int(pid),
                    "ic_addr": ic_addr,
                    "call_type": call_type,
                    "call": call_name,
                    "params": params,
                    "return": return_value.strip()
                })
                continue

            if unfinished_match := re.match(unfinished_pattern, line):                
                pid, ic_addr, call, params = unfinished_match.groups()
                unfinished_calls[(pid, call)] = {
                    "pid": int(pid),
                    "ic_addr": ic_addr,
                    "call_type": "libcall" if not call.startswith("SYS_") else "syscall",
                    "call": call,
                    "params": params,
                    "return": "<unfinished>"
                }
                continue
            
            if res_match := re.match(resumed_pattern, line):
                pid, ic_addr, call, return_value = res_match.groups()
                if (pid, call) in unfinished_calls:
                    entry = unfinished_calls.pop((pid, call))
                    entry["return"] = return_value.strip()
                    entries.append(entry)
                continue

            
    
    return pd.DataFrame(entries, columns=["pid", "ic_addr", "call_type", "call", "params", "return"])

c_hw_df = parse_ltrace("./c-hw/c-hw.ltrace")
c_hw_df.head()

Unnamed: 0,pid,ic_addr,call_type,call,params,return
0,927283,0x75e51f8039cb,syscall,brk,0,0x6318a74bc000
1,927283,0x75e51f804d2c,syscall,mmap,"0, 8192, 3, 34",0x75e51f7dd000
2,927283,0x75e51f8048cb,syscall,access,"""/etc/ld.so.preload"", 04",-2
3,927283,0x75e51f804b71,syscall,openat,"0xffffff9c, 0x75e51f80d38f, 0x80000, 0",3
4,927283,0x75e51f8048fb,syscall,fstat,"3, 0x7fffb7b1fd40",0


## Deriving library calls from the trace

- Still non-trivial, especially since functions can be async...
- `ltrace` data alone may not be able to provide enough informations
- Not clear how to extract which system calls are made by which function...

In [19]:
c_hw_df[c_hw_df["call_type"] == "libcall"]

Unnamed: 0,pid,ic_addr,call_type,call,params,return
33,927283,0x6318a732e16b,libcall,puts,"""Hello, World""",13


- Shows that the `ic_address` when making the only call to a shared library is `0x6318a732e16b`. This is outside the range of all system calls made by the program.
- This could mean that `puts` doesn't syscall, or this approach doesn't work...

In [26]:
c_hw_df[c_hw_df["call_type"]=="syscall"]["ic_addr"].unique()

array(['0x75e51f8039cb', '0x75e51f804d2c', '0x75e51f8048cb',
       '0x75e51f804b71', '0x75e51f8048fb', '0x75e51f804a2b',
       '0x75e51f804be8', '0x75e51f804bbe', '0x75e51f7ff12d',
       '0x75e51f7f3f22', '0x75e51f7f3f7e', '0x75e51f7f3ff6',
       '0x75e51f804dbb', '0x75e51f51d1c4', '0x75e51f804deb',
       '0x75e51f51738b', '0x75e51f4a9d99', '0x75e51f51d77b',
       '0x75e51f51c574'], dtype=object)

---
## Sources
1. https://en.wikipedia.org/wiki/Strace