### 製作新的 Synonym_base 
- 以 `<動詞片語, 受詞>` pair 的方式儲存一組介紹，動詞片語盡量取 lemma (原型動詞) 如 "copy a file"。
- 在手冊中每個 Syscall 的描述有 Name 區塊和 Description 區塊，每個 syscall 一定會有 Name 區塊來簡短的描述，如: `write - write to a file descriptor`。我們把 Name 區塊的 pair 稱作 `FirstLine` 欄位，理論上每個 syscall 都會有一個 `<動詞片語, 受詞>` pair 所以會有一行資料，若 First line 有兩個動詞 pair 則分開寫成兩行。
- Description 區塊是很長且非結構化的詳細敘述，說明 syscall 的運作方式、有哪些功能參數以及如何呼叫。我們盡量找 >= 1 組與 `FirstLine` 相同意義的句子，也製作成 pair 放在 `Synonym` 欄位。一個 `FirstLine` 可以有 0 或多個 synonym pair，通通寫在同一行用`"|"`分隔`<>`。
- 問題: pair 的受詞要多細膩? 回答:
```
check real user's permissions for a file => "<check,user's permission>"
```

Load origin rule_dataset, convert to pair format

In [10]:
import pandas as pd

df = pd.read_csv('../rule_dataset_final.csv') # 354 Syscall
# df = pd.read_csv('../rule_dataset.csv')       # 88  Syscall
# df = pd.read_csv('../proc.csv')       # 88  Syscall

def clean_braces(x):
    x = x.replace("()", "").replace(",", "").replace(" ", "")
    return str(x)
def add_braces(x):
    x = clean_braces(x)
    x = x+"()"
    return x

df['Syscall'] = df['Syscall'].apply(add_braces) # or clean_braces
df.head(3)


Unnamed: 0,EntityType,ActionType,Source,Syscall,EnVerb,Sentence
0,DEVICE,DEVICE,man,io_cancel(),cancel,cancel an outstanding asynchronous I/O operation
1,DEVICE,DEVICE,man,io_destroy(),destroy,destroy an asynchronous I/O context
2,DEVICE,DEVICE,man,io_getevents(),read,read asynchronous I/O events from the completi...


In [57]:
def make_firstline(v:str, sentence:str, default="<,>"):
    spl = sentence.split()
    for i,word in enumerate(spl):
        if word.startswith(v):
            object_phrase = " ".join(spl[i+1:])
            return f"<{v},{object_phrase}>"
    return default

In [58]:
# New columns: FirstLine Synonym
# Better to remain columns: (EntityType ActionType) Syscall

# syscall_88 = df['Syscall'].unique()
# df = df[df['Source'] == 'man']
df['FirstLine'] = df.apply(lambda r: make_firstline(r['EnVerb'],r['Sentence']), axis=1)
df['Synonym']   = "<,>"
df['SynonymSentence']   = ""
df = df[['Syscall', 'Sentence', 'FirstLine', 'Synonym', 'SynonymSentence']]
df.to_csv('proc.csv', index=False)
df

Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence
0,idle(),make process 0 idle,"<make,process 0 idle>","<,>",
1,move_pages(),move individual pages of a process to another ...,"<move,individual pages of a process to another...","<,>",
2,timer_gettime(),arm/disarm and fetch state of POSIX per,"<fetch,state of POSIX per>","<,>",
3,timer_delete(),delete a POSIX per,"<delete,a POSIX per>","<,>",
4,execveat(),execute program relative to a directory file d...,"<execute,program relative to a directory file ...","<,>",
...,...,...,...,...,...
74,ioprio_set(),get/set I/O scheduling class and priority,"<,>","<,>",
75,sched_getattr(),set and get scheduling policy and attributes,"<get,scheduling policy and attributes>","<,>",
76,get_robust_list(),get/set list of robust futexes,"<get,list of robust futexes>","<,>",
77,getitimer(),get or set value of an interval timer,"<get,or set value of an interval timer>","<,>",


In [None]:
'''
Syscall,Sentence,FirstLine,Synonym
read(),read from a file descriptor,"<read, file descriptor>","<>"
readlink(),read value of a symbolic link,"<read, value of a symbolic link>","<place, the contents of the symbolic link path>"
'''

### 統計資料
- numOf: syscall 395, firstline verb (in name), sysnonym verb (in description)
- unimplemented: 18 (396+18=414), 
- 做統計時，不要提到這兩個 call，因為不在 strace github 中，但我們的 ASG 有使用到 ['rm()', 'exec()']

In [94]:
import pandas as pd
dataset = pd.read_csv('./synonym_dataset.csv')
dataset = dataset[~dataset['Syscall'].isin(['rm()', 'exec()'])]
print('dataset len', len(dataset))
dataset.head()

dataset len 426


Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence
0,shutdown(),shut down part of a full-duplex connection,"<shut down,part of a full-duplex connection>","<,>",
1,recvmmsg(),receive multiple messages on a socket,"<receive,multiple messages>","<,>",
2,sendmmsg(),send multiple messages on a socket,"<send,multiple messages>","<transmit,multiple messages>",transmit multiple messages on a socket using a...
3,pipe2(),create pipe,"<create,pipe>","<,>",
4,mq_unlink(),remove a message queue,"<remove,a message queue>","<,>",


In [95]:
uq = dataset['Syscall'].unique()
len(dataset['Syscall'].unique())

395

In [96]:
# https://man7.org/linux/man-pages/man2/unimplemented.2.html 
unimplemented = '''afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg,
       gtty, isastream, lock, madvise1, mpx, prof, profil, putmsg,
       putpmsg, security, stty, tuxcall, ulimit, vserver'''
unimplemented = unimplemented.split(',')
unimplemented = [f"{u.strip()}()" for u in unimplemented]
unimplemented = [u for u in unimplemented if u not in uq]
print('unimplemented:', len(unimplemented))
unimplemented

unimplemented: 18


['afs_syscall()',
 'break()',
 'fattach()',
 'fdetach()',
 'getmsg()',
 'getpmsg()',
 'gtty()',
 'isastream()',
 'lock()',
 'madvise1()',
 'mpx()',
 'prof()',
 'putmsg()',
 'putpmsg()',
 'security()',
 'stty()',
 'tuxcall()',
 'vserver()']

In [97]:
# 缺少 FirstLine 的 syscall: 13 個
no_fl = dataset[dataset['FirstLine'] == "<,>"]
print(f"num: {len(no_fl)}, syscalls are: {list(no_fl['Syscall'])}")
no_fl


num: 13, syscalls are: ['ipc()', 'pciconfig_read()', 'pciconfig_write()', 'pciconfig_iobase()', 'timerfd_create()', 'clock_settime()', 'clock_gettime()', 'semtimedop()', 'clock_getres()', 'syscall()', 'socketcall()', 'nfsservctl()', 'msgctl()']


Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence
54,ipc(),System V IPC system calls,"<,>","<,>",
82,pciconfig_read(),pci device information handling,"<,>","<read,to buf>",Reads to buf from device dev at offset off value.
83,pciconfig_write(),pci device information handling,"<,>","<write,from buf>",Writes from buf to device dev at offset off va...
84,pciconfig_iobase(),pci device information handling,"<,>","<,>",
210,timerfd_create(),timers that notify via file descriptors,"<,>","<create,a new timer object>",timerfd_create() creates a new timer object
295,clock_settime(),clock and time functions,"<,>","<retrieve,time>",retrieve and set the time of the specified clo...
296,clock_gettime(),clock and time functions,"<,>","<set,time>",retrieve and set the time of the specified clo...
314,semtimedop(),clock and time functions,"<,>","<,>",
317,clock_getres(),clock and time functions,"<,>","<,>",
320,syscall(),indirect system call,"<,>","<,>",


In [98]:
# 缺少 FirstLine & Synonym 的 syscall: 7 個
no_fl_syn = dataset[(dataset['FirstLine'] == "<,>") & (dataset['Synonym'] == "<,>")]
print(f"num: {len(no_fl_syn)}, syscalls are: {list(no_fl_syn['Syscall'])}")

num: 7, syscalls are: ['ipc()', 'pciconfig_iobase()', 'semtimedop()', 'clock_getres()', 'syscall()', 'socketcall()', 'nfsservctl()']


In [99]:
# 缺少 Synonym 的 syscall: 298 個
no_syn = dataset[(dataset['Synonym'] == "<,>")]
print(f"num: {len(no_syn)}, syscalls are: {list(no_syn['Syscall'])}")

num: 297, syscalls are: ['shutdown()', 'recvmmsg()', 'pipe2()', 'mq_unlink()', 'socketpair()', 'mq_timedsend()', 'mq_timedreceive()', 'getpeername()', 'listen()', 'oldolduname()', 'olduname()', 'madvise()', 'sched_get_priority_max()', 'kexec_load()', 'setuid()', 'geteuid32()', 'ustat()', 'keyctl()', 'setreuid32()', 'times()', 'get_kernel_syms()', '_sysctl()', '_sysctl()', 'getpgid()', 'getegid32()', 'getuid32()', 'setpgid()', 'gettid()', 'stime()', 'setregid()', 'setgid32()', 'getsid()', 'shmctl()', 'getrusage()', 'setreuid()', 'sched_get_priority_min()', 'setregid32()', 'setgid()', 'setuid32()', 'getgid32()', 'ipc()', 'getgroups()', 'setfsuid32()', 'setresuid32()', 'getresgid32()', 'setgroups()', 'setresuid()', 'setfsgid32()', 'getgroups32()', 'setfsuid()', 'setresgid()', 'setrlimit()', 'getresgid()', 'getpriority()', 'setfsgid()', 'setpriority()', 'setgroups32()', 'getresuid()', 'syslog()', 'syslog()', 'syslog()', 'setresgid32()', 'getresuid32()', 'io_cancel()', 'iopl()', 'pciconfig_

#### 計算 Synonym 數量

In [100]:
# 只有一個 Synonym 的 syscall: 115 個
def count_pairs(x):
    numOfSplit = len(x.split('|'))
    if numOfSplit == 1 and (x == '' or x == '<,>'):
        return 0
    return numOfSplit

dataset['numOfSynonym'] = dataset['Synonym'].apply(count_pairs)
dataset['numOfFirstLine'] = dataset['FirstLine'].apply(count_pairs)

In [111]:
syn_vcnt = dataset['numOfSynonym'].value_counts()
print(syn_vcnt)
print('FirstLine >3 :', list(dataset[dataset['numOfSynonym'] == 4]['Syscall']))
print('FirstLine =3 :', list(dataset[dataset['numOfSynonym'] == 3]['Syscall']))
print('FirstLine =2 :', list(dataset[dataset['numOfSynonym'] == 2]['Syscall']))
print('FirstLine =1 :', list(dataset[dataset['numOfSynonym'] == 1]['Syscall'])[:7])

0    297
1    111
2     10
3      6
4      2
Name: numOfSynonym, dtype: int64
FirstLine >3 : ['add_key()', 'ptrace()']
FirstLine =3 : ['idle()', 'epoll_ctl()', 'dup()', 'dup2()', 'fcntl()', 'fcntl64()']
FirstLine =2 : ['request_key()', 'semget()', 'msgget()', 'sigsuspend()', 'renameat2()', 'renameat()', 'readlink()', 'poll()', 'ppoll()', 'futex()']
FirstLine =1 : ['sendmmsg()', 'mq_open()', 'accept4()', 'accept()', 'mq_getsetattr()', 'mq_getsetattr()', 'mq_notify()']


In [102]:
# No 其他意義，表示 13 個 syscall 的 Name 欄位沒有動詞
dataset['numOfFirstLine'].value_counts()

1    413
0     13
Name: numOfFirstLine, dtype: int64

#### 計算 FirstLine 數量

In [103]:
syscall_vcnt = dataset['Syscall'].value_counts()
# syscall_vcnt[syscall_vcnt > 1]
print('FirstLine >3 :', len(syscall_vcnt[syscall_vcnt > 3]))
print('FirstLine =3 :', len(syscall_vcnt[syscall_vcnt == 3]))
print('FirstLine =2 :', len(syscall_vcnt[syscall_vcnt == 2]))
print('FirstLine =1 :', len(syscall_vcnt[syscall_vcnt == 1]))

print('FirstLine >3 :', list(syscall_vcnt[syscall_vcnt > 3].index))
print('FirstLine =3 :', list(syscall_vcnt[syscall_vcnt == 3].index))
print('FirstLine =2 :', list(syscall_vcnt[syscall_vcnt == 2].index))

FirstLine >3 : 0
FirstLine =3 : 4
FirstLine =2 : 23
FirstLine =1 : 368
FirstLine >3 : []
FirstLine =3 : ['fanotify_mark()', 'syslog()', 'reboot()', 'bdflush()']
FirstLine =2 : ['open_by_handle_at()', 'open()', 'setsid()', 'mmap()', 'acct()', '_sysctl()', 'timerfd_settime()', 'rt_sigaction()', 'name_to_handle_at()', 'tee()', 'flock()', 'modify_ldt()', 'rt_sigprocmask()', 'timer_settime()', 'sigprocmask()', 'sigaction()', 'ulimit()', 'remove()', 'lchmod()', 'sigaltstack()', 'fanotify_init()', 'mq_getsetattr()', 'vfork()']
