### 製作新的 Synonym_base 
- 以 `<動詞片語, 受詞>` pair 的方式儲存一組介紹，動詞片語盡量取 lemma (原型動詞) 如 "copy a file"。
- 在手冊中每個 Syscall 的描述有 Name 區塊和 Description 區塊，每個 syscall 一定會有 Name 區塊來簡短的描述，如: `write - write to a file descriptor`。我們把 Name 區塊的 pair 稱作 `FirstLine` 欄位，理論上每個 syscall 都會有一個 `<動詞片語, 受詞>` pair 所以會有一行資料，若 First line 有兩個動詞 pair 則分開寫成兩行。
- Description 區塊是很長且非結構化的詳細敘述，說明 syscall 的運作方式、有哪些功能參數以及如何呼叫。我們盡量找 >= 1 組與 `FirstLine` 相同意義的句子，也製作成 pair 放在 `Synonym` 欄位。一個 `FirstLine` 可以有 0 或多個 synonym pair，通通寫在同一行用`"|"`分隔`<>`。
- 問題: pair 的受詞要多細膩? 回答:
```
check real user's permissions for a file => "<check,user's permission>"
```

Load origin rule_dataset, convert to pair format

In [1]:
import pandas as pd

df = pd.read_csv('../rule_dataset_final.csv') # 354 Syscall
# df = pd.read_csv('../rule_dataset.csv')       # 88  Syscall
# df = pd.read_csv('../proc.csv')       # 88  Syscall

def clean_braces(x):
    x = x.replace("()", "").replace(",", "").replace(" ", "")
    return str(x)
def add_braces(x):
    x = clean_braces(x)
    x = x+"()"
    return x

df['Syscall'] = df['Syscall'].apply(add_braces) # or clean_braces
df.head(3)


Unnamed: 0,EntityType,ActionType,Source,Syscall,EnVerb,Sentence
0,DEVICE,DEVICE,man,io_cancel(),cancel,cancel an outstanding asynchronous I/O operation
1,DEVICE,DEVICE,man,io_destroy(),destroy,destroy an asynchronous I/O context
2,DEVICE,DEVICE,man,io_getevents(),read,read asynchronous I/O events from the completi...


In [57]:
def make_firstline(v:str, sentence:str, default="<,>"):
    spl = sentence.split()
    for i,word in enumerate(spl):
        if word.startswith(v):
            object_phrase = " ".join(spl[i+1:])
            return f"<{v},{object_phrase}>"
    return default

In [58]:
# New columns: FirstLine Synonym
# Better to remain columns: (EntityType ActionType) Syscall

# syscall_88 = df['Syscall'].unique()
# df = df[df['Source'] == 'man']
df['FirstLine'] = df.apply(lambda r: make_firstline(r['EnVerb'],r['Sentence']), axis=1)
df['Synonym']   = "<,>"
df['SynonymSentence']   = ""
df = df[['Syscall', 'Sentence', 'FirstLine', 'Synonym', 'SynonymSentence']]
# df.to_csv('proc.csv', index=False)
df

Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence
0,idle(),make process 0 idle,"<make,process 0 idle>","<,>",
1,move_pages(),move individual pages of a process to another ...,"<move,individual pages of a process to another...","<,>",
2,timer_gettime(),arm/disarm and fetch state of POSIX per,"<fetch,state of POSIX per>","<,>",
3,timer_delete(),delete a POSIX per,"<delete,a POSIX per>","<,>",
4,execveat(),execute program relative to a directory file d...,"<execute,program relative to a directory file ...","<,>",
...,...,...,...,...,...
74,ioprio_set(),get/set I/O scheduling class and priority,"<,>","<,>",
75,sched_getattr(),set and get scheduling policy and attributes,"<get,scheduling policy and attributes>","<,>",
76,get_robust_list(),get/set list of robust futexes,"<get,list of robust futexes>","<,>",
77,getitimer(),get or set value of an interval timer,"<get,or set value of an interval timer>","<,>",


In [None]:
'''
Syscall,Sentence,FirstLine,Synonym
read(),read from a file descriptor,"<read, file descriptor>","<>"
readlink(),read value of a symbolic link,"<read, value of a symbolic link>","<place, the contents of the symbolic link path>"
'''

### 統計資料
- numOf: syscall 395, firstline verb (in name), sysnonym verb (in description)
- unimplemented: 18 (396+18=414), 
- 做統計時，不要提到這兩個 call，因為不在 strace github 中，但我們的 ASG 有使用到 ['rm()', 'exec()']

In [20]:
import pandas as pd
dataset = pd.read_csv('./synonym_dataset.csv')
dataset = dataset[~dataset['Syscall'].isin(['rm()', 'exec()'])]
print('dataset len', len(dataset))
dataset.head()

dataset len 426


Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence
0,shutdown(),shut down part of a full-duplex connection,"<shut down,part of a full-duplex connection>","<,>",
1,recvmmsg(),receive multiple messages on a socket,"<receive,multiple messages>","<,>",
2,sendmmsg(),send multiple messages on a socket,"<send,multiple messages>","<transmit,multiple messages>",transmit multiple messages on a socket using a...
3,pipe2(),create pipe,"<create,pipe>","<,>",
4,mq_unlink(),remove a message queue,"<remove,a message queue>","<,>",


In [21]:
uq = dataset['Syscall'].unique()
len(dataset['Syscall'].unique())

395

In [22]:
# https://man7.org/linux/man-pages/man2/unimplemented.2.html 
unimplemented = '''afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg,
       gtty, isastream, lock, madvise1, mpx, prof, profil, putmsg,
       putpmsg, security, stty, tuxcall, ulimit, vserver'''
unimplemented = unimplemented.split(',')
unimplemented = [f"{u.strip()}()" for u in unimplemented]
unimplemented = [u for u in unimplemented if u not in uq]
print('unimplemented:', len(unimplemented))
unimplemented

unimplemented: 18


['afs_syscall()',
 'break()',
 'fattach()',
 'fdetach()',
 'getmsg()',
 'getpmsg()',
 'gtty()',
 'isastream()',
 'lock()',
 'madvise1()',
 'mpx()',
 'prof()',
 'putmsg()',
 'putpmsg()',
 'security()',
 'stty()',
 'tuxcall()',
 'vserver()']

In [23]:
# 缺少 FirstLine 的 syscall: 13 個
no_fl = dataset[dataset['FirstLine'] == "<,>"]
print(f"num: {len(no_fl)}, syscalls are: {list(no_fl['Syscall'])}")
no_fl


num: 0, syscalls are: []


Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence


In [24]:
# 缺少 FirstLine & Synonym 的 syscall: 7 個
no_fl_syn = dataset[(dataset['FirstLine'] == "<,>") & (dataset['Synonym'] == "<,>")]
print(f"num: {len(no_fl_syn)}, syscalls are: {list(no_fl_syn['Syscall'])}")

num: 0, syscalls are: []


In [25]:
# 缺少 Synonym 的 syscall: 298 個
no_syn = dataset[(dataset['Synonym'] == "<,>")]
print(f"num: {len(no_syn)}, syscalls are: {list(no_syn['Syscall'])}")

num: 297, syscalls are: ['shutdown()', 'recvmmsg()', 'pipe2()', 'mq_unlink()', 'socketpair()', 'mq_timedsend()', 'mq_timedreceive()', 'getpeername()', 'listen()', 'oldolduname()', 'olduname()', 'madvise()', 'sched_get_priority_max()', 'kexec_load()', 'setuid()', 'geteuid32()', 'ustat()', 'keyctl()', 'setreuid32()', 'times()', 'get_kernel_syms()', '_sysctl()', '_sysctl()', 'getpgid()', 'getegid32()', 'getuid32()', 'setpgid()', 'gettid()', 'stime()', 'setregid()', 'setgid32()', 'getsid()', 'shmctl()', 'getrusage()', 'setreuid()', 'sched_get_priority_min()', 'setregid32()', 'setgid()', 'setuid32()', 'getgid32()', 'ipc()', 'getgroups()', 'setfsuid32()', 'setresuid32()', 'getresgid32()', 'setgroups()', 'setresuid()', 'setfsgid32()', 'getgroups32()', 'setfsuid()', 'setresgid()', 'setrlimit()', 'getresgid()', 'getpriority()', 'setfsgid()', 'setpriority()', 'setgroups32()', 'getresuid()', 'syslog()', 'syslog()', 'syslog()', 'setresgid32()', 'getresuid32()', 'io_cancel()', 'iopl()', 'pciconfig_

#### 計算 Synonym 數量
(1) 以 firstline pair 為基準

In [26]:
# 只有一個 Synonym 的 syscall: 115 個
def count_pairs(x):
    numOfSplit = len(x.split('|'))
    if numOfSplit == 1 and (x == '' or x == '<,>'):
        return 0
    return numOfSplit

dataset['numOfSynonym'] = dataset['Synonym'].apply(count_pairs)
dataset['numOfFirstLine'] = dataset['FirstLine'].apply(count_pairs)

In [27]:
syn_vcnt = dataset['numOfSynonym'].value_counts()
print(syn_vcnt)
print('FirstLine >3 :', list(dataset[dataset['numOfSynonym'] == 4]['Syscall']))
print('FirstLine =3 :', list(dataset[dataset['numOfSynonym'] == 3]['Syscall']))
print('FirstLine =2 :', list(dataset[dataset['numOfSynonym'] == 2]['Syscall']))
print('FirstLine =1 :', list(dataset[dataset['numOfSynonym'] == 1]['Syscall'])[:7])

0    297
1    111
2     10
3      6
4      2
Name: numOfSynonym, dtype: int64
FirstLine >3 : ['add_key()', 'ptrace()']
FirstLine =3 : ['idle()', 'epoll_ctl()', 'dup()', 'dup2()', 'fcntl()', 'fcntl64()']
FirstLine =2 : ['request_key()', 'semget()', 'msgget()', 'sigsuspend()', 'renameat2()', 'renameat()', 'readlink()', 'poll()', 'ppoll()', 'futex()']
FirstLine =1 : ['sendmmsg()', 'mq_open()', 'accept4()', 'accept()', 'mq_getsetattr()', 'mq_getsetattr()', 'mq_notify()']


In [28]:
# No 其他意義，表示 13 個 syscall 的 Name 欄位沒有動詞
dataset['numOfFirstLine'].value_counts()

1    426
Name: numOfFirstLine, dtype: int64

(2) 以 syscall 為基準 (累加同個 syscall 底下 firstline pair 的 synonym 數量)

In [29]:
gp = dataset.groupby(['Syscall']).sum()
gp

  gp = dataset.groupby(['Syscall']).sum()


Unnamed: 0_level_0,numOfSynonym,numOfFirstLine
Syscall,Unnamed: 1_level_1,Unnamed: 2_level_1
_llseek(),0,1
_newselect(),1,1
_sysctl(),0,2
accept(),1,1
accept4(),1,1
...,...,...
wait4(),0,1
waitid(),0,1
waitpid(),0,1
write(),0,1


In [30]:
print('sum:',sum(gp['numOfSynonym'].value_counts()))
gp['numOfSynonym'].value_counts() # 沒有 synonym 的 syscall 有 273 個，1個有97個

sum: 395


0    273
1     97
2     17
3      6
4      2
Name: numOfSynonym, dtype: int64

In [31]:
gp['numOfFirstLine'].value_counts() # 跟下一個 block 一樣

1    368
2     23
3      4
Name: numOfFirstLine, dtype: int64

(3) 計算每個 syscall 有幾個動詞

In [34]:
gp['Sum'] = gp['numOfSynonym'] + gp['numOfFirstLine']
gp.head()
gp['Sum'].value_counts()

1    257
2    105
3     18
4     13
5      2
Name: Sum, dtype: int64

#### 計算 FirstLine 數量

In [21]:
syscall_vcnt = dataset['Syscall'].value_counts()
# syscall_vcnt[syscall_vcnt > 1]
print('FirstLine >3 :', len(syscall_vcnt[syscall_vcnt > 3]))
print('FirstLine =3 :', len(syscall_vcnt[syscall_vcnt == 3]))
print('FirstLine =2 :', len(syscall_vcnt[syscall_vcnt == 2]))
print('FirstLine =1 :', len(syscall_vcnt[syscall_vcnt == 1]) - len(no_fl))
print('FirstLine =0 :', len(no_fl))

print('FirstLine >3 :', list(syscall_vcnt[syscall_vcnt > 3].index))
print('FirstLine =3 :', list(syscall_vcnt[syscall_vcnt == 3].index))
print('FirstLine =2 :', list(syscall_vcnt[syscall_vcnt == 2].index))

FirstLine >3 : 0
FirstLine =3 : 4
FirstLine =2 : 23
FirstLine =1 : 368
FirstLine =0 : 0
FirstLine >3 : []
FirstLine =3 : ['fanotify_mark()', 'syslog()', 'reboot()', 'bdflush()']
FirstLine =2 : ['open_by_handle_at()', 'open()', 'setsid()', 'mmap()', 'acct()', '_sysctl()', 'timerfd_settime()', 'rt_sigaction()', 'name_to_handle_at()', 'tee()', 'flock()', 'modify_ldt()', 'rt_sigprocmask()', 'timer_settime()', 'sigprocmask()', 'sigaction()', 'ulimit()', 'remove()', 'lchmod()', 'sigaltstack()', 'fanotify_init()', 'mq_getsetattr()', 'vfork()']


### Common verbs

In [1]:
import pandas as pd
dataset = pd.read_csv('./synonym_dataset.csv')

In [2]:
def extract_verb(x:str):
    ''' <,> => None, <get,xxx> => get '''
    x = x.replace('<','').replace('>','')
    x = x.split(',')[0]
    return x.strip()

dataset['FirstVerb'] = dataset['FirstLine'].apply(extract_verb)

In [38]:
dataset['FirstVerb'].value_counts()[:10]

get           75
set           55
change        30
create        28
read          12
remove        11
wait for       9
manipulate     8
send           7
open           7
Name: FirstVerb, dtype: int64

In [39]:
rules = pd.read_csv('../rule_dataset_final.csv')
verbs_inrule = set(rules['EnVerb'].unique())

verbs_1st = set(dataset['FirstVerb'].unique())
print('verbs_inrule:', len(verbs_inrule), 'verbs_syn:', len(verbs_1st))
add = verbs_1st-verbs_inrule
print('added verbs in firstline:', len(add), add)

verbs_inrule: 113 verbs_syn: 104
added verbs in firstline: 31 {'wait for', 'enable', 'set up', 'shut down', 'switch off', 'communicate with', 'listen for', 'detach', 'flush', 'queue', 'provide', 'predeclare', 'stop', 'operate on', 'tune', 'disable', 'arm', 'reboot', 'send signal', 'splice', 'invoke', 'register for', 'switch on', 'hangup', 'unmap', 'disarm', 'unlock', 'issue', 'unmount', 'multiplex', 'block'}


In [3]:
# extract verb from synonym
verbs_from_syn = list()
syspair = dataset['Synonym']
for pair_str in syspair:
    if pair_str.find('|') != -1:
        pair_str = pair_str.split('|')
        # print(pair_str)
        verbs = [extract_verb(pair) for pair in pair_str]
        # print(verbs)
        verbs_from_syn.extend(verbs)
    else:
        verb = extract_verb(pair_str)
        verbs_from_syn.append(verb)

from collections import Counter
counter = Counter(verbs_from_syn)
counter.most_common(12)

[('', 298),
 ('retrieve', 19),
 ('create', 16),
 ('set', 7),
 ('change', 6),
 ('monitor', 5),
 ('read', 4),
 ('remove', 4),
 ('copy', 4),
 ('creates', 3),
 ('find', 3),
 ('write', 3)]

### Total verbs

In [4]:
counter_1st = Counter(dataset['FirstVerb']) # verbs in firstline
numv_from_des = len(counter) -1
counter = counter + counter_1st
print(f"num of verbs = {len(counter) -1}, from name sector = {len(counter_1st)}, from description sector = {numv_from_des}")

num of verbs = 136, from name sector = 104, from description sector = 70


In [45]:
counter.most_common()

[('', 298),
 ('get', 77),
 ('set', 62),
 ('create', 44),
 ('change', 36),
 ('retrieve', 24),
 ('read', 16),
 ('remove', 15),
 ('manipulate', 11),
 ('write', 10),
 ('send', 10),
 ('wait for', 9),
 ('open', 8),
 ('load', 7),
 ('examine', 7),
 ('receive', 6),
 ('obtain', 5),
 ('lock', 5),
 ('fetch', 5),
 ('copy', 5),
 ('allocate', 5),
 ('monitor', 5),
 ('truncate', 5),
 ('control', 5),
 ('return', 5),
 ('operate', 5),
 ('add', 4),
 ('delete', 4),
 ('handle', 4),
 ('execute', 4),
 ('transfer', 4),
 ('creates', 3),
 ('find', 3),
 ('attach', 3),
 ('suspend', 3),
 ('check', 3),
 ('place', 3),
 ('arm', 3),
 ('disarm', 3),
 ('rename', 3),
 ('access', 3),
 ('move', 3),
 ('perform', 3),
 ('block', 3),
 ('share', 3),
 ('wait', 3),
 ('make', 3),
 ('tune', 3),
 ('initialize', 3),
 ('duplicate', 3),
 ('synchronize', 3),
 ('multiplex', 3),
 ('modify', 2),
 ('enter', 2),
 ('replace', 2),
 ('request', 2),
 ('enable', 2),
 ('disable', 2),
 ('moving', 2),
 ('free', 2),
 ('accept', 2),
 ('determine', 2),
 

In [48]:
a = [c[0] for c in counter.most_common()]
print(a)

['', 'get', 'set', 'create', 'change', 'retrieve', 'read', 'remove', 'manipulate', 'wait for', 'write', 'send', 'open', 'examine', 'load', 'receive', 'return', 'operate', 'lock', 'truncate', 'control', 'obtain', 'handle', 'execute', 'transfer', 'fetch', 'allocate', 'add', 'delete', 'copy', 'make', 'tune', 'initialize', 'duplicate', 'synchronize', 'multiplex', 'check', 'arm', 'disarm', 'monitor', 'move', 'wait', 'attach', 'perform', 'block', 'share', 'accept', 'determine', 'start', 'unlock', 'flush', 'reposition', 'sync', 'list', 'initiate', 'commit', 'predeclare', 'unmount', 'sleep', 'queue', 'splice', 'provide', 'map', 'unmap', 'creates', 'modify', 'find', 'enter', 'suspend', 'request', 'place', 'enable', 'disable', 'rename', 'access', 'free', 'replace', 'moving', 'shut down', 'listen for', 'register for', 'give', 'set up', 'communicate with', 'clear', 'cancel', 'destroy', 'stop', 'submit', 'disassociate', 'restart', 'yield', 'reassociate', 'terminate', 'unload', 'operate on', 'query'

### 5/10新任務: 把從CTI reports挑選的句子，也轉成 pair 的形式儲存

In [2]:
import pandas as pd

df = pd.read_csv('../rule_dataset_final.csv') # 354 Syscall
df = df[df['Source'] != 'man']

def clean_braces(x):
    x = x.replace("()", "").replace(",", "").replace(" ", "")
    return str(x)
def add_braces(x):
    x = clean_braces(x)
    x = x+"()"
    return x

df['Syscall'] = df['Syscall'].apply(add_braces) # or clean_braces
df.head(3)

def make_firstline(v:str, sentence:str, default="<,>"):
    spl = sentence.split()
    for i,word in enumerate(spl):
        if word.startswith(v):
            object_phrase = " ".join(spl[i+1:])
            return f"<{v},{object_phrase}>"
    return default

# New columns: FirstLine Synonym
# Better to remain columns: (EntityType ActionType) Syscall

# syscall_88 = df['Syscall'].unique()
# df = df[df['Source'] == 'man']
df['FirstLine'] = df.apply(lambda r: make_firstline(r['EnVerb'],r['Sentence']), axis=1)
df['Synonym']   = "<,>"
df['SynonymSentence']   = ""
df = df[['Syscall', 'Sentence', 'FirstLine', 'Synonym', 'SynonymSentence']]
# df.to_csv('synonym_117fromCTI.csv', index=False)
df

Unnamed: 0,Syscall,Sentence,FirstLine,Synonym,SynonymSentence
77,link(),Screenshot of the randomly-named files the Xor...,"<extract,to the /usr/bin folder.>","<,>",
78,link(),This Backdoor adds the following processes: se...,"<add,the following processes: sed -i -e '/exit...","<,>",
79,link(),The file k.sh that was dropped and executed on...,"<drop,and executed on the attacked server>","<,>",
83,mkdir(),big-ip mounts the /usr partition read-only.,"<mount,the /usr partition read-only.>","<,>",
98,open(),Screenshot of the randomly-named files the Xor...,"<extract,to the /usr/bin folder.>","<,>",
...,...,...,...,...,...
549,exit_group(),Kill all current packeting.,"<,>","<,>",
550,exit_group(),"this fallback function also checks if cron, cr...","<end,>","<,>",
552,kill(),It then kills all Telnet and SSH related proce...,"<kill,all Telnet and SSH related processes, an...","<,>",
553,kill(),"this fallback function also checks if cron, cr...","<end,>","<,>",
