### Examples:

1. Predict the parent process name from the child process name

### Methodology:

1. Create "unseen", and "seen" data sets using data you created earlier, or new captures
1. Learn the relationship between process names and parent process names
1. Explore the effectiveness of the trained model on the unseen data set

### Try for yourself:

1. Predict the working directory from the process name
1. Predict the time of day from the child process name

In [1]:
import matplotlib
import pandas
import psutil
import time

from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
columns = ['pid', 'ppid', 'cwd', 'create_time', 'cmdline', 'name']

In [None]:
i = 0
vvalues = []

# Every 5 seconds for half an hour
for i in range(360):
    for proc in psutil.process_iter(attrs=columns):
        values = list(proc.info.values())
        vvalues += [values]
    
    time.sleep(5)


In [None]:
df = pandas.DataFrame(columns=columns, data=vvalues)

In [None]:
df['create_time'] = pandas.to_datetime(df['create_time'], unit='s')
df

In [None]:
# df.to_pickle('my_processes.pickle')

In [3]:
df = pandas.read_pickle('my_processes.pickle')

In [4]:
len(df)

96631

In [5]:
names = set(df['name'])
len(names)

260

In [6]:
list(names)[:5]

['captiveagent', 'login', 'netbiosd', 'Wireshark', 'LocationMenu']

In [7]:
name_to_index = {name: index for index, name in enumerate(names)}
index_to_name = {index: name for index, name in enumerate(names)}

In [8]:
list(name_to_index.items())[:5]

[('captiveagent', 0),
 ('login', 1),
 ('netbiosd', 2),
 ('Wireshark', 3),
 ('LocationMenu', 4)]

In [9]:
index_to_name[4]

'LocationMenu'

## Our Goal

Create a dictionary showing the child process name, the parent process name, and the number of times that parent was observed to launch the child.

```
{ 
   (name1, name2): 3,
   (name1, name4): 1,
   (name5, name2): 2
   ...
}
```

But we really need to train our model on:

```
{ 
   (name_id_1, name_id_2): 3,
   (name_id_1, name_id_4): 1,
   (name_id_5, name_id_2): 2
   ...
}
```

A deeper, more complex goal might be:

```
{ 
   (name1, name2, name3): 3,
   (name1, name4, name5): 1,
   (name5, name2, name5): 2
   ...
}
```

Each sequence is like a sentence. The length of the sequence used for learning is an "n-gram"

In this way, by sequencing the IDs of "anything of note" with a causaul relationship, we can learn the prevalance of a causal chain, and start to perform machine learning over these event sequences in the same way as for word sequences.


In [10]:
df['name_id'] = df.apply(lambda x: name_to_index[x['name']], axis=1)

In [11]:
len(set(df['name_id']))

260

In [12]:
pid_name_pair = df.apply(lambda x: (x['pid'], x['name_id']), axis=1)
pid_to_name_id = {pid: name for (pid, name) in list(pid_name_pair)}
pid_to_name_id[998]  

113

In [13]:
index_to_name[221]  # expect bash 

'airportd'

In [14]:
train, test = train_test_split(df, test_size=0.2)
train.head()

Unnamed: 0,pid,ppid,cwd,create_time,cmdline,name,name_id
75405,95198,90715,/Users/nblue/dev/AusCERT-2019/tutorial/pcap,2019-05-28 03:43:53.158512115,"[/Users/nblue/dev/env/me/bin/python, -m, ipyke...",python3.6,105
48639,5976,1,/,2019-05-16 00:51:54.609373093,[/usr/libexec/assertiond],assertiond,74
79106,1095,1,/,2019-05-15 10:33:02.263432979,[/Applications/Visual Studio Code.app/Contents...,crashpad_handler,23
62132,94474,90715,/Users/nblue/dev/AusCERT-2019/tutorial/behaviour,2019-05-28 02:13:34.698369026,"[/Users/nblue/dev/env/me/bin/python, -m, ipyke...",python3.6,105
71611,336,1,/,2019-05-15 10:25:14.948797941,[/Library/Little Snitch/Little Snitch Agent.ap...,Little Snitch Agent,82
