## AutoEncoders to detect unusual groups of processes

This notebook provides a reference we use for training [autoencoders](https://en.wikipedia.org/wiki/Autoencoder) to perform anomaly detection.  Autoencoders are neural networks that attempt to faithfully reconstruct its input by first compressing it into a low dimensional encoding and then decompressing that encoding.  These networks can be useful for anomaly detection because unusual data will have poor reconstructions.  For cybersecurity, we can leverage anomaly detection to find possible attacks without having to perform significant feature engineering.

<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/3/37/Autoencoder_schema.png" width="250px" />
    <p style="font-size: 9pt">Diagram by <a href="https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png">Michaela Massi</a>, some rights reserved</p>
</div>

For our purposes, we will build an autoencoder to identify anomalous groups of processes.  We focus on processes with the prefix \\\\device\Windows since attackers leverage these executables to [live off the land](https://conf.splunk.com/files/2019/slides/SEC1375.pdf).  We use a technique called [feature hashing](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) to project the input (a map of process -> counts) into a [vector space](https://en.wikipedia.org/wiki/Vector_space) (convenient for machine learning).


In [8]:
import numpy as np
from sklearn.feature_extraction import FeatureHasher
import tensorflow as tf
from sklearn.pipeline import Pipeline


### Training data
We will create a toy dataset that will contain which processes launched and how often during some time window (e.g. hour) correllated on one or more entities (e.g. user and machine, machine).  For this demonstration, normal data will consist of a sample of four processes, of which, these four processes can occur 0-5 times within a sampling period.  We assume independence between the processes.  The below code block generates the data.

In [38]:
# Let's create some dummy data using processes
# commonly seen with the prefix C:\Windows
num_samples = 10000
def create_dataset(num_samples=10000):
    data = []
    for i in range(num_samples):
        datum = {'cmd.exe': np.round(np.random.uniform(high=5)),
                 'conhost.exe': np.round(np.random.uniform(high=5)),
                 'svchost.exe': np.round(np.random.uniform(high=5)),
                 'werfault.exe': np.round(np.random.uniform(high=5))}
        data.append(datum)
    return data

training_data = create_dataset()
test_data = create_dataset()


### Transforming the data
We use a scikit learn pipeline to feature hash the input into a 16 dimensional vector.  An example is shown of what the input and output look like.

In [59]:
pipe = Pipeline([('hasher', FeatureHasher(n_features=16))])
X = pipe.fit_transform(data)

print("Input data (process -> count map):")
print(", ".join([f"{k}->{int(v)}" for (k, v) in data[0].items()]))
print("\n")
print("Vectorized input (16 dimensional)")
print(X[0].todense())

Input data (process -> count map):
cmd.exe->3, conhost.exe->3, svchost.exe->5, werfault.exe->0


Vectorized input (16 dimensional)
[[-3.  2.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]


## Network
We build our model using TensorFlow Keras.  Since the input is already vectorized, we will stack vanilla dense layers with leaky ReLU activations to compress the input into a four dimensional vector encoding and than decompress back into the original.

In [55]:
ae_input_layer = tf.keras.layers.Input(shape=(16,), name="win_processes_hashed")
ae_net = tf.keras.layers.Dense(8, name="enc_1")(ae_input_layer)
ae_net = tf.keras.layers.LeakyReLU()(ae_net)
ae_net = tf.keras.layers.Dense(4, name="enc_2")(ae_net)
ae_net = tf.keras.layers.LeakyReLU()(ae_net)
ae_net = tf.keras.layers.Dense(8, name="dec_1")(ae_net)
ae_net = tf.keras.layers.LeakyReLU()(ae_net)
ae_net = tf.keras.layers.Dense(16, name="reconstruction")(ae_net)
ae_net = tf.keras.layers.LeakyReLU()(ae_net)
ae_model = tf.keras.models.Model(ae_input_layer, ae_net)
ae_model.compile('adam', 'mse', ['mae'])
ae_model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
win_processes_hashed (InputL [(None, 16)]              0         
_________________________________________________________________
enc_1 (Dense)                (None, 8)                 136       
_________________________________________________________________
leaky_re_lu_8 (LeakyReLU)    (None, 8)                 0         
_________________________________________________________________
enc_2 (Dense)                (None, 4)                 36        
_________________________________________________________________
leaky_re_lu_9 (LeakyReLU)    (None, 4)                 0         
_________________________________________________________________
dec_1 (Dense)                (None, 8)                 40        
_________________________________________________________________
leaky_re_lu_10 (LeakyReLU)   (None, 8)                 0   

In [56]:
# Training
ae_model.fit(X, X.todense(), epochs=10, batch_size=8)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fdcf91dd580>

### Anomaly detection
We use euclidean distance as a similarity function between the input and reconstruction.  We expect that the distance between reconstruction and input will be small for normal data and large for anomalous data.

First we apply the model and get the mean distance to the test data (which is generated the same way as the training data).  We expect this to be small and it is.

In [57]:
X_test = pipe.transform(test_data)
np.average(np.sqrt(np.sum(np.square(X_test - ae_model.predict(X_test)), axis=1)))

0.010183748708750814

Now let's apply the model to an unusual command that might be seen with [discovery](https://attack.mitre.org/tactics/TA0007/).  Typically, we may see at most one of these processes in a sampling window.  Notice how much larger the distance between the anomalous reconstruction and the mean normal reconstruction.  Therefore, we can call out this unusual collection of processes in a short period of time to an analyst to get a disposition if this behavior is malicious.  We may also call out this activity if there are other secondary or weakly predictive signals related to the same user or device.

In [58]:
unusual_command = [{
    'whoami.exe': 1,
    'net.exe': 3,
    'ver.exe': 1,
    'query.exe': 2,
    'sc.exe': 5}
]
X_u = pipe.transform(unusual_commands)
np.sqrt(np.sum(np.square(X_u - ae_model.predict(X_u))))

5.716755530643157

### Summary
Cybersecurity has long employed anomaly detection to identify unusual activity that may be attributable to cyber attacks.  This notebook shows how autoencoders, a deep neural network, can take a map of process counts during a sampling window and identify unusual groups.  To accomplish this, we use feature hashing to vectorize the map of process -> counts.  We train an autoencoder on the vectorized data.  This network is able to identify unusual inputs that may be useful for discovering attacks.