![DLI Header](../images/DLI_Header.png)

# Digital Fingerprinting

In this notebook we are going to introduce the autoencoder (AE) pipeline and use it to perform "digital fingerprinting," a process by which we use deep learning to obtain a unique digital signature of a user or service's typical activities which can then be leveraged to identify when that user or service may have been hacked.

## Objectives

By the time you complete this notebook you will:

- Understand how autoencoders can be used to create digital fingerprints representing typical user/service behavior.
- Know how to construct an autoencoder pipeline with Morpheus to create and utilize a digital fingerprint.
- Be able to analyze autoencoder pipeline results to determine if a user account has been compromised.

---

## The Problem

A breached credential for any given app can give an attacker a huge world of permissions that will not be obvious or static over time. In 2021, compromised credentials were at the root of 61% of attacks.

Traditional approaches to finding and stopping threats have ceased to be appropriately effective. One reason is the scope of ways an attacker can enter a system and do damage have proliferated as the interconnections between apps and systems have proliferated.

Applying AI to the problem seems like a natural choice but this in some sense broadens the data problem. A typical user may interact with 100 or more apps while doing their job, and integrations between apps means that there may be tens of thousands of interconnections and permissions shared across those 100 apps. If you have 10,000 users, you’d need 10,000 models as a beginning.

The good news is that NVIDIA Morpheus addresses this problem.

While most apps and systems will create logs, the variety, volume, and velocity of these logs means that much of the response possible is “closing the barn door after the horse has left.” Identifying credential breaches and the damage done can take weeks if you’re lucky, months if you’re average.

With any number of users beyond “modest” or “very modest,” traditional rule-based systems to create warnings are insufficient. A person who knew how another person or system typically behaves could notice something fishy almost immediately when that user or system started doing something that was unusual.

---

## A Solution

Every user or service in a system has a digital fingerprint: a typical set of things it does or doesn’t do in a specific sequence in time. Using Morpheus we can create a "digital fingerprint" for every user and service. Once we have obtained this digital fingerprint, we can compare all incoming data against this fingerprint, flagging data that represents interactions that are not behaving as expected.

To create these digital fingerprints we are going to use a deep learning neural network called an autoencoder, which we discussed in the previous notebook.

---

## Digital Fingerprinting with Autoencoders

In the context of digital fingerprinting we can, for a given user, train an autoencoder on logs during the user's typical activities. During the process of training, the autoencoder learns what is essential about the user's behaviors, and not just noise, and is able to first encode the user's activity into a reduced format, and from the reduced format, reconstruct the original activity with very little error.

Once the autoencoder has been trained in this way on a user's typical behavior, it ought to be able to use what it has learned to first encode and then decode novel behavior from this user in a similar way and end up with an accurate reconstruction of the data that was given to it.

If, however, the user begins to behave in a way that is very much unlike that user's usual behavior, then when the autoencoder uses what it has learned about that user to perform autoencoding, it will do a poor job at reconstructing the data it was given. In this case we might suspect that the user has been compromised and is being driven by some malicious agent.

---

## The Autoencoder Pipeline

One of the essential differences of `pipeline-ae` is that in addition to performing inference on source data, we can also perform model training as a part of the pipeline. It is in this way that we leverage the power of autoencoders to create a digital fingerprint for any number of users or services.

### Fields for Model Training

When running the pipeline we specify a `--columns-file` to indicate which fields in incoming data should be used for model training.

In [1]:
!morpheus run pipeline-ae --help | grep 'columns_file'

  --columns_file FILE        [default: data/columns_ae.txt]


We've provided a columns file for you:

In [2]:
!cat data/columns_ae.txt

apiVersion
errorCode
errorMessage
eventName
eventSource
sourceIPAddress
tlsDetailsclientProvidedHostHeader
userAgent
userIdentityaccessKeyId
userIdentityaccountId
userIdentityarn
userIdentityprincipalId
userIdentitysessionContextsessionIssueruserName


### User ID Location

Morpheus will look for every unique entry in the source data in the specified `--userid_column_name` column and will train a model for each of these entries.

In [3]:
!morpheus run pipeline-ae --help | grep 'userid_column_name TEXT' -A 1

  --userid_column_name TEXT  Which column to use as the User ID.  [default:
                             userIdentityaccountId; required]


### Training for Only a Single User

Optionally, you can set `--userid_filter` to only train for a specific user/service. We will use this option presently, and will look later in the course at removing it to run the pipeline against multiple users.

In [4]:
!morpheus run pipeline-ae --help | grep 'userid_filter TEXT' -A 3

  --userid_filter TEXT       Specifying this value will filter all incoming
                             data to only use rows with matching User IDs.
                             Which column is used for the User ID is specified
                             by `userid_column_name`


---

## Autoencoder Training Configuration

In `pipeline-ae` we perform autoencoder model training (aka digital fingerprinting) with the `train-ae` stage.

By setting `--train_data_glob` you can provide free-from-attack data for training the autoencoder models.

In [5]:
!morpheus run pipeline-ae train-ae --help | grep 'train_data_glob TEXT' -A 2

  --train_data_glob TEXT          On startup, all files matching this glob
                                  pattern will be loaded and used to train a
                                  model for each unique user ID.


If you wish, you could instead provide your own pre-trained model with `--pretrained_filename`, which would be used against every user.

In [6]:
!morpheus run pipeline-ae train-ae --help | grep 'pretrained_filename' -A 1

  --pretrained_filename FILE      Loads a single pre-trained model for all
                                  users.


If you do not provide a `train_data_glob` or a `pretrained_filename` then Morpheus will train for users found in the incoming data stream, either all users, or if `--userid_filter` is set, just the filtered user.

In this scenario, when a pre-trained model or training glob is not provided, Morpheus will train a new model for each user whenever new data for that user comes through the stream. Using `--train_max_history` you can configure how much prior data will be kept around to be a part of subsequent model trainings.

In [7]:
!morpheus run pipeline-ae train-ae --help | grep 'train_max_history' -A 5

  --train_max_history INTEGER RANGE
                                  Maximum amount of rows that will be retained
                                  in history. As new data arrives, models will
                                  be retrained with a maximum number of rows
                                  specified by this value.  [default: 1000;
                                  x>=1]


---

## CloudTrail Data for Training

In this notebook, we are going to training an autoencoder model for a user called `role-g` using anonymized CloudTrail data:

In [8]:
!ls data/training-data/

role-g-training-data.csv  user123-training-data.csv


This data was collected from a period of time when we know that `role-g` was not compromised.

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv('data/training-data/role-g-training-data.csv')

In [11]:
df.shape

(1257, 16)

In [12]:
df.dtypes

Unnamed: 0                                          int64
eventID                                             int64
eventTime                                          object
userIdentityaccountId                              object
eventSource                                        object
eventName                                          object
sourceIPAddress                                    object
userAgent                                          object
userIdentitytype                                   object
userIdentityprincipalId                            object
userIdentityarn                                    object
userIdentityaccessKeyId                            object
userIdentitysessionContextsessionIssueruserName    object
errorCode                                          object
errorMessage                                       object
apiVersion                                         object
dtype: object

In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,eventID,eventTime,userIdentityaccountId,eventSource,eventName,sourceIPAddress,userAgent,userIdentitytype,userIdentityprincipalId,userIdentityarn,userIdentityaccessKeyId,userIdentitysessionContextsessionIssueruserName,errorCode,errorMessage,apiVersion
0,0,0,2021-10-01T01:10:14Z,Account-123456789,lopez-byrd.info,GetSendQuota,208.49.113.40,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,FederatedUser,39c71b3a-ad54-4c28-916b-3da010b92564,arn:aws:4a40df8e-c56a-4e6c-acff-f24eebbc4512,,role-g,success,,
1,1,1,2021-10-01T01:10:14Z,Account-123456789,lopez-byrd.info,GetSendQuota,208.49.113.40,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,FederatedUser,0baf594e-28c1-46cf-b261-f60b4c4790d1,arn:aws:573fd2d9-4345-487a-9673-87de888e4e10,,role-g,success,,
2,2,2,2021-10-01T01:10:14Z,Account-123456789,robinson.com,ListTagsForResource,208.49.113.40,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,FederatedUser,7f8a985f-df3b-4c5c-92c0-e8bffd68abbf,arn:aws:c8c23266-13bb-4d89-bce9-a6eef8989214,ACPOSBUM5JG5BOW7B2TR,role-g,success,,1984-11-26
3,3,3,2021-10-01T01:10:15Z,Account-123456789,lin.com,DescribeManagedPrefixLists,208.49.113.40,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,FederatedUser,3771df71-71a0-4c1b-a9df-130574a78999,arn:aws:1f1aab81-e00c-4364-bb39-9ec6d522353c,ABTHWOIIC0L5POZJM2FF,role-g,success,,1990-05-27
4,4,4,2021-10-01T01:10:15Z,Account-123456789,lopez-byrd.info,GetSendQuota,208.49.113.40,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,FederatedUser,930bbed6-7bdd-4799-97c1-a50d04c2df79,arn:aws:fa23d2c0-3692-451f-ac5c-62cc45577a0d,,role-g,success,,


---

## CloudTrail Data for Pipeline Source

The incoming source for the pipeline, the data on which we will be performing processing and inference, is also CloudTrail data that may or may not be compromised:

In [14]:
!ls data/input-data

role-g-input-data.csv  user123-input-data.csv


---

## Explicitly Setting the Number of Pipeline Threads

Morpheus pipelines can be configured to execute using multiple CPU threads. At the time of writing this, the default number of internal pipeline threads that morpheus will use is 4:

In [15]:
!morpheus run --help | grep 'num_threads' -A 1

  --num_threads INTEGER RANGE     Number of internal pipeline threads to use
                                  [default: 4; x>=1]


In upcoming releases of Morpheus, the [default number of threads will be set to 1](https://github.com/nv-morpheus/Morpheus/blob/64564e1eb051f6820b13772074736e09bdc8941f/morpheus/config.py#L179-L180). For the interactive programming environment you are using right now, running the autoencoder pipeline with more than 1 internal pipeline thread can cause unexpected memory access errors. With that in mind, we will explicitly set the number of pipeline threads to 1 when using the autoencoder pipeline:

```bash
morpheus run \
  --num_threads=1 \
  ...
```

---

## The Autoencoder Pipeline

We are now ready to look at the actual pipeline we are going to be running with an understanding of all its constituent parts:

```sh
morpheus run \
  --num_threads=1 \
  pipeline-ae \
    --userid_filter="role-g" \
    --userid_column_name="userIdentitysessionContextsessionIssueruserName" \
  from-cloudtrail \
    --input_glob="data/input-data/*.csv" \
  train-ae \
    --train_data_glob="data/training-data/*.csv" \
    --seed 42 \
  preprocess \
  inf-pytorch \
  add-scores \
  serialize \
  to-file \
    --filename="data/output/role-g-output.csv" \
    --overwrite
```

In summary, this pipeline will train a digital fingerprint autoencoder model for user `role-g` using all CSV files in `data/training-data`. It will feed all CSV data from `data/input-data` into the pipeline, performing inference against the trained autoencoder for `role-g`'s activity. The pipeline will add inference results (using the `add-scores` stage) to each message and finally write the results to `data/output/role-g-output.csv`.

---

## Run the Pipeline

Execute the following cell to run the pipeline.

In [16]:
!morpheus run \
  --num_threads=1 \
  pipeline-ae \
    --userid_filter="role-g" \
    --userid_column_name="userIdentitysessionContextsessionIssueruserName" \
  from-cloudtrail \
    --input_glob="data/input-data/*.csv" \
  train-ae \
    --train_data_glob="data/training-data/*.csv" \
    --seed 42 \
  preprocess \
  inf-pytorch \
  add-scores \
  serialize \
  to-file \
    --filename="data/output/role-g-output.csv" \
    --overwrite

[32mConfiguring Pipeline via CLI[0m
[33mC++ is disabled for AutoEncoder pipelines at this time.[0m
[31mStarting pipeline via CLI... Ctrl+C to Quit[0m
512


---

## Examine the Output

The pipeline was configured to write its output to `data/output/role-g-output.csv`.

In [17]:
import pandas as pd

In [18]:
output = pd.read_csv('data/output/role-g-output.csv')

In [19]:
output.dtypes

Unnamed: 0                                           int64
_index_                                              int64
eventID                                              int64
eventTime                                           object
userIdentityaccountId                               object
eventSource                                         object
eventName                                           object
sourceIPAddress                                     object
userAgent                                           object
userIdentitytype                                    object
apiVersion                                          object
userIdentityprincipalId                             object
userIdentityarn                                     object
userIdentityaccessKeyId                             object
userIdentitysessionContextsessionIssueruserName     object
errorCode                                           object
errorMessage                                        obje

Of particular interest to us in the output is the newly added column `ae_anomaly_score` to which we will now turn our attention.

---

## Autoencoder Anomaly Scores

The added `ae_anomaly_score` gives for each row its loss in reconstruction from the original dataframe. A greater loss indicates that the trained autoencoder model did less well at reconstructing the original dataframe. The higher the loss, the less the incoming data matches its digital fingerprint that was created during training of the autoencoder model.

Because the autoencoder created the digital fingerprint during times not involved in an attack, incoming data whose loss exceeds some threshold we define should be considered anomalous, and sent to teams for further investigation.

In [20]:
output['ae_anomaly_score'].describe()

count    314.000000
mean       1.670528
std        0.374099
min        1.156825
25%        1.426554
50%        1.545133
75%        1.858500
max        2.911405
Name: ae_anomaly_score, dtype: float64

In [21]:
output.sort_values(by='ae_anomaly_score', ascending=False)['ae_anomaly_score']

284    2.911405
242    2.797282
37     2.781512
153    2.647699
148    2.647699
         ...   
245    1.225285
70     1.194002
11     1.184147
67     1.156832
208    1.156825
Name: ae_anomaly_score, Length: 314, dtype: float64

Although we can see that there is a range of loss scores over the data, they are not yet meaningful to us. In order to really understand whether these loss scores indicate anomalous behavior, we should construct z-scores.

---

## Z-Scores

A high loss score in itself may not be sufficient to determine whether or not the data should be investigated further. A better and easy to obtain metric is its [z-score](https://en.wikipedia.org/wiki/Standard_score), which describes how many standard deviations away from an expected average a given value deviates.

The z-score is calculated by taking a data point's value minus the average value for this data and then dividing by the data's standard deviation:

```python
z_score = (data.value - data.mean) / data.standard_deviation
```

Here we calculate the z-score for each row's reconstruction loss:

In [22]:
output['zscore'] = ( output['ae_anomaly_score'] - output['ae_anomaly_score'].mean() ) / output['ae_anomaly_score'].std()

In [23]:
output['zscore'].describe()

count    3.140000e+02
mean    -1.923444e-16
std      1.000000e+00
min     -1.373172e+00
25%     -6.521624e-01
50%     -3.351919e-01
75%      5.024665e-01
max      3.316975e+00
Name: zscore, dtype: float64

In [24]:
output.sort_values(by='zscore', ascending=False)['zscore']

284    3.316975
242    3.011914
37     2.969760
153    2.612067
148    2.612067
         ...   
245   -1.190172
70    -1.273795
11    -1.300140
67    -1.373154
208   -1.373172
Name: zscore, Length: 314, dtype: float64

The lower the z-score threshold you set, the more events you will flag for further investigation. The higher the z-score threshold you set the fewer false positives you will investigate. Your organization can decide and modify over time the most appropriate threshold for your data.

---

## Analysis

We typically recommend starting with a z-score threshold of 2 or 4 to warrant further investigation. For this example let us assume a z-score threshold of 4.

In the above output the highest z-score was less than 4. In this case it is reasonable to assume that we did not find any particularly anomalous behavior for this user in the data that passed through the pipeline.

---

## Next

Now that you have gained some familiarity using the autoencoder pipeline to both create a digital fingerprint for a user, and to perform inference against that fingerprint with new incoming data, it's your turn to use this technique to identify a hacked user account.

Please continue to the next notebook.