![DLI Header](../images/DLI_Header.png)

# Running the Autoencoder Pipeline on Multiple Users

In previous notebooks we have been building pipelines to do autoencoder-based digital fingerprinting and time series analysis for a single user/service. In this notebook we will start to show off the powerful capability of Mopheus pipelines to train models and perform inference on arbitrary numbers of users.

## Objectives

By the time you complete this notebook you will:

- Construct autoencoder pipelines that create models and perform inference on arbitrary numbers of users.
- Be able to perform autoencoder z-score and time series anomaly analysis on output data containing multiple users.

---

## Removing the User ID Filter

In previous notebooks, when using `pipeline-ae`, we have been setting the optional `--userid_filter` option to specify a single user for which the pipeline should train a model and perform inference. Note its use in the following pipeline, representative of our work in previous notebooks:

```bash
morpheus run \
  --num_threads=1 \
  pipeline-ae \
    --userid_filter="role-g" \
    --userid_column_name="userIdentitysessionContextsessionIssueruserName" \
  from-cloudtrail \
    --input_glob="data/input-data/*.csv" \
  train-ae \
    --train_data_glob="data/training-data/*.csv" \
    --seed 42 \
  preprocess \
  inf-pytorch \
  add-scores \
  timeseries \
    --resolution=10m \
    --zscore_threshold=8.0 \
  serialize \
  to-file \
    --filename="data/output/output.csv" \
    --overwrite
```

If we choose to omit the `--userid_filter` option, then the pipeline will look for every unique value in the `--userid_column_name` column and will train an autoencoder model and perform inference for each of these unique values.

## Running the AE Pipeline for Multiple Users

Here we remove the `--userid_filter` option to train autoencoders and perform inference for multiple users. We have set the `--log_level` option to `DEBUG` which will allow you to observe, in the output of the pipeline, print statements indicating that models have been trained for more than one user. In this case our data contains two unique users in the `userid_column_name` column: `user123` and `role-g`:

In [1]:
!morpheus --log_level=DEBUG run \
    --num_threads=1 \
  pipeline-ae \
    --userid_column_name="userIdentitysessionContextsessionIssueruserName" \
  from-cloudtrail \
    --input_glob="data/input-data/*.csv" \
  train-ae \
    --train_data_glob="data/training-data/*.csv" \
    --seed 42 \
  preprocess \
  inf-pytorch \
  add-scores \
  timeseries \
    --resolution=10m \
    --zscore_threshold=8.0 \
  serialize \
  to-file \
    --filename="data/output/output.csv" \
    --overwrite

[32mConfiguring Pipeline via CLI[0m
[33mC++ is disabled for AutoEncoder pipelines at this time.[0m
[2mLoaded columns. Current columns: [['eventSource', 'eventName', 'sourceIPAddress', 'userAgent', 'userIdentitytype', 'requestParametersroleArn', 'requestParametersroleSessionName', 'requestParametersdurationSeconds', 'responseElementsassumedRoleUserassumedRoleId', 'responseElementsassumedRoleUserarn', 'apiVersion', 'userIdentityprincipalId', 'userIdentityarn', 'userIdentityaccountId', 'userIdentityaccessKeyId', 'userIdentitysessionContextsessionIssuerprincipalId', 'userIdentitysessionContextsessionIssueruserName', 'tlsDetailsclientProvidedHostHeader', 'requestParametersownersSetitems', 'requestParametersmaxResults', 'requestParametersinstancesSetitems', 'errorCode', 'errorMessage', 'requestParametersmaxItems', 'responseElementsrequestId', 'responseElementsinstancesSetitems', 'requestParametersgroupSetitems', 'requestParametersinstanceType', 'requestParametersmonitoringenabled', 'req

---

## Explore the Results

The pipeline was configured to write its output to data/output/output.csv.

In [2]:
import pandas as pd

In [3]:
import pandas as pd
output = pd.read_csv('data/output/output.csv')

## Identify Number of Unique Users

Here we obtain the number of unique users in the output data.

In [4]:
import pandas as pd
output = pd.read_csv('data/output/output.csv')
unique_users = output['userIdentitysessionContextsessionIssueruserName'].unique()

In [5]:
import pandas as pd
output = pd.read_csv('data/output/output.csv')
unique_users = output['userIdentitysessionContextsessionIssueruserName'].unique()
print(len(unique_users))

2


## Create Z-Scores for Each User

Recall that rather than using the autoencoder reconstruction loss values directly, we instead convert them to their z-scores, which tell us how many standard deviations away from a mean reconstruction loss value they are:

```python
output['zscore'] = ( output['ae_anomaly_score'] - output['ae_anomaly_score'].mean() ) / output['ae_anomaly_score'].std()
```

Now that we have an `output` dataframe containing multiple users, we need to modify our z-score calculation since both `output['ae_anomaly_score'].mean()` and `output['ae_anomaly_score'].std()` would give us the mean and standard deviation, respectively, for all users instead of for just the values associated with the single user represented in a given row of the dataframe.

With that in mind here is one approach to creating a `zscore` column for each entry in the dataframe which only calculates the mean and standard deviations for anomaly scores in the dataframe associated with the user contained in each row:

In [6]:
# `unique_users` was defined above.
for user in unique_users:
    anomaly_score = output['ae_anomaly_score']
    
    # Match rows for this user.
    is_user = output['userIdentitysessionContextsessionIssueruserName'] == user
    
    # Mean for this user's anomaly scores.
    user_anomaly_score_mean = output[is_user]['ae_anomaly_score'].mean()
    
    # Standard deviation for this user's anomaly scores.
    user_anomaly_score_std = output[is_user]['ae_anomaly_score'].std()
    
    # Create zscores for this user.
    output.loc[is_user, ['zscore']] = ( anomaly_score - user_anomaly_score_mean ) / user_anomaly_score_std

## Check for High Z-Scores

Let's assume any z-score value greater than 4 is considered high, and obtain the scores and user IDs for any entry with a high z-score:

In [7]:
# Set threshold for a high z-score.
zscore_threshold = 2

# Get all z-scores higher than threshold.
high_zscore = output['zscore'] > zscore_threshold

# Get z-score and user name for all z-scores exceeding high z-score threshold.
high_zscores_names = output[high_zscore][['zscore', 'userIdentitysessionContextsessionIssueruserName']]

In [8]:
# Print z-score and user name for all z-scores exceeding high z-score threshold.
high_zscores_names

Unnamed: 0,zscore,userIdentitysessionContextsessionIssueruserName
1,2.048213,role-g
29,2.668827,role-g
37,3.152065,role-g
75,2.601259,role-g
101,2.668827,role-g
...,...,...
1129,2.365076,user123
1130,3.133815,user123
1131,2.174497,user123
1146,2.116456,user123


In [9]:
# Print users with high zscore entries.
high_zscores_names['userIdentitysessionContextsessionIssueruserName'].unique()

array(['role-g', 'user123'], dtype=object)

As in previous notebooks, `user123` is the user/service with z-scores exceeding 4.

## Get Users with Time Series Anomalies

In addition to autoencoder-based anomaly detection, we also configured the pipeline above to identify time series anomalies over each 10 minute period of time, setting the time series z-score threshold to 8:

```python
  timeseries \
    --resolution=10m \
    --zscore_threshold=8.0 \
```

Here we get any user in our output that exhibited this kind of time series anomaly:

In [10]:
output[output['ts_anomaly'] == True]['userIdentitysessionContextsessionIssueruserName'].unique()

array(['user123'], dtype=object)

---

## Next

In the next section you will complete your final coding exercise of the workshop, building a digital fingerprint and time series analysis pipeline from scratch, creating models for and performing inference on multiple users.

Please continue to the next notebook.