![DLI Header](../images/DLI_Header.png)

# Exercise: End to End Pipeline

Your task in this final exercise is to create pipelines to perform digital fingerprinting and time series analysis, and to eventually identify a user exhibiting anomalous behavior compared to their typical patterns of behavior.

## Guidelines

Using all data provided in the data/training-data directory, which provides data for periods of time where the behaviors for 20 different users were typical, your pipeline should train an autoencoder for each unique user found there.

Inference should be performed on all data within the data/input-data directory, which contains input data for the same 20 users represented in the data/training-data directory.

In addition to autoencoder-based digital fingerprinting, your pipeline should also perform time series analysis on all incoming data, looking for time series anomalies over each 10 minute period of time.

## Completing the Course Assessment

Before you begin your work, please open and read [the course assessment questions](https://courses.nvidia.com/courses/course-v1:DLI+C-DS-03+V1/courseware/85f2a3ac16a0476685257996b84001ad/9ef2f68fb10d40c5b54b783392938d04/2?activate_block_id=block-v1%3ADLI%2BC-DS-03%2BV1%2Btype%40vertical%2Bblock%40674c42f0c40a4956971946f596a0ff78), which should guide your work.

After you have successfully answered each of the assessment questions, you will be qualified to [generate a certificate of competency](https://courses.nvidia.com/courses/course-v1:DLI+C-DS-03+V1/progress) for the workshop.

## Continuing Your Work at a Later Time

If you are unable to complete the assessment during the allotted time for the workshop, you may return at a later time to this interactive environment to work on it further at your leisure. If you wish to save any work done in this notebook, please use the JupyterLab _File_ menu and select _Download_ to download this notebook, containing your work. When you restart the interactive environment at a later time, you can upload the notebook by dragging and dropping it into the JupyterLab file viewer and resume work on it.

---

## Your Work Here

Should you wish, feel free also to open a JupyterLab terminal within which to run your pipelines.

In [1]:
!morpheus --log_level=DEBUG run \
    --num_threads=1 \
  pipeline-ae \
    --userid_column_name="userIdentitysessionContextsessionIssueruserName" \
  from-cloudtrail \
    --input_glob="data/input-data/*.csv" \
  train-ae \
    --train_data_glob="data/training-data/*.csv" \
    --seed 42 \
  preprocess \
  inf-pytorch \
  add-scores \
  timeseries \
    --resolution=10m \
    --zscore_threshold=10.0 \
  serialize \
  to-file \
    --filename="data/output/output.csv" \
    --overwrite

[32mConfiguring Pipeline via CLI[0m
[33mC++ is disabled for AutoEncoder pipelines at this time.[0m
[2mLoaded columns. Current columns: [['apiVersion', 'errorCode', 'errorMessage', 'eventName', 'eventSource', 'sourceIPAddress', 'tlsDetailsclientProvidedHostHeader', 'userAgent', 'userIdentityaccessKeyId', 'userIdentityaccountId', 'userIdentityarn', 'userIdentityprincipalId', 'userIdentitysessionContextsessionIssueruserName']][0m
[31mStarting pipeline via CLI... Ctrl+C to Quit[0m
Config: 
{
  "ae": {
    "feature_columns": [
      "apiVersion",
      "errorCode",
      "errorMessage",
      "eventName",
      "eventSource",
      "sourceIPAddress",
      "tlsDetailsclientProvidedHostHeader",
      "userAgent",
      "userIdentityaccessKeyId",
      "userIdentityaccountId",
      "userIdentityarn",
      "userIdentityprincipalId",
      "userIdentitysessionContextsessionIssueruserName"
    ],
    "userid_column_name": "userIdentitysessionContextsessionIssueruserName",
    "userid_f

In [2]:
import pandas as pd
output = pd.read_csv('data/output/output.csv')
unique_users = output['userIdentitysessionContextsessionIssueruserName'].unique()
print(len(unique_users))

20


In [3]:
# `unique_users` was defined above.
for user in unique_users:
    anomaly_score = output['ae_anomaly_score']
    
    # Match rows for this user.
    is_user = output['userIdentitysessionContextsessionIssueruserName'] == user
    
    # Mean for this user's anomaly scores.
    user_anomaly_score_mean = output[is_user]['ae_anomaly_score'].mean()
    
    # Standard deviation for this user's anomaly scores.
    user_anomaly_score_std = output[is_user]['ae_anomaly_score'].std()
    
    # Create zscores for this user.
    output.loc[is_user, ['zscore']] = ( anomaly_score - user_anomaly_score_mean ) / user_anomaly_score_std

In [4]:
# Set threshold for a high z-score.
zscore_threshold = 5

# Get all z-scores higher than threshold.
high_zscore = output['zscore'] > zscore_threshold

# Get z-score and user name for all z-scores exceeding high z-score threshold.
high_zscores_names = output[high_zscore][['zscore', 'userIdentitysessionContextsessionIssueruserName']]

KeyError: 'zscore'

In [5]:
# Print z-score and user name for all z-scores exceeding high z-score threshold.
high_zscores_names

Unnamed: 0,zscore,userIdentitysessionContextsessionIssueruserName
6281,5.896379,user17


In [5]:
# Print users with high zscore entries.
high_zscores_names['userIdentitysessionContextsessionIssueruserName'].unique()

NameError: name 'high_zscores_names' is not defined

In [6]:
output[output['ts_anomaly'] == True]['userIdentitysessionContextsessionIssueruserName'].unique()

array(['user17'], dtype=object)

In [None]:
  timeseries \
    --resolution=10m \
    --zscore_threshold=10.0 \

---

## Next

In the final section you will learn about how to get access to Morpheus, services that can enhance your usage of Morpheus and will learn about additional resources to assist you in your development.

Please continue to the next notebook.