![DLI Header](../images/DLI_Header.png)

# Time Series Analysis in the Autoencoder Pipeline

While the AE model provides a fantastic tool to describe whether incoming data has deviated from its established digital fingerprint, Morpheus goes even further and allows us to also perform time series analysis on incoming data. In this notebook you'll learn how to incorporate time series analysis into `pipeline-ae`.

## Objectives

By the time you complete this notebook you will:

- Understand how time series analysis can provide insight in addition to the use of autoencoders.
- Be able to perform time series analysis as part of `pipeline-ae`.

---

## Time Series Analysis Using Fast Fourier Transform (FFT)

Machine application activity tends to oscillate over time, and attacker activities can be difficult to detect among the periodic noise in data with just a volumetric alert. To find subtle anomalies inside periodic data, you transform the data from the time domain to the frequency domain using [fast Fourier transformation](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (FFT). You then reconstruct the signal back to the time domain with iFFT (inverse FFT) but use only the top 90% of frequencies. A large difference between the original signal and the reconstructed signal indicates the times at which the machineâ€™s activity is unusual and potentially compromised by malicious human activity.

Morpheus applies FFTs by learning what a normal period or periods of activity looks like for a given user/service and machine/service system interaction. After this, Morpheus performs decomposition quickly and applies a rolling z-score to the transformed data which we can use to flag periods that are anomalous.

---

## More on FFT

Understanding that FFT is a technique well-suited to spotting anomalous activity over a duration of time is sufficient to use it effectively in Morpheus pipelines, especially for the duration of this workshop. However, for those of you who would like a deeper intuition of how FFT works, consider watching [this 20 minute 3Blue1Brown video](https://www.youtube.com/watch?v=spUNpyF58BY&ab_channel=3Blue1Brown) at a later time.

---

## Using Time Series in Morpheus

To perform time series anomaly detection in `pipeline-ae` we add the `timeseries` stage to the pipeline.

In [1]:
!morpheus run pipeline-ae --help | grep timeseries

  timeseries       Perform time series anomaly detection and add prediction.


We set `--resolution` to a time period we wish to bin data into, that is, the length of the periods of time we wish to determine as being anomalous or not.

In [2]:
!morpheus run pipeline-ae timeseries --help | grep 'resolution' -A 3

  --resolution TEXT               Time series resolution. Logs will be binned
                                  into groups of this size. Uses the pandas
                                  time delta format, i.e. '10m' for 10 minutes
                                  [default: 1 h]


We also set `--zscore_threshold` to determine how many standard deviations a given period of time needs to deviate from the mean values in order to be identified as anomalous by Morpheus.

In [3]:
!morpheus run pipeline-ae timeseries --help | grep 'zscore_threshold' -A 6

  --zscore_threshold FLOAT RANGE  The z-score threshold required to flag
                                  datapoints. The value indicates the number
                                  of standard deviations from the mean that is
                                  required to be flagged. Increasing this
                                  value will decrease the number of
                                  detections.  [default: 8.0; x>=0.0;
                                  required]


When using `timeseries` morpheus will add a new boolean column `ts_anomaly` indicating whether or not the time series stage identified anomalous behavior in the data.

---

## Run the Pipeline with Time Series Analysis

Below is the same pipeline we used when learning how to train and perform inference with autoencoders, only now it includes a `timeseries` stage with the configurations described above.

In [4]:
!morpheus run \
  --num_threads=1 \
  pipeline-ae \
    --userid_filter="role-g" \
    --userid_column_name="userIdentitysessionContextsessionIssueruserName" \
  from-cloudtrail \
    --input_glob="data/input-data/*.csv" \
  train-ae \
    --train_data_glob="data/training-data/*.csv" \
    --seed 42 \
  preprocess \
  inf-pytorch \
  add-scores \
  timeseries \
    --resolution=10m \
    --zscore_threshold=8.0 \
  serialize \
  to-file \
    --filename="data/output/output.csv" \
    --overwrite

[32mConfiguring Pipeline via CLI[0m
[33mC++ is disabled for AutoEncoder pipelines at this time.[0m
[31mStarting pipeline via CLI... Ctrl+C to Quit[0m
512


---

## Explore the Results

In [5]:
import pandas as pd

In [6]:
output = pd.read_csv('data/output/output.csv')

Note the new boolean column `ts_anomaly` created by the pipeline. The value of this column will indicate whether or not the data exceeded the set z-score threshold when passed through the time series analysis.

In [7]:
output.dtypes

Unnamed: 0                                           int64
_index_                                              int64
eventID                                              int64
eventTime                                           object
userIdentityaccountId                               object
eventSource                                         object
eventName                                           object
sourceIPAddress                                     object
userAgent                                           object
userIdentitytype                                    object
apiVersion                                          object
userIdentityprincipalId                             object
userIdentityarn                                     object
userIdentityaccessKeyId                             object
userIdentitysessionContextsessionIssueruserName     object
errorCode                                           object
errorMessage                                        obje

In [8]:
output['ts_anomaly'].describe()

count       314
unique        1
top       False
freq        314
Name: ts_anomaly, dtype: object

It would appear that for the data associated with `role-g` that we ran through the pipeline that there was no time series anomalies identified by the pipeline.

---

## Next

Now that you are familiar with using the `timeseries` stage as a part of `pipeline-ae`, you will perform time series analysis yourself in the next exercise to try and identify anomalous periods of user activity.

Please continue to the next notebook.