# Anomaly Detection using LSTM Encoder-Decoder.

This is the second of four notebooks used in defining and training an encoder-decoder LSTM architecture for anomaly detection. Each of the following four steps are completed in an individual notebook:

- __Preprocess__: Preprocess raw ICEWS data into time series for training and evaluating a model.
- __Train__: Create and train a model with pre-processed and cleaned data.
- __Threshold calculation__: Use the residuals from a validation set to determine an anomaly detection threshold.
- __Inference__: Run anomaly detection on data from various countries to assess performance.

In [None]:
import sys
sys.path.append('..')

import torch
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
import sklearn.preprocessing as pre

import joblib
from util import data, preprocess, icews
from models import networks

from tqdm.notebook import tqdm

sns.set()
plt.rcParams['figure.figsize'] = (10, 5)
np.random.seed(123)

## Loading pre-processed Data

In [None]:
# Load training and testing data:
train_data = joblib.load('data/icews_ml/7day_training.joblib')
valid_data = joblib.load('data/icews_ml/7day_validate.joblib')

# create test/train DataLoaders
train_loader = DataLoader(train_data, batch_size=1, shuffle=True, num_workers=0)
valid_loader = DataLoader(valid_data, batch_size=1, shuffle=True, num_workers=0)

len(train_data), len(valid_data)

## Defining a model
- The variable ``window`` defines the length of the subsequences to be reconstructed by the LSTM autoencoder. The timeseries will be chunked according to this variable.

- ``n_features`` is the number of features in the unencoded data and is used in conjunction with ``window`` size and a scaling factor to choose the embedding space.



In [None]:
# Define Anomaly Detection Period (days)
window = 7 

# Get number of variables, and embedding size
n_features = 5 # Intensity, QC1, ..., QC4
factor = 4
emb_size = int(n_features * window * factor)

# Initialize new model
model = networks.LSTMEncoderDecoder(n_features, emb_size)

print("embedding space size: ", emb_size)

## Train model
The model is then trained using a pre-defined autoencoder training function that can be found in ``models/networks.py``

In [None]:
objective = nn.MSELoss(reduction='sum')
loss = networks.train_encoder(model, 500, 
                     train_loader, 
                     criterion=objective,
                     lr=1e-4,
                     testload=valid_loader, 
                     reverse=False)

In [None]:
sns.lineplot(data=loss)
plt.legend(['Training Loss', 'Valid Loss']);

In [None]:
torch.save(model.state_dict(), f'data/models/ae_lstm_mse_sum_{window}d_{factor}f.pt')