## Unsupervised Autoencoder Anomaly Detection

1. Train an autoencoder to capture latent representation of entire dataset
2. Extract the latent representation of the entire dataset
3. Append the reconstruction error as part of the feature vector for the latent representation dataset
4. Perform clustering against the new dataset to determine natural groupings
5. Train another autoencoder against datapoints within groupings treated as normal
6. Apply autothresholding with head tail break
7. Test performance

In [2]:
import sys
import os
import pandas as pd
sys.path.append('..')

from lib.autoencoder import Autoencoder

## Experimental Parameters

In [3]:
# Location of csv training file
ds_train_file = "../Datasets/thyroid_train.csv"
# Location of csv test file
ds_test_file = "../Datasets/thyroid_test.csv" 

# Dataframe instance for training data
df_train = pd.read_csv(ds_train_file)
# Dataframe instance for test data
df_test = pd.read_csv(ds_test_file)

# Expected number of features
n_features = 21

# Extracted dataframe for all training values without labels
df_train_x = df_train.drop(['y'], axis=1)
# Extracted dataframe for all labels of training data
df_train_y = df_train['y']

# Extracted dataframe for all test values without labels
df_test_x = df_test.drop(['y'], axis=1)
# Extracted dataframe for all test labels of testing data
df_test_y = df_test['y']

# Autoencoder parameter for layers. First element is the size of the input vector. Succeeding values are hidden layers for the encoder
layers = [21, 10]

# Autoencoder parameter for hidden activation
h_activation = 'relu'

# Autoencoder parameter for output activation
o_activation = 'sigmoid'

# Autoencoder parameter for learning rate
learning_rate = 0.001

# Torch parameter for device
device = 'cpu'

# Training parameter for number of epochs
epochs = 100

# Training parameter for batch size
batch_size = 10

## Train Autoencoder Model

The first autoencoder will attempt to get the latent representation of the data regardless of the labels.

In [None]:
model = Autoencoder(layers=layers, h_activation=h_activation, o_activation=o_activation, device=device)

# Represent the training data as x
x = torch.tensor(df_train_x.values).float().to(device)

train_ds = AutoencoderDataset(x=x)