# AutoThreshold Test

Notebook for testing class `AutoThresholdRe` features namely:

1. Passing a pre-trained autoencoder to determine threshold
2. Apply classification for anomalous data
3. Measure performance

In this example, we will be training the `creditcardfraud.csv` with an autoencoder. The `AutoThresholdRe` class will be utilized to determine an automatic threshold to determine anomalous data.

In [1]:
# Import necessary libraries and path relative to project
import torch
import pandas as pd
import numpy as np

import sys
import os

sys.path.append(os.path.join(os.path.abspath(''), '../pyno/lib'))

from autoencoder import Autoencoder
from auto_threshold_re import AutoThresholdRe

## Autoencoder Training

Similar to Autoencoder Test, train an autoencoder with topology `[29, 27, 25]` against the credit card dataset.

In [2]:
# The topology of the model from input layer to innermost latent layer
layers = [29, 27, 25]

h_activation = 'relu'
o_activation = 'sigmoid'
device = torch.device('cpu')
error_type = 'mse'
optimizer_type = 'adam'

# Initialize the autoencoder
autoencoder = Autoencoder(
                layers=layers, 
                h_activation=h_activation, 
                o_activation=o_activation, 
                device=device, 
                error_type=error_type, 
                optimizer_type=optimizer_type)

# Instantiate pandas DataFrame
data = pd.DataFrame()

# Chunk size for reading data
chunksize = 10000

# The reference to the dataset. Change this to 
dataset_file = '../data/creditcardfraud.csv'

print("Loading dataset '{}'...".format(dataset_file))

# Read each chunk and append to data frame
for i, chunk in enumerate(pd.read_csv(dataset_file, header=None, chunksize=chunksize)):
    print("Reading chunk %d" % (i + 1))
    data = data.append(chunk)

print("Done loading dataset...")
    
# Check for proper value of input dimensionality to be used by model
input_dim = len(data.columns) - 1
print("Input Dimensionality: %d" % (input_dim))

# Partition the data into positive_data and negative_data
positive_data = data[data[input_dim] == 1].iloc[:,:input_dim]
negative_data = data[data[input_dim] == -1].iloc[:,:input_dim]

# x representing all data regardless of label.
# Need to convert it to a tensor before passing it to the model for training
x = torch.tensor(positive_data.values).float()

epochs = 100
lr = 0.005
batch_size = 10000

autoencoder.fit(
    x, 
    epochs=epochs, 
    lr=lr,
    batch_size=batch_size)

Loading dataset '../data/creditcardfraud.csv'...
Reading chunk 1
Reading chunk 2
Reading chunk 3
Reading chunk 4
Reading chunk 5
Reading chunk 6
Reading chunk 7
Reading chunk 8
Reading chunk 9
Reading chunk 10
Reading chunk 11
Reading chunk 12
Reading chunk 13
Reading chunk 14
Reading chunk 15
Reading chunk 16
Reading chunk 17
Reading chunk 18
Reading chunk 19
Reading chunk 20
Reading chunk 21
Reading chunk 22
Reading chunk 23
Reading chunk 24
Reading chunk 25
Reading chunk 26
Reading chunk 27
Reading chunk 28
Reading chunk 29
Done loading dataset...
Input Dimensionality: 29
Epoch: 1	Loss: 0.59087
Epoch: 2	Loss: 0.22608
Epoch: 3	Loss: 0.21531
Epoch: 4	Loss: 0.21393
Epoch: 5	Loss: 0.21254
Epoch: 6	Loss: 0.21088
Epoch: 7	Loss: 0.20770
Epoch: 8	Loss: 0.20238
Epoch: 9	Loss: 0.19742
Epoch: 10	Loss: 0.19293
Epoch: 11	Loss: 0.18980
Epoch: 12	Loss: 0.18780
Epoch: 13	Loss: 0.18690
Epoch: 14	Loss: 0.18621
Epoch: 15	Loss: 0.18515
Epoch: 16	Loss: 0.18374
Epoch: 17	Loss: 0.18060
Epoch: 18	Loss: 0.1

## AutoThreshold computation

Creates an instance of `AutoThresholdRe` and compute based on the initial training data.

In [3]:
cmd = AutoThresholdRe(x, autoencoder)
cmd.execute()

print("Optimal Threshold: {}".format(cmd.optimal_threshold))

Optimal Threshold: 0.19802218160861892


## Classification with AutoThresholdRe

Use the `predict(x)` method of `AutoTresholdRe` to classify data as either normal (`1`) or anomalous (`-1`) using the `optimal_threshold` defined in the class. `x` should be a tensor variable in pytorch while the `predict` returns numpy array.

In the code below, we are passing all anomalous data and see if the model can classify it correctly as `-1`.

In [4]:
outliers = torch.tensor(negative_data.values).float()

predictions = cmd.predict(outliers)

print(predictions)

[-1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1  1  1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1  1 -1 -1
 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1  1 -1  1 -1 -1 -1  1  1  1 -1 -1  1  1 -1 -1 -1 -1 -1 -1
 -1  1  1 -1  1 -1 -1 -1 -1  1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1  1  1
 -1  1 -1  1 -1 -1 -1  1  1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1  1 -1 -1
  1 -1 -1 -1  1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1  1 -1 -1
 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1  1  1 -1  1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1

In [5]:
diffs = cmd.predict(x)
print(d)

NameError: name 'd' is not defined