# Autoencoder Test

Notebook for testing class `Autoencoder` features namely:

1. Initialization with parameters
2. Training with `fit` method
3. Dimensionality reduction with `encode`

In this example, we will be loading the `creditcardfraud.csv` dataset for testing. The dataset has `29` dimensions. The topology of the model will be `29` for the input layer, `27` for the first hidden layer and `25` for the innermost hidden layer (also known as the latent layer).

In [1]:
# Import necessary libraries and path relative to project
import torch
import pandas as pd

import sys
import os

sys.path.append(os.path.join(os.path.abspath(''), '../pyno/lib'))

from autoencoder import Autoencoder

## Initialization Parameters

### `layers`

An array of integers corresponding to the neuron count from input to innermost latent layer of the autoencoder.

### `h_activation`

The activation function to be used for hidden layers. Possible values:

* `relu`

Default value: `relu`

### `o_activation`

The activation function to be used for the output layer. Possible values:

* `sigmoid`

Default value: `sigmoid`

### `device`

Torch device to be used for training. Default is `torch.device("cpu")` (use CPU of machine)

### `error_type`

The algorithm to be used when computing the overall error for an epoch as well as the `diff` function to determine the residual error between input and output. Possible values:

* `mse`

Default value: `mse`

### `optimizer_type`

The torch optimizer to be used for back propagation. Possible values:

* `adam`

Default value: `adam`

In [2]:
# The topology of the model from input layer to innermost latent layer
layers = [29, 27, 25]

h_activation = 'relu'
o_activation = 'sigmoid'
device = torch.device('cpu')
error_type = 'mse'
optimizer_type = 'adam'

# Initialize the autoencoder
autoencoder = Autoencoder(
                layers=layers, 
                h_activation=h_activation, 
                o_activation=o_activation, 
                device=device, 
                error_type=error_type, 
                optimizer_type=optimizer_type)

## Loading the Dataset

Loading the dataset involves loading a file to a `pandas` `DataFrame` instance. The dataset to be loaded should be in the form of a CSV file without any headers. To avoid consuming too much memory from a single read for large datasets, provide a `chunk_size` value (integer) to determine how much rows will be loaded to the `DataFrame` per read. The format should be as follows:

```
x1,x2,x3...xn,y
```

where:

`x1,x2,x3...xn` is the multivariate dataset and `y` is the label for that dataset. `y = 1` if the row corresponds to normal data while `y = -1` if the row corresponds to an anomaly.

### Data Segregation

Data will be stripped off the last column to retain only the multivariate data. It will be separated into the variables `positive_data` for all normal data and `negative_data` for all anomalies.

In [3]:
# Instantiate pandas DataFrame
data = pd.DataFrame()

# Chunk size for reading data
chunksize = 10000

# The reference to the dataset. Change this to 
dataset_file = '../data/creditcardfraud.csv'

print("Loading dataset '{}'...".format(dataset_file))

# Read each chunk and append to data frame
for i, chunk in enumerate(pd.read_csv(dataset_file, header=None, chunksize=chunksize)):
    print("Reading chunk %d" % (i + 1))
    data = data.append(chunk)

print("Done loading dataset...")
    
# Check for proper value of input dimensionality to be used by model
input_dim = len(data.columns) - 1
print("Input Dimensionality: %d" % (input_dim))

# Partition the data into positive_data and negative_data
positive_data = data[data[input_dim] == 1].iloc[:,:input_dim]
negative_data = data[data[input_dim] == -1].iloc[:,:input_dim]

# x representing all data regardless of label.
# Need to convert it to a tensor before passing it to the model for training
x = torch.tensor(data.iloc[:,:input_dim].values).float()

Loading dataset '../data/creditcardfraud.csv'...
Reading chunk 1
Reading chunk 2
Reading chunk 3
Reading chunk 4
Reading chunk 5
Reading chunk 6
Reading chunk 7
Reading chunk 8
Reading chunk 9
Reading chunk 10
Reading chunk 11
Reading chunk 12
Reading chunk 13
Reading chunk 14
Reading chunk 15
Reading chunk 16
Reading chunk 17
Reading chunk 18
Reading chunk 19
Reading chunk 20
Reading chunk 21
Reading chunk 22
Reading chunk 23
Reading chunk 24
Reading chunk 25
Reading chunk 26
Reading chunk 27
Reading chunk 28
Reading chunk 29
Done loading dataset...
Input Dimensionality: 29


## Training with `fit` method

The `fit` method in the `Autoencoder` class takes in the dataset for training using back propogation. The parameters are as follows:

### `x`

The vector of multivariate data representing the training set. No labels should be included. This will also be the value of `y` in terms of autoencoder training.

### `epochs`

Number of iterations for training the entire dataset. Default value is `100`

###  `lr`

The learning rate for back propagtion. Values should be `> 0` and `< 1`. Default value is `0.005`.

### `batch_size`

Number of records to be included in a batch for mini-batch training. Default value is `5`.

In [None]:
epochs = 100
lr = 0.005
batch_size = 10000

autoencoder.fit(
    x, 
    epochs=epochs, 
    lr=lr,
    batch_size=batch_size)

  Variable._execution_engine.run_backward(


Epoch: 1	Loss: 0.68170
Epoch: 2	Loss: 0.22920
Epoch: 3	Loss: 0.21586
Epoch: 4	Loss: 0.21478
Epoch: 5	Loss: 0.21440
Epoch: 6	Loss: 0.21405
Epoch: 7	Loss: 0.21366
Epoch: 8	Loss: 0.21322
Epoch: 9	Loss: 0.21240
Epoch: 10	Loss: 0.21094
Epoch: 11	Loss: 0.20734
Epoch: 12	Loss: 0.20266
Epoch: 13	Loss: 0.19867
Epoch: 14	Loss: 0.19734
Epoch: 15	Loss: 0.19683
Epoch: 16	Loss: 0.19628
Epoch: 17	Loss: 0.19557
Epoch: 18	Loss: 0.19450
Epoch: 19	Loss: 0.19266
Epoch: 20	Loss: 0.18931
