# Decoders: Binary Recurrent Example

#### Placeholder
Theoretically it should be possible to create auto-encoders using recurrent layers. After researching this a bit and trying various combinations, turned out it was difficult to make something which came even remotely close to the performance of convolutional auto-encoders on this data. They only ever reached F1 score around 0.5, which is way less than the *near* 0.7 of the convolutional counterparts.

Various approaches were tested;
- Using the last output of an LSTM as latent vector and repeating it in the decoder.
- Using all of the output as latent vector.
- Using the hidden stated of the encoder as input to the decoder.
- Build a Machine Translation *Seq2Seq* sort of decoder.

But none of these worked all the well. Not sure if this is because of the data or the implementation or both. *We might revist this in the future*

----
#### Note on the data set 
The data set used here is not particularly complex and/or big. It's not really all that challenging to find the fraud. In an ideal world we'd be using more complex data sets to show the real power of Deep Learning. There are a bunch of PCA'ed data sets available, but the PCA obfuscates some of the elements that are useful. 
*These examples are meant to show the possibilities, it's not so useful to interpret their performance on this data set*

# Imports

In [1]:
import torch
import numpy as np
import gc
import datetime as dt

import d373c7.features as ft
import d373c7.engines as en
import d373c7.pytorch as pt
import d373c7.pytorch.models as pm
import d373c7.plot as pl

## Set a random seed for Numpy and Torch
> Will make sure we always sample in the same way. Makes it easier to compare results. At some point it should been removed to test the model stability.

In [2]:
# Numpy
np.random.seed(42)
# Torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Define base feature and read the File
The base features are features found in the input file. They need to be defined after which the file can be read using the `EnginePandasNumpy`. Using the `from_csv` method.
The `from_csv` method will read the file and return a Pandas DataFrame object

In [3]:
# Change this to read from another location
file = '../../../../data/bs140513_032310.csv'

In [4]:
step = ft.FeatureSource('step', ft.FEATURE_TYPE_INT_16) 
customer = ft.FeatureSource('customer', ft.FEATURE_TYPE_STRING)
age = ft.FeatureSource('age', ft.FEATURE_TYPE_CATEGORICAL)
gender = ft.FeatureSource('gender', ft.FEATURE_TYPE_CATEGORICAL)
merchant = ft.FeatureSource('merchant', ft.FEATURE_TYPE_CATEGORICAL)
category = ft.FeatureSource('category', ft.FEATURE_TYPE_CATEGORICAL)
amount = ft.FeatureSource('amount', ft.FEATURE_TYPE_FLOAT)
fraud = ft.FeatureSource('fraud', ft.FEATURE_TYPE_INT_8)


# Function to calculate the date and time from the step
def step_to_date(step_count: int):
    return dt.datetime(2020, 1, 1) + dt.timedelta(days=int(step_count))

# Derrived Features
amount_binned = ft.FeatureBin('amount_bin', ft.FEATURE_TYPE_INT_16, amount, 30)
date_time = ft.FeatureExpression('date', ft.FEATURE_TYPE_DATE_TIME, step_to_date, [step])

amount_oh = ft.FeatureOneHot('amount_one_hot', ft.FEATURE_TYPE_INT_8, amount_binned)
age_oh = ft.FeatureOneHot('age_one_hot', ft.FEATURE_TYPE_INT_8, age)
gender_oh = ft.FeatureOneHot('gender_one_hot', ft.FEATURE_TYPE_INT_8, gender)
merchant_oh = ft.FeatureOneHot('merchant_one_hot', ft.FEATURE_TYPE_INT_8, merchant)
category_oh = ft.FeatureOneHot('category_one_hot', ft.FEATURE_TYPE_INT_8, category)
fraud_label = ft.FeatureLabelBinary('fraud_label', ft.FEATURE_TYPE_INT_8, fraud)

learning_features = ft.TensorDefinition(
    'learning', 
    [
        age_oh,
        gender_oh,
        merchant_oh,
        category_oh,
        amount_oh
    ])

with en.EnginePandasNumpy(num_threads=8) as e:
    series_list = e.to_series_stacked(
        learning_features, file, key_feature=customer, time_feature=date_time, window=5, inference=False
    )

print('Series Shapes')
print(series_list.shapes)
print(series_list.dtype_names)

2021-12-29 14:54:28.924 d373c7.engines.common          INFO     Start Engine...
2021-12-29 14:54:28.924 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.4
2021-12-29 14:54:28.925 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2021-12-29 14:54:28.925 d373c7.engines.panda_numpy     INFO     Building Panda for : InternalKeyTime from file ../../../../data/bs140513_032310.csv
2021-12-29 14:54:29.137 d373c7.engines.panda_numpy     INFO     Building Panda for : <Source_Derive_Source> from DataFrame. Inference mode <False>
2021-12-29 14:54:29.137 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Source_Derive_Source
2021-12-29 14:54:29.143 d373c7.engines.panda_numpy     INFO     Done creating Source_Derive_Source. Shape=(594643, 7)
2021-12-29 14:54:29.592 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: InternalKeyTime
2021-12-29 14:54:29.637 d373c7.engines.panda_numpy     INFO     Start creating stacked series for Target Tensor Definiti

Series Shapes
[(594643, 5, 107)]
['int8']


## Wrangle the data
Time to split the data. For time series data it is very important to keep the order of the data. Below split will start from the end and work it's way to the front of the data. Doing so the training, validation and test data are nicely colocated in time. You almost *never* want to plain shuffle time based data.

> 1. Split out a test-set of size `test_records`. This is used for model testing.
> 2. Split out a validation-set of size `validation_records`. It will be used to monitor overfitting during training
> 3. All the rest is considered training data.

For time-series we'll perform an additional action.
> 1. The series at the beginning of the data set will all be more or less empty as there is no history, that is not so useful during training, ideally we have records with history and complete series, sometimes named 'mature' series. We'll throw away the first couple of entries.

__Important__; please make sure the data is ordered in ascending fashion on a date(time) field. The split function does not order the data, it assumes the data is in the correct order.

For auto-encoders we perform a 5th step, all fraud records will be removed from the training and validation data. The auto-encoder will only see *non-fraud* records during training.
> 1. Remove fraud records from training and validation
