# Encoders: Binary Example
One of the cool Neural Net Achitectures are auto-encoders. Auto-encoders are networks designed to predict their input. An auto-encoder consists of an `encoder` which encodes the input to a set of __latent variables__ and a `decoder` which decodes the latent variables and tries to reconstruct the full input. The loss indicates how close it got to the original input.

Initially this may sound like somewhat of a trivial exercise. But auto-encoders have some potentially interesting use cases. Auto-encoders do not need labels. One only needs transactions. Auto-encoders might potentially be used for:
- __Anomaly detection__. If the loss after reconstruction is high we could assume that it is something the model did not really see before. I might be an anomaly.
- __Transfer learning__. Transfer learning is an interesting concept where 

----
#### Note on the data set 
The data set used here is not particularly complex and/or big. It's not really all that challenging to find the fraud. In an ideal world we'd be using more complex data sets to show the real power of Deep Learning. There are a bunch of PCA'ed data sets available, but the PCA obfuscates some of the elements that are useful. 
*These examples are meant to show the possibilities, it's not so useful to interpret their performance on this data set*

# Imports

In [1]:
import torch
import numpy as np
import gc

import d373c7.features as ft
import d373c7.engines as en
import d373c7.pytorch as pt
import d373c7.pytorch.models as pm
import d373c7.plot as pl

## Set-up device

In [2]:
print(f'Torch Version : {torch.__version__}')

# Set up the GPU if available. This will be the default device
if torch.cuda.is_available():
    device = torch.device('cuda:0')
    print(f'Cuda Version  : {torch.version.cuda}')
    print(f'GPU found. Using GPU <{device.index}>')
else:
    device = torch.device('cpu')
    print(f'No GPU found ... Using CPU {device}')

# Also set up a cpu device
cpu = torch.device('cpu')

Torch Version : 1.6.0
Cuda Version  : 10.2
GPU found. Using GPU <0>


## Set a random seed for Numpy and Torch
> Will make sure we always sample in the same way. Makes it easier to compare results. At some point it should been removed to test the model stability.

In [3]:
# Numpy
np.random.seed(42)
# Torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Define base feature and read the File
The base features are features found in the input file. They need to be defined after which the file can be read using the `EnginePandasNumpy`. Using the `from_csv` method.
The `from_csv` method will read the file and return a Pandas DataFrame object

In [4]:
# Change this to read from another location
file = '../../../data/bs140513_032310.csv'

In [5]:
age = ft.FeatureSource('age', ft.FEATURE_TYPE_CATEGORICAL)
gender = ft.FeatureSource('gender', ft.FEATURE_TYPE_CATEGORICAL)
merchant = ft.FeatureSource('merchant', ft.FEATURE_TYPE_CATEGORICAL)
category = ft.FeatureSource('category', ft.FEATURE_TYPE_CATEGORICAL)
amount = ft.FeatureSource('amount', ft.FEATURE_TYPE_FLOAT)
fraud = ft.FeatureSource('fraud', ft.FEATURE_TYPE_INT_8)

base_features = ft.TensorDefinition(
    'base', 
    [
        age,
        gender,
        merchant,
        category,
        amount,
        fraud
    ])

amount_binned = ft.FeatureBin('amount_bin', ft.FEATURE_TYPE_INT_16, amount, 30)

intermediate_features = ft.TensorDefinition(
    'base', 
    [
        age,
        gender,
        merchant,
        category,
        amount_binned,
        fraud
    ])

with en.EnginePandasNumpy() as e:
    df = e.from_csv(base_features, file, inference=False)
    df = e.from_df(intermediate_features, df, inference=False)
    
df

2020-10-30 15:18:48.262 d373c7.engines.common          INFO     Start Engine...
2020-10-30 15:18:48.263 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.2
2020-10-30 15:18:48.263 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2020-10-30 15:18:48.263 d373c7.engines.panda_numpy     INFO     Building Panda for : base from file ../../../data/bs140513_032310.csv
2020-10-30 15:18:48.467 d373c7.engines.panda_numpy     INFO     Building Panda for : <base> from DataFrame. Inference mode <False>
2020-10-30 15:18:48.473 d373c7.engines.panda_numpy     INFO     Done creating base. Shape=(594643, 6)
2020-10-30 15:18:48.473 d373c7.engines.panda_numpy     INFO     Building Panda for : <base> from DataFrame. Inference mode <False>
2020-10-30 15:18:48.488 d373c7.engines.panda_numpy     INFO     Done creating base. Shape=(594643, 6)


Unnamed: 0,age,gender,merchant,category,amount_bin,fraud
0,4,M,M348934600,es_transportation,1,0
1,2,M,M348934600,es_transportation,1,0
2,4,F,M1823072687,es_transportation,1,0
3,3,M,M348934600,es_transportation,1,0
4,5,M,M348934600,es_transportation,1,0
...,...,...,...,...,...,...
594638,3,F,M1823072687,es_transportation,1,0
594639,4,F,M1823072687,es_transportation,1,0
594640,2,F,M349281107,es_fashion,1,0
594641,5,M,M1823072687,es_transportation,1,0


In [6]:
amount_oh = ft.FeatureOneHot('amount_one_hot', amount_binned)
age_oh = ft.FeatureOneHot('age_one_hot', age)
gender_oh = ft.FeatureOneHot('gender_one_hot', gender)
merchant_oh = ft.FeatureOneHot('merchant_one_hot', merchant)
category_oh = ft.FeatureOneHot('category_one_hot', category)
fraud_label = ft.FeatureLabelBinary('fraud_label', fraud)

learning_features = ft.TensorDefinition(
    'learning', 
    [
        age_oh,
        gender_oh,
        merchant_oh,
        category_oh,
        amount_oh,
        fraud_label
    ])

with en.EnginePandasNumpy() as e:
    df = e.from_df(learning_features, df, inference=False)
df

2020-10-30 15:18:50.656 d373c7.engines.common          INFO     Start Engine...
2020-10-30 15:18:50.656 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.2
2020-10-30 15:18:50.656 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2020-10-30 15:18:50.657 d373c7.engines.panda_numpy     INFO     Building Panda for : <learning> from DataFrame. Inference mode <False>
2020-10-30 15:18:50.795 d373c7.engines.panda_numpy     INFO     Done creating learning. Shape=(594643, 108)


Unnamed: 0,age__0,age__1,age__2,age__3,age__4,age__5,age__6,age__U,gender__E,gender__F,...,amount_bin__22,amount_bin__23,amount_bin__24,amount_bin__25,amount_bin__26,amount_bin__27,amount_bin__28,amount_bin__29,amount_bin__0,fraud_label
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594638,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
594639,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
594640,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
594641,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Convert to Numpy
> From here on down we follow the same path

In [7]:
with en.EnginePandasNumpy() as e:
    data_list = e.to_numpy_list(learning_features, df)
print(data_list.shapes)
print(data_list.dtype_names)

2020-10-30 15:18:52.354 d373c7.engines.common          INFO     Start Engine...
2020-10-30 15:18:52.354 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.2
2020-10-30 15:18:52.355 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2020-10-30 15:18:52.355 d373c7.engines.panda_numpy     INFO     Converting DataFrame to Numpy of type: int8
2020-10-30 15:18:52.355 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Binary
2020-10-30 15:18:52.386 d373c7.engines.panda_numpy     INFO     Converting DataFrame to Numpy of type: int8
2020-10-30 15:18:52.386 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Label


[(594643, 107), (594643,)]
['int8', 'int8']


## Wrangle the data
Time split the data. For time series data it is very important to keep the order of the data. This split will start from the end and work it's way to the front of the data. This way the training, validation and test data are nicely colocated in time

> 1. Split out a training-set of size `test_records`. This is used for model testing.
> 2. Split out a validation-set of size `validation_records`. It will be used to monitor overfitting during training
> 3. All the rest is considered training data.

In [8]:
test_records = 100000
val_records  = 30000 

train_data, val_data, test_data = data_list.split_time(val_records, test_records) 

print(f'Training Data shapes {train_data.shapes}')
print(f'Validation Data shapes {val_data.shapes}')
print(f'Test Data shapes {test_data.shapes}')
del data_list
gc.collect()
print('Done')

Training Data shapes [(464643, 107), (464643,)]
Validation Data shapes [(30000, 107), (30000,)]
Test Data shapes [(100000, 107), (100000,)]
Done


## Define model

In [9]:
# Setup Pytorch Datasets for the training and validation
batch_size = 128
train_ds = pt.NumpyListDataSet(learning_features, train_data)
val_ds = pt.NumpyListDataSet(learning_features, val_data)
train_sampler = pt.ClassSampler(learning_features, train_data).over_sampler()

# Wrap them in a Pytorch Dataloader
train_dl = train_ds.data_loader(cpu, batch_size, num_workers=2, sampler=train_sampler)
val_dl = val_ds.data_loader(cpu, batch_size, num_workers=2)