# Encoders: Binary Example

----
#### Note on the data set 
The data set used here is not particularly complex and/or big. It's not really all that challenging to find the fraud. In an ideal world we'd be using more complex data sets to show the real power of Deep Learning. There are a bunch of PCA'ed data sets available, but the PCA obfuscates some of the elements that are useful. 
*These examples are meant to show the possibilities, it's not so useful to interpret their performance on this data set*

## Imports

In [5]:
import torch
import numpy as np
import gc

import d373c7.features as ft
import d373c7.engines as en
import d373c7.pytorch as pt
import d373c7.pytorch.models as pm
import d373c7.plot as pl

## Set-up device

In [6]:
print(f'Torch Version : {torch.__version__}')

# Set up the GPU if available. This will be the default device
if torch.cuda.is_available():
    device = torch.device('cuda:0')
    print(f'Cuda Version  : {torch.version.cuda}')
    print(f'GPU found. Using GPU <{device.index}>')
else:
    device = torch.device('cpu')
    print(f'No GPU found ... Using CPU {device}')

# Also set up a cpu device
cpu = torch.device('cpu')

Torch Version : 1.6.0
Cuda Version  : 10.2
GPU found. Using GPU <0>


## Set a random seed for Numpy and Torch
> Will make sure we always sample in the same way. Makes it easier to compare results. At some point it should been removed to test the model stability.

In [8]:
# Numpy
np.random.seed(42)
# Torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Define base feature and read the File
The base features are features found in the input file. They need to be defined after which the file can be read using the `EnginePandasNumpy`. Using the `from_csv` method.
The `from_csv` method will read the file and return a Pandas DataFrame object

In [9]:
# Change this to read from another location
file = '../../../data/bs140513_032310.csv'

In [10]:
age = ft.FeatureSource('age', ft.FEATURE_TYPE_CATEGORICAL)
gender = ft.FeatureSource('gender', ft.FEATURE_TYPE_CATEGORICAL)
merchant = ft.FeatureSource('merchant', ft.FEATURE_TYPE_CATEGORICAL)
category = ft.FeatureSource('category', ft.FEATURE_TYPE_CATEGORICAL)
amount = ft.FeatureSource('amount', ft.FEATURE_TYPE_FLOAT)
fraud = ft.FeatureSource('fraud', ft.FEATURE_TYPE_INT_8)

base_features = ft.TensorDefinition(
    'base', 
    [
        age,
        gender,
        merchant,
        category,
        amount,
        fraud
    ])


age_i = ft.FeatureIndex('age_index', ft.FEATURE_TYPE_INT_8, age)
gender_i = ft.FeatureIndex('gender_index', ft.FEATURE_TYPE_INT_8, gender)
merchant_i = ft.FeatureIndex('merchant_index', ft.FEATURE_TYPE_INT_16, merchant)
category_i = ft.FeatureIndex('category_index', ft.FEATURE_TYPE_INT_16, category)
amount_binned = ft.FeatureBin('amount_bin', ft.FEATURE_TYPE_INT_16, amount, 30)
fraud_label = ft.FeatureLabelBinary('fraud_label', fraud)

learning_features = ft.TensorDefinition(
    'learning', 
    [
        age_i,
        gender_i,
        merchant_i,
        category_i,
        amount_binned,
        fraud_label
    ])

with en.EnginePandasNumpy() as e:
    df = e.from_csv(base_features, file, inference=False)
    df = e.from_df(learning_features, df, inference=False)

df

2020-11-03 14:40:22.364 d373c7.engines.common          INFO     Start Engine...
2020-11-03 14:40:22.364 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.2
2020-11-03 14:40:22.365 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2020-11-03 14:40:22.371 d373c7.engines.panda_numpy     INFO     Building Panda for : base from file ../../../data/bs140513_032310.csv
2020-11-03 14:40:22.586 d373c7.engines.panda_numpy     INFO     Building Panda for : <base> from DataFrame. Inference mode <False>
2020-11-03 14:40:22.592 d373c7.engines.panda_numpy     INFO     Done creating base. Shape=(594643, 6)
2020-11-03 14:40:22.592 d373c7.engines.panda_numpy     INFO     Building Panda for : <learning> from DataFrame. Inference mode <False>
2020-11-03 14:40:22.634 d373c7.engines.panda_numpy     INFO     Done creating learning. Shape=(594643, 6)


Unnamed: 0,age_index,gender_index,merchant_index,category_index,amount_bin,fraud_label
0,1,1,1,1,1,0
1,2,1,1,1,1,0
2,1,2,2,1,1,0
3,3,1,1,1,1,0
4,4,1,1,1,1,0
...,...,...,...,...,...,...
594638,3,2,2,1,1,0
594639,1,2,2,1,1,0
594640,2,2,15,11,1,0
594641,4,1,2,1,1,0


## Convert to Numpy

In [12]:
with en.EnginePandasNumpy() as e:
    data_list = e.to_numpy_list(learning_features, df)
print(data_list.shapes)
print(data_list.dtype_names)

2020-11-03 14:41:14.406 d373c7.engines.common          INFO     Start Engine...
2020-11-03 14:41:14.406 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.2
2020-11-03 14:41:14.407 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2020-11-03 14:41:14.407 d373c7.engines.panda_numpy     INFO     Converting DataFrame to Numpy of type: int16
2020-11-03 14:41:14.407 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Categorical
2020-11-03 14:41:14.411 d373c7.engines.panda_numpy     INFO     Converting DataFrame to Numpy of type: int8
2020-11-03 14:41:14.411 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Label


[(594643, 5), (594643,)]
['int16', 'int8']


## Wrangle the data
Time split the data. For time series data it is very important to keep the order of the data. This split will start from the end and work it's way to the front of the data. This way the training, validation and test data are nicely colocated in time

> 1. Split out a training-set of size `test_records`. This is used for model testing.
> 2. Split out a validation-set of size `validation_records`. It will be used to monitor overfitting during training
> 3. All the rest is considered training data.

__Important__. For auto-encoders we perform a 4th step, all fraud records will be removed from the training and validation data. The auto-encoder will only see *non-fraud* records during training.
> 4. Remove fraud from training and validation

In [13]:
test_records = 100000
val_records  = 30000 

train_data, val_data, test_data = data_list.split_time(val_records, test_records) 

# Filter. Only keep non-fraud records with label 0. 
train_data = train_data.filter_label(learning_features, 0)
val_data = val_data.filter_label(learning_features, 0)

print(f'Training Data shapes {train_data.shapes}')
print(f'Validation Data shapes {val_data.shapes}')
print(f'Test Data shapes {test_data.shapes}')
del data_list
gc.collect()
print('Done')

Training Data shapes [(458847, 5), (458847,)]
Validation Data shapes [(29670, 5), (29670,)]
Test Data shapes [(100000, 5), (100000,)]
Done


## Define model

> Define a __LinearToCategoryAutoEncoder__. As input it takes the size of the latent dimension. In this case *3*. And it takes a list of integers indicating the number and the size of the hidden dimensions. *We are defining it to have 1 hidden layer of size 16*.