# Recurrent Neural Net (RNN) w/Series Expression

The previous example showed some improvements we can make, but there are always more things we can do to make it easier for the model to detect fraud. Here are a couple of things we can do;

- Add more derived features. For instance the fact that we had a round amount gets lost because of scaling and binning. Such a features can be added with a `FeatureExpression` it only depends on a single transactions
- Some derived features might depend on the series. For instance we could argue that the time between transactions is a notion that gets lost in the 'stacking' of the series, this can not be calculated with just one transaction, we need a set of them. We have a `FeatureExpressionSeries` to help with that. A `FeatureExpressionSeries` works similar to a regular `FeatureExpression` but is executed as the series are built. While the regular `FeatureExpression` is built prior to constructing the Series. The input to a `FeatureExpressionSeries` is the ordered set of transaction fields as DataFrame. This syntax may seem a bit awkward, but it allows us to run highly efficient vectorized code on the DataFrames, this takes a fraction of the time Python would need natively to loop over the elements.

We will not use the round amount example as there are barely round amounts in the data-set, a bit unrealistic. But we'll add a date-delta and we'll use the scaled amount again rather than using a binned amount, so as to mix things up a bit.

---
#### Note on the data set 
The data set used here is not particularly complex and/or big. It's not really all that challenging to find the fraud. In an ideal world we'd be using more complex data sets to show the real power of Deep Learning. There are a bunch of PCA'ed data sets available, but the PCA obfuscates some of the elements that are useful. 
*These examples are meant to show the possibilities, it's not so useful to interpret their performance on this data set*

In [1]:
import torch
import numpy as np
import pandas as pd
import datetime as dt
import gc

import d373c7.features as ft
import d373c7.engines as en
import d373c7.pytorch as pt
import d373c7.pytorch.models as pm
import d373c7.plot as pl

## Set a random seed for Numpy and Torch
> Will make sure we always sample in the same way. Makes it easier to compare results. At some point it should been removed to test the model stability.

In [2]:
# Numpy
np.random.seed(42)
# Torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Define base feature and read the File
The base features are features found in the input file. They need to be defined after which the file can be read using the `EnginePandasNumpy`. Using the `from_csv` method.

A new function had been adeed to calculate the time-delta between consequent payments for the same customer. It's input will be a Panda's DataFrame and it will output a normalized (to 190 days) time-delta. Remember that __normalizing__ the data is important for Neural Nets, it allows them to converge much faster.

The new function is used in the `FeatureExpressionSeries` feature named __'date-time-delta'__.

The Customer data has a similar shape as the previous example, we can see however that the series output is different. We now have 3 Numpy Arrays in the series list instead of 2 in the previous examples 
- The first Numpy in the seriers list is of type `int16` and had a Rank-3. The first dimension is the batch, the second dimension is the time and the third dimension are the 'category' and 'merchant' indexes.
- The second Numpy in the series list is of type `float32` and has a Rank-3. The first dimension is the batch, the second dimension is the time and the third dimension are the scaled amount and time-delta respetively.


In [3]:
# Change this to read from another location
file = '../../../../data/bs140513_032310.csv'

In [4]:
# Base Features
step = ft.FeatureSource('step', ft.FEATURE_TYPE_INT_16) 
customer = ft.FeatureSource('customer', ft.FEATURE_TYPE_STRING)
age = ft.FeatureSource('age', ft.FEATURE_TYPE_CATEGORICAL)
gender = ft.FeatureSource('gender', ft.FEATURE_TYPE_CATEGORICAL)
merchant = ft.FeatureSource('merchant', ft.FEATURE_TYPE_CATEGORICAL)
category = ft.FeatureSource('category', ft.FEATURE_TYPE_CATEGORICAL)
amount = ft.FeatureSource('amount', ft.FEATURE_TYPE_FLOAT)
fraud = ft.FeatureSource('fraud', ft.FEATURE_TYPE_INT_8)

# Function to calculate the date and time from the step
def step_to_date(step_count: int):
    return dt.datetime(2020, 1, 1) + dt.timedelta(days=int(step_count))

# Function to calculate the time difference between all rows and normalise
def calc_delta(dates):
    if isinstance(dates, pd.DataFrame):
        res = dates.diff() / np.timedelta64(190, 'D')
        res = res.fillna(0).abs()
        return res
    else:
        # There was only 1 row
        return 0

# Derived Features
date_time = ft.FeatureExpression('date', ft.FEATURE_TYPE_DATE_TIME, step_to_date, [step])
date_time_delta = ft.FeatureExpressionSeries('delta', ft.FEATURE_TYPE_FLOAT_32, calc_delta, [date_time])
age_i = ft.FeatureIndex('age_index', ft.FEATURE_TYPE_INT_8, age)
gender_i = ft.FeatureIndex('gender_index', ft.FEATURE_TYPE_INT_8, gender)
merchant_i = ft.FeatureIndex('merchant_index', ft.FEATURE_TYPE_INT_16, merchant)
category_i = ft.FeatureIndex('category_index', ft.FEATURE_TYPE_INT_16, category)
amount_scale = ft.FeatureNormalizeScale('amount_scale', ft.FEATURE_TYPE_FLOAT_32, amount)
fraud_label = ft.FeatureLabelBinary('fraud_label', ft.FEATURE_TYPE_INT_8, fraud)

cust_learn_features = ft.TensorDefinition(
    'customer_learning', 
    [
        age_i,
        gender_i,
    ])

trx_learn_features = ft.TensorDefinition(
    'transaction_learning', 
    [
        customer,
        merchant_i,
        category_i,
        amount_scale,
        date_time_delta
    ])


label = ft.TensorDefinition('label', [fraud_label])

model_features = ft.TensorDefinitionMulti([cust_learn_features, trx_learn_features, label])

with en.EnginePandasNumpy(num_threads=8) as e:
    cust_df     = e.from_csv(cust_learn_features, file, inference=False)
    series_list = e.to_series_stacked(
        trx_learn_features, file, key_feature=customer, time_feature=date_time, window=5, inference=False
    )
    lb_df       = e.from_csv(label, file, inference=False)
    cust_list   = e.to_numpy_list(cust_learn_features, cust_df)
    lb_np       = e.to_numpy_list(label, lb_df)
    
print('Customer data Shapes')
print(cust_list.shapes)
print(cust_list.dtype_names)
print('Series Shapes')
print(series_list.shapes)
print(series_list.dtype_names)
print('Label Shapes')
print(lb_np.shapes)
print(lb_np.dtype_names)

data_list = en.NumpyList(cust_list.lists + series_list.lists + lb_np.lists)
print('Numpy Shapes')
print(data_list.shapes)

2021-12-29 10:47:49.527 d373c7.engines.common          INFO     Start Engine...
2021-12-29 10:47:49.527 d373c7.engines.panda_numpy     INFO     Pandas Version : 1.1.4
2021-12-29 10:47:49.528 d373c7.engines.panda_numpy     INFO     Numpy Version : 1.19.2
2021-12-29 10:47:49.528 d373c7.engines.panda_numpy     INFO     Building Panda for : customer_learning from file ../../../../data/bs140513_032310.csv
2021-12-29 10:47:49.659 d373c7.engines.panda_numpy     INFO     Building Panda for : <Source_Derive_Source> from DataFrame. Inference mode <False>
2021-12-29 10:47:49.659 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: Source_Derive_Source
2021-12-29 10:47:49.660 d373c7.engines.panda_numpy     INFO     Done creating Source_Derive_Source. Shape=(594643, 2)
2021-12-29 10:47:49.671 d373c7.engines.panda_numpy     INFO     Reshaping DataFrame to: customer_learning
2021-12-29 10:47:49.672 d373c7.engines.panda_numpy     INFO     Building Panda for : InternalKeyTime from file ../..

Customer data Shapes
[(594643, 2)]
['int8']
Series Shapes
[(594643, 5, 2), (594643, 5, 2)]
['int16', 'float32']
Label Shapes
[(594643,)]
['int8']
Numpy Shapes
[(594643, 2), (594643, 5, 2), (594643, 5, 2), (594643,)]


## End

This notebook merely shows how to add a series feature. It turns out that is you add them, on this data, the models perform worse. It is not totally clear if the date is actually a usefull feature. The data might not be constructed in a way where the elapsed time between payments is indicative. In real life one would expect it is.