In [None]:
############################################################################
##
## Copyright (C) 2022 NVIDIA Corporation.  All rights reserved.
##
## NVIDIA Sample Code
##
## Please refer to the NVIDIA end user license agreement (EULA) associated
## with this source code for terms and conditions that govern your use of
## this software. Any use, reproduction, disclosure, or distribution of
## this software and related documentation outside the terms of the EULA
## is strictly prohibited.
##
############################################################################

# Synthetic Data Generation

Synthetic Data Generation is a data augmentation technique and is necessary for increasing the robustness of models by supplying additional data to train models. 

An ideal <em>synthetic dataset</em> generated on top of real data is a dataset that shares with the real data:
- the same features (columns)
- for a particular feature, they share the same data type (integer, float, string, etc)
- the same distributions in an individual column
- the same joint distributions when considering multiple columns
- the same conditional distributions (i.e. applying a condition on one distribution and looking at another)

A <em>synthetic data generator</em> is a model that can be trained on the real data, and then be utilized to create new synthetic data with the properties described above.

# Motivation and Financial Services Use Cases

Synthetic data generation has some sample applications listed below:

<strong>Fraud Detection</strong> - simulate payments data, insurance claims, or images

<strong>Backtesting</strong> - simulate stock market data / order book data

<strong>Loan Delinquency</strong> - simulate mortgage data

<strong>Financial News or Forms</strong> - simulate financial news such as a macro event, or information on a 10-K 


The common themes in the sample applications above 
- Conditional Generation
- Upsample infrequent data or edge cases
- Regularization

# Methods

Synthetic data generation methods can be segmented into classical and deep learning approaches. Classical approaches such as SMOTE may oversample certain data points, or generate new data points through interpolation of existing points. This can lead to issues where bias can be added to the model, or unrealistic or nonsensical data points can be generated, such as interpolating between zip codes or checking account numbers. Deep learning methods involve training a model, such as a Variational Autoencoder (VAE), which will encode the training data to a latent subspace followed by decoding the data back to the original feature space. Once trained, the user need only sample from the latent space and pass these samples through the decoder to produce synthetic output data. VAEs, GANs, and the like suffer from Posterior Collapse, where the synthetic data generator only outputs a single value, or Catastrophic forgetting where the model forgets previous information upon learning new information. In practice this makes a VAE or GAN model difficult for generating synthetic tabular data, especially if there are multiple high cardinality categorical features.

### Classical:

- Oversampling - ex. SMOTE (Synthetic minority oversampling)
- Bagging - Bootstrap aggregation
- Monte Carlo
- PCA - Principal component analysis
- Rotation, scaling, and cropping of images

### Deep Learning:
- (Variational) AutoEncoders
- Generative Adversarial Networks (GANs)
- RNNs, LSTMs, etc., and others.
- Transformers (what we will focus on today)

### Issues with Classical and earlier DL methods:
- Loss of temporal information
- Re-use existing datapoints, which can add bias to a model
- Linear interpolation of existing datapoints, which may not be accurate or make sense for certain data (ex. categorical data such as zip codes)
- Hard to create model that captures information on conditional distributions.
- Catastrophic Forgetting - the model forgets previous information upon learning new information [[1]](#0_1)
- Posterior collapse - the synthetic data generator only outputs a single value. [[2]](#0_2)


### In this workshop, we will be exploring Transformers for synthetic tabular data generation. In our experience, using Transformers has yielded more accurate results in a shorter amount of time compared to the laborious and iterative process involved in training VAEs on large amounts of data. 

# Tabular data Synthetic Data Generation

In addition to the points mentioned above for an ideal synthetic dataset, our synthetic tabular generator should:
- Privacy-focused – does not leak information about users in real data
- Representative:
    - Synthetic data accurately represents global trends, and local trends in the real data
    - Relevant cross-column categorical features 
    - Maintain correlations for subsets of the data 
- Conditionally generated:
    - Generate new data based on provided context
    - Generate new edge case data
    


## Credit Card Payments

Throughout the rest of this workshop we will explore credit card payments using the TabFormer Dataset[[3]](#0_3), which itself was synthetically generated. An example of the credit card payments is shown below.

| user | card | amount | date                | year | month | day  | hour | minute | use chip              | merchant name | merchant city | merchant state | zip   | mcc   | errors | is fraud |
|------|------|------- |-------------------- |------|------ |------| -----| ------ | ----------------------| --------------| ------------- | -------------- | ----- | ---   | ------ | -------- |
| 791  | 1    | 68.00  | 2018-01-02 09:10:00 | 2018 |  1    |  2   |  9   |     10 |    Swipe Transaction  | 12345536      |  New York     | NY             | 10017 |  8005 |  \<NA> | 0        |
| 1572 | 0    | 572.42 | 2018-04-12 07:11:00 | 2018 |  4    |  12  |  7   |     11 |    Chip Transaction   | 49908535      |  Princeton    | NJ             | 19406 |  5634 |  \<NA> | 0        |
| 2718 | 7    | 123.10 | 2019-01-04 10:14:00 | 2019 |  1    |  4   |  10  |     14 |    Chip Transaction   | 43211536      |  Beverly Hills| CA             | 90210 |  4800 |  \<NA> | 0        |
| 21   | 2    | 42.04  | 2020-06-23 11:18:00 | 2020 |  6    |  23  |  11  |     18 |    Swipe Transaction  | 65423006      |  Burke        | VA             | 22015 |  5604 |  \<NA> | 0        |
| 1001 | 1    | 5000.00| 2020-11-03 01:22:00 | 2020 |  11   |  3   |  1   |     22 |    Online Transaction | 75434546      |  \<NA>        | \<NA>          | \<NA> |  1234 |  \<NA> | 1        |


A description of the columns is as follows:

- <strong>user</strong> - (<em>integer</em>) the user id between 0-2000
- <strong>card</strong>- (<em>integer</em>) the card id for a user between 0-8
- <strong>amount</strong> - (<em>float</em>) the amount spent on a transaction from -500.00 (for a return) to ~10,000
- <strong>year, month, day, hour, minute</strong> - (<em>integer</em>) the corresponding time a transaction occurred
- <strong>use chip</strong> - (<em>string</em>) the transaction type, one of <em>Swipe Transaction, Chip Transaction, or Online Transaction</em>
- <strong>merchant name</strong> - (<em>integer</em>) the merchant id, there are about 100,000 merchants total
- <strong>merchant city, merchant state, zip</strong> - (<em>string) the city, state, and zip code where the transaction occurred. Will be NA if the transaction was online.
- <strong>mcc</strong> - (<em>integer</em>) the <a href="https://www.investopedia.com/terms/m/merchant-category-codes-mcc.asp">merchant category code</a> which is a 4-digit number categorizing the type of transaction. We will use this <a href="https://github.com/jleclanche/python-iso18245">iso18245 repository</a> for looking up merchant category codes as needed.
- <strong>errors</strong> - (<em>string</em>) a comma separated list of errors that occurred in the transaction.
- <strong>is fraud</strong> - (<em>boolean</em>) whether the transaction is labeled fraudulent or not.



#### A synthetic data generator for credit card transactions has some intricacies that make the modeling process particularly difficult.
    
#### At a high level, there are <strong>(1)</strong> different users, <strong>(2)</strong> in different geographic locations, <strong>(3)</strong> with different transaction profiles, and <strong>(4)</strong> have different payment methods/preferences, <strong>(5)</strong> features which are time dependent. Also, there is a mix of high cardinality categorical (zip codes, city, state), and continuous ( amount spent) columns, that must make sense. Ex. Beverly Hills, CA 90210 is the only acceptable combination of values.

# References

<a id="0_1">[1]</a> 
<a href="https://www.researchgate.net/profile/Truyen-Tran-2/publication/326342681_On_catastrophic_forgetting_and_mode_collapse_in_Generative_Adversarial_Networks/links/5db7848992851c81801152e1/On-catastrophic-forgetting-and-mode-collapse-in-Generative-Adversarial-Networks.pdf">Catastrophic forgetting and mode collapse in GANs</a></br>

<a id="0_2">[2]</a>
<a href="https://openreview.net/pdf?id=r1xaVLUYuE">Understanding Posterior Collapse in Generative Latent Variable Models</a></br>

<a id="0_3">[3]</a>
<a href="https://github.com/IBM/TabFormer/tree/main">Tabular Transformers for Modeling Multivariate Time Series</a></br>
