# Sequential data synthesis with YData

## The PaySim use case

Payments data is one of the most common types of transactional datasets, and for many business areas one of the most valuable. However, it is particularly sensitive and has complex underlying logic governing it, which makes it a perfect test bed for data quality assessment and synthetic data generation.

The dataset _“PaySim: A financial mobile money simulator for fraud detection”_ is a case study based on a real company that developed a mobile money implementation which provided mobile phone users the ability to transfer money between themselves using the phone as a sort of electronic wallet.

**Outline of this document**

- Ingesting and understanding a transactional financial dataset (including business logic)
- Training a synthetic data generator over the dataset
- Generating a synthetic version of the transactional dataset
- Evaluating the statistical fidelity and analytical utility of the generated data, including abidance by business logic

## Dataset

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of a mobile financial service currently running in more than 14 countries all around the world.

The PaySim version used is the one [publicly available on Kaggle](https://www.kaggle.com/datasets/ealaxi/paysim1), generated through the code on the [PaySim GitHub repository](https://github.com/EdgarLopezPhD/PaySim) and based on the paper [_PaySim: A financial mobile money simulator for fraud detection_](https://www.researchgate.net/publication/313138956_PAYSIM_A_FINANCIAL_MOBILE_MONEY_SIMULATOR_FOR_FRAUD_DETECTION).

The dataset should be downloaded from Kaggle and the CSV file placed under the `data` folder.

### Loading the dataset

YData's platform is fully integrated - this means that any source of data created through the UI can be easily consumed within the Labs (via the [Platform SDK](https://github.com/ydataai/academy/blob/master/1%20-%20platform-sdk/ui-to-sdk-examples.ipynb)) for further exploration, leveraging the platform's built-in scalability. Alternatively, YData's [Connectors](https://github.com/ydataai/academy/tree/master/2%20-%20connectors) could also be used.

Below, the Platform SDK will be used. 

In [4]:
# Importing YData's packages
from ydata.platform.datasources import DataSources
from ydata.metadata import Metadata

# Creating a Dataset from the Data Source
datasource = DataSources.get(uid='5759d69f-e127-419d-a382-71d2cc01025a',
                             namespace='45685d15-0577-4001-834b-701ed6a52ad0')

dataset = datasource.read()
# Quickly previewing the Dataset
dataset.head()

Unnamed: 0_level_0,time,step,action,amount,nameOrig,oldBalanceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud,isUnauthorizedOverdraft
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0,1,CASH_IN,229382.36,C3568019779,67.45,229449.8,M1836583048,0.0,0.0,0,0,0
1,2099485,215,PAYMENT,3898.67,C7787814982,4330561.39,4326662.72,M8323358603,119874.35,123773.02,0,0,0
2,1422612,162,CASH_IN,328906.66,C5933160999,2738664.99,3067571.64,M5203480520,62420.96,62420.96,0,0,0
3,2771031,304,PAYMENT,5893.84,C4556212256,355476.1,349582.27,M7504842284,134119.24,140013.08,0,0,0
4,735308,41,CASH_IN,130091.39,C6471076107,2732966.8,2863058.19,M6673769831,8268.4,8268.4,0,0,0


In [5]:
dataset.shape(lazy_eval=False)

(3440390, 13)

By quickly taking a peek at the dataset, we can identify the several types of features we have:
    
- `time` is our sequential indicator
- features describing the financial operation and its flow (`type`, `amount`, `nameOrig` and `nameDest`)
- features accounting for the variation in balances (`oldbalanceDest`, `newbalanceDest`, `oldbalanceOrg`, `newbalanceOrig`) 
- features related with fraud (`isFlaggedFraud`, automatically detected potential frauds and `isFraud`, the real assessment of the evaluation)
    
The automated profiling heavily complements these basic insights with automatic detection of potential data quality issues. 

### Evaluating data quality

A `Metadata` object holds information about the dataset, including potential data quality warnings, which are useful to identify potential issues which must be solved prior to data synthesis. 

In [2]:
# Creating a Metadata (where warnings can be reviewed) from the Dataset
metadata = Metadata(dataset)
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m13
[1m% of duplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
                     Column    Data type Variable type
0                      time    numerical           int
1                      step  categorical           int
2                    action  categorical        string
3                    amount    numerical         float
4                  nameOrig  categorical        string
5            oldBalanceOrig    numerical         float
6            newBalanceOrig    numerical         float
7                  nameDest  categorical        string
8            oldBalanceDest    numerical         float
9            newBalanceDest    numerical         float
10                  isFraud  categorical           int
11           isFlaggedFraud  categorical           int
12  isUnauthorizedOverdraft  categorical           int




We can, for the moment, drop the `step` column, as it encodes the same sequence information as `time` (order of transactions). 

In [6]:
dataset=dataset[['time',
                 'action',
                 'amount',
                 'nameOrig',
                 'oldBalanceOrig',
                 'newBalanceOrig',
                 'nameDest',
                 'oldBalanceDest',
                 'newBalanceDest',
                 'isFraud',
                 'isFlaggedFraud',
                 'isUnauthorizedOverdraft']]

In [7]:
dataset.columns

['time',
 'action',
 'amount',
 'nameOrig',
 'oldBalanceOrig',
 'newBalanceOrig',
 'nameDest',
 'oldBalanceDest',
 'newBalanceDest',
 'isFraud',
 'isFlaggedFraud',
 'isUnauthorizedOverdraft']

### Validation of business rules 

`Metadata` is also required to validate **business rules and domain-specific knowledge**. A dataset not abiding by the expected business rules may silently go unnoticed through all the phases of a project and fail to deliver any business value when finally live.

Stemming from its transactional-financial nature and given the extra context available on Kaggle and in the original paper, some implicit and explicit business rules can be derived for this dataset:
    
- `CASH-IN`, `CASH-OUT` and `PAYMENT` operations need to have as destination `Merchant` accounts
- `DEBIT` operations need to have `BANKS` as the destination
- If the operation is a rejected overdraft (`isUnauthorizedOverdraft==1`), the resulting balance needs to be 0
- For every destination entity, if the operation is not a rejected overdraft: $balance(t+1) = balance(t) + amount(t) \times (-1 \: if \: 1 \: action(n) \: in \: [CASH-OUT, \: PAYMENT, \: TRANSFER, \: DEBIT] \: else \: 1)$
- The **amount** column should only assume values positive values (as the signal of the operation is codified in the `action` column)

These kind of constraints can be validated through the flexibility of YData's `ConstraintEngine`. For demo purposes, let's validate the last two: 

TODO: revise constraints to check if they make sense

In [14]:
#double check this with Quemy
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.constraint import GreaterThan, Positive, CustomConstraint

# Create the custom constrains for the dataset. In this case we have created a custom constrain to validate the newBalanceOrig column integrity

def check_originBalance(df):
    
    # How can we add this type of logic to deal with the associated signs? Can we maybe do an apply instead prior the validations?
    # balance(t) is in newBalanceOrig; balance(t-1) is in oldBalanceOrig
    
    action_mapping = {'DEBIT': -1, 'TRANSFER': -1, 'CASH_IN': 1, 'CASH_OUT': -1, 'PAYMENT': -1}
    return df['newBalanceOrig'] == df['oldBalanceOrig'] + (df['amount']*df['action'].map(action_mapping))*(1-df['isUnauthorizedOverdraft'])

# def check_destBalance(df):
    
    # How can we add this type of logic to deal with the associated signs? Can we maybe do an apply instead prior the validations?
    # balance(t) is in newBalanceDest; balance(t-1) is in oldBalanceDest

    # action_mapping = {'DEBIT': -1, 'TRANSFER': -1, 'CASH_IN': 1, 'CASH_OUT': -1, 'PAYMENT': -1}
    # return df['newBalanceDest'] == df['oldBalanceDest'] + (df['amount']*df['action'].map(action_mapping))*(1-df['isUnauthorizedOverdraft'])

#Some out-of-the-box constrains are also available like Greater than, that checks whether a colum(n) is(are) greater than a certain provided value(s)

c1 = Positive(columns=['amount'])
c2 = CustomConstraint(name="Balance Check Origin | Entity | axis=0", 
                       check=check_originBalance,
                       available_columns=['nameOrig', 'oldBalanceOrig', 'newBalanceOrig', 'amount', 'action', 'isUnauthorizedOverdraft'],
                       entity='nameOrig', 
                       axis=0)

engine = ConstraintEngine()
engine.add_constraints([c1, c2])

An additional set of constraints which stems from the entity lifecycle/trajectory. 

Add here more detail on why this constraints validation it is important for the use case. And how much does this impact the process of data synthesis 

YData's `ConstraintsEngine` also supports this scenario. 

In [7]:
from pandas import DataFrame as pdDataframe
from utils import check_balance_with_interaction

In [8]:
from typing import List, Union, Optional

from ydata.dataset.dataset import Dataset
from ydata.constraints.base import RowConstraint

#Setting the class that allows to create the constraint
class PaySimStateConstraint(RowConstraint):
    def __init__(
        self,
        name: Optional[str] = None
    ):
        self.name = name
        
    def validate(self, dataset: Dataset):
        return check_balance_with_interaction(dataset.to_pandas())

c = PaySimStateConstraint(name="Balance Check with interaction via CustomStateConstraint")

engine.add_constraint(c)

In [9]:
#Getting the summary of constrains validation
engine.validate(dataset)
engine.summary()

{'violation_count': 3440390,
 'violation_ratio': 1.0,
 'violation_per_constraint': {"Positive(columns=['amount'])": {'violation_count': 0,
   'violation_ratio': 0.0,
   'validation_time': (18.17375874519348,)},
  'Balance Check Origin | Entity | axis=0': {'violation_count': 1500879,
   'violation_ratio': 0.4362525760160912,
   'validation_time': (110.14000582695007,)},
  'Balance Check with interaction via CustomStateConstraint': {'violation_count': 3440390,
   'violation_ratio': 1.0,
   'validation_time': (753.0259964466095,)}}}

TODO: analysis and comments on these -> this is our actionability

## Creating a synthetic replica of the PaySim dataset

Add here more detail. We are only synthesizing the amount, all the other columns will be derived from the process of synthesis.

In [10]:
from ydata.utils.data_types import DataType, VariableType
from ydata.dataset.dataset_type import DatasetType
from ydata.synthesizers.timeseries.model import TimeSeriesSynthesizer



In [11]:
#Selecting the columns to be synthesized
sel_dataset = dataset[['time',
                     'action',
                     'amount',
                     'nameOrig',
                     'nameDest',
                     'isFraud',
                     'isFlaggedFraud',
                     'isUnauthorizedOverdraft']]

In [12]:
#Creating the Metadata object for the synthesis
dataset_attrs = {
     "sortbykey": "time",
     "entity_id_cols": ["nameOrig", "nameDest"]
}

m = Metadata(sel_dataset,
             dataset_attrs,
             dataset_type=DatasetType.TIMESERIES)

In [13]:
synthesizer = TimeSeriesSynthesizer()
synthesizer.fit(sel_dataset, metadata=m)

INFO: 2022-07-06 22:04:27,263 [SYNTHESIZER] - Initializing Time Series SYNTHESIZER.
INFO: 2022-07-06 22:04:27,267 [SYNTHESIZER] - Number columns considered for synth: 8
INFO: 2022-07-06 22:05:00,288 [SYNTHESIZER] - Starting the synthetic data modeling process over 115x1 blocks.
INFO: 2022-07-06 22:05:00,297 [SYNTHESIZER] - Generating pipeline for segment (-0.001, 29916.426]
INFO: 2022-07-06 22:05:00,449 [SYNTHESIZER] - Preprocess segment
INFO: 2022-07-06 22:05:01,058 [SYNTHESIZER] - Synthesizer init.
INFO: 2022-07-06 22:05:01,064 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2022-07-06 22:05:08,436 [SYNTHESIZER] - Generating pipeline for segment (29916.426, 59832.852]
INFO: 2022-07-06 22:05:08,444 [SYNTHESIZER] - Preprocess segment
INFO: 2022-07-06 22:05:09,071 [SYNTHESIZER] - Synthesizer init.
INFO: 2022-07-06 22:05:09,073 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2022-07-06 22:05:10,292 [SYNTHESIZER] - Generating pipeline for 

<ydata.synthesizers.timeseries.model.TimeSeriesSynthesizer at 0x7f6ddf279c10>

In [14]:
#Generating a sample fo synthetic data
sample = synthesizer.sample(n_samples=len(dataset))

INFO: 2022-07-06 22:11:27,160 [SYNTHESIZER] - Start generating model samples.
INFO: 2022-07-06 22:11:27,161 [SYNTHESIZER] - Sample segment (-0.001, 29916.426]
INFO: 2022-07-06 22:11:28,443 [SYNTHESIZER] - Sample segment (29916.426, 59832.852]
INFO: 2022-07-06 22:11:29,363 [SYNTHESIZER] - Sample segment (59832.852, 89749.278]
INFO: 2022-07-06 22:11:30,509 [SYNTHESIZER] - Sample segment (89749.278, 119665.704]
INFO: 2022-07-06 22:11:31,208 [SYNTHESIZER] - Sample segment (119665.704, 149582.13]
INFO: 2022-07-06 22:11:31,979 [SYNTHESIZER] - Sample segment (149582.13, 179498.557]
INFO: 2022-07-06 22:11:32,621 [SYNTHESIZER] - Sample segment (179498.557, 209414.983]
INFO: 2022-07-06 22:11:33,429 [SYNTHESIZER] - Sample segment (209414.983, 239331.409]
INFO: 2022-07-06 22:11:35,006 [SYNTHESIZER] - Sample segment (239331.409, 269247.835]
INFO: 2022-07-06 22:11:35,759 [SYNTHESIZER] - Sample segment (269247.835, 299164.261]
INFO: 2022-07-06 22:11:36,437 [SYNTHESIZER] - Sample segment (299164.261, 

In [15]:
### The synthesized dataset
sample.head(10)

Unnamed: 0_level_0,time,action,amount,nameOrig,nameDest,isFraud,isFlaggedFraud,isUnauthorizedOverdraft
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,CASH_IN,230163.11,C5831266343,M1836583048,0,0,0
1,1,CASH_IN,96795.66,C1838435211,M4632515968,0,0,0
2,2,CASH_IN,227298.43,C1911378560,M7113244499,0,0,0
3,3,CASH_IN,155631.49,C9648741489,M3688842852,0,0,0
4,4,CASH_IN,118296.85,C5502668364,M4934014946,0,0,0
5,5,CASH_IN,238892.12,C7292130163,M0618525076,0,0,0
6,6,CASH_IN,312997.55,C6572617605,M8734902957,0,0,0
7,7,CASH_IN,326002.9,C0199859122,M9290819283,0,0,0
8,8,CASH_IN,297653.14,C6837295018,M0640225594,0,0,0
9,9,CASH_IN,61676.28,C3323406427,M5755289891,0,0,0


### Calculating the constrained features - balances
#Origin and destination balances need to take account the generated amount. 

In [None]:
### Building the function to calculate the balances origin, Destination balances

## Evaluating the quality of the synthetic dataset

- Arunn's whitepaper
- Quemy notebooks on MultiEntity