# Sequential data synthesis with YData

## The PaySim use case

Payments data is one of the most common types of transactional datasets, and for many business areas one of the most valuable. However, it is particularly sensitive and has complex underlying logic governing it, which makes it a perfect test bed for data quality assessment and synthetic data generation.

The dataset _“PaySim: A financial mobile money simulator for fraud detection”_ is a case study based on a real company that developed a mobile money implementation which provided mobile phone users the ability to transfer money between themselves using the phone as a sort of electronic wallet.

**Outline of this document**

- Ingesting and understanding a transactional financial dataset (including business logic)
- Training a synthetic data generator over the dataset
- Generating a synthetic version of the transactional dataset
- Evaluating the statistical fidelity and analytical utility of the generated data, including abidance by business logic

## Dataset

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of a mobile financial service currently running in more than 14 countries all around the world.

The PaySim version used is a custom one generated through the code on the [PaySim GitHub repository](https://github.com/EdgarLopezPhD/PaySim) and based on the paper [_PaySim: A financial mobile money simulator for fraud detection_](https://www.researchgate.net/publication/313138956_PAYSIM_A_FINANCIAL_MOBILE_MONEY_SIMULATOR_FOR_FRAUD_DETECTION).

This dataset version can be downloaded from **⚠️ TODO: add link to bucket with dataset**. 

### Loading the dataset

YData's platform is fully integrated - this means that any source of data created through the UI can be easily consumed within the Labs (via the [Platform SDK](https://github.com/ydataai/academy/blob/master/1%20-%20platform-sdk/ui-to-sdk-examples.ipynb)) for further exploration, leveraging the platform's built-in scalability. Alternatively, YData's [Connectors](https://github.com/ydataai/academy/tree/master/2%20-%20connectors) could also be used.

Below, the Platform SDK will be used. 

In [1]:
# Importing YData's packages
from ydata.platform.datasources import DataSources
from ydata.metadata import Metadata

# Creating a Dataset from the Data Source
datasource = DataSources.get(uid='5759d69f-e127-419d-a382-71d2cc01025a',
                             namespace='45685d15-0577-4001-834b-701ed6a52ad0')

dataset = datasource.read()

# Quickly previewing the Dataset
dataset.head()


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| toolz   | 0.11.2 | 0.12.0    | None    |
+---------+--------+-----------+---------+


Unnamed: 0_level_0,time,step,action,amount,nameOrig,oldBalanceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud,isUnauthorizedOverdraft
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0,1,CASH_IN,229382.36,C3568019779,67.45,229449.8,M1836583048,0.0,0.0,0,0,0
1,2099485,215,PAYMENT,3898.67,C7787814982,4330561.39,4326662.72,M8323358603,119874.35,123773.02,0,0,0
2,1422612,162,CASH_IN,328906.66,C5933160999,2738664.99,3067571.64,M5203480520,62420.96,62420.96,0,0,0
3,2771031,304,PAYMENT,5893.84,C4556212256,355476.1,349582.27,M7504842284,134119.24,140013.08,0,0,0
4,735308,41,CASH_IN,130091.39,C6471076107,2732966.8,2863058.19,M6673769831,8268.4,8268.4,0,0,0


In [2]:
dataset.shape(lazy_eval=False)

(3440390, 13)

**⚠️ TODO**: missing `value_counts()` to show imbalance

By quickly taking a peek at the dataset, we can identify the several types of features we have:
    
- `time` is our sequential indicator
- features describing the financial operation and its flow (`type`, `amount`, `nameOrig` and `nameDest`)
- features accounting for the variation in balances (`oldbalanceDest`, `newbalanceDest`, `oldbalanceOrg`, `newbalanceOrig`) 
- features related with fraud (`isFlaggedFraud`, automatically detected potential frauds and `isFraud`, the real assessment). Fraudulent events seem to be minoritary, accounting for 0.09% of all transactions.

### Evaluating data quality

The automated profiling available on the platform heavily complements these insights above with automatic detection of potential data quality issues. To access these data quality warnigs, useful to identify potential issues which must be solved prior to data synthesis, a `Metadata` object, which holds information about the dataset, can be created.

In [3]:
# Creating a Metadata (where warnings can be reviewed) from the Dataset
metadata = Metadata(dataset)
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m13
[1m% of duplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
                     Column    Data type Variable type
0                      time    numerical           int
1                      step  categorical           int
2                    action  categorical        string
3                    amount    numerical         float
4                  nameOrig  categorical        string
5            oldBalanceOrig    numerical         float
6            newBalanceOrig    numerical         float
7                  nameDest  categorical        string
8            oldBalanceDest    numerical         float
9            newBalanceDest    numerical         float
10                  isFraud  categorical           int
11           isFlaggedFraud  categorical           int
12  isUnauthorizedOverdraft  categorical           int




This gives us additional information about the dataset: balances and amounts have highly skewed distributions, which suggests different types of transactions exist. We also seem to have a large number of entities represented in this dataset. 

For now, we will drop the `step` column, as it encodes the same sequence information as `time` (a `step` may have multiple transactions, `time` is just an artificial column with all the transactions in a `step` linearly ordered). 

In [4]:
dataset=dataset[['time',
                 'action',
                 'amount',
                 'nameOrig',
                 'oldBalanceOrig',
                 'newBalanceOrig',
                 'nameDest',
                 'oldBalanceDest',
                 'newBalanceDest',
                 'isFlaggedFraud',
                 'isFraud',
                 'isUnauthorizedOverdraft']]

In [5]:
dataset.columns

['time',
 'action',
 'amount',
 'nameOrig',
 'oldBalanceOrig',
 'newBalanceOrig',
 'nameDest',
 'oldBalanceDest',
 'newBalanceDest',
 'isFraud',
 'isFlaggedFraud',
 'isUnauthorizedOverdraft']

In [6]:
dataset.head()

Unnamed: 0_level_0,time,action,amount,nameOrig,oldBalanceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud,isUnauthorizedOverdraft
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0,CASH_IN,229382.36,C3568019779,67.45,229449.8,M1836583048,0.0,0.0,0,0,0
1,2099485,PAYMENT,3898.67,C7787814982,4330561.39,4326662.72,M8323358603,119874.35,123773.02,0,0,0
2,1422612,CASH_IN,328906.66,C5933160999,2738664.99,3067571.64,M5203480520,62420.96,62420.96,0,0,0
3,2771031,PAYMENT,5893.84,C4556212256,355476.1,349582.27,M7504842284,134119.24,140013.08,0,0,0
4,735308,CASH_IN,130091.39,C6471076107,2732966.8,2863058.19,M6673769831,8268.4,8268.4,0,0,0


In [None]:
dataset.value_counts(col='isFraud')

**⚠️ TODO:** 
- Behaviour when transactions are flagged as Fraud. Do transactions go through and are balances updated? In some cases they seem to go, others not (see sample below, where in _TRANSFERS_ the balances are updated, but in _CASH_OUT_ only the origin balances are updated)
- _Please notice that there is not record of balance from clients that start with M (Merchants)_

In [7]:
data_sample = dataset.sample(100000).to_pandas()
data_sample[data_sample.isFraud == 1].head(10)

Unnamed: 0_level_0,time,action,amount,nameOrig,oldBalanceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud,isUnauthorizedOverdraft
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2693,2789951,TRANSFER,3402623.41,C5256384465,3402623.41,0.0,CC1029674054,0.0,3402623.41,1,0,0
3278,2122246,TRANSFER,2696073.0,C9158728708,2696073.0,0.0,CC5177879110,0.0,2696073.0,1,0,0
3279,2122260,CASH_OUT,1499555.12,CC8862219637,1499555.12,0.0,M6215644884,108079.04,108079.04,1,0,0
20258,2238415,TRANSFER,4057858.22,C5707006942,4057858.22,0.0,CC5793434220,0.0,4057858.22,1,0,0
23162,1581917,CASH_OUT,1948499.0,CC8802731505,1948499.0,0.0,M8330551572,97641.89,97641.89,1,0,0
31705,953519,TRANSFER,2673360.16,C0241370014,2673360.16,0.0,CC7381838191,0.0,2673360.16,1,0,0
31865,954716,TRANSFER,3419383.12,C8058883304,3419383.12,0.0,CC9411585546,0.0,3419383.12,1,0,0
31867,954724,CASH_OUT,3590999.18,CC6549428035,3590999.18,0.0,M3600853125,35365.97,35365.97,1,0,0
31869,954745,TRANSFER,6747536.62,C9130150296,6747536.62,0.0,CC5355430220,0.0,6747536.62,1,0,0
31871,954754,CASH_OUT,1847907.25,CC8450828994,1847907.25,0.0,M2295217524,20195.46,20195.46,1,0,0


In [8]:
data_sample[data_sample.action == 'CASH_IN'].head(10)

Unnamed: 0_level_0,time,action,amount,nameOrig,oldBalanceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud,isUnauthorizedOverdraft
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0,CASH_IN,229382.36,C3568019779,67.45,229449.8,M1836583048,0.0,0.0,0,0,0
3,2771044,CASH_IN,48808.44,C4556212256,75075.95,123884.39,M8086589413,156003.4,156003.4,0,0,0
4,2099500,CASH_IN,109885.57,C0678112597,2747184.04,2857069.61,M9909231856,62174.51,62174.51,0,0,0
5,25,CASH_IN,107426.96,C6274596817,91.66,107518.62,M6999620024,0.0,0.0,0,0,0
6,49,CASH_IN,271236.81,C2392540098,16.94,271253.74,M7362150288,0.0,0.0,0,0,0
9,2099557,CASH_IN,169301.48,C0347915649,1901202.55,2070504.03,M8226185334,40041.35,40041.35,0,0,0
10,1422692,CASH_IN,43618.68,C2681990324,2724214.21,2767832.89,M1960862483,49165.64,49165.64,0,0,0
17,144,CASH_IN,125139.81,C5324201498,13.25,125153.06,M5042326640,0.0,0.0,0,0,0
20,153,CASH_IN,22193.31,C8819492197,258725.99,280919.31,M9859718536,0.0,0.0,0,0,0
22,159,CASH_IN,41000.75,C4892645723,10917.77,51918.51,M1770382943,0.0,0.0,0,0,0


### Validation of business rules 

A dataset not abiding by the expected business rules may silently go unnoticed through all the phases of a project and fail to deliver any business value when finally live. As such, it's important to validate these kind of constraints. The `Metadata` objects can be used for validation of several type of arbitrary row and column-wise constraints (per-entity, if required), through its integration with YData's `ConstraintEngine`.

#### Which business rules and constraints does this dataset have?

Stemming from its transactional-financial nature and given the [extra context available on Kaggle](https://www.kaggle.com/datasets/ealaxi/paysim1) and in the original paper, some implicit and explicit business rules can be derived for this dataset:
    
- `CASH-IN`, `CASH-OUT` and `PAYMENT` operations need to have as destination `Merchant` accounts
- `DEBIT` operations need to have `BANKS` as the destination
- If the operation is a rejected overdraft (`isUnauthorizedOverdraft==1`), the balance of the origin entity is kept
- The `amount` column should only assume values positive values (as the signal of the transaction is codified in the `action` column)
- For every origin entity, the trajectory of each entity's balance across time must be coherent with its interactions with other entities.
    - In practice, if the operation is not a rejected overdraft: $balance(t+1) = balance(t) + amount(t) \times (-1 \: if \: 1 \: action(n) \: in \: [CASH-OUT, \: PAYMENT, \: TRANSFER, \: DEBIT] \: else \: 1)$
    - For destination entities, the above applies but `amount` is multiplied by 1 (entity receives money)

For demo purposes, let's validate only the last two constraints we defined. 

In [None]:
#double check this with Quemy
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.constraint import GreaterThan, Positive, CustomConstraint

# Create the custom constrains for the dataset

# Validate coherence in balance of the origin entity during the transaction 
def check_originBalance(df):
    # How can we add this type of logic to deal with the associated signs? Can we maybe do an apply instead prior the validations?
    # balance(t) is in newBalanceOrig; balance(t-1) is in oldBalanceOrig
    
    action_mapping = {'DEBIT': -1, 'TRANSFER': -1, 'CASH_IN': 1, 'CASH_OUT': -1, 'PAYMENT': -1}
    return df['newBalanceOrig'] == df['oldBalanceOrig'] + (df['amount']*df['action'].map(action_mapping))*(1-df['isUnauthorizedOverdraft'])

# Validate coherence in balance of the destination entity during the transaction
def check_destBalance(df):
    
    # How can we add this type of logic to deal with the associated signs? Can we maybe do an apply instead prior the validations?
    # balance(t) is in newBalanceOrig; balance(t-1) is in oldBalanceOrig
    
    # For the receiving entity, the transaction signs are inverted
    action_mapping = {'DEBIT': 1, 'TRANSFER': 1, 'CASH_IN': -1, 'CASH_OUT': 1, 'PAYMENT': 1} 
    return df['newBalanceDest'] == df['oldBalanceDest'] + (df['amount']*df['action'].map(action_mapping))*(1-df['isUnauthorizedOverdraft'])


# Some other out-of-the-box constrains are also available like GreaterThan, 
# which checks whether a colum(n) is(are) greater than a certain provided value(s)

c1 = Positive(columns=['amount'])
c2 = CustomConstraint(name="Balance Check Origin Entity | axis=1", 
                       check=check_originBalance,
                       available_columns=['nameOrig', 'oldBalanceOrig', 'newBalanceOrig', 'amount', 'action', 'isUnauthorizedOverdraft'],
                       #entity='nameOrig', 
                       axis=0)

c3 = CustomConstraint(name="Balance Check Destination Entity | axis=1", 
                       check=check_destBalance,
                       available_columns=['nameDest', 'oldBalanceDest', 'newBalanceDest', 'amount', 'action', 'isUnauthorizedOverdraft'],
                       #entity='nameDest', 
                       axis=0)


engine = ConstraintEngine()
engine.add_constraints([c1, c2, c3])

**⚠️ TODO:** Why are we checking by entity? This seems to be a simple row constraint given we keep the running balances of both entities in the `oldBalance*` variables.  

Running the Constraints engine with and without this verification yields very similar violation ratios.  

- With entity set: Positive amount 0.0, Balance Check origin 0.43625, Balance Check Destination 0.8145
- Without entity set: Positive amount 0.0, Balance Check origin 0.0.43625, Balance Check Destination 0.8145 (but takes longer)

In [None]:
"""
from pandas import DataFrame as pdDataframe
from utils import check_balance_with_interaction
from typing import List, Union, Optional

from ydata.dataset.dataset import Dataset
from ydata.constraints.base import RowConstraint

#Setting the class that allows to create the constraint
class PaySimStateConstraint(RowConstraint):
    def __init__(
        self,
        name: Optional[str] = None
    ):
        self.name = name
        
    def validate(self, dataset: Dataset):
        return check_balance_with_interaction(dataset.to_pandas())

c = PaySimStateConstraint(name="Balance Check with interaction via CustomStateConstraint")

engine.add_constraint(c)
"""

**⚠️ TODO**

_It is not possible to check a constraint on the balance by simply looking at each entity trajectory. In this case, checking the constraint on the balance requires to maintain a state of the balance for each entity while iterating per row.The current Constraint Engine does not directly support such scenario. However, it is possible to define a constraint object which will be checked row by row while maintaining this state._ -> **well, the PaySim dataset we have maintains this running balance with the `oldBalance` columns. So maybe doing _Balance Check Origin_ and _Balance Check Destination_ constraints, per row, is enough to verify this?**

#### Running the validation 

In [None]:
# Getting the summary view of the constraints engine

engine.validate(dataset)
engine.summary()

As we can see, the PaySim dataset includes massive inconsistences when keeping track of the balances of each entity throughout the transactions.  The evolution of balances throughout time. 

TODO: analysis and comments on these -> this is our actionability


Add here more detail on why this constraints validation it is important for the use case. And how much does this impact the process of data synthesis 

## Creating a synthetic replica of the PaySim dataset

Add here more detail. We are only synthesizing the amount, all the other columns will be derived from the process of synthesis.

In [None]:
from ydata.utils.data_types import DataType, VariableType
from ydata.dataset.dataset_type import DatasetType
from ydata.synthesizers.timeseries.model import TimeSeriesSynthesizer

In [None]:
#Selecting the columns to be synthesized
sel_dataset = dataset[['time',
                     'action',
                     'amount',
                     'nameOrig',
                     'nameDest',
                     'isFraud',
                     'isFlaggedFraud',
                     'isUnauthorizedOverdraft']]

In [None]:
#Creating the Metadata object for the synthesis
dataset_attrs = {
     "sortbykey": "time",
     "entity_id_cols": ["nameOrig", "nameDest"]
}

m = Metadata(sel_dataset,
             dataset_attrs,
             dataset_type=DatasetType.TIMESERIES)

In [None]:
synthesizer = TimeSeriesSynthesizer()
synthesizer.fit(sel_dataset, metadata=m)

In [None]:
# Generating a sample fo synthetic data
sample = synthesizer.sample(n_samples=len(dataset))

In [None]:
# The synthesized dataset
sample.head(10)

### Calculating the constrained features - balances
#Origin and destination balances need to take account the generated amount. 

Let's just calculate the newBalance, as the oldOnes can be trivially deduced from the new ones and the transaction amount (or by tracking the transaction amount across time)

In [None]:
### Building the function to calculate the balances origin, Destination balances
## Todo: import from utils.py

## Evaluating the quality of the synthetic dataset

Visualizations:
- Number of transactions per clients
- Number of clients
- Imbalance
- Distribution of balance (marginals)
- Distribution of amount (marginals)
- Trajectories show the time properties (balance over time for specific account IDs)

_Balance is a hidden state with deterministic rules and constraints (balance depends on previous row) - if that fails, so does the synthesis._