# Synthetic Payments Data Generation For Federated Payments Anomaly Detection

## Introduction

As with any Machine Learning experiment, training and evaluation of models requires good quality datasets. Quality is relative, but generally agreed upon indicators are good data distribution, and a sufficiently large enough number of samples.

This is even more pertinent with Federated Learning, as the premise of FL is to be able to train a better model than individual participant models, based on the fact that they individually have been trained on a variety of different distributions. Combined with the lack of good quality data (real-life, partially-synthetic, purely-synthetic) from payment systems, or customizable dataset generators for this domain creates a fundamental problem.

We attempt to solve this problem by spinning up a custom purely-synthetic data generation tool. The tool is capable of

1. Generating arbitrary number of training and validation datasets.
2. Generating data with different distributions which are controllable for repeatability and explain-ability.
3. Customizable features and attributes.
4. Inducing rule-based anomalies by
   1. Enable writing custom rules
   2. Enable complex anomaly scenarios via singular complex rules, or via overlapping several simple independent, and compatible rules.
   3. Control anomaly class sizes for a variety of scenarios
5. Generating datasets for different relevant classification scenarios (binary, multi-label, multi-class), and for different types of models based on their eagerness.


## Caveats

Even though this tool is fairly flexible, the tool and the data that we generate have their own caveats as it currently stands

1. The distribution of the data is completely hypothetical. It is hard to source true payments data distributions, let alone the dataset itself because of strict regulation and privacy control measures. So, the resulting models may not work in the real world simply because the distribution and the class size might be entirely different.
2. The dataset inflates the anomaly class samples to avoid minority class problems during data processing. It introduces anomalies into about 25-35% of the records, but in reality payment anomalies constitute < 1% of all the samples in payments datasets. However, this can be adjusted and a datasets with fewer anomalies can be generated.


## Import Required Modules

We start off by importing all the modules that we will need for our synthetic payments data generation.

In [1]:
# standard library imports
import os

os.environ['MPLCONFIGDIR'] = "~/"

import uuid
import logging
from logging import config
from datetime import date, datetime, timedelta

# third-party imports
import faker
import numpy as np
import pandas as pd
from babel import numbers
from currency_converter import CurrencyConverter

## Static Data

### Countries and Currencies

For CBP datasets, this serves a few purposes

1. These countries provide bounds for geo-regions that we add cross border payments data for.
2. These countries help find land geo-coordinates which might not correspond to a real address, but are real land geo-coordinates from a country.
3. These country codes also help us generate more realistic currency and rates data. Libraries such as `babel` accept countries as `alpha_2` codes to return corresponding real currency codes (`alpha_3` format e.g. `USD`).
4. Overall, this gives the whole dataset a more "realistic" feel and aligns us to use good quality information when we expand the process to create more realistic distributions.

In [2]:
# you can use pycountry to get a list of alpha_2 codes for countries of interest. I am hardcoding this list, because sometimes static data generation can become
# highly unpredictable. reference data libs do not work predictably always.
# it might be something that I am doing but don't have the time right now to fix or debug
COUNTRIES = (
    "US",
    "CA",
    "MX",
    "BR",
    "GB",
    "FR",
    "CH",
    "DE",
    "NL",
    "TR",
    "RU",
    "IN",
    "CN",
    "HK",
    "JP",
    "KR",
    "ZA",
    "AU",
    "NZ",
)
COUNTRY_STATIC_DATA = pd.DataFrame(
    {
        "country": COUNTRIES,
        "currency": [numbers.get_territory_currencies(cty)[0] for cty in COUNTRIES],
    }
)
COUNTRY_STATIC_DATA

Unnamed: 0,country,currency
0,US,USD
1,CA,CAD
2,MX,MXN
3,BR,BRL
4,GB,GBP
5,FR,EUR
6,CH,CHF
7,DE,EUR
8,NL,EUR
9,TR,TRY


### Define currency exchange rates

We use real-life historic currency exchange rate information to mimic the real-world "feel" of the dataset. We can use/generate artificial rates; however, in the spirit of data realism, we leverage the `currency_converter` module to get us some historic rates for currency pairs defined in the previous step from public data published by European Central Bank (ECB).

> Since this library under the hood calls a rest endpoint, we need to ensure that we do not call this library unnecessarily. This data though available freely, we want to ensure respectable usage.

A few points to note
- The function `generate_exchange_rates` fetches rates and saves them to a CSV file. The function defaults to using the static rates file. There is an existing file checked under `/path/to/fl/payments/synthetic-data-gen/data/EXCHANGE_RATES.csv`.
- If you wish to force fetch the rates from the API, you must pass the `force_load_from_api` flag to the `generate_exchange_rates` function as `True`.
- Be careful that if you do this, it will override the data in the `EXCHANGE_RATES.csv` file which comes with this notebook in the repository.**

In [3]:
# define some exchange rates
def generate_exchange_rates(
        rates_file_name="./data/EXCHANGE_RATES.csv", force_load_from_api=False
):
    abs_filepath = os.path.abspath(os.path.expanduser(os.path.expandvars(rates_file_name)))

    if force_load_from_api or not os.path.exists(abs_filepath):
        os.makedirs(os.path.dirname(abs_filepath), exist_ok=True)
        print(" ==> [WARNING] <== LOADING CURRENCY EXCHANGE RATES FROM API")

        curr_conv = CurrencyConverter(fallback_on_missing_rate=True)
        all_ccy_pairs = pd.MultiIndex.from_product(
            [COUNTRY_STATIC_DATA["currency"], COUNTRY_STATIC_DATA["currency"]],
            names=["CCY1", "CCY2"],
        ).to_frame(index=False)

        def _xchg_rate(row):
            rate = (
                curr_conv.convert(
                    1,
                    row["CCY1"],
                    row["CCY2"],
                    date=date(2019, 1, 1),
                )
                if row["CCY1"] != row["CCY2"]
                else 1
            )
            return round((1.0 if rate == 0.0 or not rate else rate), 4)

        all_ccy_pairs["RATE"] = all_ccy_pairs.loc[:, ["CCY1", "CCY2"]].apply(
            _xchg_rate, axis=1
        )
        all_ccy_pairs.to_csv(rates_file_name, index=False, mode="w", quotechar='"')
    return pd.read_csv(rates_file_name)


# define a helper function which simplifies fetching rates from the rates data frame
def get_exchange_rates(curr1, curr2):
    result = EXCHANGE_RATES[
        (EXCHANGE_RATES["CCY1"] == curr1) & (EXCHANGE_RATES["CCY2"] == curr2)
        ]
    return result["RATE"].values[0] if len(result) else 1


EXCHANGE_RATES: pd.DataFrame = generate_exchange_rates(force_load_from_api=False)
EXCHANGE_RATES.head()

Unnamed: 0,CCY1,CCY2,RATE
0,USD,USD,1.0
1,USD,CAD,1.3635
2,USD,MXN,19.6464
3,USD,BRL,3.8679
4,USD,GBP,0.7862


### Define reference data

Besides the currency and country reference data that we generated earlier, we define a few more constants which will help define the parameters over which data generation will take place.

This includes (but not limited to)
- Payment participant prefixes
- Payment statuses which will be sampled randomly.
- Payment transaction types. E.g. could include `REQUEST_FOR_PAYMENT` from creditor to the debitor.

In [4]:
PAYMENT_CRDTR_PREFIX = "CREDITOR"
PAYMENT_DBTR_PREFIX = "DEBITOR"
PAYMENT_TRANSACTION_TYPES = ["PAYMENT"]
PAYMENT_STATUS = ["PENDING", "PROCESSING", "COMPLETED", "FAILED", "CANCELLED"]

## Initialize Faker

Faker is a python library for generating realistic-looking fake data. It uses a pseudo-random generator backed data provider for those curious.

- For the sake of brevity we only use locales which have English as a language. in practice however, data could belong to any locale.
- We also peg the seed to a fixed value to ensure that the data generated everytime is the same. But we can avoid this if we want some degree of randomness
- Additionally, we weight the data so the data sampled from providers though completely random ascribes to a more normal data distribution.
- We also use weighting to generate data out of a distribution to make it more "realistic". if you want to randomize data, set `use_weighting=False`

In [5]:
faker.Faker.seed(0)
fake: faker.Faker = faker.Faker(locale=["en", "tr_TR"], use_weighting=True)

## Data Generation Quick Access - Functions to change sampled attributes

These functions should help you quickly tune the values of the dataset attribute normal and anomalous values

In [6]:
from enum import Enum


class Distribution(str, Enum):
    Uniform = "uniform"
    Normal = "normal"


def sample_value_from_distribution(dist: Distribution, *args, **kwargs):
    if dist == Distribution.Uniform:
        if "low" not in kwargs or "high" not in kwargs:
            raise RuntimeError(
                "If uniform distribution, you need to provide distribution intervals as low and high values.")
        return np.random.uniform(low=kwargs["low"], high=kwargs["high"])

    elif dist == Distribution.Normal:
        if "mean" not in kwargs or "std_dev" not in kwargs:
            raise RuntimeError(
                "If normal distribution, you need to provide distribution parameters as mean and std_dev, where std_dev > 0")
        return np.random.normal(loc=kwargs["mean"], scale=kwargs["std_dev"])

    raise RuntimeError("Distribution not recognized.")

#### CHANGE ME FOR DIFFERENT DATASETS

In [7]:
## functions for generating NORMAL tower lat and long. note that this returns a delta, not a absolute value
tower_latitude_perturbation_fn = lambda: float(sample_value_from_distribution(Distribution.Uniform, low=-1, high=1))
tower_longitude_perturbation_fn = lambda: float(sample_value_from_distribution(Distribution.Uniform, low=-1, high=1))

## function for generating tower lat and long perturbance. note that this returns a delta, not a absolute value
get_north_or_east_perturbation_factor = lambda: -9.1
get_south_or_west_perturbation_factor = lambda: -8.9

## function for generating normal debitor amount
debitor_amount_generator_fn = lambda: float(
    np.round(sample_value_from_distribution(Distribution.Uniform, low=10_000_000, high=20_000_000), 2))

## function for generating anomalous debitor amount
anomalous_debitor_amount_generator_fn = lambda: float(
    np.round(sample_value_from_distribution(Distribution.Uniform, low=15_000_000, high=25_000_000), 2))

## Create dataset attributes template

- To keep row data generation as simple as possible and yet be flexible enough, we define the data fields as properties which have a `name` and `value` associated with them. (c.f. `PaymentAttribute`)
- We then define parts of the attributes using 
- We define simple class method interfaces so that we can quickly and uniformly generate data over a concise API.
- This API also serves well to generate more complex forms of data with data relationships and dependencies.
- We then tie all of this together with a schema based row generator `PaymentRowGenerator` which generates the row using a class method.

### Attribute Specifier

This class helps represent all attributes as properties, where the name is pre-formatter, but the values are assigned dyn

In [8]:
class PaymentAttribute:
    def __init__(self, attr_name: str):
        self._name = attr_name.upper()
        self._value = None

    @property
    def name(self):
        return self._name

    @property
    def value(self):
        return self._value

    @value.setter
    def value(self, value):
        self._value = value

    def get_key_value(self):
        return self.name, self.value

### Payment Participant Attributes (attributes common to actors of the payment - DEBITOR and CREDITOR)

In [9]:
def get_coordinate(addr_country_code: str) -> float:
    # ST - This is annoying. faker.geo.local_latlng accepts a country code to GUARANTEE a coord.
    #   however, calling this API is not consistent, and it is prone to return None. my guess as to probably why is that
    #   it has an internal marker/ idx over a subset of locations filtered on country. if this marker/ idx falls outside set bounds,
    #   it returns none. since this is a random marker, it jumps wildly. a simple loop with 10 tries to fetch
    #   10 coords from say US sometimes returns nothing 8/10 times. retrying is the only option.
    #   10 seems to be a faithful number of times to try
    coord = None
    while not coord:
        coord = fake.local_latlng(country_code=addr_country_code)

    # print(f"------country: {addr_country}, coord: {coord} -------")
    # idx 0 is lat and 1 is long
    return coord


class PaymentParticipantAttributes:
    def __init__(self, participant_prefix: str):
        super().__init__()
        self.participant_prefix = participant_prefix
        self.username = PaymentAttribute(f"{self.participant_prefix}_username")
        self.first_name = PaymentAttribute(f"{self.participant_prefix}_first_name")
        self.last_name = PaymentAttribute(f"{self.participant_prefix}_last_name")
        self.email_address = PaymentAttribute(
            f"{self.participant_prefix}_email_address"
        )
        self.phone_number = PaymentAttribute(f"{self.participant_prefix}_phone_number")
        self.birth_year = PaymentAttribute(f"{self.participant_prefix}_birth_year")
        self.birth_month = PaymentAttribute(f"{self.participant_prefix}_birth_month")
        self.birth_day = PaymentAttribute(f"{self.participant_prefix}_birth_day")
        self.gender = PaymentAttribute(f"{self.participant_prefix}_gender")
        self.addr_building = PaymentAttribute(
            f"{self.participant_prefix}_addr_building"
        )
        self.addr_street = PaymentAttribute(f"{self.participant_prefix}_addr_street")
        self.addr_city = PaymentAttribute(f"{self.participant_prefix}_addr_city")
        self.addr_state = PaymentAttribute(f"{self.participant_prefix}_addr_state")
        self.addr_zipcode = PaymentAttribute(f"{self.participant_prefix}_addr_zipcode")
        self.addr_country = PaymentAttribute(f"{self.participant_prefix}_addr_country")
        self.geo_latitude = PaymentAttribute(f"{self.participant_prefix}_geo_latitude")
        self.geo_longitude = PaymentAttribute(
            f"{self.participant_prefix}_geo_longitude"
        )
        self.account_number = PaymentAttribute(
            f"{self.participant_prefix}_account_number"
        )
        self.bic_code = PaymentAttribute(f"{self.participant_prefix}_bic_code")
        self.account_create_timestamp = PaymentAttribute(
            f"{self.participant_prefix}_account_create_timestamp"
        )
        self.currency = PaymentAttribute(f"{self.participant_prefix}_currency")
        self.ip_address = PaymentAttribute(f"{self.participant_prefix}_ip_address")
        self.tower_latitude = PaymentAttribute(
            f"{self.participant_prefix}_tower_latitude"
        )
        self.tower_longitude = PaymentAttribute(
            f"{self.participant_prefix}_tower_longitude"
        )
        self.comment = PaymentAttribute(f"{self.participant_prefix}_comment")

    def _set_base_data(self):
        self.username.value = fake.user_name()
        self.first_name.value = fake.first_name()
        self.last_name.value = fake.last_name()
        self.email_address.value = fake.email()
        self.phone_number.value = fake.phone_number()
        self.gender.value = str(np.random.choice(["M", "F", "NB"]))

    def _set_dates(self):
        dob = fake.date_of_birth(minimum_age=21, maximum_age=80)
        self.birth_year.value = dob.year
        self.birth_month.value = dob.month
        self.birth_day.value = dob.day

    def _set_account_details(self):
        self.account_number.value = fake.iban()
        self.bic_code.value = fake.aba()
        # "normal" accounts are at least 3 months old (relative to today's date)
        # NOTE: If you change this - please review PaymentCoreFinancialAttributes.payment_init_timestamp and payment_update_timestamp
        self.account_create_timestamp.value = fake.date_time_between(
            start_date=datetime(1985, 1, 1, 0, 0, 0),
            end_date=(datetime.today() - timedelta(weeks=12)),
        ).timestamp()

    def _set_address_details(self):
        self.addr_building.value = fake.building_number()
        self.addr_street.value = f"{fake.street_name()} {fake.street_suffix()}"
        self.addr_city.value = fake.city()
        self.addr_state.value = fake.state()
        self.addr_zipcode.value = fake.postcode()
        self.addr_country.value = str(np.random.choice(COUNTRIES))
        geo_coord = get_coordinate(self.addr_country.value)
        self.geo_latitude.value = np.float64(geo_coord[0])
        self.geo_longitude.value = np.float64(geo_coord[1])

    def _set_operational_details(self):
        self.currency.value = COUNTRY_STATIC_DATA[
            COUNTRY_STATIC_DATA["country"] == self.addr_country.value
            ]["currency"].values[0]
        self.ip_address.value = fake.ipv4(network=False)
        # tower location is typically within +-1 deg N-S and +-1 deg E-W for "normal" transactions
        # 1 degree lat shift ~ 69 miles and 1 degree long shift ~ 54.6 miles
        self.tower_latitude.value = self.geo_latitude.value + tower_latitude_perturbation_fn()
        self.tower_longitude.value = self.geo_longitude.value + tower_longitude_perturbation_fn()
        self.comment.value = fake.text(max_nb_chars=60)

    def set_mock_data(self):
        self._set_base_data()
        self._set_address_details()
        self._set_dates()
        self._set_account_details()
        self._set_operational_details()

    def get_mock_data_row(self) -> dict:
        self.set_mock_data()
        mock_data_row: dict = {
            self.__dict__[item].name: self.__dict__[item].value
            for item in self.__dict__
            if isinstance(self.__dict__[item], PaymentAttribute)
        }
        return mock_data_row

### Payment Row Core Financial Attributes

In [10]:
class PaymentCoreFinancialAttributes:
    def __init__(self):
        super().__init__()
        self.payment_id = PaymentAttribute("payment_id")
        self.payment_init_timestamp = PaymentAttribute("payment_init_timestamp")
        self.payment_last_update_timestamp = PaymentAttribute(
            "payment_last_update_timestamp"
        )
        self.payment_status = PaymentAttribute("payment_status")
        self.payment_type = PaymentAttribute("payment_type")

    def set_mock_data(
            self, dbtr: PaymentParticipantAttributes, crdtr: PaymentParticipantAttributes
    ):
        self.payment_id.value = str(uuid.uuid4())

        # compare dbtr acc create and crdtr acc create timestamps, and determine whose account was created later.
        latest_ts = datetime.fromtimestamp(
            dbtr.account_create_timestamp.value
            if dbtr.account_create_timestamp.value
               >= crdtr.account_create_timestamp.value
            else crdtr.account_create_timestamp.value
        )
        # the idea here is that "normal" transactions will only happen at least 15 days AFTER the users accounts were created.
        # since the account age is guaranteed to be at least 3 months old relative to TODAY (check PaymentParticipantAttributes)
        # the date generated will never be past today
        self.payment_init_timestamp.value = fake.date_time_between(
            start_date=(latest_ts + timedelta(days=15)), end_date=datetime.now()
        ).timestamp()
        self.payment_last_update_timestamp.value = fake.date_time_between(
            start_date=datetime.fromtimestamp(
                self.payment_last_update_timestamp.value
                if self.payment_last_update_timestamp.value
                else self.payment_init_timestamp.value
            ),
            end_date=datetime.now(),
        ).timestamp()
        self.payment_status.value = str(np.random.choice(PAYMENT_STATUS))
        self.payment_type.value = str(np.random.choice(PAYMENT_TRANSACTION_TYPES))

    def get_mock_data_row(
            self, dbtr: PaymentParticipantAttributes, crdtr: PaymentParticipantAttributes
    ) -> dict:
        self.set_mock_data(dbtr, crdtr)
        mock_data_row: dict = {
            self.__dict__[item].name: self.__dict__[item].value
            for item in self.__dict__
            if isinstance(self.__dict__[item], PaymentAttribute)
        }
        return mock_data_row

### Payment Row Derived Attributes

In [11]:
class PaymentDerivedAttributes:
    def __init__(self):
        self.dbtr_ccy_crdtr_ccy_rate = PaymentAttribute(
            f"{PAYMENT_DBTR_PREFIX}_ccy_{PAYMENT_CRDTR_PREFIX}_ccy_rate"
        )
        self.crdtr_ccy_dbtr_ccy_rate = PaymentAttribute(
            f"{PAYMENT_CRDTR_PREFIX}_ccy_{PAYMENT_DBTR_PREFIX}_ccy_rate"
        )
        self.dbtr_amount = PaymentAttribute(f"{PAYMENT_DBTR_PREFIX}_amount")
        self.crdtr_amount = PaymentAttribute(f"{PAYMENT_CRDTR_PREFIX}_amount")

    def set_mock_data(
            self, dbtr: PaymentParticipantAttributes, crdtr: PaymentParticipantAttributes
    ):
        self.dbtr_ccy_crdtr_ccy_rate.value = get_exchange_rates(
            dbtr.currency.value, crdtr.currency.value
        )
        self.crdtr_ccy_dbtr_ccy_rate.value = get_exchange_rates(
            crdtr.currency.value, dbtr.currency.value
        )
        self.dbtr_amount.value = debitor_amount_generator_fn()
        self.crdtr_amount.value = float(np.round(
            self.dbtr_amount.value * self.dbtr_ccy_crdtr_ccy_rate.value, 2
        ))

    def get_mock_data_row(
            self, dbtr: PaymentParticipantAttributes, crdtr: PaymentParticipantAttributes
    ) -> dict:
        self.set_mock_data(dbtr, crdtr)
        mock_data_row: dict = {
            self.__dict__[item].name: self.__dict__[item].value
            for item in self.__dict__
            if isinstance(self.__dict__[item], PaymentAttribute)
        }
        return mock_data_row

### Payment Row Generator

In [12]:
class PaymentRowGenerator:
    dbtr_attributes = PaymentParticipantAttributes(PAYMENT_DBTR_PREFIX)
    crdtr_attributes = PaymentParticipantAttributes(PAYMENT_CRDTR_PREFIX)
    core_fin_attributes = PaymentCoreFinancialAttributes()
    derived_fin_attributes = PaymentDerivedAttributes()

    @classmethod
    def generate_row(cls):
        payment_row: dict = {"FRAUD_FLAG": 0}
        payment_row.update(cls.dbtr_attributes.get_mock_data_row())
        payment_row.update(cls.crdtr_attributes.get_mock_data_row())
        payment_row.update(
            cls.core_fin_attributes.get_mock_data_row(
                dbtr=cls.dbtr_attributes, crdtr=cls.crdtr_attributes
            )
        )
        payment_row.update(
            cls.derived_fin_attributes.get_mock_data_row(
                dbtr=cls.dbtr_attributes, crdtr=cls.crdtr_attributes
            )
        )
        return payment_row

## Anomaly Generation

### Define rules for inserting a payment anomalies

The way we approach generating anomalous data is by assuming that the base data we generate are "good" transactions - which are free of anomalies.
We then define perturbative functions to nudge a non-anomalous row to an anomalous one.

1. These simple rules are self-contained. They can be applied to a data set to generate very simple and and easily separable classes.
2. These rules can be combined to create more complex scenarios where anomaly elements are spread across multiple dimensions.  

### Anomaly Type 1 - Geo Location and Tower Location are too far apart

We define a simple rule which will mutate the tower location to be further away from the supposed physical location of a creditor and/or debitor.
This is inspired from payment app features where transactions initiated further away from city region where the customer normally operates usually are flagged and considered fraudulent.

In [13]:
def generate_faraway_coord(coord, max_coord, min_coord):
    # print(coord, max_coord, min_coord)
    # first find the leeway we have towards either directions
    curr_north_or_east_delta = max_coord - coord + get_north_or_east_perturbation_factor()
    curr_south_or_west_delta = abs(min_coord - coord) + get_south_or_west_perturbation_factor()
    # print("Deltas: ", curr_north_or_east_delta, curr_south_or_west_delta)

    # based on the range above, we can generate a coordinate range to move
    north_or_east_coord_range = (
        (max_coord - curr_north_or_east_delta, max_coord)
        if curr_north_or_east_delta > 0
        else (0, 0)
    )
    south_or_west_coord_range = (
        (min_coord, curr_south_or_west_delta - abs(min_coord))
        if curr_south_or_west_delta > 0
        else (0, 0)
    )
    # print("Coord ranges: ", north_or_east_coord_range, south_or_west_coord_range)

    new_north_or_east_coord = float(np.random.uniform(*north_or_east_coord_range))
    new_south_or_west_coord = float(np.random.uniform(*south_or_west_coord_range))
    # print("New Coordinates: ", new_north_or_east_coord, new_south_or_west_coord)

    if curr_north_or_east_delta > 0 and curr_south_or_west_delta > 0:
        return float(np.random.choice((new_north_or_east_coord, new_south_or_west_coord)))
    elif curr_north_or_east_delta > 0:
        return new_north_or_east_coord
    elif curr_south_or_west_delta > 0:
        return new_south_or_west_coord


def type_1_tower_loc_phy_loc_mismatch(row):
    row["DEBITOR_TOWER_LATITUDE"] = generate_faraway_coord(
        row["DEBITOR_TOWER_LATITUDE"], 90, -90
    )
    row["DEBITOR_TOWER_LONGITUDE"] = generate_faraway_coord(
        row["DEBITOR_TOWER_LONGITUDE"], 180, -180
    )
    row["CREDITOR_TOWER_LATITUDE"] = generate_faraway_coord(
        row["CREDITOR_TOWER_LATITUDE"], 90, -90
    )
    row["CREDITOR_TOWER_LONGITUDE"] = generate_faraway_coord(
        row["CREDITOR_TOWER_LONGITUDE"], 180, -180
    )
    return row

### Anomaly Type 2 - Account age is too low and the amount is above a certain threshold

We define a simple rule which will mutate the account creation timestamp for the debitor to within a few minutes of the payment.
This is inspired from features where Debitors with relatively "young" accounts have restrictions in place to ensure that they cannot transfer large sums outright. 

In [14]:
def type_2_account_too_young_amount_above_threshold(row):
    payment_create_dt = datetime.fromtimestamp(row["PAYMENT_INIT_TIMESTAMP"])
    t = payment_create_dt - timedelta(
        hours=int(np.random.choice([0, 1, 2, 3, 4, 5])),
        minutes=int(np.random.choice([0, 1, 4, 9, 16, 25])),
        seconds=int(np.random.choice([1, 10, 20, 30, 40, 50])),
    )

    row["DEBITOR_ACCOUNT_CREATE_TIMESTAMP"] = t.timestamp()
    row["DEBITOR_AMOUNT"] = anomalous_debitor_amount_generator_fn()
    row["CREDITOR_AMOUNT"] = float(np.round(row["DEBITOR_AMOUNT"] * row["DEBITOR_CCY_CREDITOR_CCY_RATE"], 2))
    return row

### Define anomaly mapping

To wrap up the fraud generation mechanism, we define a simple map of these fraud generating functions, and map them to be used later.
We also define a function to insert arbitrary fraudulent rows based on certain input parameters

In [15]:
FRAUD_TYPE_DEFINITION = {
    "type1": type_1_tower_loc_phy_loc_mismatch,
    "type2": type_2_account_too_young_amount_above_threshold,
}

In [16]:
def get_row_indices_with_controlled_overlap(
        dataset_df: pd.DataFrame,
        fraudulent_frac: float = 0.3,
        random_state=42,
        fraud_overlap_frac: float = 0.0,
):
    existing_fraud_rows = dataset_df[dataset_df["FRAUD_FLAG"] == 1]
    existing_non_fraud_rows = dataset_df[dataset_df["FRAUD_FLAG"] == 0]
    # if we do not want to overlap fraud rules, then we sample only non-fraud rows
    if fraud_overlap_frac <= 0:
        return existing_non_fraud_rows.sample(
            frac=fraudulent_frac, random_state=random_state
        ).index

    # if we want to overlap fraud rules, then we sample based on fraud_overlap_frac.
    # we subsample the fraud rows by fraud_overlap_frac and select the rest from non-fraud rows
    total_fraud_row_count = int(np.ceil(dataset_df.shape[0] * fraudulent_frac))
    fraud_subsample_row_count = int(np.ceil(total_fraud_row_count * fraud_overlap_frac))
    fraud_subsample_row_count = (
        fraud_subsample_row_count
        if existing_fraud_rows.shape[0] > fraud_subsample_row_count
        else existing_fraud_rows.shape[0]
    )
    non_fraud_subsample_row_count = total_fraud_row_count - fraud_subsample_row_count
    non_fraud_subsample_row_count = (
        non_fraud_subsample_row_count
        if existing_non_fraud_rows.shape[0] > non_fraud_subsample_row_count
        else existing_non_fraud_rows.shape[0]
    )
    fraud_subsample = existing_fraud_rows.sample(
        n=fraud_subsample_row_count, random_state=random_state
    )
    non_fraud_subsample = existing_non_fraud_rows.sample(
        n=non_fraud_subsample_row_count, random_state=random_state
    )
    return pd.concat([fraud_subsample, non_fraud_subsample]).index


def insert_fraud_rows(
        fraud_insertion_func,
        dataset_df: pd.DataFrame,
        fraudulent_frac: float = 0.3,
        random_state=42,
        fraud_overlap_frac: float = 0.1,
):
    fraudulent_txn_indexes = get_row_indices_with_controlled_overlap(
        dataset_df,
        fraudulent_frac,
        random_state=random_state,
        fraud_overlap_frac=fraud_overlap_frac,
    )
    # print(fraudulent_txn_indexes)
    if fraudulent_txn_indexes.empty:
        print(
            "COULD NOT SAMPLE INDEXES BASED ON THE FRAUD FRACTION AND OVERLAP FRACTION!"
        )
        return dataset_df

    # fraudulent_txn_indexes = dataset_df.sample(frac=fraudulent_frac, random_state=random_state).index
    dataset_df.loc[fraudulent_txn_indexes, :] = dataset_df.loc[
                                                fraudulent_txn_indexes, :
                                                ].apply(fraud_insertion_func, axis=1)
    dataset_df.loc[fraudulent_txn_indexes, ["FRAUD_FLAG"]] = 1
    return dataset_df

## Test Data Generation

### Generate Sample Payment Data
To see all of what we have defined in action, let us generate some sample rows.
We then display this in a simple tabular format.

In [17]:
def generate_mock_payment_data(num_payment_rows=1):
    return [
        PaymentRowGenerator.generate_row()
        for _ in range(num_payment_rows if num_payment_rows > 0 else 1)
    ]


# payments dataset
dataset = pd.DataFrame(generate_mock_payment_data(num_payment_rows=10))
dataset.columns

Index(['FRAUD_FLAG', 'DEBITOR_USERNAME', 'DEBITOR_FIRST_NAME',
       'DEBITOR_LAST_NAME', 'DEBITOR_EMAIL_ADDRESS', 'DEBITOR_PHONE_NUMBER',
       'DEBITOR_BIRTH_YEAR', 'DEBITOR_BIRTH_MONTH', 'DEBITOR_BIRTH_DAY',
       'DEBITOR_GENDER', 'DEBITOR_ADDR_BUILDING', 'DEBITOR_ADDR_STREET',
       'DEBITOR_ADDR_CITY', 'DEBITOR_ADDR_STATE', 'DEBITOR_ADDR_ZIPCODE',
       'DEBITOR_ADDR_COUNTRY', 'DEBITOR_GEO_LATITUDE', 'DEBITOR_GEO_LONGITUDE',
       'DEBITOR_ACCOUNT_NUMBER', 'DEBITOR_BIC_CODE',
       'DEBITOR_ACCOUNT_CREATE_TIMESTAMP', 'DEBITOR_CURRENCY',
       'DEBITOR_IP_ADDRESS', 'DEBITOR_TOWER_LATITUDE',
       'DEBITOR_TOWER_LONGITUDE', 'DEBITOR_COMMENT', 'CREDITOR_USERNAME',
       'CREDITOR_FIRST_NAME', 'CREDITOR_LAST_NAME', 'CREDITOR_EMAIL_ADDRESS',
       'CREDITOR_PHONE_NUMBER', 'CREDITOR_BIRTH_YEAR', 'CREDITOR_BIRTH_MONTH',
       'CREDITOR_BIRTH_DAY', 'CREDITOR_GENDER', 'CREDITOR_ADDR_BUILDING',
       'CREDITOR_ADDR_STREET', 'CREDITOR_ADDR_CITY', 'CREDITOR_ADDR_STATE',
       '

### Conditioning generated sample data

Before we induce anomalies in this sample dataset, let us condition it a bit we randomize the row order for better predictive value.

In [18]:
# shuffle the row order randomly
print("Shuffling row order...")
dataset = dataset.sample(frac=1, replace=False, random_state=42)

# post shuffling, previous indexes are preserved. let us trash them and generate new ones.
print("Resetting indexes...")
dataset.reset_index(drop=True, inplace=True)
dataset

Shuffling row order...
Resetting indexes...


Unnamed: 0,FRAUD_FLAG,DEBITOR_USERNAME,DEBITOR_FIRST_NAME,DEBITOR_LAST_NAME,DEBITOR_EMAIL_ADDRESS,DEBITOR_PHONE_NUMBER,DEBITOR_BIRTH_YEAR,DEBITOR_BIRTH_MONTH,DEBITOR_BIRTH_DAY,DEBITOR_GENDER,...,CREDITOR_COMMENT,PAYMENT_ID,PAYMENT_INIT_TIMESTAMP,PAYMENT_LAST_UPDATE_TIMESTAMP,PAYMENT_STATUS,PAYMENT_TYPE,DEBITOR_CCY_CREDITOR_CCY_RATE,CREDITOR_CCY_DEBITOR_CCY_RATE,DEBITOR_AMOUNT,CREDITOR_AMOUNT
0,0,anthonyday,Natalie,Tevetoğlu,zerickson@example.com,001-826-727-9061,2002,3,10,F,...,Computer doctor up high southern job high.,1979176c-f87f-49c9-b0b7-8ff28e2241d2,1691787000.0,1749631000.0,FAILED,PAYMENT,0.2263,4.4185,18765728.51,4246684.0
1,0,kaversezer,Stephen,Rogers,dennis75@example.net,(992)466 1093,1964,1,1,F,...,Blanditiis id dignissimos aliquam veniam.,d805a55c-7b03-41a3-bd4d-ba977d182df0,1748336000.0,1730731000.0,PENDING,PAYMENT,0.5864,1.7052,13180665.4,7729142.0
2,0,temizkalcetin,Öge,Clark,uzbaybilge@example.com,952-240-5980x250,1976,6,19,NB,...,Provident inventore consequuntur ab maiores.,41523862-e624-4406-8bb2-e3016e9426e1,1140835000.0,1749295000.0,COMPLETED,PAYMENT,3.9072,0.2559,14561646.7,56895270.0
3,0,qturk,Ünübol,Güçlü,fayizeyaman@example.com,001-719-848-9241x15781,1958,10,5,F,...,Garden economy others kind.,daf401da-852a-41f7-a4d3-c88ff3cfcc18,1612870000.0,1727436000.0,FAILED,PAYMENT,77.546,0.0129,14508788.97,1125099000.0
4,0,yilmazcagdan,Gücal,Martin,sherryvilla@example.com,960.733.4523x5326,1991,9,15,NB,...,Popular word book read pass heart soldier action.,7185f035-6632-4314-9509-57f9b1903263,1703913000.0,1749616000.0,FAILED,PAYMENT,0.1118,8.9484,10566916.26,1181381.0
5,0,hakgunduz,Turcein,Karadeniz,qsener@example.org,4558508249,1985,2,26,M,...,Ut doloremque consequuntur.,f3bec399-2cb8-4127-84bd-e94608a0d332,908103800.0,1734211000.0,PROCESSING,PAYMENT,1118.9434,0.0009,11801660.79,13205390000.0
6,0,sierra14,Heather,Gray,salizorlu@example.net,+1-717-418-7727,1972,12,12,NB,...,Choice phone nor. Western month itself history.,148ee983-8929-4615-823a-9aa94fc12b79,1561908000.0,1749632000.0,PENDING,PAYMENT,80.2934,0.0125,13195452.69,1059508000.0
7,0,cynthiapitts,Bariş,Hall,demirelfugen@example.com,708 8 697,1964,10,24,NB,...,Nulla ad nisi laudantium.,30dcbdcb-4eee-49b3-a9fe-819fb3976939,1608652000.0,1748563000.0,PROCESSING,PAYMENT,0.1118,8.9484,11832526.22,1322876.0
8,0,fergusonkyle,Filit,Brock,necmettininonu@example.org,(602)906 1126,2001,8,1,F,...,Corporis ut corporis et distinctio deleniti.,e554c6c3-d608-4194-8ffa-6f7cfa38aa8b,1547128000.0,1748186000.0,FAILED,PAYMENT,0.476,2.1009,12256402.94,5834048.0
9,0,uyildirim,Ashley,Hill,durduaygonenc@example.org,+90(132)716-6446x5951,1999,9,12,NB,...,Similique modi illo quos.,d7b2e2e4-816f-4221-aacd-d81a6641c191,1346524000.0,1749348000.0,FAILED,PAYMENT,79.5371,0.0126,15868036.71,1262098000.0


### Explore the anomalous features

In [19]:
dataset[
    [
        "FRAUD_FLAG",
        "DEBITOR_TOWER_LATITUDE",
        "DEBITOR_TOWER_LONGITUDE",
        "CREDITOR_TOWER_LATITUDE",
        "CREDITOR_TOWER_LONGITUDE",
        "DEBITOR_ACCOUNT_CREATE_TIMESTAMP",
        "DEBITOR_AMOUNT",
    ]
]

Unnamed: 0,FRAUD_FLAG,DEBITOR_TOWER_LATITUDE,DEBITOR_TOWER_LONGITUDE,CREDITOR_TOWER_LATITUDE,CREDITOR_TOWER_LONGITUDE,DEBITOR_ACCOUNT_CREATE_TIMESTAMP,DEBITOR_AMOUNT
0,0,-19.666333,-44.283454,52.379328,7.846328,780088000.0,18765728.51
1,0,-44.041372,171.673782,51.710423,5.903726,1633097000.0,13180665.4
2,0,44.719465,-73.591908,38.21967,28.675109,793772600.0,14561646.7
3,0,-28.00375,31.765801,34.996994,128.614218,706714200.0,14508788.97
4,0,22.812854,113.107726,48.126139,2.535672,1161613000.0,10566916.26
5,0,45.174548,-123.779113,37.015912,129.358294,552391700.0,11801660.79
6,0,43.909447,-79.664648,37.828868,140.802051,533197700.0,13195452.69
7,0,23.299002,114.533494,51.204337,5.946537,909844500.0,11832526.22
8,0,-27.808292,29.079701,40.386313,121.58955,1258558000.0,12256402.94
9,0,52.584392,8.617212,45.871595,38.7581,680120000.0,15868036.71


### Test with sample dataset

Let us to a sample fraud data generation exercise where we apply these transformations to the same data set

In [20]:
dataset = insert_fraud_rows(
    FRAUD_TYPE_DEFINITION["type1"],
    dataset, fraudulent_frac=0.2,
    random_state=int(np.random.randint(1, 50)),
)
dataset[["FRAUD_FLAG", "DEBITOR_TOWER_LATITUDE", "DEBITOR_TOWER_LONGITUDE", "CREDITOR_TOWER_LATITUDE",
         "CREDITOR_TOWER_LONGITUDE"]]

Unnamed: 0,FRAUD_FLAG,DEBITOR_TOWER_LATITUDE,DEBITOR_TOWER_LONGITUDE,CREDITOR_TOWER_LATITUDE,CREDITOR_TOWER_LONGITUDE
0,0,-19.666333,-44.283454,52.379328,7.846328
1,0,-44.041372,171.673782,51.710423,5.903726
2,1,26.136469,-125.451734,12.767367,-118.149713
3,0,-28.00375,31.765801,34.996994,128.614218
4,0,22.812854,113.107726,48.126139,2.535672
5,0,45.174548,-123.779113,37.015912,129.358294
6,0,43.909447,-79.664648,37.828868,140.802051
7,0,23.299002,114.533494,51.204337,5.946537
8,0,-27.808292,29.079701,40.386313,121.58955
9,1,-22.444291,149.739651,88.369248,9.468436


In [21]:
dataset = insert_fraud_rows(
    FRAUD_TYPE_DEFINITION["type2"], dataset, fraudulent_frac=0.2, fraud_overlap_frac=0.3
)
dataset[["FRAUD_FLAG", "DEBITOR_ACCOUNT_CREATE_TIMESTAMP", "DEBITOR_AMOUNT"]]

Unnamed: 0,FRAUD_FLAG,DEBITOR_ACCOUNT_CREATE_TIMESTAMP,DEBITOR_AMOUNT
0,0,780088000.0,18765728.51
1,1,1748317000.0,24392322.74
2,1,793772600.0,14561646.7
3,0,706714200.0,14508788.97
4,0,1161613000.0,10566916.26
5,0,552391700.0,11801660.79
6,0,533197700.0,13195452.69
7,0,909844500.0,11832526.22
8,0,1258558000.0,12256402.94
9,1,1346523000.0,22135176.93


As one can see that specific data transformations to convert a payment transaction row to a fraudulent one are successful.

## Putting Everything Together

We can now combine all the elements developed to generate fraud datasets for mock participating institutions

In [22]:
bank_fraud_rules = {
    "bank2": [
        {
            "fraud_insertion_rule_stack": [],
            "num_datasets": 2,
            # "min_num_rows": 2500,
            "max_num_rows": 2500,
        },
        # Genesis dataset with Type 1 frauds for bank 1
        {
            "fraud_insertion_rule_stack": ["type1"],
            "num_datasets": 2,
            # "min_num_rows": 25000,
            "max_num_rows": 25000,
            "apply_probability": 0.9,
            "fname_label": "gen_train",
        },
        {
            "fraud_insertion_rule_stack": ["type1"],
            "num_datasets": 1,
            "apply_probability": 0.9,
            "fname_label": "scaling",
            "min_num_rows": 2500,
            "max_num_rows": 2500,
        },
        {
            "fraud_insertion_rule_stack": ["type2"],
            "num_datasets": 1,
            "apply_probability": 0.9,
            "fname_label": "scaling",
            "min_num_rows": 2500,
            "max_num_rows": 2500,
        },
        # for eval set for bank1
        {
            "fraud_insertion_rule_stack": ["type1"],
            "num_datasets": 5,
            "apply_probability": 0.9,
            "fname_label": "eval",
            "min_num_rows": 2500,
            "max_num_rows": 2500,
        },
        {
            "fraud_insertion_rule_stack": ["type2"],
            "num_datasets": 5,
            "apply_probability": 0.9,
            "fname_label": "eval",
            "min_num_rows": 2500,
            "max_num_rows": 2500,
        }
        # {
        #     "fraud_insertion_rule_stack": ["type1", "type2"],
        #     "num_datasets": 1,
        #     "apply_probability": 0.9,
        #     "fname_label": "eval",
        #     "min_num_rows": 2500,
        #     "max_num_rows": 2500,
        # },
        # {
        #     "fraud_insertion_rule_stack": ["type1", "type2"],
        #     "num_datasets": 1,
        #     "apply_probability": 0.9,
        #     "fraud_overlap_frac": 0.33,
        #     "fname_label": "eval",
        #     "min_num_rows": 2500,
        #     "max_num_rows": 2500,
        # },
    ]
}

### Define a map of banks to the rule set for data generation

This can then be looped over in a repeatable pattern

In [23]:
def apply_fraud_with_probability(
        dataset_df: pd.DataFrame, fraud_apply_probability: float = 0.9
) -> pd.DataFrame:
    # we want to confuse the model just a bit by giving some rows fraud like values, but they might not actually be fraudulent
    # if we want to apply fraud with a 100% probability, then just return
    if fraud_apply_probability >= 1:
        return dataset_df

    # we want to take 1 - fraud_apply_probability fraction of rows and then sift through them
    # select rows where we induced fraud
    fraud_rows_idx = (
        dataset_df[dataset_df["FRAUD_FLAG"] == 1]
        .sample(frac=(1 - fraud_apply_probability), random_state=38)
        .index
    )
    # then, reset the fraud flag value for some rows
    dataset_df.loc[fraud_rows_idx, "FRAUD_FLAG"] = dataset_df.loc[
        fraud_rows_idx, "FRAUD_FLAG"
    ].apply(lambda _: 1 - _)
    return dataset_df

### Define quick logging utilities

In [24]:
log_file_path = os.path.join(
    os.path.abspath(os.path.expanduser(os.path.expandvars("./"))),
    "data_generation_stats.log",
)
log_file_path

'/home/jupyter-sarthakt/data_generation_stats.log'

In [25]:
logging_dict_config = {
    "version": 1,
    # "disable_existing_loggers": True, # enable if third-party logging is annoying
    "formatters": {
        "defaultFormatter": {
            "format": "[%(asctime)s] [%(levelname)s] [%(threadName)s] - %(message)s"
        }
    },
    "handlers": {
        "console": {
            "class": "logging.StreamHandler",
            "level": "INFO",
            "formatter": "defaultFormatter",
            "stream": "ext://sys.stdout",
        },
        "file": {
            "class": "logging.FileHandler",
            "level": "INFO",
            "formatter": "defaultFormatter",
            "filename": log_file_path,
        },
    },
    "loggers": {"": {"level": "INFO", "handlers": ["console", "file"]}},  # root logger
}
logging.config.dictConfig(logging_dict_config)

### Generate all bank data

In [26]:
## make dataset directory
dir_path = os.path.abspath(os.path.expanduser(os.path.expandvars("./payment_anom_datasets_experiment_1")))
os.makedirs(dir_path, exist_ok=True)

In [27]:
logging.info("------------------------------------------------------------")
logging.info(f"Starting data generation...")
for bank_name in bank_fraud_rules:
    logging.info(
        f"Generating {len(bank_fraud_rules[bank_name])} dataset group(s) for {bank_name}..."
    )

    for idx, dataset_gen_config in enumerate(bank_fraud_rules[bank_name]):
        fraud_insertion_rule_stack = dataset_gen_config["fraud_insertion_rule_stack"]
        num_datasets = dataset_gen_config["num_datasets"]
        fraud_overlap_fraction = (
            dataset_gen_config["fraud_overlap_frac"]
            if "fraud_overlap_frac" in dataset_gen_config
            else -1
        )
        min_num_rows = (
            dataset_gen_config["min_num_rows"]
            if "min_num_rows" in dataset_gen_config
               and dataset_gen_config["min_num_rows"]
            else 1000
        )
        max_num_rows = (
            dataset_gen_config["max_num_rows"]
            if "max_num_rows" in dataset_gen_config
               and dataset_gen_config["max_num_rows"]
            else 3000
        )
        fname_label = (
            dataset_gen_config["fname_label"]
            if "fname_label" in dataset_gen_config and dataset_gen_config["fname_label"]
            else ""
        )
        apply_probability = (
            dataset_gen_config["apply_probability"]
            if "apply_probability" in dataset_gen_config
               and dataset_gen_config["apply_probability"]
            else 1
        )

        logging.info(
            f"Generating {num_datasets} datasets for {bank_name} in group {idx}..."
        )

        for i in range(1, num_datasets + 1):
            num_rows = max_num_rows  #int(np.random.randint(min_num_rows, max_num_rows))
            logging.info(
                f"Generating dataset '{i}' with '{num_rows}' payment rows for '{bank_name}'..."
            )
            bank_dataset: pd.DataFrame = pd.DataFrame(
                generate_mock_payment_data(num_payment_rows=num_rows)
            )
            bank_dataset = bank_dataset.sample(frac=1, replace=False, random_state=42)
            bank_dataset.reset_index(drop=True, inplace=True)

            for fraud_type in fraud_insertion_rule_stack:
                # if we apply more than 1 fraud rules, we can't sample >30% for every rule given how we generate fraud rows
                # in our case specifically, it ends up generating > 50% fraud rows which is not great.
                # so we cap the amount of fraud if we apply more than 1 rule
                fraud_row_frac = (
                    float(np.random.uniform(0.3, 0.4))
                    if len(fraud_insertion_rule_stack) == 1
                    else float(np.random.uniform(0.15, 0.25))
                )
                logging.info("Starting fraudulent row insertion... ")
                logging.info(
                    f"{round(fraud_row_frac * 100)}%' fraudulent rows will be induced by fraud rule '{fraud_type}'"
                )
                if fraud_overlap_fraction > 0:
                    logging.info(
                        f"Approximately {round(fraud_overlap_fraction * 100)}% previously generated fraud rows will re-sampled"
                    )
                else:
                    logging.info(
                        "Will only select previously non-fraudulent rows to sample from..."
                    )

                bank_dataset = insert_fraud_rows(
                    FRAUD_TYPE_DEFINITION[fraud_type],
                    bank_dataset,
                    fraudulent_frac=fraud_row_frac,
                    random_state=int(np.random.randint(20, 50)),
                    fraud_overlap_frac=fraud_overlap_fraction,
                )

            logging.info(
                f"Fraud rows: {bank_dataset[bank_dataset["FRAUD_FLAG"] == 1].shape[0]}"
            )
            bank_dataset = apply_fraud_with_probability(bank_dataset, apply_probability)
            logging.info(
                f"Fraud rows adjusted for application probability: {bank_dataset[bank_dataset["FRAUD_FLAG"] == 1].shape[0]}"
            )
            fname_part_fraud_type = (
                "_".join(fraud_insertion_rule_stack)
                if fraud_insertion_rule_stack
                else "no_fraud"
            )
            fname_part_label = f"{fname_label}_{i}" if fname_label else i
            fname_part_overlap = (
                f"pct_overlap_{round(fraud_overlap_fraction * 100)}"
                if fraud_overlap_fraction > 0
                else "no_overlap"
            )
            fname_part_apply_probability = f"app_frac_{apply_probability}"

            dataset_file_name = os.path.join(
                dir_path,
                f"{bank_name}_[{fname_part_fraud_type}]_[{fname_part_apply_probability}]_[{fname_part_overlap}]_[{fname_part_label}].csv",
            )
            bank_dataset.to_csv(dataset_file_name, index=False)

            logging.info(f"Saved data to {dataset_file_name}")
            logging.info("-----------------------------------------------------------")

logging.info(f"Finished dataset generation!")
logging.info("-----------------------------------------------------------")
logging.info("ALL DATASETS PLACED IN THE DIRECTORY: %s", dir_path)

[2025-06-11 10:49:27,192] [INFO] [MainThread] - ------------------------------------------------------------
[2025-06-11 10:49:27,193] [INFO] [MainThread] - Starting data generation...
[2025-06-11 10:49:27,194] [INFO] [MainThread] - Generating 6 dataset group(s) for bank2...
[2025-06-11 10:49:27,195] [INFO] [MainThread] - Generating 2 datasets for bank2 in group 0...
[2025-06-11 10:49:27,195] [INFO] [MainThread] - Generating dataset '1' with '2500' payment rows for 'bank2'...
[2025-06-11 10:49:36,151] [INFO] [MainThread] - Fraud rows: 0
[2025-06-11 10:49:36,152] [INFO] [MainThread] - Fraud rows adjusted for application probability: 0
[2025-06-11 10:49:36,261] [INFO] [MainThread] - Saved data to /home/jupyter-sarthakt/payment_anom_datasets_experiment_1/bank2_[no_fraud]_[app_frac_1]_[no_overlap]_[1].csv
[2025-06-11 10:49:36,262] [INFO] [MainThread] - -----------------------------------------------------------
[2025-06-11 10:49:36,262] [INFO] [MainThread] - Generating dataset '2' with '25

### Add a summary of all files generated

In [28]:
stats = []
dataset_filenames = sorted(os.listdir(dir_path))
for dataset_filename in dataset_filenames:
    filepath = os.path.join(dir_path, dataset_filename)
    if os.path.isfile(filepath):
        df = pd.read_csv(filepath)
        stats.append([
            dataset_filename,
            df.shape[1],
            df.shape[0],
            df[df["FRAUD_FLAG"] == 1].shape[0],
            round((df[df["FRAUD_FLAG"] == 1].shape[0] / df.shape[0]) * 100, 2),
        ])

stats_table = pd.DataFrame(
    columns=[
        "File Name",
        "Column Count",
        "Total Rows",
        "Fraudulent Rows",
        "% Fraudulent Rows",
    ],
    data=stats
)
pd.set_option('display.max_colwidth', 0)
# stats_table.style.set_properties(subset=['File Name'], **{'width-min': '1000px'})
stats_table

Unnamed: 0,File Name,Column Count,Total Rows,Fraudulent Rows,% Fraudulent Rows
0,bank1_[no_fraud]_[app_frac_1]_[no_overlap]_[1].csv,60,2500,0,0.0
1,bank1_[no_fraud]_[app_frac_1]_[no_overlap]_[2].csv,60,2500,0,0.0
2,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[eval_1].csv,60,2500,823,32.92
3,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[eval_2].csv,60,2500,777,31.08
4,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[eval_3].csv,60,2500,815,32.6
5,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[eval_4].csv,60,2500,676,27.04
6,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[eval_5].csv,60,2500,887,35.48
7,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[gen_train_1].csv,60,25000,7943,31.77
8,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[gen_train_2].csv,60,25000,8463,33.85
9,bank1_[type1]_[app_frac_0.9]_[no_overlap]_[scaling_1].csv,60,2500,875,35.0


In [None]:
dataset.columns

In [None]:
import io

buffer = io.StringIO()
dataset.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.txt", "w",
          encoding="utf-8") as f:  
    f.write(s)
