# Synthetic data generation with Calculated Features

The Lending Club dataset from Kaggle (available at Kaggle: [Lending Club Dataset](https://www.kaggle.com/datasets/husainsb/lendingclub-issued-loans)) provides a comprehensive collection of data on issued loans. This dataset includes numerous columns and detailed information about borrowers, making it a great dataset to demonstrate how to enforce business rules & expectations into the process of synthetic data generation.

**Why calculated features might be relevant**

In this example, having in mind that the financial industry is governed by various business rules and regulatory requirements, synthetic data generated from the Lending Club dataset must comply with these existing business rules. For instance, the relationships between a borrowerâ€™s credit score, income, and loan repayment history should be realistically maintained in the synthetic data to ensure its utility and relevance.
Calculated features can be used to ensure or replicate any relations between the different columns of a dataset, ensuring higher quality and business compliance when it comes to the synthetic data generated, particularly important in scenarios where complex and certain rules and relationships need to be preserved.

In [None]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{dataset-uid}', namespace='{project-id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata

In [12]:
dataset.head(10)

Unnamed: 0,installment,int_rate,loan_amnt,revol_bal,revol_util,term,total_pymnt,total_rec_int,total_rec_late_fee,total_rec_prncp,total_rev_hi_lim
0,111.97,0.0749,3600,5658,0.149,36m,0.0,0.0,0.0,0.0,37900
1,356.78,0.1499,15000,53167,0.753,60m,0.0,0.0,0.0,0.0,70600
2,276.56,0.1139,8400,12831,0.303,36m,0.0,0.0,0.0,0.0,42300
3,130.0,0.1049,4000,4388,0.332,36m,0.0,0.0,0.0,0.0,13200
4,185.93,0.0724,6000,9571,0.413,36m,164.21,14.48,0.0,149.73,23200
5,703.05,0.1599,20000,11843,0.26,36m,0.0,0.0,0.0,0.0,45500
6,173.31,0.1499,5000,10276,0.901,36m,0.0,0.0,0.0,0.0,11400
7,628.95,0.0824,20000,16206,0.6,36m,0.0,0.0,0.0,0.0,27000
8,375.99,0.0799,12000,25423,0.521,36m,0.0,0.0,0.0,0.0,48800
9,298.14,0.0532,9900,6585,0.345,36m,0.0,0.0,0.0,0.0,19100


In [14]:
m = Metadata(dataset)

print(m)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m11
[1mNumber of rows: [0m92618
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
                Column    Data type Variable type Characteristics
0          installment    numerical         float                
1             int_rate    numerical         float                
2            loan_amnt    numerical           int                
3            revol_bal    numerical           int                
4           revol_util    numerical         float                
5                 term  categorical        string                
6          total_pymnt    numerical         float                
7        total_rec_int    numerical         float                
8   total_rec_late_fee    numerical         float                
9      total_rec_prncp    numerical         float                
10    total_rev_hi_lim    numerical           int   

## Creating the calculated features definition

In the context of the lending dataset, we were able to identify at least 3 different columns that can be written as the combination of others. In fact, these columns are the result of business logic embedded into the data originally provided. The columns for which the business rules will be enforced as python functions are the following:
- revol_util
- installment
- total_pymnt
For each of these columns we will be defining a python function that translates the business expectation.

In [16]:
def get_revolving_util(revol_bal, total_rev_hi_lim):
    return (revol_bal / total_rev_hi_lim).values

def get_installment(int_rate, loan_amnt, term):
    "Computes the installment values due monthly based on an amortization loan schedule."
    n = term.str.rstrip("m").astype("int")  # The total number of periods
    period_int = (
        int_rate / 12
    )  # The adjusted annual interest for the monthly installment periods
    return (
        loan_amnt
        * (
            (period_int * (1 + period_int) ** n) /
            ((1 + period_int) ** n - 1)
        ).values
    )

def get_total_payment(total_rec_int, total_rec_late_fee, total_rec_prncp):
    "Computes total payment as the sum of all payment parcels."
    return (total_rec_int + total_rec_late_fee + total_rec_prncp).values

In [17]:
calculated_features = [
    {
        "calculated_features": "revol_util",
        "function": get_revolving_util,
        "calculated_from": ["revol_bal", "total_rev_hi_lim"],
    },
    {
        "calculated_features": "installment",
        "function": get_installment,
        "calculated_from": ["int_rate", "loan_amnt", "term"],
    },
    {
        "calculated_features": "total_pymnt",
        "function": get_total_payment,
        "calculated_from": [
            "total_rec_int",
            "total_rec_late_fee",
            "total_rec_prncp",
        ],
    },
]

### Training a synthesizer given the calculated features

In [18]:
from ydata.synthesizers import RegularSynthesizer

synth = RegularSynthesizer()
synth.fit(dataset,
          metadata=m,
         calculated_features=calculated_features)

INFO: 2023-11-07 00:06:01,875 [SYNTHESIZER] - Number columns considered for synth: 8
INFO: 2023-11-07 00:06:03,370 [SYNTHESIZER] - Starting the synthetic data modeling process over 3x1 blocks.
INFO: 2023-11-07 00:06:03,371 [SYNTHESIZER] - Generating pipeline for segment (-0.001, 24697.667]
INFO: 2023-11-07 00:06:03,374 [SYNTHESIZER] - Preprocess segment
INFO: 2023-11-07 00:06:03,377 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-11-07 00:06:03,378 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-11-07 00:06:06,257 [SYNTHESIZER] - Generating pipeline for segment (24697.667, 49395.333]
INFO: 2023-11-07 00:06:06,262 [SYNTHESIZER] - Preprocess segment
INFO: 2023-11-07 00:06:06,265 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-11-07 00:06:06,266 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-11-07 00:06:09,376 [SYNTHESIZER] - Generating pipeline for segment (49395.333, 74093.0]
INFO: 2023-11-07 00:06:09,380 [SYNTHESIZER] - Preprocess 

<ydata.synthesizers.regular.model.RegularSynthesizer at 0x7f11cc58bfa0>

In [19]:
n_synth_samples=len(dataset)

sample = synth.sample(n_synth_samples)

INFO: 2023-11-07 00:06:12,431 [SYNTHESIZER] - Start generating model samples.
INFO: 2023-11-07 00:06:12,432 [SYNTHESIZER] - Sample segment (-0.001, 24697.667]
INFO: 2023-11-07 00:06:13,520 [SYNTHESIZER] - Sample segment (24697.667, 49395.333]
INFO: 2023-11-07 00:06:15,016 [SYNTHESIZER] - Sample segment (49395.333, 74093.0]


In [20]:
sample.head()


Unnamed: 0,installment,int_rate,loan_amnt,revol_bal,revol_util,term,total_pymnt,total_rec_int,total_rec_late_fee,total_rec_prncp,total_rev_hi_lim
0,194.699366,0.1274,5800,5767,0.341243,36m,205.11,58.21,0.0,146.9,16900
1,195.094134,0.1599,5550,4845,0.35365,36m,0.0,0.0,0.0,0.0,13700
2,77.467367,0.0724,2500,1577,0.089096,36m,0.0,0.0,0.0,0.0,17700
3,150.573906,0.0532,5000,5365,0.216331,36m,150.58,0.0,0.0,150.58,24800
4,745.415974,0.1699,30000,16702,0.297718,60m,0.0,0.0,0.0,0.0,56100
