

<div align="center">
<p align="center">


# 🚀 Synthetic Data Generator

</p>
</div>
The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.

Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.

In [3]:
# install from git
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting git+https://github.com/hitsz-ids/synthetic-data-generator.git
  Cloning https://github.com/hitsz-ids/synthetic-data-generator.git to c:\users\elvin\appdata\local\temp\pip-req-build-cyx2iv05
  Resolved https://github.com/hitsz-ids/synthetic-data-generator.git to commit 0fc9ea290d5836d079d029c8d6702e526c2676a4
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pandas (from sdgx==0.2.4.dev0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/ed/30/b97456e7063edac0e5a405128065f0cd2033adfe3716fb2256c186bd41d0/pandas-2.0.3-cp310-cp310-win_amd64.whl (10.7 MB)
Collecting psutil (from table-evaluator->sdgx==0.2.4.dev0)
  Downloading

  Running command git clone --filter=blob:none --quiet https://github.com/hitsz-ids/synthetic-data-generator.git 'C:\Users\Elvin\AppData\Local\Temp\pip-req-build-cyx2iv05'
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
copulas 0.8.0 requires pandas<2,>=1.3.4; python_version >= "3.10" and python_version < "3.11", but you have pandas 2.0.3 which is incompatible.
sdmetrics 0.9.0 requires pandas<2,>=1.5.0; python_version >= "3.10", but you have pandas 2.0.3 which is incompatible.


We demonstrate with a single table data synthetic example.

In [4]:
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata

# 1. Load data

The demo data set for this demonstration is a risk control data set used to predict whether an individual will default on a loan. This dataset contains the following features:

| Column name | Meaning |
|-----------------------|-----------------------|
| loan_id | loan ID |
| user_id | user ID |
| total_loan | Total loan amount |
| year_of_loan | Loan period |
...

This code shows the process of loading real data:

In [8]:
# In the future, this part of the function will be integrated into `sdgx.processor`
import os 
import requests


def download_file(url, path):
    response = requests.get(url)
    if response.status_code == 200:
        with open(path, 'wb') as file:
            file.write(response.content)
        print(f"File downloaded successfully to {path}")
    else:
        print(f"Failed to download file from {url}")

# download dataset from github
# This datajset can be downloaded through sdgx.utils 
dataset_url = "https://raw.githubusercontent.com/aialgorithm/Blog/master/projects/一文梳理风控建模全流程/train_internet.csv"
file_path = 'train_internet.csv'

if not os.path.exists(file_path):
    download_file(dataset_url, file_path)

File downloaded successfully to train_internet.csv


In [9]:

from pathlib import Path

path_obj = Path(file_path)

# Create a data connector and data loader for large csv data
# SDG will load data with chunk, can reduce memory usage.
data_connector = CsvConnector(path=path_obj)
# For small data you can use DataFrameConnector
# from sdgx.data_connectors.dataframe_connector import DataFrameConnector
# data_connector = DataFrameConnector(dataframe)
data_loader = DataLoader(data_connector)

# 2. Create Metadata from Dataloader

sdgx supports creating metadata from pd.DataFrame or DataLoader, and also supports creating metadata from scratch from blank metadata (but this method is not recommended because it is more troublesome).

In this example, we use `from_dataloader` to create the first Metadata.

In [10]:
loan_metadata = Metadata.from_dataloader(data_loader)

[32m2024-11-29 11:18:27.846[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mfrom_dataloader[0m:[36m318[0m - [1mInspecting metadata...[0m
[32m2024-11-29 11:18:30.355[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mupdate_primary_key[0m:[36m527[0m - [1mPrimary Key updated: {'user_id', 'loan_id'}.[0m


Let’s first understand some common member variables in Metadata.

The most important and commonly used method is to use `column_list` to view column information. This variable returns a list. The order of columns corresponds to the order of the actual table.

In [11]:
loan_metadata.column_list

['loan_id',
 'user_id',
 'total_loan',
 'year_of_loan',
 'interest',
 'monthly_payment',
 'class',
 'sub_class',
 'work_type',
 'employer_type',
 'industry',
 'work_year',
 'house_exist',
 'house_loan_status',
 'censor_status',
 'marriage',
 'offsprings',
 'issue_date',
 'use',
 'post_code',
 'region',
 'debt_loan_ratio',
 'del_in_18month',
 'scoring_low',
 'scoring_high',
 'pub_dero_bankrup',
 'early_return',
 'early_return_amount',
 'early_return_amount_3mon',
 'recircle_b',
 'recircle_u',
 'initial_list_status',
 'earlies_credit_mon',
 'title',
 'policy_code',
 'f0',
 'f1',
 'f2',
 'f3',
 'f4',
 'f5',
 'is_default']

# 3. Use Inspectors to automatically label column types

Currently, when a sdgx's Metadata module pd.DataFrame or DataLoader is created, it will load Inspectors, automatically scan some data (not all data), and label the columns in the table according to the logic of each Inspector.

Currently, we support automatic inference of multiple data types, and sdgx supports the following basic types:
- bool
- int
- float
- datetime
- discrete
- id

Basic data types guarantee that each column will be labeled to one of the data types.

sdgx also supports the following data types, and the sdgx team will continue to add data types:

- english_name
- email
- china_mainland_mobile_phone
- china_mainland_id
- china_mainland_postcode
- unified_social_credit_code
- china_mainland_address
- chinese_name

If you need to query a column of a certain data type, you can access it through `.{column_name}_columns`, for example: access the datetime column through `.datetime_columns`, and access the english_name column through `english_name_columns`.

For example, we can access discrete columns through `.discrete_columns`, which will return a set containing the set of column names that are considered **discrete** columns.


In [12]:
# Automatically infer discrete columns
loan_metadata.discrete_columns

{'class',
 'earlies_credit_mon',
 'employer_type',
 'industry',
 'issue_date',
 'sub_class',
 'work_type',
 'work_year'}

Similarly, we can view `int_columns`, `bool_columns` and other columns as follows:

In [13]:
# No Bool columns in current tabular data.
loan_metadata.bool_columns

set()

In [14]:
# check int columns
loan_metadata.int_columns

{'censor_status',
 'del_in_18month',
 'early_return',
 'early_return_amount',
 'early_return_amount_3mon',
 'f0',
 'f1',
 'f2',
 'f3',
 'f4',
 'f5',
 'house_exist',
 'house_loan_status',
 'initial_list_status',
 'is_default',
 'loan_id',
 'marriage',
 'offsprings',
 'policy_code',
 'post_code',
 'pub_dero_bankrup',
 'recircle_b',
 'region',
 'scoring_high',
 'scoring_low',
 'title',
 'total_loan',
 'use',
 'user_id',
 'year_of_loan'}

Use `datetime_columns` to view datetime types, but note that datetime type needs to add formats before data processing, **datetime formats need to completely correspond to datetime columns**. 

For specific operations, please refer to the manual interface of metadata below.

In [15]:
loan_metadata.datetime_columns

{'earlies_credit_mon', 'issue_date'}

⚠️ It is worth noting that Inspectors work well in most cases, but all types in tabular data may not be fully covered, or there may be incomplete coverage.

Therefore, before proceeding to the next step of training the model or further processing the data, we still recommend that data analysts **check** all the labeling of data types.

# 4. Understand the inspect_level mechanism in Metadata
 
Since Metadata will run multiple Inspectors when it is created, the same data column may be labeled multiple times. For example, a column is marked as PostCode and discrete at the same time. In fact, this column is a post code column. 

From this, we use `inspect_level` to solve this problem. Different inspectors have different inspect levels, and the final mark of the final column is determined by the mark with the higher inspect level.

In [16]:
# ·column_inspect_level· records the inspect_level values of all inspectors
# the default inspect_level is 10 
loan_metadata.column_inspect_level

defaultdict(<function sdgx.data_models.metadata.Metadata.<lambda>()>,
            {'email_columns': 30,
             'unified_social_credit_code_columns': 30,
             'chinese_company_name_columns': 40,
             'id_columns': 20,
             'china_mainland_address_columns': 30,
             'china_mainland_postcode_columns': 20,
             'english_name_columns': 40,
             'china_mainland_id_columns': 30,
             'datetime_columns': 20,
             'int_columns': 10,
             'float_columns': 10,
             'empty_columns': 90,
             'bool_columns': 10,
             'china_mainland_mobile_phone_columns': 30,
             'chinese_name_columns': 40,
             'discrete_columns': 10,
             'const_columns': 80})

# 5. Metadata manual interface

Metadata supports the following manual interfaces, which can finely modify column labels one by one according to your ideas:
- query: Query the tag of a certain column.
- get: Get all tags by key.
- set：Set tags, will convert value to set if value is not a set.
- add: Add tags.



In [17]:
loan_metadata.set('id_columns', {'loan_id'})

loan_metadata.id_columns

{'loan_id'}

Note that currently only the datetime type needs to `add formats`, and before data processing, datetime formats need to completely correspond to datetime columns (otherwise the column will be deleted during the data preprocessing process), other data types do not need it.

In [18]:
# datetime_format has no content, which will cause an error in the subsequent process.
loan_metadata.datetime_format

defaultdict(str, {})

The above four basic methods only apply to columns.

For the dict type datetime format, it is recommended to assign values directly.

In [19]:
datetime_format = {
    'issue_date': '%Y-%m-%d',
    'earlies_credit_mon': '%b-%Y'
}
loan_metadata.datetime_format = datetime_format
# You can also try this.
# loan_metadata.datetime_format["issue_date"] = "%Y-%m-%d"
# loan_metadata.datetime_format["earlies_credit_mon"] = "%b-%Y"
loan_metadata.datetime_format

{'issue_date': '%Y-%m-%d', 'earlies_credit_mon': '%b-%Y'}

# 6. Get the exact data type of each column

We provide the get_column_data_type method to query the final data type of each column:

In [20]:
loan_metadata.get_column_data_type("f0")

'int'

In [21]:
loan_metadata.get_column_data_type("recircle_u")

'float'

If you need to get the exact data type of all columns, you can combine it with the `.column_list` method:

In [22]:
for each_col in loan_metadata.column_list:
    print(f'{each_col}: {loan_metadata.get_column_data_type(each_col)}')

loan_id: id
user_id: int
total_loan: int
year_of_loan: int
interest: float
monthly_payment: float
class: discrete
sub_class: discrete
work_type: discrete
employer_type: discrete
industry: discrete
work_year: discrete
house_exist: int
house_loan_status: int
censor_status: int
marriage: int
offsprings: int
issue_date: datetime
use: int
post_code: int
region: int
debt_loan_ratio: float
del_in_18month: int
scoring_low: int
scoring_high: int
pub_dero_bankrup: int
early_return: int
early_return_amount: int
early_return_amount_3mon: int
recircle_b: int
recircle_u: float
initial_list_status: int
earlies_credit_mon: datetime
title: int
policy_code: const
f0: int
f1: int
f2: int
f3: int
f4: int
f5: int
is_default: int


# 7. Setting the categorical encoder for ML Model.
This feature now only be available in `CTGANSynthesizerModel`.
For some ML Model such as `CTGANSynthesizerModel`, it supports specifying the encoder for categorical columns. You can use `CategoricalEncoderType` to check which encoder you can use.

In [1]:
from sdgx.data_models.metadata import CategoricalEncoderType
CategoricalEncoderType

{'label', 'onehot', 'frequency'}

In [24]:
loan_metadata.discrete_columns

{'class',
 'earlies_credit_mon',
 'employer_type',
 'industry',
 'issue_date',
 'sub_class',
 'work_type',
 'work_year'}

Then you can specify the column's encoder directly by setting the `categorical_encoder`.
Tips: 
1. For datetime columns if we used the `DatetimeFormatter` in processors, we don't need to select an encoder for it. Cause the processor has been transform it to a float.
2. If we don't specify the encoder for some columns, the model will using its default logic to select encoder. For `CTGANSynthesizerModel`, it will use 'onehot'.

In [25]:
loan_metadata.categorical_encoder = {
    "class": "label",
    "sub_class": "label",    
    "employer_type": "onehot", # this line can be removed for CTGANSynthesizerModel, cause the default encoder is "onehot" in the model.
    "industry": "frequency",
    "work_year": "label"
    # "work_type" using default encoder, we are not specified its encoder.
    # "issue_date" and "earlies_credit_mon" are datetime columns. We not need to specify its encoder when we use DatetimeFormatter in training, because it transformed as float. 
}

Furthermore, if the columns unique values are too large, using onehot encoder for ML Model can cause performance problem because of the large training dimensions. We can use `categorical_threshold` to automatically select encoder.

In [26]:
loan_metadata.categorical_threshold = {
    # if the length of unique values less than 100, use onehot encoder.
    100: "frequency", # if the length of unique values greater than 100, use frequency encoder.
    10000: "label" # if the length of unique values greater than 10000, use label encoder.
}

In `CTGANSynthesizerModel`, if we both specify the `categorical_threshold` and `categorical_encoder`, the `categorical_encoder` are firstly used even if the column matched a regulation in `categorical_threshold`. 