# TABDDPM: Modelling Tabular Data with Diffusion Models

Directly applying diffusion models to general tabular problems can be challenging because data points are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data complicates accurate modeling, as individual features can vary widely in nature; some may be continuous, while others are discrete. In this notebook, we explore **TabDDPM** — a diffusion model that can be universally applied to tabular datasets and effectively handles both categorical and numerical features.

Our primary focus in this work is synthetic data generation, which is in high demand for many tabular tasks. Firstly, tabular datasets are often limited in size, unlike vision or NLP problems where large amounts of additional data are readily available online. Secondly, properly generated synthetic datasets do not contain actual user data, thus avoiding GDPR-like regulations and allowing for public sharing without compromising anonymity.

In this notebook, we work with the ClavaDDPM implementation, which is originally designed for multi-table data synthesis. However, by applying a specific single-table configuration, we can effectively leverage it for single-table synthesis as well. This configuration activates TabDDPM, a component within ClavaDDPM tailored for single-table scenarios.

In the following sections, we will delve deeper into the implementation of this method. The notebook is organized as follows:

1. [Imports and Setup]()


2. [Load Configuration]()


3. [Data Loading and Preprocessing]()
    
    
4. [TabDDPM Algorithm]()

    4.1. [Overview]()
    
    4.2. [Model Training]()
    
    4.3. [Model Sampling]()


# Imports and Setup

In this section, we import all necessary libraries and modules for setting up the environment. This includes libraries for logging, argument parsing, file path management, and configuration loading. We also import essential packages for data loading, model creation, and training, such as PyTorch and numpy, along with custom modules specific to the ClavaDDPM.

In [7]:
import json

from midst_models.single_table_TabDDPM.complex_pipeline import (
    clava_clustering,
    clava_training,
    clava_load_pretrained,
    clava_synthesizing,
    load_configs,
)
from midst_models.single_table_TabDDPM.pipeline_modules import load_multi_table

ModuleNotFoundError: No module named 'faiss'

In [6]:
!python --version
# !python -m pip install ipykernel
# !python -m ipykernel install --user
# !cat /Users/soma/Library/Jupyter/kernels/python3/kernel.json

Python 3.7.0


# Load Configuration

In this section, we establish the setup for model training by loading the configuration file, which includes the necessary parameters and settings for the training process. The configuration file, stored in `json` format, is read and parsed into a dictionary. We print out the entire configuration file in the code cell below and will explain the hyperparameters in more detail further down to clarify.

A sample configuration file is available at `configs/trans.json`, where general parameters can be modified as needed.

In [None]:
# Load config
config_path = "configs/trans.json"
configs, save_dir = load_configs(config_path)

# Display config
json_str = json.dumps(configs, indent=4)
print(json_str)

# Data Loading and Preprocessing

In this notebook, we use the Transactions table from the Berka dataset. You can access the Berka dataset files for TabDDPM [here](https://drive.google.com/drive/folders/1rmJ_E6IzG25eCL3foYAb2jVmAstXktJ1?usp=drive_link)
The BERKA dataset is a comprehensive banking dataset originally released by the Czech bank ČSOB for the Financial Modeling and Analysis (FMA) competition in 1999. It provides detailed financial data on transactions, accounts, loans, credit cards, and demographic information for thousands of customers over multiple years.
In this section, we load and preprocess the dataset based on the configuration settings. 
The following files are needed to be present in the data directory:
- `train.csv`: The transactions susbet from the Berka dataset used for training. Note that the id columns (columns ending in "_id") should be removed from the training data.
- `test.csv`: The transactions susbet from the Berka dataset used for evaluation. Note that the id columns (columns ending in "_id") should be removed from the test data.
- `trans_label_encoders.pkl`: The label encoders used to encode the transactions table if you are using the already preprocessed data from shared files.
- `trans_domain.json`: This file contains the domain information for each column in the transactions table. A sample domain file is available at `configs/trans_domain.json`
- `dataset_meta.json`: The configuration file defines the relationships between different tables in the dataset. For single-table synthesis, it should be configured to include only one table. A sample configuration file is available at `configs/dataset_meta.json`.

In [None]:
# Load  dataset
# In this step, we load the dataset according to the 'dataset_meta.json' file located in the data_dir.
tables, relation_order, dataset_meta = load_multi_table(configs["general"]["data_dir"])
print("")

# Tables is a dictionary of the multi-table dataset
print(
    "{} We show the keys of the tables dictionary below {}".format("=" * 20, "=" * 20)
)
print(list(tables.keys()))

# TabDDPM Algorithm

In this section, we will describe the design of TabDDPM as well as its main hyperparameters loaded through config, which affect the model’s effectiveness. 

**TabDDPM:** uses the multinomial diffusion to model the categorical and binary features, and the Gaussian diffusion to model the numerical ones. The model is trained using the diffusion process, which is a continuous-time Markov chain that models the data distribution. In more detail, for a tabular data sample that consists of N numerical featuresand C categorical features with Ki categories each, TabDDPM takes one-hot encoded versions of categorical features as an input, and normalized numerical features. The figure below illustrates the diffusion process for classification problems; t, y and l denote a diffusion timestep, a class label, and logits, respectively.

<p align="center">
<img src="https://github.com/user-attachments/assets/1b772284-de6a-44ad-8346-39b5f040cd31" width="1000"/>
</p>

**Diffusion models:**  are likelihood-based generative models that handle the data through forward and reverse Markov processes. The forward process gradually adds noise to an initial sample x0 from the data distribution q(x0) sampling noise from the predefined distributions q(xt|xt−1) with variances {β1, ..., βT}.

<p align="center">
<img src="https://github.com/user-attachments/assets/6f610e06-ab5b-4974-97ce-9767baf254ea" width="300"/>
</p>

The reverse diffusion proces gradually denoises a latent variable xT∼q(xT) and allows generating new data samples from q(x0). Distributions p(xt−1|xt) are usually unknown and approximated by a neural network with parameters θ.

<p align="center">
<img src="https://github.com/user-attachments/assets/2c641eda-1678-4009-8d6e-88bf2ab24600" width="280"/>
</p>

**Gaussian diffusion models:** operate in continuous spaces where forward and reverse processes are characterized by Gaussian distributions:

<p align="center">
<img src="https://github.com/user-attachments/assets/c0cfa4a8-9281-4a7a-aaaa-b220ffd05734" width="330"/>
</p>

While in general θ parameters are learned from the data by optimizing a variational lower bound, in practice for Gaussian modeling, this objective can be simplified to the sum of mean-squared errors between εθ(xt ,t) and ε over all timesteps t as follows:

<p align="center">
<img src="https://github.com/user-attachments/assets/61f34373-3890-4785-98c6-6e103bd81950" width="330"/>
</p>

**Multinomial diffusion models:** are designed to generate categorical data where samples are a one-hot encoded categorical variable with K values. The multinomial forward diffusion process defines q(xt|xt−1) as a categorical distribution that corrupts the data by uniform noise over K classes: 

<p align="center">
<img src="https://github.com/user-attachments/assets/ced8bc14-9296-4a09-9881-64f90bed537d" width="440"/>
</p>

The reverse distribution pθ(xt−1|xt) is parameterized as q(xt−1|xt,xˆ0(xt,t)), where xˆ0 is predicted by a neural network. 

## Model Training
Note that ClavaDDPM introduces relation-aware clustering to model parent-child constraints and leverages diffusion models for controlled tabular data synthesis. However in the single-table synthesis scenario, although we perform the clustering, it won't have an impact how the model is trained or sampled.


In [None]:
# Display important clustering parameters
params_clustering = configs["clustering"]
print("{} We show the clustering parameters below {}".format("=" * 20, "=" * 20))
for key, val in params_clustering.items():
    print(f"{key}: {val}")
print("")

# Clustering on the multi-table dataset
tables, all_group_lengths_prob_dicts = clava_clustering(
    tables, relation_order, save_dir, configs
)

Important parameters for the training process include:

- `d_layers`: the dimension of layers in the diffusion model. 
- `num_timesteps`: the number of diffusion steps for adding noise and denoising. 
- `iterations`: the number of training iterations. The default is 10000. Recommended range for tuning: 5000 to 20000.
- `batch_size`: the batch size for training. The default is 4096. 

In [None]:
# Display important sampling parameters
params_sampling = configs["diffusion"]
print(
    "{} We show the important sampling parameters below {}".format("=" * 20, "=" * 20)
)
for key, val in params_sampling.items():
    print(f"{key}: {val}")
print("")

### PyTorch Training from Scratch
The training process is implemented using a custom PyTorch function, specifying parameters such as the number of epochs and checkpoints. Various callbacks are configured to monitor and save the model during training. The training process is then initiated, logging progress and completing the model's training. Finally, the trained models are saved to the specified directory and returned for further use. This process is happening in the `train_model` function, which gets the following inputs:

- `tables`: the relational tables with data augmentation.
- `configs`: the configuration dictionary with hyperparameters and settings for the training process.
- `relation_order`: the parent-child relationships between tables.
- `save_dir`: the directory to save the trained models and logs.


In [None]:
# Launch training from scratch
models = clava_training(tables, relation_order, save_dir, configs)

### Loading Pretrained Models
If the training process from scratch takes too long, please run the following command to load pre-trained models and samples.

In [None]:
# Use the pre-trained models
## save_dir was determined when loading the config file
models = clava_load_pretrained(relation_order, save_dir)

# Model Sampling

Important parameters for the sampling process include:
- `batch_size`: Mini-batch size for sampling.

In [None]:
# Display important sampling parameters
params_sampling = configs["sampling"]
print(
    "{} We show the important sampling parameters below {}".format("=" * 20, "=" * 20)
)
for key, val in params_sampling.items():
    print(f"{key}: {val}")
print("")

### Generating Data from Scratch
To generate synthetic data from scratch, we run the following code cell. This `clava_synthesizing` function gets the following inputs:

- `tables`: the relational tables with data augmentation.
- `relation_order`: the parent-child relationships between tables.
- `save_dir`: the directory to save the synthetic data.
- `all_group_lengths_prob_dicts`: a dictionary that computes group size distributions for each table, used in the sampling stage to determine the size of the tables to generate.
- `models`: the trained diffusion models.
- `configs`: the configuration dictionary with hyperparameters and settings for the sampling process.
- `sample_scale`: the scale factor for the sampling process.

The synthetic data will be saved in the specified output directory.

In [None]:
# Generate synthetic data from scratch
cleaned_tables, synthesizing_time_spent, matching_time_spent = clava_synthesizing(
    tables,
    relation_order,
    save_dir,
    all_group_lengths_prob_dicts,
    models,
    configs,
    sample_scale=1 if "debug" not in configs else configs["debug"]["sample_scale"],
)

Finally, as some integer values are saved as strings during this process, we convert them back to integers for further evaluation.

In [None]:
# Cast int values that saved as string to int for further evaluation
for key in cleaned_tables.keys():
    for col in cleaned_tables[key].columns:
        if cleaned_tables[key][col].dtype == "object":
            try:
                cleaned_tables[key][col] = cleaned_tables[key][col].astype(int)
            except ValueError:
                print(f"Column {col} cannot be converted to int.")

<!-- # Prepare the synthetic data and reference data for single-table metric evaluation
shutil.copy(os.path.join(configs['general']['data_dir'], 'dataset_meta.json'), os.path.join(save_dir, 'dataset_meta.json'))
for table_name in tables.keys():
    shutil.copy(os.path.join(save_dir, table_name, '_final', f'{table_name}_synthetic.csv'), os.path.join(save_dir, f'{table_name}.csv'))
    # uncomment and run the following line if you want to use the pre-synthesized data
    # shutil.copy(os.path.join(pretrained_dir, table_name, '_final', f'{table_name}_synthetic.csv'), os.path.join(save_dir, f'{table_name}.csv'))

    shutil.copy(os.path.join(configs['general']['data_dir'], f'{table_name}_domain.json'), os.path.join(save_dir, f'{table_name}_domain.json'))

test_tables, _, _ = load_multi_table(save_dir, verbose=False)
real_tables, _, _ = load_multi_table(configs['general']['data_dir'], verbose=False)

# Single table metrics
for table_name in tables.keys():
    print(f'Generating report for {table_name}')
    real_data = real_tables[table_name]['df']
    syn_data = cleaned_tables[table_name]
    domain_dict = real_tables[table_name]['domain']

    if configs['general']['workspace_dir'] is not None:
        test_data = test_tables[table_name]['df']
    else:
        test_data = None

    gen_single_report(
        real_data, 
        syn_data,
        domain_dict,
        table_name,
        save_dir,
        alpha_beta_sample_size=200_000,
        test_data=test_data
    ) -->

## References

**Pang, Wei, et al.** "ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models." *preprint* (2024).

**GitHub Repository:** [ClavaDDPM](https://github.com/weipang142857/ClavaDDPM)