<center>
  
# TABSYN: Tabular Data Synthesis with Diffusion Models

</center>

Two challenges regarding the extention of diffusion models to tabular data are:
1. **Diverse data types:** a single table can have different columns each containing data of different types, including numerical, categorical, text, etc.
2. **Varied distributions:** the distribution of data under different columns in a single table varry widely from column to column.

**TabSyn** addresses these challenges by introducing a latent space where tabular data of all columns are jointly represented. It then proceedes to train a diffusion model on the latent representations.
This tactic allows TabSyn to:
1. Train a single diffusion model for all data types in the dataset (i.e. Generality).
2. Optimize the distribution of latent embeddings to facilitate training of the subsequent diffusion model, thus generating higher quality synthetic data (i.e. Quality).
3. Require much fewer reverse steps during training of the diffusion model, and synthesize data faster (i.e. Speed).

In this notebook, we review and implement the TabSyn model. The notebook is organized as follows:

1. [Imports and Setup]()


2. [Berka Dataset]()
    
    
3. [TabSyn Algorithm]()
    
    3.1. [Load Config]()
    
    3.2. [Make Dataset]()
    
    3.3. [Instantiate Model]()
    
    3.4. [Train Model]()
        
    3.5. [Load Pretrained Model]()
    
    3.6. [Sample Data]()
    
    3.7. [Review Synthetic Data]()


# Imports and Setup

In this section, we import all necessary libraries and modules required for setting up the environment.

In [1]:
import os
import json
import pandas as pd
from pprint import pprint

import torch
from torch.utils.data import DataLoader

from midst_models.single_table_TabSyn.scripts.process_dataset import process_data

from midst_models.single_table_TabSyn.src.data import preprocess, TabularDataset
from midst_models.single_table_TabSyn.src.tabsyn.pipeline import TabSyn
from midst_models.single_table_TabSyn.src import load_config

# Berka Dataset

In this section, we will process the Transactions table from the Berka dataset. You can access the Berka dataset files for TabSyn [here](https://drive.google.com/drive/folders/18KHv3VQuRphMHqZQsQc-x2ALoIiAggA0?usp=drive_link).
The BERKA dataset is a comprehensive banking dataset originally released by the Czech bank ČSOB for the Financial Modeling and Analysis (FMA) competition in 1999. It provides detailed financial data on transactions, accounts, loans, credit cards, and demographic information for thousands of customers over multiple years.

Download the data files from the link above and place the train set in the `RAW_DATA_DIR` directory.
Note that the id columns (columns ending in "_id") should be removed from the training and test data.

Data info files are required for running the scripts. Sample info file for the transaction data is available in `data_info/trans.json`. The paths for the training and test data in the file can be modified as needed.

In [None]:
INFO_DIR = "data_info"

DATA_DIR = "data/"
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw_data")
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed_data")
SYNTH_DATA_DIR = os.path.join(DATA_DIR, "synthetic_data")
DATA_NAME = "trans"

MODEL_PATH = "models/tabsyn"

In [None]:
# process data
process_data(DATA_NAME, INFO_DIR, DATA_DIR)

# review data
df = pd.read_csv(os.path.join(PROCESSED_DATA_DIR, DATA_NAME, "train.csv"))
df.head(10)

In [4]:
# review json file and its contents
with open(f"{PROCESSED_DATA_DIR}/{DATA_NAME}/info.json", "r") as file:
    data_info = json.load(file)
data_info


Note that if you want to use a subset of the entire transaction table, you must still preprocess the full table, retain the main table, and pass it as the reference data to `preprocess` later. This is because the model should have access to all the categories for categorical columns in the data.

The sample data info files is available in `data_info/trans_all.json`. The paths for the training and test data in the file can be modified as needed.

In [None]:
DATA_DIR_ALL = "all_data/"
RAW_DATA_DIR_ALL = os.path.join(DATA_DIR_ALL, "raw_data")
PROCESSED_DATA_DIR_ALL = os.path.join(DATA_DIR_ALL, "processed_data")
DATA_NAME_ALL = "trans_all"

process_data(DATA_NAME_ALL, INFO_DIR, DATA_DIR_ALL)

REF_DATA_PATH = os.path.join(PROCESSED_DATA_DIR_ALL, DATA_NAME_ALL)

# TabSyn Algorithm

In this section, we will describe the design of TabSyn as well as its main hyperparameters loaded through config, which affect the model’s effectiveness. 

**TabSyn** consists of two parts:
1. A *variational auto-encoder (VAE)* which learns a joint representation space for the given tabular data.
2. A *Diffusion model* which learns the distribution of data in the joint representation space.

The figure below shows a diagram of the TabSyn model.

<p align="center">
<img src="https://github.com/user-attachments/assets/a7e6a218-dd8e-4ae8-a8e5-6fc3974b2e9b" width="1000"/>
</p>

**VAE**

The left-side of the figure shows the VAE which operates in the original data space. The VAE itself consists of two parts: an encoder and a decoder. It also contains the corresponding tokenizer and detokenizer.
Each row of the input tabular data ($\pmb{x}$) is tokenized, then embedded by a transformer. Another transformer decodes the embeddings and a detokenizer reconstructs the table ($\pmb{\tilde{x}}$). The VAE is trained by minimizing the reconstruction loss between $\pmb{x}$ and $\pmb{\tilde{x}}$.

After the VAE is fully trained, the whole data ($\pmb{x}$) is tokenized and embedded. The embedding of each row is flattened to form a 1-dimensional vector $\pmb{z}$.
These 1-dimensional embeddings for all rows are stored on disk, and will later be used to train the diffusion model.

**Diffusion**

The right-side of the figure shows the diffusion model which operates in the latent representation space; in other words, it only *sees* the embeddings obtained by the VAE, not the original tabular data.
The diffusion model can be similarly divided into two parts: a forward process, and a reverse process.

The forward process receives the embedded data points. A single data point is denoted by $\pmb{z_0}$ in the figure. Gaussian noise is incrementally added to the embeddings in numerous incremental steps during the forward process. The number of the steps is denoted by $T$ in the figure. $T$ should be high enough that the distribution of embeddings at step $t=T$ is essentially a standard Gaussian distribution; in other words, the signal-to-noise ratio is practically zero.

The reverse process, on the other hand, learns to *predict* an earlier-step embedding (e.g. $\pmb{z_{t-\Delta t}}$) from a later-step embedding (e.g. $\pmb{z_t}$) via a neural network.

After the diffusion model is fully trained, the reverse process can estimate the data distribution at step $t=0$ if it receives a standard Gaussian distribution at step $t=T$. New data points can be synthesized by sampling from this estimated distribution.


## Load Config

In this section, we will load the configuration file that contains the hyperparameters for the TabSyn model. 

In [None]:
config_path = os.path.join("src/configs", f"{DATA_NAME}.toml")
raw_config = load_config(config_path)

pprint(raw_config)

The configuration file is a TOML file that contains the following hyperparameters:

1. **model_params:** specifies the structure of the transformers (both encoder and decoder) in the VAE model, including number of transformer layers, number of self-attnetion heads and token dimension.

2. **transforms:** specifies the transformations and preprocessing of the data before tokenization, such as cleaning, normalization, and encoding.
    - For preprocessing numerical features, we use the gaussian quantile transformation and replace the NaN values with mean of each row.
    - For categorical features, we use the one-hot encoding method. NaN values are left unchanged, but we have the option to replace them. We have the option to drop the values that appear with less than a given minimum frequency under each column. Furthermore, we have the option to add an extra encoding step for categorical features during tokenization.

3. **train.vae:** specifies training parameters of the VAE, including batch size, number of epochs, and number of dataset workers.

4. **train.diffusion:** specifies the same training parameters as above for the diffusion model.

5. **train.optim.vae:** specifies the parameters of the *Adam* optimizer and the `ReduceLROnPlateau` learning rate scheduler used to train the VAE. Optimizer parameters include initial learning rate and weight decay. LR scheduler parameters includer `factor` and `patience`.

6. **train.optim.diffusion:** specifies the same parameters as above for the diffusion model.

7. **loss_params:** specifies parameters of the loss function used to train the VAE including `max_beta`, `min_beta` and `lambd`.

$\beta$ is the coefficient of the KL divergence term in the VAE loss formula,

$\mathcal{L}_{vae} = \mathcal{L}_{mse} + \mathcal{L}_{ce} + \beta \mathcal{L}_{kl}$
.

Parameters `max_beta` and `min_beta` determine the range of $\beta$. $\beta$ is first set to `max_beta`. If the loss stops decreasing for a certain number of epochs (e.g. $10$ epochs), then at the end of each epoch after that (e.g. epoch $11$, $12$, etc.) $\beta$ is decreased by a factor of `lambd`,
$\beta_{new} = \lambda \beta_{curr}$,
until it reaches `beta_min`.


## Make Dataset

In this section, we pre-process the data and make a dataset object.

First, we determine transformations needed for the dataset, such as normalization and cleaning, in `transforms`. Next, using `preprocess` function we load the data from disk in arrays that contain both training and test data (`X_num` and `X_cat`), as well as the number of categories for each categorical feature (`categories`) and the number of numerical features (`d_numerical`).

We then separate the train and test data in different arrays and convert them to Pytorch tensors.
We create a dataset object (`TabularDataset`) with the train data. `TabularDataset` is a simple module which returns the tokens of a single row at a time. Each row constiutes a single data sample in TabSyn. Afterwards, we create a Dataloader for the train data using the `batch_size` and `num_workers` specified in config.

In contrast, we keep the test data as tensors (`X_test_num` and `X_test_cat`). If a GPU is available, we move these tensors to GPU so that they can be accessed by the model later on.

In [None]:
# preprocess data
X_num, X_cat, categories, d_numerical = preprocess(
    os.path.join(PROCESSED_DATA_DIR, DATA_NAME),
    ref_dataset_path=REF_DATA_PATH,
    transforms=raw_config["transforms"],
    task_type=raw_config["task_type"],
)

# separate train and test data
X_train_num, X_test_num = X_num
X_train_cat, X_test_cat = X_cat

# convert to float tensor
X_train_num, X_test_num = (
    torch.tensor(X_train_num).float(),
    torch.tensor(X_test_num).float(),
)
X_train_cat, X_test_cat = torch.tensor(X_train_cat), torch.tensor(X_test_cat)

# create dataset module
train_data = TabularDataset(X_train_num.float(), X_train_cat)

# move test data to gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_test_num = X_test_num.float().to(device)
X_test_cat = X_test_cat.to(device)

# create train dataloader
train_loader = DataLoader(
    train_data,
    batch_size=raw_config["train"]["vae"]["batch_size"],
    shuffle=True,
    num_workers=raw_config["train"]["vae"]["num_dataset_workers"],
)

## Instantiate Model

Next, we instantiate the model using the `TabSyn` class. `TabSyn` class takes the following arguments:

1. `train_loader`: dataloader for train data.
2. `X_test_num`: numerical features of the test data.
3. `X_test_cat`: categorical features of the train data.
4. `num_numerical_features`: number of numerical features in the dataset.
5. `num_classes`: number of classes (i.e. categories) of each categorical feature in the dataset.
6. `device`: the device on which the model and data exist, either "cpu" or "cuda".

In [7]:
tabsyn = TabSyn(
    train_loader,
    X_test_num,
    X_test_cat,
    num_numerical_features=d_numerical,
    num_classes=categories,
    device=device,
)

`TabSyn` class has the tools to instantiate VAE and diffusion models, train both, and sample from the trained diffusion model.
We will demonstrate how to use these tools in the following sections.

## Train Model


The VAE and the diffusion model are trained independently. The following subsections explain each training process.


### A. Train VAE

First, we need to instantiate the VAE using the `instantiate_vae` method. This method takes the VAE model hyperparameters, optimizer and lr scheduler parameters from config, and instantiates them.

In [None]:
# instantiate VAE model for training
tabsyn.instantiate_vae(
    **raw_config["model_params"], optim_params=raw_config["train"]["optim"]["vae"]
)

Now that we have instantiated the VAE, we can train it using the `train_vae` function.
This function receives the loss hyperparameters and number of epochs from the config.
Moreover, it recieves `save_path` which is the directory where trained model checkpoints will be saved.

In [None]:
os.makedirs(f"{MODEL_PATH}/{DATA_NAME}/vae", exist_ok=True)
tabsyn.train_vae(
    **raw_config["loss_params"],
    num_epochs=raw_config["train"]["vae"]["num_epochs"],
    save_path=os.path.join(MODEL_PATH, DATA_NAME, "vae"),
)

After training the VAE, we embed the training data with the trained encoder and store the embeddings in a direcotry specified by `vae_ckpt_dir`.

In [None]:
# embed all inputs in the latent space
tabsyn.save_vae_embeddings(
    X_train_num, X_train_cat, vae_ckpt_dir=os.path.join(MODEL_PATH, DATA_NAME, "vae")
)

### B. Train Diffusion Model

Now that we have stored the training data embeddings, we need to load and prepare them for the diffusion model.
We load the embeddings using `load_vae_embeddings`. We normalize the embeddings by subtracting the mean and dividing by the standard deviation. Then, we create a Dataloader with the specified `batch_size` and `num_workers` from the config.

In [10]:
# load latent space embeddings
train_z, _ = tabsyn.load_latent_embeddings(
    os.path.join(MODEL_PATH, DATA_NAME, "vae")
)  # train_z dim: B x in_dim

# normalize embeddings
mean, std = train_z.mean(0), train_z.std(0)
latent_train_data = (train_z - mean) / 2

# create data loader
latent_train_loader = DataLoader(
    latent_train_data,
    batch_size=raw_config["train"]["diffusion"]["batch_size"],
    shuffle=True,
    num_workers=raw_config["train"]["diffusion"]["num_dataset_workers"],
)

Now that the data is ready, we instantiate the diffusion model with `instantiate_diffusion`. The input dimension and hidden dimention of the diffusion model is determined by the dimension of the embeddings. 
Moreover, we instantiate the optimizer and lr scheduler using hyperparameters from config.

In [None]:
# instantiate diffusion model for training
tabsyn.instantiate_diffusion(
    in_dim=train_z.shape[1],
    hid_dim=train_z.shape[1],
    optim_params=raw_config["train"]["optim"]["diffusion"],
)

We train the diffusion model with `train_diffusion` function.
This function takes the following arguements:
1. `latent_train_loader`: dataloader for the latent representations which are used to train the diffusion model.
2. `num_epochs`: number of training epochs.
3. `ckpt_path`: directory where the model checkpoints will be stored.

In [None]:
os.makedirs(f"{MODEL_PATH}/{DATA_NAME}", exist_ok=True)
# train diffusion model
tabsyn.train_diffusion(
    latent_train_loader,
    num_epochs=raw_config["train"]["diffusion"]["num_epochs"],
    ckpt_path=os.path.join(MODEL_PATH, DATA_NAME),
)

## Load Pretrained Model

Instead of training model from scratch, we can also load weights of a pre-trained model from a given checkpoint with `load_model_state` function.
If we haven't instantiated the VAE and diffusion model beforehand, we need to instantiate them first using `instantiate_vae` and `instantiate_diffusion` methods.

In [None]:
latent_embeddings_path = os.path.join(MODEL_PATH, DATA_NAME, "vae")
pretrained_model_path = os.path.join(MODEL_PATH, DATA_NAME)

# instantiate VAE model
tabsyn.instantiate_vae(**raw_config["model_params"], optim_params=None)

# load latent embeddings of input data
train_z, token_dim = tabsyn.load_latent_embeddings(latent_embeddings_path)

# instantiate diffusion model
tabsyn.instantiate_diffusion(
    in_dim=train_z.shape[1], hid_dim=train_z.shape[1], optim_params=None
)

# load state from checkpoint
tabsyn.load_model_state(ckpt_dir=pretrained_model_path, dif_ckpt_name="model.pt")

## Sample Data

Now that we trained the model effectively, using `sample` function we can generate synthetic data starting from compelete noise. The input of this function is as follows:

1. `train_z`: latent embeddings of the training data.
2. `info`: info about the data from the json file we reviewed at the beginning of this notebook.
3. `num_inverse`: detokenizer for numerical features.
4. `cat_inverse`: detokenizer for categorical features.
5. `save_path`: file-path where the synthetic table will be saved.

In [None]:
# load data info file
with open(os.path.join(PROCESSED_DATA_DIR, DATA_NAME, "info.json"), "r") as file:
    data_info = json.load(file)
data_info["token_dim"] = token_dim

# get inverse tokenizers
_, _, categories, d_numerical, num_inverse, cat_inverse = preprocess(
    os.path.join(PROCESSED_DATA_DIR, DATA_NAME),
    ref_dataset_path=REF_DATA_PATH,
    transforms=raw_config["transforms"],
    task_type=raw_config["task_type"],
    inverse=True,
)

os.makedirs(os.path.join(SYNTH_DATA_DIR, DATA_NAME), exist_ok=True)

# sample data
num_samples = train_z.shape[0]
in_dim = train_z.shape[1]
mean_input_emb = train_z.mean(0)
tabsyn.sample(
    num_samples,
    in_dim,
    mean_input_emb,
    info=data_info,
    num_inverse=num_inverse,
    cat_inverse=cat_inverse,
    save_path=os.path.join(SYNTH_DATA_DIR, DATA_NAME, "tabsyn.csv"),
)

## Review Synthetic Data

Finally here, we review the synthesized data. In the following `evaluate_synthetic_data.ipynb` notebook, we will evaluate this synthesized data with respect to various metrics.

In [None]:
df = pd.read_csv(os.path.join(SYNTH_DATA_DIR, DATA_NAME, "tabsyn.csv"))
df.head(10)

## References

**Zhang, Hengrui, et al.** "Mixed-type tabular data synthesis with score-based diffusion in latent space." *International Conference on Learning Representations (ICLR)* (2023).

**GitHub Repository:** [Amazon Science - Tabsyn](https://github.com/amazon-science/tabsyn)