# TabDDPM for Adult dataset

Please use GPU runtime for this notebook: `Runtime` -> `Change runtime type` -> `T4 GPU` -> `save`.

Check GPU availability using `nvidia-smi`:

In [None]:
!nvidia-smi

Thu Mar 28 02:52:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Configure environment and dependencies

Set up the environment to use the [TabDDPM Github repository](https://github.com/yandex-research/tab-ddpm#setup-the-environment). When/If you are prompted with something like "whether to restart your session/runtime", please select "cancel" to continue exceution.

In [None]:
%%capture

# Take about 3 minutes

# Change the Python version of Colab to 3.9.x
!apt-get update -y
!apt-get install python3.9 python3.9-distutils
!update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
!update-alternatives --config python3
!apt-get install python3-pip
!python3 -m pip install --upgrade pip --user


# Download code repo, data and configuration files
!git clone https://github.com/yandex-research/tab-ddpm.git
%cd tab-ddpm

!wget "https://www.dropbox.com/s/rpckvcs3vx7j605/data.tar?dl=0" -O data.tar
!tar -xvf data.tar
!rm -rf data.tar

# Set up some environment variables
%env REPO_DIR=.
%env PROJECT_DIR=.
%env PYTHONPATH=/usr/local/bin/python:.


# Install dependencies
!pip install torch==1.12.1+cu102 torchvision==0.13.1+cu102 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu102
!pip install -r requirements.txt

## Tabular diffusion model

For simplicity, we use the pre-defined splitting (train/validation/test) under `data/adult` for the modeling and inference.

The training and sampling hyperparameters can be found under `exp/adult/ddpm_cb_best/config.toml`, or [this](https://github.com/yandex-research/tab-ddpm/blob/main/exp/adult/ddpm_cb_best/config.toml) from the repo.

You may change the configurations for better performance. For this notebook, let us use the default setup.

### Training

First, we train the model with the default configuration for adult dataset. Model checkpoints will be saved under `exp/adult/ddpm_cb_best/`, and it is advised to save the checkpoints somewhere to avoid re-training (Colab's runtime is not persistent, so everything will be lost next time you open the notebook, and you will have to train the model again).

In [None]:
!python scripts/pipeline.py --config exp/adult/ddpm_cb_best/config.toml --train

[ 9 16  7 15  6  5  2 42]
108
{'num_classes': 2, 'is_y_cond': True, 'rtdl_params': {'d_layers': [256, 1024, 1024, 1024, 1024, 256], 'dropout': 0.0}, 'd_in': 108}
mlp
Step 500/30000 MLoss: 1.1682 GLoss: 0.9567 Sum: 2.1249
Step 1000/30000 MLoss: 1.0481 GLoss: 0.7786 Sum: 1.8267
Step 1500/30000 MLoss: 0.9865 GLoss: 0.6422 Sum: 1.6287
Step 2000/30000 MLoss: 0.9672 GLoss: 0.6247 Sum: 1.5918999999999999
Step 2500/30000 MLoss: 0.9621 GLoss: 0.618 Sum: 1.5800999999999998
Step 3000/30000 MLoss: 0.9588 GLoss: 0.6126 Sum: 1.5714000000000001
Step 3500/30000 MLoss: 0.9529 GLoss: 0.6073 Sum: 1.5602
Step 4000/30000 MLoss: 0.9487 GLoss: 0.604 Sum: 1.5527
Step 4500/30000 MLoss: 0.9549 GLoss: 0.6083 Sum: 1.5632
Step 5000/30000 MLoss: 0.9417 GLoss: 0.5997 Sum: 1.5413999999999999
Step 5500/30000 MLoss: 0.9575 GLoss: 0.611 Sum: 1.5685
Step 6000/30000 MLoss: 0.9435 GLoss: 0.5822 Sum: 1.5257
Step 6500/30000 MLoss: 0.9436 GLoss: 0.5684 Sum: 1.512
Step 7000/30000 MLoss: 0.9438 GLoss: 0.5833 Sum: 1.5271
Step 75

### Sampling

Sample from the saved checkpoints.

One may change sampling configurations (e.g. number of synthetic samples to be generated) by modifying hyperparameters of the configuration file `exp/adult/ddpm_cb_best/config.toml`, also illustrated [here](https://github.com/yandex-research/tab-ddpm/blob/b476257dd460b778ba09eb97f7a51d6490fa17f8/exp/adult/ddpm_cb_best/config.toml#L42).

To sample from the trained model, run the pipeline with `--sample` flag.

In [None]:
!python scripts/pipeline.py --config exp/adult/ddpm_cb_best/config.toml --sample

mlp
Sample timestep    0



Sample timestep    0
Sample timestep    0

Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0

Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Sample timestep    0
Discrete cols: [2]
Num shape:  (216000, 6)
Elapsed time: 0:01:01


## Check synthetic data


The generated synthetic sample can be found under `exp/adult/ddpm_cb_best/`.

Let's process and prepare them as a regular dataframe.

In [None]:
import numpy as np
import pandas as pd

import os

true_data_dir = "./data/adult/"
syn_data_dir = "./exp/adult/ddpm_cb_best"

# Columns of the Adult dataset

num_features_list = [
    "age",
    "fnlwgt",
    "educationl-num",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
]

cat_features_list = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "gender",
    "native-country",
]

y_feature = "income"  # <= 50K or > 50K

is_y_cat = True

names_dict = {
    "num_features_list": num_features_list,
    "cat_features_list": cat_features_list,
    "y_feature": y_feature,
    "is_y_cat": is_y_cat,
}


In [None]:
def concat_data(
    data_path: str,
    split: str = "train",
    num_features_list: list = None,
    cat_features_list: list = None,
    y_feature: str = "y",
    is_y_cat: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Aggregate generated features and the response to a dataframe.
    - data_path: path to generated data folder with same naming convention as in the tddpm repo, sampling part. e.g., under this path, we might have:
        - y_{split}.npy
        - X_num_{split}.npy
        - X_cat_{split}.npy
    - num_features_list: list of numerical features names.
    - cat_features_list: list of categorical features names.
    - y_feature: name of the response
    - is_y_cat: whether the response is categorical or not. Will be used to determine the type of the response column.

    Returns a dataframe with columns: [y_feature] + num_features_list + cat_features_list, in original scale
    """
    assert split in ["train", "val", "test"], "split should be one of train/val/test"
    concat_list, col_names = [], []

    # response
    y_test_syn = np.load(os.path.join(data_path, f"y_{split}.npy"))

    concat_list.append(y_test_syn[:, None])
    col_names += [y_feature]

    X_num_path = os.path.join(data_path, f"X_num_{split}.npy")
    if os.path.exists(X_num_path):
        X_num_test_syn = np.load(X_num_path)

        concat_list.append(X_num_test_syn)
        if num_features_list is not None:
            assert len(num_features_list) == X_num_test_syn.shape[1]
        else:
            num_features_list = [f"num_{i}" for i in range(X_num_test_syn.shape[1])]
        col_names += num_features_list

    X_cat_path = os.path.join(data_path, f"X_cat_{split}.npy")
    if os.path.exists(X_cat_path):
        X_cat_test_syn = np.load(X_cat_path, allow_pickle=True)

        concat_list.append(X_cat_test_syn)
        if cat_features_list is not None:
            assert len(cat_features_list) == X_cat_test_syn.shape[1]
        else:
            cat_features_list = [f"cat_{i}" for i in range(X_cat_test_syn.shape[1])]
        col_names += cat_features_list
    else:
        # for cat_list created later
        cat_features_list = []

    temp_df = pd.DataFrame(
        np.concatenate(concat_list, axis=1),
        columns=col_names,
    )

    cat_list = (
        cat_features_list if is_y_cat == False else cat_features_list + [y_feature]
    )

    new_types = {
        col_name: "category" if col_name in cat_list else "float"
        for col_name in col_names
    }

    temp_df = temp_df.astype(new_types)

    return temp_df


In [None]:
# The generated synthetic data
syn_df = concat_data(syn_data_dir, split = "train", **names_dict)
syn_df

Unnamed: 0,income,age,fnlwgt,educationl-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,gender,native-country
0,1,53.0,160117.953729,9.0,0.0,0.0,50.0,Private,HS-grad,Married-civ-spouse,Sales,Husband,White,Male,United-States
1,1,45.0,37308.084980,13.0,7688.0,0.0,52.0,Self-emp-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,0,20.0,170092.997220,10.0,0.0,0.0,20.0,Private,Some-college,Never-married,Other-service,Own-child,White,Female,United-States
3,0,36.0,86633.394992,9.0,0.0,0.0,40.0,Private,HS-grad,Separated,Machine-op-inspct,Not-in-family,White,Male,United-States
4,1,50.0,47226.330627,14.0,15024.0,0.0,48.0,Private,Masters,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215995,0,29.0,136387.776687,13.0,0.0,0.0,40.0,Private,Bachelors,Never-married,Prof-specialty,Own-child,White,Female,United-States
215996,0,29.0,191459.068134,9.0,0.0,0.0,45.0,Private,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,Male,United-States
215997,1,28.0,175256.149196,14.0,0.0,0.0,42.0,Private,Masters,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
215998,0,45.0,223211.886047,9.0,0.0,0.0,40.0,,HS-grad,Divorced,,Unmarried,Black,Female,United-States


In [None]:
# The original training data
true_df_train = concat_data(true_data_dir, split = "train", **names_dict)
true_df_train

Unnamed: 0,income,age,fnlwgt,educationl-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,gender,native-country
0,0,32.0,228265.0,9.0,0.0,0.0,30.0,Private,HS-grad,Never-married,Handlers-cleaners,Own-child,White,Female,United-States
1,0,21.0,89154.0,7.0,0.0,0.0,42.0,Private,11th,Never-married,Other-service,Own-child,White,Male,El-Salvador
2,0,33.0,43716.0,10.0,0.0,0.0,4.0,State-gov,Some-college,Married-civ-spouse,Craft-repair,Husband,White,Male,United-States
3,0,43.0,81243.0,13.0,0.0,1876.0,40.0,Private,Bachelors,Divorced,Tech-support,Not-in-family,White,Male,United-States
4,1,50.0,155574.0,14.0,7298.0,0.0,50.0,Self-emp-inc,Masters,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26043,0,18.0,98667.0,7.0,0.0,0.0,16.0,Private,11th,Never-married,Other-service,Own-child,White,Female,United-States
26044,0,26.0,170786.0,10.0,0.0,0.0,40.0,Private,Some-college,Married-civ-spouse,Transport-moving,Husband,White,Male,United-States
26045,0,49.0,75673.0,9.0,0.0,0.0,40.0,Private,HS-grad,Never-married,Adm-clerical,Own-child,Asian-Pac-Islander,Female,United-States
26046,0,48.0,95661.0,10.0,0.0,0.0,43.0,Private,Some-college,Never-married,Adm-clerical,Not-in-family,White,Female,United-States


## Concluding remarks

It is suggested that you save the synthetic df and true df for downstream analysis. That way, you don't have to regenerate the synthetic data everytime you have a new Colab runtime (again, it is not persistent).

You may use the synthetic data to do any downstream analysis as suggested in the assignment.