# Tutorial 10: Sequential Synthesis
In this tutorial, we explore the **Sequential Synthesis** approach using
the `syn_seq` plugin in `synthcity`. Sequential synthesis allows us to
model variables one-by-one (column-by-column), using conditional relationships
learned from the real data. The main idea is:
1. Synthesize the first variable (often with sample-without-replacement, "SWR"),
2. Then synthesize the second variable conditioned on the first,
3. And so on for each subsequent variable.
This approach can better preserve complex dependencies among columns than
simple marginal or naive methods.
We'll demonstrate this using the **diabetes** dataset, just like other tutorials,
and compare the resulting synthetic data.

In [None]:
!pip install synthcity

In [2]:
# stdlib
import sys
import warnings

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import Syn_SeqDataLoader

log.add(sink=sys.stderr, level="INFO")
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


                  variable OMP_PATH to the location of the header before importing keopscore or pykeops,
                  e.g. using os.environ: import os; os.environ['OMP_PATH'] = '/path/to/omp/header'


## Load Dataset

If we run the dataloader, it automatically shows order of synthesis and variable selection matrix. Variable selection matrix indicates which variables are used to synthesize the variable in each synthesis.

In [3]:
# Load the reference data
# Note: preprocessing data with OneHotEncoder or StandardScaler is not needed or recommended. Synthcity handles feature encoding and standardization internally.
from synthcity.utils.datasets.categorical.categorical_adult import CategoricalAdultDataloader

X = CategoricalAdultDataloader().load()

X

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income>50K
0,23,5,4,12,2,8,3,0,1,2,0,39,0,0
1,34,1,4,12,0,4,2,0,1,0,0,12,0,0
2,22,0,13,8,1,6,3,0,1,0,0,39,0,0
3,37,0,15,6,0,6,2,4,1,0,0,39,0,0
4,12,0,22,12,0,5,0,4,0,0,0,39,12,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,23,0,13,12,1,5,3,0,0,0,0,35,0,0
48838,48,8,20,8,4,14,4,4,1,0,0,39,0,0
48839,22,0,24,12,0,5,2,0,1,0,0,49,0,0
48840,28,0,4,12,1,8,1,1,1,5,0,39,0,0


In [4]:
X["capital-gain"].value_counts()

capital-gain
0     44888
7       803
15      531
3       504
2       473
4       379
5       296
99      244
1       147
6       100
10       91
14       83
8        82
27       58
20       49
13       42
9        36
25       20
34        6
11        4
41        3
18        2
22        1
Name: count, dtype: int64

## Preprocess the data for special values and imbalanced dataset

In real world datasets, 

In [5]:
# synthcity absolute
from synthcity.plugins.core.models.syn_seq.syn_seq_preprocess import SynSeqPreprocessor

# 1) 전처리 인스턴스 생성 + 사전 작업
prep = SynSeqPreprocessor(
    user_dtypes={
        "workclass": "category",
        "occupation": "category",
        "relationship": "category",
        "race": "category",
        "native-country": "category",
        "martial-status": "category",
        "sex": "category",
        "income>50K": "category",
    },
    user_special_values={
        "capital-gain": [0],
        "capital-loss": [0]
    },
    max_categories=15
)

# 2) Preprocess (date -> offset, numeric split 등)
X_processed = prep.preprocess(X)

[auto_assign] age -> numeric (nuniq=74)
[auto_assign] fnlwgt -> numeric (nuniq=77)
[auto_assign] education-num -> numeric (nuniq=16)
[auto_assign] marital-status -> category (nuniq=7)
[auto_assign] capital-gain -> numeric (nuniq=23)
[auto_assign] capital-loss -> numeric (nuniq=48)
[auto_assign] hours-per-week -> numeric (nuniq=96)
[detect_special] Column 'capital-gain' is highly imbalanced: 0 occurs in 91.9% of non-null rows.
[detect_special] Column 'capital-loss' is highly imbalanced: 0 occurs in 95.3% of non-null rows.


## Define the dataloader with user custom

In [6]:
X_processed.columns

Index(['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'capital-gain_cat',
       'capital-gain', 'capital-loss_cat', 'capital-loss', 'hours-per-week',
       'native-country', 'income>50K'],
      dtype='object')

In [7]:
X_processed["capital-gain_cat"].value_counts()

capital-gain_cat
0          44888
NUMERIC     3954
Name: count, dtype: int64

In [8]:
X_processed["capital-loss_cat"].value_counts()

capital-loss_cat
0          46560
NUMERIC     2282
Name: count, dtype: int64

In [9]:
X["capital-gain"].value_counts()

capital-gain
0     44888
7       803
15      531
3       504
2       473
4       379
5       296
99      244
1       147
6       100
10       91
14       83
8        82
27       58
20       49
13       42
9        36
25       20
34        6
11        4
41        3
18        2
22        1
Name: count, dtype: int64

In [10]:
user_custom = {
# Decide which order to synthesize the dataset.
    'syn_order' : ['age', 'sex', 'workclass', 'education-num', 'marital-status',
       'occupation', 'relationship', 'fnlwgt', 'race', 'capital-loss', 'hours-per-week', 'native-country', 'income>50K', 'capital-gain'],

# Specify the method to use for certain variables. 'CART' is used as default.
    'method' : {"relationship": "rf"},

# Select which variables to use as predictor of synthesizing for each sequence.
    'variable_selection' : {
      'capital-loss': ['age', 'sex', 'workclass', 'education-num', 'marital-status',
         'occupation', 'relationship', 'fnlwgt', 'race'],
      'hours-per-week': ['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
         'occupation', 'relationship', 'race', 'sex'],
      'native-country': ['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'hours-per-week', 'native-country'],
      'income>50K': ['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'hours-per-week', 'native-country']
         }
}

In [11]:
loader = Syn_SeqDataLoader(X_processed,
                           user_custom=user_custom,
                           target_column="income>50K", sensitive_columns=["sex", "race"])

loader.dataframe()


[INFO] Syn_SeqEncoder summary:
  (column, method)

  (age, swr)
    --> 
  (sex, cart)
    --> 
  (workclass, cart)
    --> 
  (education-num, cart)
    --> 
  (marital-status, cart)
    --> 
  (occupation, cart)
    --> 
  (relationship, rf)
    --> 
  (fnlwgt, cart)
    --> 
  (race, cart)
    --> 
  (capital-loss_cat, cart)
    --> 
  (capital-loss, cart)
    --> 
  (hours-per-week, cart)
    --> 
  (native-country, cart)
    --> 
  (income>50K, cart)
    --> 
  (capital-gain_cat, cart)
    --> 
  (capital-gain, cart)

  - variable_selection_:
                  age  sex  workclass  education-num  marital-status  \
age                 0    0          0              0               0   
sex                 1    0          0              0               0   
workclass           1    1          0              0               0   
education-num       1    1          1              0               0   
marital-status      1    1          1              1               0   
occupation    

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain_cat,capital-gain,capital-loss_cat,capital-loss,hours-per-week,native-country,income>50K
0,23,5,4,12,2,8,3,0,1,NUMERIC,2,0,0,39,0,0
1,34,1,4,12,0,4,2,0,1,0,0,0,0,12,0,0
2,22,0,13,8,1,6,3,0,1,0,0,0,0,39,0,0
3,37,0,15,6,0,6,2,4,1,0,0,0,0,39,0,0
4,12,0,22,12,0,5,0,4,0,0,0,0,0,39,12,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,23,0,13,12,1,5,3,0,0,0,0,0,0,35,0,0
48838,48,8,20,8,4,14,4,4,1,0,0,0,0,39,0,0
48839,22,0,24,12,0,5,2,0,1,0,0,0,0,49,0,0
48840,28,0,4,12,1,8,1,1,1,NUMERIC,5,0,0,39,0,0


## Existing plugins

In [12]:
# synthcity absolute
from synthcity.plugins import Plugins

generators = Plugins()

generators.list()

[2025-02-19T21:49:08.588848+0900][1593][CRITICAL] module disabled: /Users/minkeychang/synthcity/src/synthcity/plugins/generic/plugin_goggle.py
[2025-02-19T21:49:08.588848+0900][1593][CRITICAL] module disabled: /Users/minkeychang/synthcity/src/synthcity/plugins/generic/plugin_goggle.py


['uniform_sampler',
 'survival_nflow',
 'image_cgan',
 'ctgan',
 'bayesian_network',
 'survival_ctgan',
 'timegan',
 'decaf',
 'aim',
 'privbayes',
 'arf',
 'radialgan',
 'timevae',
 'survival_gan',
 'pategan',
 'dummy_sampler',
 'survae',
 'marginal_distributions',
 'rtvae',
 'great',
 'dpgan',
 'nflow',
 'ddpm',
 'image_adsgan',
 'syn_seq',
 'adsgan',
 'tvae',
 'fflows']

In [13]:
syn_model = Plugins().get("syn_seq")

[2025-02-19T21:49:10.199391+0900][1593][CRITICAL] module disabled: /Users/minkeychang/synthcity/src/synthcity/plugins/generic/plugin_goggle.py


In [14]:
syn_model.fit(loader)

[INFO] Syn_Seq aggregator: fitting columns...
Fitting 'age' => stored distribution from real data. Done.
Fitting 'sex' with 'cart' ... Done!
Fitting 'workclass' with 'cart' ... Done!
Fitting 'education-num' with 'cart' ... Done!
Fitting 'marital-status' with 'cart' ... Done!
Fitting 'occupation' with 'cart' ... Done!
Fitting 'relationship' with 'cart' ... Done!
Fitting 'fnlwgt' with 'cart' ... Done!
Fitting 'race' with 'cart' ... Done!
Fitting 'capital-loss_cat' with 'cart' ... Done!
Fitting 'capital-loss' with 'cart' ... Done!
Fitting 'hours-per-week' with 'cart' ... Done!
Fitting 'native-country' with 'cart' ... Done!
Fitting 'income>50K' with 'cart' ... Done!
Fitting 'capital-gain_cat' with 'cart' ... Done!
Fitting 'capital-gain' with 'cart' ... Done!


<synthcity.plugins.generic.plugin_syn_seq.Syn_SeqPlugin at 0x1102cccd0>

In [16]:
synthetic_loader = syn_model.generate(
    count = len(X)
    )

Generating 'age' => done.
Generating 'sex' => done.
Generating 'workclass' => done.
Generating 'education-num' => done.
Generating 'marital-status' => done.
Generating 'occupation' => done.
Generating 'relationship' => done.
Generating 'fnlwgt' => done.
Generating 'race' => done.
Generating 'capital-loss_cat' => done.
Generating 'capital-loss' => done.
Generating 'hours-per-week' => done.
Generating 'native-country' => done.
Generating 'income>50K' => done.
Generating 'capital-gain_cat' => done.
Generating 'capital-gain' => done.


In [17]:
synthetic_df = synthetic_loader.dataframe()
synthetic_df.head(20)

Unnamed: 0,age,sex,workclass,education-num,marital-status,occupation,relationship,fnlwgt,race,capital-loss_cat,capital-loss,hours-per-week,native-country,income>50K,capital-gain_cat,capital-gain
0,9,1,0,4,0,6,2,22,0,0,50,39,0,0,0,7
1,31,1,0,12,0,3,2,7,0,0,48,39,0,1,0,2
2,25,0,0,8,0,8,0,12,0,0,39,59,0,0,0,4
3,45,1,8,8,0,14,2,11,0,0,37,39,0,0,0,4
4,49,0,8,8,4,14,3,8,4,0,34,39,0,0,0,2
5,40,1,2,12,0,3,2,13,0,0,38,49,0,1,0,7
6,18,1,0,12,0,5,2,6,0,0,46,49,0,1,0,4
7,22,0,0,8,0,4,0,11,0,0,38,54,0,0,0,3
8,1,0,0,5,2,3,1,8,0,0,48,14,0,0,0,15
9,27,1,0,9,0,1,2,2,0,0,32,49,0,0,0,8


In [18]:
synthetic_df["capital-gain_cat"].value_counts()

capital-gain_cat
0    48842
Name: count, dtype: int64

In [None]:
synthetic_df["capital-loss_cat"].value_counts()

In [None]:
user_rules = {
  "martial-status":[
    ("age", "<=", 18),
    ("martial-status", "=", 2)
  ]
}

In [None]:
synthetic_df = prep.postprocess(synthetic_df, rules=user_rules)

In [None]:
# third party
import matplotlib.pyplot as plt

syn_model.plot(plt, loader)

plt.show()

## Benchmarking metrics

| **Metric**                                         | **Description**                                                                                                            |
|----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| sanity.data\_mismatch.score                        | Data types mismatch between the real//synthetic features                                                                   |
| sanity.common\_rows\_proportion.score              | Real data copy-paste in the synthetic data                                                                                 |
| sanity.nearest\_syn\_neighbor\_distance.mean       | Computes the \textless{}reduction\textgreater{}(distance) from the real data to the closest neighbor in the synthetic data |
| sanity.close\_values\_probability.score            | the probability of close values between the real and synthetic data.                                                       |
| sanity.distant\_values\_probability.score          | the probability of distant values between the real and synthetic data.                                                     |
| stats.jensenshannon\_dist.marginal                 | the average Jensen-Shannon distance                                                                                        |
| stats.chi\_squared\_test.marginal                  | the one-way chi-square test.                                                                                               |
| stats.feature\_corr.joint                          | the correlation/strength-of-association of features in data-set with both categorical and continuous features              |
| stats.inv\_kl\_divergence.marginal                 | the average inverse of the Kullback–Leibler Divergence metric.                                                             |
| stats.ks\_test.marginal                            | the Kolmogorov-Smirnov test for goodness of fit.                                                                           |
| stats.max\_mean\_discrepancy.joint                 | Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.                |
| stats.prdc.precision                               | precision between the two manifolds                                                                                        |
| stats.prdc.recall                                  | recall between the two manifolds                                                                                           |
| stats.prdc.density                                 | density between the two manifolds                                                                                          |
| stats.prdc.coverage                                | coverage between the two manifolds                                                                                         |
| stats.alpha\_precision.delta\_precision\_alpha\_OC | Delta precision                                                                                                            |
| stats.alpha\_precision.delta\_coverage\_beta\_OC   | Delta coverage                                                                                                             |
| stats.alpha\_precision.authenticity\_OC            | Authetnticity                                                                                                              |
| stats.survival\_km\_distance.optimism              | Kaplan-Meier distance between real-synthetic data                                                                          |
| stats.survival\_km\_distance.abs\_optimism         | Kaplan-Meier metrics absolute distance between real-syn data                                                               |
| stats.survival\_km\_distance.sightedness           | Kaplan-Meier metrics distance on the temporal axis                                                                         |
| performance.linear\_model.gt.c\_index              | Train on real, test on the test real data using CoxPH: C-Index                                                             |
| performance.linear\_model.gt.brier\_score          | Train on real, test on the test real data using CoxPH: Brier score                                                         |
| performance.linear\_model.syn\_id.c\_index         | Train on synthetic, test on the train real data using CoxPH: C-Index                                                       |
| performance.linear\_model.syn\_id.brier\_score     | Train on synthetic, test on the train real data using CoxPH: Brier score                                                   |
| performance.linear\_model.syn\_ood.c\_index        | Train on synthetic, test on the test real data using CoxPH: C-Index                                                        |
| performance.linear\_model.syn\_ood.brier\_score    | Train on synthetic, test on the test real data using CoxPH: Brier score                                                    |
| performance.mlp.gt.c\_index                        | Train on real, test on the test real data using NN: C-Index                                                                |
| performance.mlp.gt.brier\_score                    | Train on real, test on the test real data using NN : Brier score                                                           |
| performance.mlp.syn\_id.c\_index                   | Train on synthetic, test on the train real data using NN: C-Index                                                          |
| performance.mlp.syn\_id.brier\_score               | Train on synthetic, test on the train real data using NN: Brier score                                                      |
| performance.mlp.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using NN: C-Index                                                           |
| performance.mlp.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using NN: Brier score                                                       |
| performance.xgb.gt.c\_index                        | Train on real, test on the test real data using XGB: C-Index                                                               |
| performance.xgb.gt.brier\_score                    | Train on real, test on the test real data using XGB : Brier score                                                          |
| performance.xgb.syn\_id.c\_index                   | Train on synthetic, test on the train real data using XGB: C-Index                                                         |
| performance.xgb.syn\_id.brier\_score               | Train on synthetic, test on the train real data using XGB: Brier score                                                     |
| performance.xgb.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using XGB: C-Index                                                          |
| performance.xgb.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using XGB: Brier score                                                      |
| performance.feat\_rank\_distance.corr              | Correlation for the rank distances between the feature importance on real and synthetic data                               |
| performance.feat\_rank\_distance.pvalue            | p-vale for the rank distances between the feature importance on real and synthetic data                                    |
| detection.detection\_xgb.mean                      | The average AUCROC score for detecting synthetic data using an XGBoost.                                                    |
| detection.detection\_mlp.mean                      | The average AUCROC score for detecting synthetic data using a NN.                                                          |
| detection.detection\_gmm.mean                      | The average AUCROC score for detecting synthetic data using a GMM.                                                         |
| privacy.delta-presence.score                       | the maximum re-identification probability on the real dataset from the synthetic dataset.                                  |
| privacy.k-anonymization.gt                         | the k-anon for the real data                                                                                               |
| privacy.k-anonymization.syn                        | the k-anon for the synthetic data                                                                                          |
| privacy.k-map.score                                | the minimum value k that satisfies the k-map rule.                                                                         |
| privacy.distinct l-diversity.gt                    | the l-diversity for the real data                                                                                          |
| privacy.distinct l-diversity.syn                   | the l-diversity for the synthetic data                                                                                     |
| privacy.identifiability\_score.score               | the re-identification score on the real dataset from the synthetic dataset.                                                |

## Benchmark the quality of plugins

For survival analysis, general purpose generators can be used as well.

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [(f"test_{model}", model, {}) for model in ["syn_seq"]],
    loader,
    synthetic_size=1000,
    repeats=2,
)

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
