# Tutorial 10: Sequential Synthesis
In this tutorial, we explore the **Sequential Synthesis** approach using
the `syn_seq` plugin in `synthcity`. Sequential synthesis allows us to
model variables one-by-one (column-by-column), using conditional relationships
learned from the real data. The main idea is:
1. Synthesize the first variable (often with sample-without-replacement, "SWR"),
2. Then synthesize the second variable conditioned on the first,
3. And so on for each subsequent variable.
This approach can better preserve complex dependencies among columns than
simple marginal or naive methods.
We'll demonstrate this using the **diabetes** dataset, just like other tutorials,
and compare the resulting synthetic data.


In [12]:
!pip install synthcity

/Users/minkeychang/synthcity/src/synthcity/__init__.py


In [1]:

import sys
import warnings

warnings.filterwarnings("ignore")

from sklearn.datasets import load_diabetes

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins

log.add(sink=sys.stderr, level="INFO")


                  variable OMP_PATH to the location of the header before importing keopscore or pykeops,
                  e.g. using os.environ: import os; os.environ['OMP_PATH'] = '/path/to/omp/header'


## 1) Load the data
We will use the diabetes dataset for simplicity.


In [2]:

X, y = load_diabetes(return_X_y=True, as_frame=True)
X["target"] = y

ods = X

ods.head()


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0



## 2) Create a Syn_SeqDataLoader
Instead of using a `GenericDataLoader`, we use our specialized
`Syn_SeqDataLoader`. We'll define a `syn_order` — the sequence in which columns
get synthesized. If not provided, it defaults to the data's columns order.


In [3]:
from synthcity.plugins.core.dataloader import Syn_SeqDataLoader

##############################################################################
# 2. Syn_SeqDataLoader 생성
##############################################################################
custom_col_type = {
    "age": "category",  # auto-detect도 가능하지만 예시로 override
}

special_value_map = {
'bp':[-0.040099, -0.005670]
}

loader = Syn_SeqDataLoader(
    data=ods,
    special_value=special_value_map,
    col_type=custom_col_type,
)

# DataLoader init 시, 내부 Syn_SeqEncoder도 fit()됨
# => columns split, date offset, etc.는 transform 전까지 미적용
print("\n[INFO] Loader created. loader.shape =", loader.shape)
print("loader columns =", loader.columns)




[INFO] Most of the time, it is recommened to have category variables before synthesizing numeric variables.
[INFO] Syn_SeqDataLoader init complete:
  - syn_order: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']
  - special_value: {'bp': [-0.040099, -0.00567]}
  - col_type: {'age': 'category'}
  - data shape: (442, 11)
[DEBUG] After encoder.fit(), detected info:
  - encoder.col_map =>
       age : {'original_dtype': 'float64', 'converted_type': 'category', 'method': 'cart'}
       sex : {'original_dtype': 'float64', 'converted_type': 'category', 'method': 'cart'}
       bmi : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       bp : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s1 : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s2 : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s3 : {'original_dtype': 'float64', 'conve

In [4]:
ods

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930,220.0



The `Syn_SeqDataLoader` also prints out debug info, including the
automatically-detected numeric vs categorical columns.

## 3) List available plugins
Recall from earlier tutorials that you can see all generative model plugins
with `Plugins().list()`. We'll specifically focus on `"syn_seq"` here.



In [5]:
Plugins().list()

[2025-01-15T21:22:52.711185+0900][20168][CRITICAL] module disabled: /Users/minkeychang/synthcity/src/synthcity/plugins/generic/plugin_goggle.py
[2025-01-15T21:22:52.711185+0900][20168][CRITICAL] module disabled: /Users/minkeychang/synthcity/src/synthcity/plugins/generic/plugin_goggle.py


['syn_seq',
 'survival_gan',
 'bayesian_network',
 'survival_nflow',
 'privbayes',
 'survae',
 'survival_ctgan',
 'tvae',
 'great',
 'dpgan',
 'pategan',
 'ctgan',
 'uniform_sampler',
 'fflows',
 'marginal_distributions',
 'timegan',
 'image_cgan',
 'aim',
 'adsgan',
 'rtvae',
 'image_adsgan',
 'arf',
 'dummy_sampler',
 'timevae',
 'radialgan',
 'nflow',
 'ddpm',
 'decaf']

You should see `"syn_seq"` in the returned list.

## 4) Load and train the Sequential Synthesis Model
The `syn_seq` plugin allows you to specify how each column is synthesized:
- `"SWR"` = sample without replacement
- `"CART"`, `"rf"`, `"pmm"`, `"logreg"`, etc. for the rest
Typically, we do `"SWR"` for the first column, and `"CART"` or `"rf"` for subsequent columns.
But you can choose any method for each column.


In [6]:
user_custom = {
  'syn_order' : ['sex', 'bmi', 'age', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target'],
  'method' : {'bp':'norm'},
  'variable_selection' : {
    "s4": ['sex', 'bmi', 'age', 'bp', 's1', 's2'],
    "target": ['sex', 'bmi', 'age', 'bp', 's1', 's2', 's3']
  }
}

In [7]:
syn_model = Plugins().get("syn_seq")


[2025-01-15T21:22:55.821330+0900][20168][CRITICAL] module disabled: /Users/minkeychang/synthcity/src/synthcity/plugins/generic/plugin_goggle.py


In [8]:
syn_model.fit(loader,user_custom)

[INFO] Syn_SeqDataLoader init complete:
  - syn_order: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']
  - special_value: {'bp': [-0.040099, -0.00567]}
  - col_type: {'age': 'category'}
  - data shape: (442, 11)
[DEBUG] After encoder.fit(), detected info:
  - encoder.col_map =>
       age : {'original_dtype': 'category', 'converted_type': 'category', 'method': 'cart'}
       sex : {'original_dtype': 'category', 'converted_type': 'category', 'method': 'cart'}
       bmi : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       bp : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s1 : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s2 : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s3 : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s4 : {'original_dtype': 'float64', 'converted_type': 'nume

<synthcity.plugins.generic.plugin_syn_seq.Syn_SeqPlugin at 0x324b00e50>


**Note**: During training, you'll see some printed info about which method
is used for each column, plus the final variable selection matrix.

## 5) Generate synthetic data
By default, let's sample 200 synthetic rows.



In [9]:
my_constraints = {
    "target": [
        ("bmi", ">", 0.15),
        ("target", ">", 0)
    ]
}

n = len(ods)
syn_model.generate(nrows = n, constraints= my_constraints)


Generating 'age' ... Done!
Generating 'sex' ... Done!
Generating 'bmi' ... Done!
Generating 'bp' ... Done!
Generating 's1' ... Done!
Generating 's2' ... Done!
Generating 's3' ... Done!
Generating 's4' ... Done!
Generating 's5' ... Done!
Generating 's6' ... Done!
Generating 'target' ... Done!
[INFO] Syn_SeqDataLoader init complete:
  - syn_order: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']
  - special_value: {'bp': [-0.040099, -0.00567]}
  - col_type: {'age': 'category'}
  - data shape: (442, 11)
[DEBUG] After encoder.fit(), detected info:
  - encoder.col_map =>
       age : {'original_dtype': 'float64', 'converted_type': 'category', 'method': 'cart'}
       sex : {'original_dtype': 'float64', 'converted_type': 'category', 'method': 'cart'}
       bmi : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       bp : {'original_dtype': 'float64', 'converted_type': 'numeric', 'method': 'cart'}
       s1 : {'original_dtype': 'float64'

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.019913,-0.044642,-0.057941,-0.040099,-0.001569,-0.005072,0.074412,-0.039493,-0.042571,-0.059067,96.0
1,-0.012780,-0.044642,-0.023451,-0.033213,-0.016704,0.004636,-0.047082,0.034309,0.022688,0.036201,69.0
2,0.038076,0.050680,-0.024529,-0.049291,0.049341,-0.004132,-0.069172,0.145012,0.106351,0.019633,170.0
3,-0.012780,-0.044642,-0.035307,-0.033213,-0.009825,-0.003193,0.019187,-0.039493,-0.022517,-0.079778,127.0
4,-0.023677,0.050680,0.059541,0.052858,0.053469,0.056619,-0.021311,0.034309,0.015568,0.056912,222.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.070769,0.050680,0.020739,0.039087,0.061725,0.056305,-0.039719,0.108111,0.068986,0.015491,212.0
438,0.056239,-0.044642,-0.009439,0.021872,0.027326,0.040022,0.011824,-0.002592,-0.041176,-0.017646,230.0
439,0.059871,-0.044642,0.041218,0.076958,0.035582,0.037830,0.015505,-0.002592,-0.001496,-0.009362,173.0
440,-0.078165,0.050680,0.055229,0.018430,-0.037344,-0.047347,-0.058127,0.034309,0.054720,-0.017646,202.0


In [None]:
# third party
import matplotlib.pyplot as plt

syn_model.plot(plt, loader)

plt.show()

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
