# Tutorial 10: Sequential Synthesis
In this tutorial, we explore the **Sequential Synthesis** approach using
the `syn_seq` plugin in `synthcity`. Sequential synthesis allows us to
model variables one-by-one (column-by-column), using conditional relationships
learned from the real data. The main idea is:
1. Synthesize the first variable (often with sample-without-replacement, "SWR"),
2. Then synthesize the second variable conditioned on the first,
3. And so on for each subsequent variable.
This approach can better preserve complex dependencies among columns than
simple marginal or naive methods.
We'll demonstrate this using the **diabetes** dataset, just like other tutorials,
and compare the resulting synthetic data.


In [1]:
!pip install synthcity



In [1]:

import sys
import warnings

warnings.filterwarnings("ignore")

from sklearn.datasets import load_diabetes

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins

log.add(sink=sys.stderr, level="INFO")


                  variable OMP_PATH to the location of the header before importing keopscore or pykeops,
                  e.g. using os.environ: import os; os.environ['OMP_PATH'] = '/path/to/omp/header'


## 1) Load the data
We will use the diabetes dataset for simplicity.


In [2]:
import pandas as pd

# 이중 백슬래시 사용
df = pd.read_csv("C:\\Users\\hsrhe\\Desktop\\SQRG\\SD2011.csv")

# 데이터 확인
print(df.head())


      sex  age  agegr               placesize               region  \
0  FEMALE   57  45-59   URBAN 100,000-200,000             Lubuskie   
1    MALE   20  16-24             RURAL AREAS            Podlaskie   
2  FEMALE   18  16-24  URBAN 500,000 AND OVER          Mazowieckie   
3  FEMALE   78    65+             RURAL AREAS            Podlaskie   
4  FEMALE   54  45-59   URBAN 100,000-200,000  Zachodnio-pomorskie   

                    edu                                            eduspec  \
0    VOCATIONAL/GRAMMAR  services for the population and transport serv...   
1    VOCATIONAL/GRAMMAR                                  no specialisation   
2    VOCATIONAL/GRAMMAR                                  no specialisation   
3  PRIMARY/NO EDUCATION                                  no specialisation   
4    VOCATIONAL/GRAMMAR                     agriculture, forestry, fishing   

            socprof  unempdur  income  ... alcsol  workab  wkabdur  wkabint  \
0           RETIRED         0  


## 2) Create a Syn_SeqDataLoader
Instead of using a `GenericDataLoader`, we use our specialized
`Syn_SeqDataLoader`. We'll define a `syn_order` — the sequence in which columns
get synthesized. If not provided, it defaults to the data's columns order.


In [3]:
from synthcity.plugins.core.dataloader import Syn_SeqDataLoader

user_custom = {
  'syn_order' : ["sex", "age", "edu", "marital", "income", "ls", "wkabint"],
  'method' : {},
  'special_value': {},
  'col_type' : {},
  'variable_selection' : {
    "wkabint": [''],
    "ls": ['sex', 'age', 'income'],
  }
}


[WARN] user did not specify 'syn_order'; using raw_df.columns order.
[INFO] Syn_SeqDataLoader init complete. splitted_df shape= (442, 11)
  - syn_order: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']
  - special_value: {}
  - col_type: {}
  - max_categories: 20
  - encoder.col_map =>
     age: original_dtype=float64, converted_type=numeric, method=cart
     sex: original_dtype=float64, converted_type=category, method=cart
     bmi: original_dtype=float64, converted_type=numeric, method=cart
     bp: original_dtype=float64, converted_type=numeric, method=cart
     s1: original_dtype=float64, converted_type=numeric, method=cart
     s2: original_dtype=float64, converted_type=numeric, method=cart
     s3: original_dtype=float64, converted_type=numeric, method=cart
     s4: original_dtype=float64, converted_type=numeric, method=cart
     s5: original_dtype=float64, converted_type=numeric, method=cart
     s6: original_dtype=float64, converted_type=numeric, method

In [None]:
loader = Syn_SeqDataLoader(X, target_column="target", sensitive_columns=["sex"], user_custom = user_custom)

In [6]:
    info_dict = loader.info()

    print("=== Syn_SeqDataLoader Info ===")
    print(info_dict)

=== Syn_SeqDataLoader Info ===
{'data_type': 'syn_seq', 'len': 442, 'train_size': 0.8, 'random_state': 0, 'syn_order': ['sex', 'bmi', 'age', 'bp_cat', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target'], 'method': {'sex': 'cart', 'bmi': 'cart', 'age': 'cart', 'bp': 'cart', 's1': 'cart', 's2': 'cart', 's3': 'cart', 's4': 'cart', 's5': 'cart', 's6': 'cart', 'target': 'cart', 'bp_cat': 'cart'}, 'special_value': {'bp': [-0.040099, -0.00567]}, 'original_type': {'sex': 'float64', 'bmi': 'float64', 'age': 'float64', 'bp': 'float64', 's1': 'float64', 's2': 'float64', 's3': 'float64', 's4': 'float64', 's5': 'float64', 's6': 'float64', 'target': 'float64', 'bp_cat': None}, 'converted_type': {'sex': 'category', 'bmi': 'numeric', 'age': 'category', 'bp': 'numeric', 's1': 'numeric', 's2': 'numeric', 's3': 'numeric', 's4': 'numeric', 's5': 'numeric', 's6': 'numeric', 'target': 'numeric', 'bp_cat': 'category'}, 'variable_selection':         sex  bmi  age  bp  s1  s2  s3  s4  s5  s6  target  bp_cat
se


The `Syn_SeqDataLoader` also prints out debug info, including the
automatically-detected numeric vs categorical columns.

## 3) List available plugins
Recall from earlier tutorials that you can see all generative model plugins
with `Plugins().list()`. We'll specifically focus on `"syn_seq"` here.



In [5]:
Plugins().list()

[2025-01-15T21:32:59.642370+0900][13168][CRITICAL] module disabled: C:\Users\hsrhe\Desktop\synthcity\src\synthcity\plugins\generic\plugin_goggle.py
[2025-01-15T21:32:59.642370+0900][13168][CRITICAL] module disabled: C:\Users\hsrhe\Desktop\synthcity\src\synthcity\plugins\generic\plugin_goggle.py


['ddpm',
 'arf',
 'tvae',
 'survival_ctgan',
 'survae',
 'privbayes',
 'syn_seq',
 'timevae',
 'bayesian_network',
 'timegan',
 'rtvae',
 'nflow',
 'aim',
 'dpgan',
 'pategan',
 'image_cgan',
 'radialgan',
 'adsgan',
 'image_adsgan',
 'survival_nflow',
 'dummy_sampler',
 'marginal_distributions',
 'ctgan',
 'decaf',
 'great',
 'uniform_sampler',
 'survival_gan',
 'fflows']

You should see `"syn_seq"` in the returned list.

## 4) Load and train the Sequential Synthesis Model
The `syn_seq` plugin allows you to specify how each column is synthesized:
- `"SWR"` = sample without replacement
- `"CART"`, `"rf"`, `"pmm"`, `"logreg"`, etc. for the rest
Typically, we do `"SWR"` for the first column, and `"CART"` or `"rf"` for subsequent columns.
But you can choose any method for each column.


In [6]:
user_custom = {
  'syn_order' : ['sex', 'bmi', 'age', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target'],
  'method' : {'bp':'norm'},
  'variable_selection' : {
    "s4": ['sex', 'bmi', 'age', 'bp', 's1', 's2'],
    "target": ['sex', 'bmi', 'age', 'bp', 's1', 's2', 's3']
  }
}

In [7]:
syn_model = Plugins().get("syn_seq")


[2025-01-15T21:33:02.354537+0900][13168][CRITICAL] module disabled: C:\Users\hsrhe\Desktop\synthcity\src\synthcity\plugins\generic\plugin_goggle.py


In [8]:
syn_model.fit(loader,user_custom)

[INFO] Syn_SeqDataLoader init complete:
  - syn_order: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']
  - special_value: {'bp': [-0.040099, -0.00567]}
  - col_type: {'age': 'category'}
  - data shape: (442, 11)

[DEBUG] After encoder.fit(), ready for preprocessing with detected info:
  - special_value: {'bp': [-0.040099, -0.00567]}
  - encoder.col_map =>
       age : converted_type=category, method=cart
       sex : converted_type=category, method=cart
       bmi : converted_type=numeric, method=cart
       bp : converted_type=numeric, method=cart
       s1 : converted_type=numeric, method=cart
       s2 : converted_type=numeric, method=cart
       s3 : converted_type=numeric, method=cart
       s4 : converted_type=numeric, method=cart
       s5 : converted_type=numeric, method=cart
       s6 : converted_type=numeric, method=cart
       target : converted_type=numeric, method=cart
  - variable_selection_:
        age  sex  bmi  bp  s1  s2  s3  s4  s5  s6  tar

<synthcity.plugins.generic.plugin_syn_seq.Syn_SeqPlugin at 0x1c2ad85a490>

In [11]:
df = loader.dataframe()
print(df.dtypes)

age       float64
sex       float64
bmi       float64
bp        float64
s1        float64
s2        float64
s3        float64
s4        float64
s5        float64
s6        float64
target    float64
dtype: object



**Note**: During training, you'll see some printed info about which method
is used for each column, plus the final variable selection matrix.

## 5) Generate synthetic data
By default, let's sample 200 synthetic rows.



In [13]:
constraints = {
  "target":[
    ("bmi", ">", 0.15),
    ("target", ">", 0)
  ]
}

encoded_loader = syn_model._aggregator.get_encoded_loader()

encoded_loader.data.dtypes

sex       category
bmi        float64
age       category
bp_cat    category
bp         float64
s1         float64
s2         float64
s3         float64
s4         float64
s5         float64
s6         float64
target     float64
dtype: object

In [17]:
# 여기서 X=loader 를 함께 넘긴다고 가정
synthetic_loader = syn_model.generate(
    count=len(df),
    constraints=constraints,
    X=encoded_loader # <-- 꼭 넣어주세요!
)


# 6) 결과
synthetic_df = synthetic_loader.dataframe()
print(synthetic_df.head())

        sex       bmi       age  bp_cat        bp        s1        s2  \
0  0.050680 -0.006206  0.027178    -777  0.018430 -0.016704  0.017788   
1 -0.044642  0.071397 -0.052738    -777 -0.074527 -0.015328 -0.034508   
2  0.050680  0.127443  0.019913    -777  0.097615 -0.011201 -0.015092   
3 -0.044642 -0.038540 -0.092695    -777 -0.019442 -0.068991 -0.074277   
4 -0.044642 -0.047163 -0.045472    -777 -0.026328 -0.015328 -0.022608   

         s3        s4        s5        s6  target  
0 -0.039719 -0.002592 -0.018114 -0.025930    55.0  
1  0.004460 -0.039493 -0.015999  0.069338    90.0  
2 -0.054446  0.034309  0.060791  0.032059   295.0  
3 -0.054446  0.012906  0.001148 -0.038357   252.0  
4  0.015505 -0.002592  0.024055 -0.038357    66.0  


In [18]:
# 6) 결과
orginal_df = encoded_loader.dataframe()
synthetic_df = synthetic_loader.dataframe()
print(orginal_df.head())
print(synthetic_df.head())

        sex       bmi       age bp_cat        bp        s1        s2  \
0  0.050680  0.061696  0.038076   -777  0.021872 -0.044223 -0.034821   
1 -0.044642 -0.051474 -0.001882   -777 -0.026328 -0.008449 -0.019163   
2  0.050680  0.044451  0.085299   -777 -0.005670 -0.045599 -0.034194   
3 -0.044642 -0.011595 -0.089063   -777 -0.036656  0.012191  0.024991   
4 -0.044642 -0.036385  0.005383   -777  0.021872  0.003935  0.015596   

         s3        s4        s5        s6  target  
0 -0.043401 -0.002592  0.019907 -0.017646   151.0  
1  0.074412 -0.039493 -0.068332 -0.092204    75.0  
2 -0.032356 -0.002592  0.002861 -0.025930   141.0  
3 -0.036038  0.034309  0.022688 -0.009362   206.0  
4  0.008142 -0.002592 -0.031988 -0.046641   135.0  
        sex       bmi       age  bp_cat        bp        s1        s2  \
0  0.050680 -0.006206  0.027178    -777  0.018430 -0.016704  0.017788   
1 -0.044642  0.071397 -0.052738    -777 -0.074527 -0.015328 -0.034508   
2  0.050680  0.127443  0.019913    -

In [21]:
print(orginal_df.dtypes)
print(synthetic_df.dtypes)
print(loader.dataframe().dtypes)

sex       category
bmi        float64
age       category
bp_cat    category
bp         float64
s1         float64
s2         float64
s3         float64
s4         float64
s5         float64
s6         float64
target     float64
dtype: object
sex       category
bmi        float64
age       category
bp_cat       int64
bp         float64
s1         float64
s2         float64
s3         float64
s4         float64
s5         float64
s6         float64
target     float64
dtype: object
age       float64
sex       float64
bmi       float64
bp        float64
s1        float64
s2        float64
s3        float64
s4        float64
s5        float64
s6        float64
target    float64
dtype: object


In [15]:
# third party
import matplotlib.pyplot as plt

syn_model.plot(plt, loader)

plt.show()

ValueError: Need a DataLoader for decoding. (Syn_SeqDataLoader 등)

## Benchmarking metrics

| **Metric**                                         | **Description**                                                                                                            |
|----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| sanity.data\_mismatch.score                        | Data types mismatch between the real//synthetic features                                                                   |
| sanity.common\_rows\_proportion.score              | Real data copy-paste in the synthetic data                                                                                 |
| sanity.nearest\_syn\_neighbor\_distance.mean       | Computes the \textless{}reduction\textgreater{}(distance) from the real data to the closest neighbor in the synthetic data |
| sanity.close\_values\_probability.score            | the probability of close values between the real and synthetic data.                                                       |
| sanity.distant\_values\_probability.score          | the probability of distant values between the real and synthetic data.                                                     |
| stats.jensenshannon\_dist.marginal                 | the average Jensen-Shannon distance                                                                                        |
| stats.chi\_squared\_test.marginal                  | the one-way chi-square test.                                                                                               |
| stats.feature\_corr.joint                          | the correlation/strength-of-association of features in data-set with both categorical and continuous features              |
| stats.inv\_kl\_divergence.marginal                 | the average inverse of the Kullback–Leibler Divergence metric.                                                             |
| stats.ks\_test.marginal                            | the Kolmogorov-Smirnov test for goodness of fit.                                                                           |
| stats.max\_mean\_discrepancy.joint                 | Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.                |
| stats.prdc.precision                               | precision between the two manifolds                                                                                        |
| stats.prdc.recall                                  | recall between the two manifolds                                                                                           |
| stats.prdc.density                                 | density between the two manifolds                                                                                          |
| stats.prdc.coverage                                | coverage between the two manifolds                                                                                         |
| stats.alpha\_precision.delta\_precision\_alpha\_OC | Delta precision                                                                                                            |
| stats.alpha\_precision.delta\_coverage\_beta\_OC   | Delta coverage                                                                                                             |
| stats.alpha\_precision.authenticity\_OC            | Authetnticity                                                                                                              |
| stats.survival\_km\_distance.optimism              | Kaplan-Meier distance between real-synthetic data                                                                          |
| stats.survival\_km\_distance.abs\_optimism         | Kaplan-Meier metrics absolute distance between real-syn data                                                               |
| stats.survival\_km\_distance.sightedness           | Kaplan-Meier metrics distance on the temporal axis                                                                         |
| performance.linear\_model.gt.c\_index              | Train on real, test on the test real data using CoxPH: C-Index                                                             |
| performance.linear\_model.gt.brier\_score          | Train on real, test on the test real data using CoxPH: Brier score                                                         |
| performance.linear\_model.syn\_id.c\_index         | Train on synthetic, test on the train real data using CoxPH: C-Index                                                       |
| performance.linear\_model.syn\_id.brier\_score     | Train on synthetic, test on the train real data using CoxPH: Brier score                                                   |
| performance.linear\_model.syn\_ood.c\_index        | Train on synthetic, test on the test real data using CoxPH: C-Index                                                        |
| performance.linear\_model.syn\_ood.brier\_score    | Train on synthetic, test on the test real data using CoxPH: Brier score                                                    |
| performance.mlp.gt.c\_index                        | Train on real, test on the test real data using NN: C-Index                                                                |
| performance.mlp.gt.brier\_score                    | Train on real, test on the test real data using NN : Brier score                                                           |
| performance.mlp.syn\_id.c\_index                   | Train on synthetic, test on the train real data using NN: C-Index                                                          |
| performance.mlp.syn\_id.brier\_score               | Train on synthetic, test on the train real data using NN: Brier score                                                      |
| performance.mlp.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using NN: C-Index                                                           |
| performance.mlp.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using NN: Brier score                                                       |
| performance.xgb.gt.c\_index                        | Train on real, test on the test real data using XGB: C-Index                                                               |
| performance.xgb.gt.brier\_score                    | Train on real, test on the test real data using XGB : Brier score                                                          |
| performance.xgb.syn\_id.c\_index                   | Train on synthetic, test on the train real data using XGB: C-Index                                                         |
| performance.xgb.syn\_id.brier\_score               | Train on synthetic, test on the train real data using XGB: Brier score                                                     |
| performance.xgb.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using XGB: C-Index                                                          |
| performance.xgb.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using XGB: Brier score                                                      |
| performance.feat\_rank\_distance.corr              | Correlation for the rank distances between the feature importance on real and synthetic data                               |
| performance.feat\_rank\_distance.pvalue            | p-vale for the rank distances between the feature importance on real and synthetic data                                    |
| detection.detection\_xgb.mean                      | The average AUCROC score for detecting synthetic data using an XGBoost.                                                    |
| detection.detection\_mlp.mean                      | The average AUCROC score for detecting synthetic data using a NN.                                                          |
| detection.detection\_gmm.mean                      | The average AUCROC score for detecting synthetic data using a GMM.                                                         |
| privacy.delta-presence.score                       | the maximum re-identification probability on the real dataset from the synthetic dataset.                                  |
| privacy.k-anonymization.gt                         | the k-anon for the real data                                                                                               |
| privacy.k-anonymization.syn                        | the k-anon for the synthetic data                                                                                          |
| privacy.k-map.score                                | the minimum value k that satisfies the k-map rule.                                                                         |
| privacy.distinct l-diversity.gt                    | the l-diversity for the real data                                                                                          |
| privacy.distinct l-diversity.syn                   | the l-diversity for the synthetic data                                                                                     |
| privacy.identifiability\_score.score               | the re-identification score on the real dataset from the synthetic dataset.                                                |

## Benchmark the quality of plugins

For survival analysis, general purpose generators can be used as well.

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [(f"test_{model}", model, {}) for model in ["syn_seq", "ctgan"]],
    loader,
    synthetic_size=1000,
    repeats=2,
    task_type="survival_analysis",
)

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
