# Tutorial 10: Sequential Synthesis
In this tutorial, we explore the **Sequential Synthesis** approach using
the `syn_seq` plugin in `synthcity`. Sequential synthesis allows us to
model variables one-by-one (column-by-column), using conditional relationships
learned from the real data. The main idea is:
1. Synthesize the first variable (often with sample-without-replacement, "SWR"),
2. Then synthesize the second variable conditioned on the first,
3. And so on for each subsequent variable.
This approach can better preserve complex dependencies among columns than
simple marginal or naive methods.
We'll demonstrate this using the **diabetes** dataset, just like other tutorials,
and compare the resulting synthetic data.


In [1]:
!pip install synthcity



In [2]:

import sys
import warnings

warnings.filterwarnings("ignore")

from sklearn.datasets import load_diabetes

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins

log.add(sink=sys.stderr, level="INFO")


    The default C++ compiler could not be found on your system.
    You need to either define the CXX environment variable or a symlink to the g++ command.
    For example if g++-8 is the command you can do
      import os
      os.environ['CXX'] = 'g++-8'
    


## 1) Load the data
We will use the diabetes dataset for simplicity.


In [3]:
import pandas as pd

ods = pd.read_csv("C:\\Users\\hsrhe\\Desktop\\SQRG\\ods.csv")
#전처리 작업 - 합성에 원하는 컬럼 
target = ["sex", "age", "edu", "income", "smoke", "nociga", "wkabdur", "ls", "wkabint", "date"]
ods = ods[target]

print(ods.head())

      sex  age                   edu  income smoke  nociga  wkabdur  \
0  FEMALE   57    VOCATIONAL/GRAMMAR   800.0    NO      -8       -8   
1    MALE   20    VOCATIONAL/GRAMMAR   350.0    NO      -8       -8   
2  FEMALE   18    VOCATIONAL/GRAMMAR     NaN    NO      -8       -8   
3  FEMALE   78  PRIMARY/NO EDUCATION   900.0    NO      -8       -8   
4  FEMALE   54    VOCATIONAL/GRAMMAR  1500.0   YES      20       -8   

                 ls wkabint        date  
0           PLEASED      NO  1979-10-07  
1  MOSTLY SATISFIED      NO         NaN  
2           PLEASED      NO         NaN  
3             MIXED      NO  1958-08-11  
4  MOSTLY SATISFIED      NO  1980-06-08  



## 2) Create a Syn_SeqDataLoader
Instead of using a `GenericDataLoader`, we use our specialized
`Syn_SeqDataLoader`. We'll define a `syn_order` — the sequence in which columns
get synthesized. If not provided, it defaults to the data's columns order.


In [4]:
from synthcity.plugins.core.dataloader import Syn_SeqDataLoader

user_custom = {
  'syn_order' : ["sex", "age", "edu", "income", "smoke", "nociga", "wkabdur", "ls", "wkabint", "date"],
  'method' : {"age": "cart"},
  'special_value': {'income': [-8], 'nociga': [-8]},
  'col_type' : {"date": "date"},
  'variable_selection' : {
    "nociga": ["sex", "age", "edu", "smoke"],
    "ls": ['sex', 'age', 'income'],
  }
}


In [5]:
loader = Syn_SeqDataLoader(
    ods, 
    target_column="income", 
    sensitive_columns=["income"], 
    user_custom = user_custom, 
    max_categories = 15
    )


[INFO] Syn_SeqEncoder summary:
  (column, converted_type, method)

  (sex, category, swr)
    --> 
  (age, numeric, cart)
    --> 
  (edu, category, cart)
    --> 
  (income, numeric, cart)
    --> 
  (smoke, category, cart)
    --> 
  (nociga, numeric, cart)
    --> 
  (wkabdur, numeric, cart)
    --> 
  (ls, category, cart)
    --> 
  (wkabint, category, cart)
    --> 
  (date, numeric, cart)

  - special_value => {'income': [-8], 'nociga': [-8], 'wkabdur': [-8]}

  - variable_selection_:
         sex  age  edu  income  smoke  nociga  wkabdur  ls  wkabint  date
sex        0    0    0       0      0       0        0   0        0     0
age        1    0    0       0      0       0        0   0        0     0
edu        1    1    0       0      0       0        0   0        0     0
income     1    1    1       0      0       0        0   0        0     0
smoke      1    1    1       1      0       0        0   0        0     0
nociga     1    1    1       0      1       0        0   0 


The `Syn_SeqDataLoader` also prints out debug info, including the
automatically-detected numeric vs categorical columns.

## 3) List available plugins
Recall from earlier tutorials that you can see all generative model plugins
with `Plugins().list()`. We'll specifically focus on `"syn_seq"` here.



In [6]:
Plugins().list()

[2025-02-07T15:21:28.011905+0900][11664][CRITICAL] module disabled: C:\Users\hsrhe\Desktop\synthcity\src\synthcity\plugins\generic\plugin_goggle.py
[2025-02-07T15:21:28.011905+0900][11664][CRITICAL] module disabled: C:\Users\hsrhe\Desktop\synthcity\src\synthcity\plugins\generic\plugin_goggle.py


['image_cgan',
 'rtvae',
 'timevae',
 'fflows',
 'survival_ctgan',
 'adsgan',
 'tvae',
 'dummy_sampler',
 'survival_gan',
 'bayesian_network',
 'privbayes',
 'ddpm',
 'image_adsgan',
 'nflow',
 'great',
 'survival_nflow',
 'survae',
 'syn_seq',
 'decaf',
 'radialgan',
 'aim',
 'dpgan',
 'ctgan',
 'pategan',
 'arf',
 'uniform_sampler',
 'timegan',
 'marginal_distributions']

You should see `"syn_seq"` in the returned list.

## 4) Load and train the Sequential Synthesis Model
The `syn_seq` plugin allows you to specify how each column is synthesized:
- `"SWR"` = sample without replacement
- `"CART"`, `"rf"`, `"pmm"`, `"logreg"`, etc. for the rest
Typically, we do `"SWR"` for the first column, and `"CART"` or `"rf"` for subsequent columns.
But you can choose any method for each column.


In [7]:
syn_model = Plugins().get("syn_seq")


[2025-02-07T15:21:37.752341+0900][11664][CRITICAL] module disabled: C:\Users\hsrhe\Desktop\synthcity\src\synthcity\plugins\generic\plugin_goggle.py


In [8]:
syn_model.fit(loader)

Fitting 'age' with 'cart' ... Done!
Fitting 'edu' with 'cart' ... Done!
Fitting 'income_cat' with 'cart' ... Done!
Fitting 'income' with 'cart' ... Done!
Fitting 'smoke' with 'cart' ... Done!
Fitting 'nociga_cat' with 'cart' ... Done!
Fitting 'nociga' with 'cart' ... Done!
Fitting 'wkabdur_cat' with 'cart' ... Done!
Fitting 'wkabdur' with 'cart' ... Done!
Fitting 'ls' with 'cart' ... Done!
Fitting 'wkabint' with 'cart' ... Done!
Fitting 'date' with 'cart' ... Done!


<synthcity.plugins.generic.plugin_syn_seq.Syn_SeqPlugin at 0x1677a63e3d0>


**Note**: During training, you'll see some printed info about which method
is used for each column, plus the final variable selection matrix.

## 5) Generate synthetic data
By default, let's sample 200 synthetic rows.



In [9]:
# 여기서 X=loader 를 함께 넘긴다고 가정
synthetic_loader = syn_model.generate(
    nrows = len(ods),
    )

Generating 'age' => done.
Generating 'edu' => done.
Generating 'income_cat' => done.
Generating 'income' => done.
Generating 'smoke' => done.
Generating 'nociga_cat' => done.
Generating 'nociga' => done.
Generating 'wkabdur_cat' => done.
Generating 'wkabdur' => done.
Generating 'ls' => done.
Generating 'wkabint' => done.
Generating 'date' => done.


In [10]:
# 6) 결과
synthetic_df = synthetic_loader.dataframe()
print(synthetic_df.head(20))

       sex   age                       edu  income smoke  nociga  wkabdur  \
0   FEMALE  51.0        VOCATIONAL/GRAMMAR  1000.0   YES    30.0     24.0   
1     MALE  25.0  POST-SECONDARY OR HIGHER  2000.0    NO    20.0      3.0   
2   FEMALE  56.0        VOCATIONAL/GRAMMAR  1350.0    NO    22.0      6.0   
3     MALE  53.0        VOCATIONAL/GRAMMAR  2500.0    NO    15.0      8.0   
4     MALE  30.0        VOCATIONAL/GRAMMAR  1500.0   YES    15.0     24.0   
5   FEMALE  92.0      PRIMARY/NO EDUCATION   850.0    NO    40.0     24.0   
6     MALE  53.0        VOCATIONAL/GRAMMAR  1700.0    NO    20.0      8.0   
7   FEMALE  48.0        VOCATIONAL/GRAMMAR  1200.0   YES    15.0      5.0   
8   FEMALE  73.0      PRIMARY/NO EDUCATION  1115.0    NO     3.0     36.0   
9   FEMALE  42.0                 SECONDARY  1200.0    NO    15.0      6.0   
10  FEMALE  76.0  POST-SECONDARY OR HIGHER  2400.0    NO     5.0     36.0   
11  FEMALE  35.0  POST-SECONDARY OR HIGHER  1100.0    NO    10.0      4.0   

In [11]:
from synthcity.metrics.eval import Metrics
metrics = Metrics()
result = metrics.evaluate(
    X_gt = loader,
    X_syn = synthetic_loader
)

KeyError: 'converted_type'

In [17]:
# user_rules = {
#   "smoke":[
#     ("nociga", "=", -8),
#     ("smoke", "=", "NO")
    
#   ]
# }

In [81]:
# #test
# encoded_loader, enc_dict = loader.encode()
# encoded_df = encoded_loader.dataframe()
# print(encoded_df.dtypes)
#     # print(encoded_df)
# print(encoded_df)
# print(enc_dict['syn_order'])
# print(enc_dict['converted_type'])

In [82]:
# #test
# # 2) 디코딩
# decoded_loader = encoded_loader.decode()
# decoded_df = decoded_loader.dataframe()

# print("\n=== Decoded DataFrame ===")
# print(decoded_df.head())
# print(decoded_df.dtypes)

In [37]:
# # third party
# import matplotlib.pyplot as plt

# syn_model.plot(plt, loader)

# plt.show()

In [None]:
#test
for col in synthetic_df.columns:
    value_ratios = synthetic_df[col].value_counts(normalize=True)
    high_ratios = value_ratios[value_ratios >= 0.7]
    print(f"컬럼 '{col}'에서 비율이 0.8 이상인 값:")
    print(high_ratios, "\n")

In [None]:
# 각 컬럼의 고유값과 빈도수 확인
for col in ods.columns:
    print(f"컬럼 '{col}'의 고유값 빈도수:")
    print(ods[col].value_counts(), "\n")

In [None]:
# 각 컬럼의 고유값과 빈도수 확인
for col in synthetic_df.columns:
    print(f"컬럼 '{col}'의 고유값 빈도수:")
    print(synthetic_df[col].value_counts(), "\n")

In [None]:
# 6) 결과
orginal_df = loader.dataframe()
synthetic_df = synthetic_loader.dataframe()
print(orginal_df.head())
print(synthetic_df.head())

## Benchmarking metrics

| **Metric**                                         | **Description**                                                                                                            |
|----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| sanity.data\_mismatch.score                        | Data types mismatch between the real//synthetic features                                                                   |
| sanity.common\_rows\_proportion.score              | Real data copy-paste in the synthetic data                                                                                 |
| sanity.nearest\_syn\_neighbor\_distance.mean       | Computes the \textless{}reduction\textgreater{}(distance) from the real data to the closest neighbor in the synthetic data |
| sanity.close\_values\_probability.score            | the probability of close values between the real and synthetic data.                                                       |
| sanity.distant\_values\_probability.score          | the probability of distant values between the real and synthetic data.                                                     |
| stats.jensenshannon\_dist.marginal                 | the average Jensen-Shannon distance                                                                                        |
| stats.chi\_squared\_test.marginal                  | the one-way chi-square test.                                                                                               |
| stats.feature\_corr.joint                          | the correlation/strength-of-association of features in data-set with both categorical and continuous features              |
| stats.inv\_kl\_divergence.marginal                 | the average inverse of the Kullback–Leibler Divergence metric.                                                             |
| stats.ks\_test.marginal                            | the Kolmogorov-Smirnov test for goodness of fit.                                                                           |
| stats.max\_mean\_discrepancy.joint                 | Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.                |
| stats.prdc.precision                               | precision between the two manifolds                                                                                        |
| stats.prdc.recall                                  | recall between the two manifolds                                                                                           |
| stats.prdc.density                                 | density between the two manifolds                                                                                          |
| stats.prdc.coverage                                | coverage between the two manifolds                                                                                         |
| stats.alpha\_precision.delta\_precision\_alpha\_OC | Delta precision                                                                                                            |
| stats.alpha\_precision.delta\_coverage\_beta\_OC   | Delta coverage                                                                                                             |
| stats.alpha\_precision.authenticity\_OC            | Authetnticity                                                                                                              |
| stats.survival\_km\_distance.optimism              | Kaplan-Meier distance between real-synthetic data                                                                          |
| stats.survival\_km\_distance.abs\_optimism         | Kaplan-Meier metrics absolute distance between real-syn data                                                               |
| stats.survival\_km\_distance.sightedness           | Kaplan-Meier metrics distance on the temporal axis                                                                         |
| performance.linear\_model.gt.c\_index              | Train on real, test on the test real data using CoxPH: C-Index                                                             |
| performance.linear\_model.gt.brier\_score          | Train on real, test on the test real data using CoxPH: Brier score                                                         |
| performance.linear\_model.syn\_id.c\_index         | Train on synthetic, test on the train real data using CoxPH: C-Index                                                       |
| performance.linear\_model.syn\_id.brier\_score     | Train on synthetic, test on the train real data using CoxPH: Brier score                                                   |
| performance.linear\_model.syn\_ood.c\_index        | Train on synthetic, test on the test real data using CoxPH: C-Index                                                        |
| performance.linear\_model.syn\_ood.brier\_score    | Train on synthetic, test on the test real data using CoxPH: Brier score                                                    |
| performance.mlp.gt.c\_index                        | Train on real, test on the test real data using NN: C-Index                                                                |
| performance.mlp.gt.brier\_score                    | Train on real, test on the test real data using NN : Brier score                                                           |
| performance.mlp.syn\_id.c\_index                   | Train on synthetic, test on the train real data using NN: C-Index                                                          |
| performance.mlp.syn\_id.brier\_score               | Train on synthetic, test on the train real data using NN: Brier score                                                      |
| performance.mlp.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using NN: C-Index                                                           |
| performance.mlp.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using NN: Brier score                                                       |
| performance.xgb.gt.c\_index                        | Train on real, test on the test real data using XGB: C-Index                                                               |
| performance.xgb.gt.brier\_score                    | Train on real, test on the test real data using XGB : Brier score                                                          |
| performance.xgb.syn\_id.c\_index                   | Train on synthetic, test on the train real data using XGB: C-Index                                                         |
| performance.xgb.syn\_id.brier\_score               | Train on synthetic, test on the train real data using XGB: Brier score                                                     |
| performance.xgb.syn\_ood.c\_index                  | Train on synthetic, test on the test real data using XGB: C-Index                                                          |
| performance.xgb.syn\_ood.brier\_score              | Train on synthetic, test on the test real data using XGB: Brier score                                                      |
| performance.feat\_rank\_distance.corr              | Correlation for the rank distances between the feature importance on real and synthetic data                               |
| performance.feat\_rank\_distance.pvalue            | p-vale for the rank distances between the feature importance on real and synthetic data                                    |
| detection.detection\_xgb.mean                      | The average AUCROC score for detecting synthetic data using an XGBoost.                                                    |
| detection.detection\_mlp.mean                      | The average AUCROC score for detecting synthetic data using a NN.                                                          |
| detection.detection\_gmm.mean                      | The average AUCROC score for detecting synthetic data using a GMM.                                                         |
| privacy.delta-presence.score                       | the maximum re-identification probability on the real dataset from the synthetic dataset.                                  |
| privacy.k-anonymization.gt                         | the k-anon for the real data                                                                                               |
| privacy.k-anonymization.syn                        | the k-anon for the synthetic data                                                                                          |
| privacy.k-map.score                                | the minimum value k that satisfies the k-map rule.                                                                         |
| privacy.distinct l-diversity.gt                    | the l-diversity for the real data                                                                                          |
| privacy.distinct l-diversity.syn                   | the l-diversity for the synthetic data                                                                                     |
| privacy.identifiability\_score.score               | the re-identification score on the real dataset from the synthetic dataset.                                                |

## Benchmark the quality of plugins

For survival analysis, general purpose generators can be used as well.

In [None]:
loader

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [(f"test_{model}", model, {}) for model in ["syn_seq"]],
    loader,
    synthetic_size=1000,
    repeats=2,
)

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
