# PETsARD Operator 測試

這個 notebook 用來測試各個 Operator 的功能，按照 test.ipynb 的設定進行調整。

## LoaderOperator 測試

In [1]:
from petsard.operator import LoaderOperator

# 測試 benchmark 資料載入
loader_op = LoaderOperator(config={"filepath": "benchmark://adult-income"})

# 執行載入
loader_op.run(input={})

# 取得結果
data = loader_op.get_result()
meta = loader_op.get_metadata()

print(f"data: {type(data)}, meta: {type(meta)}")
print(f"data shape: {data.shape}")
print(f"metadata schema_id: {meta.schema_id}")

data: <class 'pandas.core.frame.DataFrame'>, meta: <class 'petsard.metadater.schema.schema_types.SchemaMetadata'>
data shape: (48842, 15)
metadata schema_id: adult-income


In [2]:
# 測試本地檔案載入
loader_op_local = LoaderOperator(config={"filepath": "benchmark/adult-income.csv"})

loader_op_local.run(input={})

data_local = loader_op_local.get_result()
meta_local = loader_op_local.get_metadata()

print(f"data: {type(data_local)}, meta: {type(meta_local)}")
print(f"data shape: {data_local.shape}")

data: <class 'pandas.core.frame.DataFrame'>, meta: <class 'petsard.metadater.schema.schema_types.SchemaMetadata'>
data shape: (48842, 15)


# Just simplify

In [3]:
data = data.loc[0:99, :]
data.shape

(100, 15)

## SplitterOperator 測試

In [4]:
from petsard.operator import SplitterOperator

# 測試資料分割 - 使用與 test.ipynb 相同的配置
splitter_op = SplitterOperator(config={"num_samples": 5, "train_split_ratio": 0.8})

# 準備輸入資料
splitter_input = {"data": data, "metadata": meta, "exclude_index": []}

splitter_op.run(input=splitter_input)

# 取得分割結果 - 使用第一次分割
split_results = splitter_op.get_result()
split_meta = splitter_op.get_metadata()

# 取得第一次分割的訓練集和測試集 - 限制只能取第一個
train_data = split_results["train"]
test_data = split_results["validation"]

print(f"train_data: {type(train_data)}, test_data: {type(test_data)}")
print(f"train_data shape: {train_data.shape}, test_data shape: {test_data.shape}")

train_data: <class 'pandas.core.frame.DataFrame'>, test_data: <class 'pandas.core.frame.DataFrame'>
train_data shape: (80, 15), test_data shape: (20, 15)


In [5]:
# 測試自定義資料分割
splitter_op_custom = SplitterOperator(
    config={
        "method": "custom_data",
        "filepath": {
            "ori": "benchmark://adult-income_ori",
            "control": "benchmark://adult-income_control",
        },
    }
)

# 自定義資料不需要輸入資料
splitter_op_custom.run(input={"exclude_index": []})

custom_split_result = splitter_op_custom.get_result()
print(f"Custom split result keys: {custom_split_result.keys()}")

Custom split result keys: dict_keys(['train', 'validation'])


## PreprocessorOperator 測試

In [6]:
from petsard.operator import PreprocessorOperator

# 測試預設預處理
preproc_op = PreprocessorOperator(config={"method": "default"})

preproc_input = {"data": train_data, "metadata": meta}

preproc_op.run(input=preproc_input)

default_preproc_data = preproc_op.get_result()
preproc_meta = preproc_op.get_metadata()

print(f"default_preproc_data shape: {default_preproc_data.shape}")

default_preproc_data shape: (38, 15)


In [7]:
# 測試僅處理缺失值
preproc_op_missing = PreprocessorOperator(
    config={
        "method": "custom",
        "config": {
            "missing": {
                "age": "missing_mean",
            },
        },
        "sequence": ["missing"],
    }
)

preproc_op_missing.run(input=preproc_input)

missing_preproc_data = preproc_op_missing.get_result()

print(f"missing_preproc_data shape: {missing_preproc_data.shape}")
print("Age column after missing value processing:")
print(missing_preproc_data["age"].head(10))

missing_preproc_data shape: (80, 15)
Age column after missing value processing:
0    25
1    38
2    28
3    44
4    18
5    34
6    63
7    24
8    55
9    65
Name: age, dtype: int8


In [8]:
# 測試僅處理離群值
preproc_op_outlier = PreprocessorOperator(
    config={
        "method": "custom",
        "config": {
            "outlier": {
                "age": "outlier_zscore",
            },
        },
        "sequence": ["outlier"],
    }
)

preproc_op_outlier.run(input=preproc_input)

outlier_preproc_data = preproc_op_outlier.get_result()

print(f"outlier_preproc_data shape: {outlier_preproc_data.shape}")
print("Age column after outlier processing:")
print(outlier_preproc_data["age"].head(10))

outlier_preproc_data shape: (38, 15)
Age column after outlier processing:
0    38
1    28
2    18
3    24
4    26
5    58
6    20
7    43
8    37
9    34
Name: age, dtype: int8


In [9]:
# 測試僅編碼處理
preproc_op_encoder = PreprocessorOperator(
    config={
        "method": "custom",
        "config": {
            "encoder": {
                "workclass": "encoder_onehot",
            },
        },
        "sequence": ["encoder"],
    }
)

preproc_op_encoder.run(input=preproc_input)

encoder_preproc_data = preproc_op_encoder.get_result()

print(f"encoder_preproc_data shape: {encoder_preproc_data.shape}")
print("Workclass columns after encoding:")
workclass_cols = [
    col for col in encoder_preproc_data.columns if col.startswith("workclass_")
]
print(encoder_preproc_data[workclass_cols].head(10))

encoder_preproc_data shape: (80, 20)
Workclass columns after encoding:
   workclass_Federal-gov  workclass_Local-gov  workclass_Private  \
0                    0.0                  0.0                1.0   
1                    0.0                  0.0                1.0   
2                    0.0                  1.0                0.0   
3                    0.0                  0.0                1.0   
4                    0.0                  0.0                0.0   
5                    0.0                  0.0                1.0   
6                    0.0                  0.0                0.0   
7                    0.0                  0.0                1.0   
8                    0.0                  0.0                1.0   
9                    0.0                  0.0                1.0   

   workclass_Self-emp-inc  workclass_Self-emp-not-inc  workclass_State-gov  
0                     0.0                         0.0                  0.0  
1                     0.0 

In [10]:
# 測試僅縮放處理
preproc_op_scaler = PreprocessorOperator(
    config={
        "method": "custom",
        "config": {
            "scaler": {
                "age": "scaler_minmax",
            },
        },
        "sequence": ["scaler"],
    }
)

preproc_op_scaler.run(input=preproc_input)

scaler_preproc_data = preproc_op_scaler.get_result()

print(f"scaler_preproc_data shape: {scaler_preproc_data.shape}")
print("Age column after scaling:")
print(scaler_preproc_data["age"].head(10))

scaler_preproc_data shape: (80, 15)
Age column after scaling:
0    0.145455
1    0.381818
2    0.200000
3    0.490909
4    0.018182
5    0.309091
6    0.836364
7    0.127273
8    0.690909
9    0.872727
Name: age, dtype: float64


## SynthesizerOperator 測試

In [11]:
from petsard.operator import SynthesizerOperator

# 測試合成器 - 使用預設方法
synth_op = SynthesizerOperator(config={"method": "default"})

synth_input = {"data": default_preproc_data, "metadata": preproc_meta}

synth_op.run(input=synth_input)

syn_data = synth_op.get_result()

print(f"syn_data shape: {syn_data.shape}")
print(f"syn_data type: {type(syn_data)}")

syn_data shape: (38, 15)
syn_data type: <class 'pandas.core.frame.DataFrame'>


## PostprocessorOperator 測試

In [12]:
from petsard.operator import PostprocessorOperator

# 測試後處理
postproc_op = PostprocessorOperator(config={"method": "default"})

postproc_input = {
    "data": syn_data,
    "preprocessor": preproc_op.processor,
}

postproc_op.run(input=postproc_input)

postproc_data = postproc_op.get_result()

print(f"postproc_data shape: {postproc_data.shape}")
print(f"postproc_data type: {type(postproc_data)}")

postproc_data shape: (38, 15)
postproc_data type: <class 'pandas.core.frame.DataFrame'>


## ConstrainerOperator 測試

In [13]:
from petsard.operator import ConstrainerOperator

# 測試約束器 - 空配置
constr_op = ConstrainerOperator(config={})

constr_input = {"data": postproc_data}

constr_op.run(input=constr_input)

cnst_data = constr_op.get_result()

print(f"cnst_data shape: {cnst_data.shape}")
print(f"cnst_data type: {type(cnst_data)}")

cnst_data shape: (38, 15)
cnst_data type: <class 'pandas.core.frame.DataFrame'>


In [14]:
# 測試約束器 - 使用 resample_until_satisfy
satisfy_data = constr_op.constrainer.resample_until_satisfy(
    data=postproc_data,
    target_rows=postproc_data.shape[0],
    synthesizer=synth_op.synthesizer,
    max_trials=300,
    sampling_ratio=10.0,
    verbose_step=10,
)

print(f"satisfy_data shape: {satisfy_data.shape}")

satisfy_data shape: (38, 15)


## EvaluatorOperator 測試

In [15]:
from petsard.operator import EvaluatorOperator

# 測試評估器 - 預設方法
eval_op = EvaluatorOperator(config={"method": "default"})

eval_input = {
    "data": {
        "ori": train_data,
        "control": test_data,
        "syn": syn_data,
    }
}

eval_op.run(input=eval_input)

eval_result = eval_op.get_result()

print(f"Evaluation result keys: {eval_result.keys()}")
print("\nGlobal evaluation results:")
print(eval_result["global"].head(1))
print("\nColumnwise evaluation results:")
print(eval_result["columnwise"].head(1))
print("\nPairwise evaluation results:")
print(eval_result["pairwise"].head(1))

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 874.43it/s]|
Column Shapes Score: 13.04%

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 242.81it/s]|
Column Pair Trends Score: 1.95%

Overall Score (Average): 7.49%

Evaluation result keys: dict_keys(['global', 'columnwise', 'pairwise'])

Global evaluation results:
        Score  Column Shapes  Column Pair Trends
result   0.07            NaN                 NaN

Columnwise evaluation results:
          Property        Metric  Score
age  Column Shapes  KSComplement    0.0

Pairwise evaluation results:
                         Property                 Metric  Score  \
age workclass  Column Pair Trends  ContingencySimilarity    0.0   

               Real Correlation  Synthetic Correlation Error  
age workclass               NaN                    NaN  None  


In [16]:
# 測試評估器 - MLUtility Classification
eval_op_mlutility = EvaluatorOperator(
    config={
        "method": "mlutility-classification",
        "target": "income",
    }
)

eval_input_anon = {
    "data": {
        "ori": train_data,
        "control": test_data,
        "syn": satisfy_data,
    }
}

eval_op_mlutility.run(input=eval_input_anon)

eval_result_mlutility = eval_op_mlutility.get_result()

print(eval_result_mlutility["global"].head(1))

The cardinality of the column workclass is too high. Ori: Over row numbers 80, column cardinality 7. Syn: Over row numbers 38, column cardinality 6. The column workclass is removed.
The cardinality of the column education is too high. Ori: Over row numbers 80, column cardinality 12. Syn: Over row numbers 38, column cardinality 3. The column education is removed.
The cardinality of the column marital-status is too high. Ori: Over row numbers 80, column cardinality 6. Syn: Over row numbers 38, column cardinality 4. The column marital-status is removed.
The cardinality of the column occupation is too high. Ori: Over row numbers 80, column cardinality 14. Syn: Over row numbers 38, column cardinality 11. The column occupation is removed.
The cardinality of the column relationship is too high. Ori: Over row numbers 80, column cardinality 5. Syn: Over row numbers 38, column cardinality 5. The column relationship is removed.


   ori_mean  ori_std  syn_mean  syn_std  diff
0      0.79     0.04      0.71     0.02 -0.08


## DescriberOperator 測試

In [17]:
from petsard.operator import DescriberOperator

# 測試描述器
desc_op = DescriberOperator(config={"method": "default"})

desc_input = {"data": {"data": satisfy_data}}

desc_op.run(input=desc_input)

desc_result = desc_op.get_result()

print(f"Description result type: {type(desc_result)}")
if isinstance(desc_result, dict):
    print(f"Description result keys: {desc_result.keys()}")

    print(desc_result["global"].head(1))
    print(desc_result["columnwise"].head(1))
    print(desc_result["pairwise"].head(1))

Description result type: <class 'dict'>
Description result keys: dict_keys(['global', 'columnwise', 'pairwise'])
   row_count  col_count  na_count
0         38         15         0
      mean  median    std    min   max  kurtosis  skew     q1    q3  na_count  \
age  33.88   33.57  12.91  18.07  69.0      0.39  0.87  22.77  40.0       0.0   

    nunique  
age    <NA>  
  column1 column2  corr
0     age     age   1.0


## ReporterOperator 測試

In [18]:
from petsard.operator import ReporterOperator

# 測試報告器 - Save Data
reporter_op_data = ReporterOperator(
    config={
        "method": "save_data",
        "source": "Postprocessor",
    }
)

report_data_input = {"data": {("Postprocessor", "exp1"): satisfy_data}}

reporter_op_data.run(input=report_data_input)

report_result_data = reporter_op_data.get_result()

print(f"Save data report result: {report_result_data}")

Now is petsard_Postprocessor[exp1] save to csv...
Save data report result: {'Postprocessor[exp1]':           age         workclass         fnlwgt     education  educational-num  \
0   21.368340       Federal-gov  136036.234374  Some-college        12.000000   
1   18.192728           Private  262862.772405       HS-grad        12.000000   
2   69.000000           Private  259823.864157       HS-grad        12.000000   
3   30.074558      Self-emp-inc   59099.210250       HS-grad         9.026702   
4   35.140393           Private  236828.797072       HS-grad        12.000000   
5   39.069126         State-gov  162763.149486     Assoc-voc        12.000000   
6   37.411303           Private  220947.186829  Some-college        10.642865   
7   35.126467           Private  141962.609039       HS-grad        12.000000   
8   35.732487           Private   69002.496064  Some-college         9.477471   
9   64.298642                 ?  131163.179409       HS-grad         9.065433   
10  53.899

In [19]:
# 測試報告器 - Save Report
reporter_op_report = ReporterOperator(
    config={
        "method": "save_report",
        "granularity": "global",
    }
)

report_report_input = {
    "data": {
        ("Evaluator", "eval1_[global]"): desc_result["global"],
    }
}

reporter_op_report.run(input=report_report_input)

report_result_report = reporter_op_report.get_result()

print(f"Save report result: {report_result_report}")

Now is petsard[Report]_[global] save to csv...
Save report result: {'[global]':       full_expt_name Evaluator  eval1_row_count  eval1_col_count  \
0  Evaluator[global]  [global]               38               15   

   eval1_na_count  
0               0  }


## 總結

這個 notebook 展示了如何使用各個 Operator 類別來執行 PETsARD 的功能，按照 test.ipynb 的設定進行調整：

1. **LoaderOperator**: 載入資料和元資料
2. **SplitterOperator**: 分割資料為訓練集和測試集
3. **PreprocessorOperator**: 資料預處理（缺失值、離群值、編碼、縮放）
4. **SynthesizerOperator**: 合成資料
5. **PostprocessorOperator**: 後處理合成資料
6. **ConstrainerOperator**: 應用資料約束
7. **EvaluatorOperator**: 評估合成資料品質
8. **DescriberOperator**: 描述資料特性
9. **ReporterOperator**: 生成報告

每個 Operator 都遵循相同的模式：
- 使用配置初始化
- 準備輸入資料
- 執行 `run()` 方法
- 使用 `get_result()` 取得結果
- 使用 `get_metadata()` 取得元資料（如果適用）

這種設計讓每個模組都可以獨立測試，不需要依賴 Status 物件。