# Quick start

*EPAM Syngen* is an unsupervised tabular data generation tool based on a variational autoencoder (VAE). 
It supports common tabular datatypes (floats, integers, datetime, text, categorical, binary) and can generate linked tables that sharing keys using the simple statistical approach. 
The SDK exposes simple programmatic entry points for training, inference, report generation, loading and saving data in supported formats - *CSV*, *Avro* and *Excel* format. The data should be located locally and be in UTF-8 encoding.

This notebook demonstrates the SDK usage. Install the package and then you can call the main SDK class `Syngen` to run training, inference or generation of reports, and the class`DataIO` to load and save the data in supported formats.

Python *3.10* or *3.11* is required to run the library. The library is tested on Linux and Windows operating systems.

# Installation

Please, install the library *syngen* (from Pypi):

In [1]:
!pip install  --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --use-pep517 --no-cache-dir syngen==0.10.28rc15

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple/

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Initialization

```python
Syngen(
    metadata_path: Optional[str] = None,  # use a metadata file in the '.yaml', '.yml' format to train or infer multiple tables to provide all parameters in a file
    table_name: Optional[str] = None,     # required for single-table training or inference process
    source: Optional[str] = None,         # a path to original data in case of the single-table training
)
```

### Key notes:

- **Single table training**: Both `source` and `table_name` are required when `metadata_path` is not provided.
- **Single table inference**: `table_name` is required when `metadata_path` is not provided
- **Multiple tables training and inference**: Use `metadata_path` to define multiple tables with or without relationships.

## Launch training

You can start a training process using the SDK entrypoint `Syngen().train(...)`. This will train a model and save the model artifacts to a disk in the directory *'./model_artifacts'*. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a complete description of all available parameters:

```python
train(
    epochs: int = 10,                     # a number of training epochs
    drop_null: bool = False,              # whether to drop rows with at least one missing value
    row_limit: Optional[int] = None,      # a number of rows to train over
    reports: Union[str, List[str]] = "none",  # report types: 'none', 'accuracy', 'sample', 'metrics_only', 'all'
    log_level: str = "INFO",              # a logging level
    batch_size: int = 32,                 # a training batch size
    fernet_key: Optional[str] = None      # a name of the environment variable containing the Fernet key for secure storage of the data subset
)
```

### Parameters description:

- **`epochs`** *(int, default: 10)*: A number of training epochs. Must be ≥ 1. Since the early stopping mechanism is implemented the bigger value of epochs is the better.

- **`drop_null`** *(bool, default: False)*: Whether to drop rows containing at least one missing value before training. When `False`, missing values are handled during the training process.

- **`row_limit`** *(Optional[int], default: None)*: A maximum number of rows to use for training. If specified and less than the total rows, a random subset of the specified size will be selected. Useful for testing or working with large datasets.

- **`reports`** *(Union[str, List[str]], default: "none")*: Controls generation of quality reports. Accepts single string or list of strings:
  - `"none"` - no reports generated (default)
  - `"accuracy"` - generates an accuracy report comparing synthetic data (same size as original) with original dataset to estimate the quality of training process
  - `"sample"` - generates a sample report showing distribution comparisons between the original data and the subset of this data
  - `"metrics_only"` - outputs metrics to stdout without generation of an accuracy report
  - `"all"` - generates both accuracy and sample reports

  List example: `["accuracy", "sample"]` to generate multiple report types

  *Note*: Report generation may require significant time for large tables (>10,000 rows)

- **`log_level`** *(str, default: "INFO")*: A logging level for the training process. Accepted values: `"TRACE"`, `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`.

- **`batch_size`** *(int, default: 32)*: A training batch size. Must be ≥ 1. Splits training into batches to optimize memory usage. Smaller batches use less RAM but may increase training time.

- **`fernet_key`** *(Optional[str], default: None)*: A name of the environment variable containing a 44-character URL-safe base64-encoded Fernet key. When provided, the data subset is encrypted on disk (stored in the `.dat` format). If not provided, data is stored unencrypted as `.pkl` format. **Important**: The same key must be used during inference and report generation to decrypt the data.

*Note:* For full documentation, metadata file format, and additional details, please refer to [README.md](../README.md)

In [2]:
# The example of training the single table

from syngen.sdk import Syngen


launcher_for_single_table = Syngen(
    source="../examples/example-data/housing.csv", 
    table_name="housing"
)

launcher_for_single_table.train(
    epochs=5,
    drop_null=False,
    row_limit=1000, 
    batch_size=32, 
    reports="all"
)

2025-10-29 16:12:31.844474: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-10-29 16:12:31.844527: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-10-29 16:12:31.845450: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[32m2025-10-29 16:12:34.806[0m | [1mINFO    [0m | [36msyngen.train[0m:[36mvalidate_required_parameters[0m:[36m58[0m - [1mThe training process will be executed according to the information mentioned in 'train_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will b



[32m2025-10-29 16:12:50.893[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-29 16:12:50.894[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-29 16:12:52.827[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-29 16:12:52.874[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing/vae/checkpoints[0m
[32m2025-10-29 16:12:52.874[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-29 16:12:52.875[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[0m - [1mNo

 1/32 [..............................] - ETA: 4s

[32m2025-10-29 16:12:53.349[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing/vae/checkpoints[0m
[32m2025-10-29 16:12:53.350[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 1 batch(es)[0m
[32m2025-10-29 16:12:53.350[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 1 of 1[0m
[32m2025-10-29 16:12:53.350[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:12:53.350[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m




Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 7956.09it/s]
[32m2025-10-29 16:12:53.615[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:12:53.620[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m243[0m - [1mSynthesis of the table - 'housing' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing/merged_infer_housing.csv'[0m
[32m2025-10-29 16:12:53.621[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of sample metrics for the table - 'housing' has started[0m
[32m2025-10-29 16:12:56.871[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe sample report of the table - 'housing' has been generated[0m
[32m2025-10-29 

In [4]:
launcher_for_single_table.execution_artifacts

{'housing': {'losses_path': 'model_artifacts/system_store/losses/losses-housing-2025-10-29-16-12-34-895251.csv',
  'path_to_input_data': 'model_artifacts/tmp_store/housing/input_data_housing.pkl',
  'generated_reports': {'sample_report': 'model_artifacts/resources/housing/reports/sample-report-2025_10_29_16_12_56_869466.html',
   'accuracy_report': 'model_artifacts/resources/housing/reports/accuracy-report-2025_10_29_16_13_19_142749.html'}}}

In [5]:
# The example of training of multiple tables with relationships

from syngen.sdk import Syngen


launcher_for_multiple_tables = Syngen(
    metadata_path="../examples/example-metadata/housing_metadata.yaml"
)

launcher_for_multiple_tables.train(log_level="DEBUG")

[32m2025-10-29 16:18:57.123[0m | [1mINFO    [0m | [36msyngen.train[0m:[36mvalidate_required_parameters[0m:[36m58[0m - [1mThe training process will be executed according to the information mentioned in 'train_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-29 16:18:57.128[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.validation_schema.validation_schema[0m:[36mvalidate_schema[0m:[36m300[0m - [34m[1mThe schema of the metadata is valid[0m
[32m2025-10-29 16:18:57.129[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/resources/housing-properties/' was removed[0m
[32m2025-10-29 16:18:57.130[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m



[32m2025-10-29 16:19:12.345[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-29 16:19:12.346[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-29 16:19:13.256[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-29 16:19:13.294[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing-properties/vae/checkpoints[0m
[32m2025-10-29 16:19:13.294[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-29 16:19:13.295[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[



[32m2025-10-29 16:19:29.023[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-29 16:19:29.023[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-29 16:19:30.521[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-29 16:19:30.554[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing-conditions/vae/checkpoints[0m
[32m2025-10-29 16:19:30.554[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-29 16:19:30.555[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[



[32m2025-10-29 16:19:30.904[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-properties/vae/checkpoints[0m
[32m2025-10-29 16:19:30.905[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m490[0m - [34m[1mInfer model with parameters: size=790, run_parallel=False, batch_size=790, random_seed=1[0m
[32m2025-10-29 16:19:30.905[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 1 batch(es)[0m
[32m2025-10-29 16:19:30.905[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_properties'. Generating the batch 1 of 1[0m
[32m2025-10-29 16:19:30.905[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:19:30.906[0m | [1mI



[32m2025-10-29 16:19:31.336[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-conditions/vae/checkpoints[0m
[32m2025-10-29 16:19:31.337[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m490[0m - [34m[1mInfer model with parameters: size=1799, run_parallel=False, batch_size=1000, random_seed=1[0m
[32m2025-10-29 16:19:31.337[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 2 batch(es)[0m
[32m2025-10-29 16:19:31.337[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_conditions'. Generating the batch 1 of 2[0m
[32m2025-10-29 16:19:31.337[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:19:31.338[0m | [1



Generation of the data...: 100%|██████████| 3/3 [00:00<00:00, 3031.30it/s]
[32m2025-10-29 16:19:31.617[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mgenerate_keys[0m:[36m428[0m - [1mThe 'households' assigned as a foreign_key feature[0m
[32m2025-10-29 16:19:31.645[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m243[0m - [1mSynthesis of the table - 'housing_conditions' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing-conditions/merged_infer_housing-conditions.csv'[0m
[32m2025-10-29 16:19:31.646[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of sample metrics for the table - 'housing_conditions' has started[0m
[32m2025-10-29 16:19:32.421[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe sample report of the table - 'housing_conditions' has b

The *'execution_artifacts'* attribute of the **Syngen** class provides information about the generated artifacts during the training process:

In [6]:
from pprint import pprint

pprint(launcher_for_multiple_tables.execution_artifacts)

{'housing_conditions': {'generated_reports': {'accuracy_report': 'model_artifacts/resources/housing-conditions/reports/accuracy-report-2025_10_29_16_19_36_864750.html',
                                              'sample_report': 'model_artifacts/resources/housing-conditions/reports/sample-report-2025_10_29_16_19_32_420554.html'},
                        'losses_path': 'model_artifacts/system_store/losses/losses-housing-conditions-2025-10-29-16-19-13-306306.csv',
                        'path_to_input_data': 'model_artifacts/tmp_store/housing-conditions/input_data_housing-conditions.pkl'},
 'housing_properties': {'generated_reports': {'accuracy_report': 'model_artifacts/resources/housing-properties/reports/accuracy-report-2025_10_29_16_19_46_776242.html',
                                              'sample_report': 'model_artifacts/resources/housing-properties/reports/sample-report-2025_10_29_16_19_38_027541.html'},
                        'losses_path': 'model_artifacts/system_sto

# Launch inference

You can start inference process using the SDK entrypoint `Syngen().infer(...)`. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a complete description of all available parameters:

```python
infer(
    size: int = 100,                      # the desired number of rows to generate
    run_parallel: bool = False,           # whether to use multiprocessing (feasible for tables > 5000 rows)
    batch_size: Optional[int] = None,     # an inference batch size
    random_seed: Optional[int] = None,    # if specified, generates a reproducible result
    reports: Union[str, List[str]] = "none",  # report types: 'none', 'accuracy', 'metrics_only', 'all'
    log_level: str = "INFO",              # a logging level
    fernet_key: Optional[str] = None      # a name of the environment variable containing the Fernet key for decrypting the data subset
)
```
### Parameters description:

- **`size`** *(int, default: 100)*: The desired number of synthetic rows to generate. Must be ≥ 1.

- **`run_parallel`** *(bool, default: False)*: Whether to use multiprocessing for data generation. Set to `True` to enable parallel processing, which is recommended and feasible for generating large tables (>5000 rows).

- **`batch_size`** *(Optional[int], default: None)*: The inference batch size. Must be ≥ 1. If specified, the generation is split into batches to optimize memory usage and save RAM.

- **`random_seed`** *(Optional[int], default: None)*: A random seed for reproducible generation. Must be ≥ 0.

- **`reports`** *(Union[str, List[str]], default: "none")*: Controls generation of quality reports. Accepts single string or list of strings:
  - `"none"` - no reports generated (default)
  - `"accuracy"` - generates an accuracy report comparing original and synthetic data patterns to verify quality of a generated data
  - `"metrics_only"` - outputs metrics to stdout without generating an accuracy report
  - `"all"` - generates an accuracy report (same as `"accuracy"`)

  List example: `["accuracy", "metrics_only"]` to generate multiple report types

  *Note*: Report generation may require significant time for large generated tables (>10,000 rows)

- **`log_level`** *(str, default: "INFO")*: A logging level for the inference process. Accepted values: `"TRACE"`, `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`.

- **`fernet_key`** *(Optional[str], default: None)*: A name of the environment variable containing a 44-character URL-safe base64-encoded Fernet key. When provided, the data subset is decrypted for report generation. **Important**: The same key used during a training must be used during a report generation to successfully decrypt the data.

*Note:* For full documentation, metadata file format, and additional details, please refer to [README.md](../README.md)

In [7]:
# The example of inference for the single table
from syngen.sdk import Syngen


launcher_for_single_table.infer(
    size=200,
    run_parallel=False,
    batch_size=32, 
    random_seed=42,
    reports="all"
)

[32m2025-10-29 16:20:10.514[0m | [1mINFO    [0m | [36msyngen.infer[0m:[36mvalidate_required_parameters[0m:[36m27[0m - [1mThe inference process will be executed according to the information mentioned in 'infer_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-29 16:20:10.516[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing/merged_infer_housing.csv' was removed[0m
[32m2025-10-29 16:20:10.516[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing/infer_message.success' was removed[0m
[32m2025-10-29 16:20:10.517[0m | [1mINFO    [0m | [36msyngen



[32m2025-10-29 16:20:10.981[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing/vae/checkpoints[0m
[32m2025-10-29 16:20:10.982[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 7 batch(es)[0m
[32m2025-10-29 16:20:10.982[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 1 of 7[0m
[32m2025-10-29 16:20:10.982[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:20:10.983[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 10752.12it/s]
[32m2025-10-29 16:20:11.152[0m | [1m



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 11462.69it/s]




[32m2025-10-29 16:20:11.207[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:20:11.207[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 3 of 7[0m
[32m2025-10-29 16:20:11.207[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:20:11.207[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 11707.01it/s]




[32m2025-10-29 16:20:11.254[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:20:11.254[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 4 of 7[0m
[32m2025-10-29 16:20:11.254[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:20:11.255[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 11322.05it/s]




[32m2025-10-29 16:20:11.303[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:20:11.303[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 5 of 7[0m
[32m2025-10-29 16:20:11.304[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:20:11.304[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 11173.98it/s]
[32m2025-10-29 16:20:11.351[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values



[32m2025-10-29 16:20:11.352[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 6 of 7[0m
[32m2025-10-29 16:20:11.352[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:20:11.352[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 12837.32it/s]
[32m2025-10-29 16:20:11.404[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:20:11.405[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 7 of



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 11736.80it/s]
[32m2025-10-29 16:20:11.583[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:20:11.587[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m243[0m - [1mSynthesis of the table - 'housing' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing/merged_infer_housing.csv'[0m
[32m2025-10-29 16:20:11.588[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of accuracy metrics for the table - 'housing' has started[0m
[32m2025-10-29 16:20:11.977[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m171[0m - [1mMedian accuracy is 0.8722[0m
Generating bivariate distributions...: 100%

In [8]:
launcher_for_single_table.execution_artifacts

{'housing': {'path_to_input_data': 'model_artifacts/tmp_store/housing/input_data_housing.pkl',
  'path_to_generated_data': 'model_artifacts/tmp_store/housing/merged_infer_housing.csv',
  'generated_reports': {'accuracy_report': 'model_artifacts/tmp_store/housing/reports/accuracy-report-2025_10_29_16_20_34_300574.html'}}}

In [9]:
# The example of inference of multiple tables with relationships

from syngen.sdk import Syngen


launcher_for_multiple_tables.infer(log_level="DEBUG")

[32m2025-10-29 16:21:34.623[0m | [1mINFO    [0m | [36msyngen.infer[0m:[36mvalidate_required_parameters[0m:[36m27[0m - [1mThe inference process will be executed according to the information mentioned in 'infer_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-29 16:21:34.629[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.validation_schema.validation_schema[0m:[36mvalidate_schema[0m:[36m300[0m - [34m[1mThe schema of the metadata is valid[0m
[32m2025-10-29 16:21:34.629[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing-properties/merged_infer_housing-properties.csv' was removed[0m
[32m2025-10-29 16:21:34.629[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[3



[32m2025-10-29 16:21:34.979[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-properties/vae/checkpoints[0m
[32m2025-10-29 16:21:34.980[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m490[0m - [34m[1mInfer model with parameters: size=90, run_parallel=False, batch_size=90, random_seed=10, reports - 'accuracy'[0m
[32m2025-10-29 16:21:34.980[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 1 batch(es)[0m
[32m2025-10-29 16:21:34.980[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_properties'. Generating the batch 1 of 1[0m
[32m2025-10-29 16:21:34.980[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:



[32m2025-10-29 16:21:35.357[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-conditions/vae/checkpoints[0m
[32m2025-10-29 16:21:35.358[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m490[0m - [34m[1mInfer model with parameters: size=90, run_parallel=False, batch_size=90, random_seed=10, reports - 'accuracy'[0m
[32m2025-10-29 16:21:35.358[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 1 batch(es)[0m
[32m2025-10-29 16:21:35.358[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_conditions'. Generating the batch 1 of 1[0m
[32m2025-10-29 16:21:35.358[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:

The *'execution_artifacts'* attribute of the **Syngen** class provides information about the generated artifacts during the inference process:

In [10]:
from pprint import pprint


pprint(launcher_for_multiple_tables.execution_artifacts)

{'housing_conditions': {'generated_reports': {'accuracy_report': 'model_artifacts/tmp_store/housing-conditions/reports/accuracy-report-2025_10_29_16_21_39_909242.html'},
                        'path_to_generated_data': 'model_artifacts/tmp_store/housing-conditions/merged_infer_housing-conditions.csv',
                        'path_to_input_data': 'model_artifacts/tmp_store/housing-conditions/input_data_housing-conditions.pkl'},
 'housing_properties': {'generated_reports': {'accuracy_report': 'model_artifacts/tmp_store/housing-properties/reports/accuracy-report-2025_10_29_16_21_48_512312.html'},
                        'path_to_generated_data': 'model_artifacts/tmp_store/housing-properties/merged_infer_housing-properties.csv',
                        'path_to_input_data': 'model_artifacts/tmp_store/housing-properties/input_data_housing-properties.pkl'}}


# Data security: Using Fernet Key for encryption

In the current implementation, a sample of the original data is stored on a disk during the training process. To ensure data security and protect sensitive information, you can use a **Fernet key** to encrypt this data.

## What is a Fernet Key?

A Fernet key is a 44-character URL-safe base64-encoded string used for symmetric encryption. When provided, the data subset is encrypted on a disk (stored in `.dat` format instead of unencrypted `.pkl` format).

## How to Generate a Fernet Key

You can generate a Fernet key using the following code:

In [11]:
# Generate a Fernet key
from cryptography.fernet import Fernet

fernet_key = Fernet.generate_key().decode("utf-8")

## Setting the Fernet Key as an environment variable

After generating the key, you need to store it as an environment variable. This can be done in your terminal or programmatically in Python.

### Option 1: Set in Terminal (Linux/macOS)

```bash
export MY_FERNET_KEY='your_generated_fernet_key_here'
```

### Option 2: Set in Terminal (Windows)

```cmd
set MY_FERNET_KEY=your_generated_fernet_key_here
```

### Option 3: Set programmatically in Python

```python
import os
os.environ['MY_FERNET_KEY'] = 'your_generated_fernet_key_here'
```

## Using the Fernet Key in a training

When training with encryption, pass the name of the environment variable (not the key itself) to the `fernet_key` parameter:

In [12]:
# The example: the training with the Fernet key encryption

import os
from syngen.sdk import Syngen

# Step 1: Set the Fernet key as an environment variable
os.environ["MY_FERNET_KEY"] = fernet_key  # Using the key generated above

# Step 2: Train with encryption enabled

launcher_for_encrypted_data = Syngen(
    source="../examples/example-data/housing.csv", 
    table_name="housing_encrypted"
)

launcher_for_encrypted_data.train(
    epochs=5,
    row_limit=1000, 
    batch_size=32, 
    fernet_key="MY_FERNET_KEY"  # Pass the environment variable name, not the key itself
)

[32m2025-10-29 16:22:59.165[0m | [1mINFO    [0m | [36msyngen.train[0m:[36mvalidate_required_parameters[0m:[36m58[0m - [1mThe training process will be executed according to the information mentioned in 'train_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-29 16:22:59.167[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/resources/housing-encrypted/' was removed[0m
[32m2025-10-29 16:22:59.170[0m | [1mINFO    [0m | [36msyngen.ml.config.validation[0m:[36m_collect_errors[0m:[36m435[0m - [1mThe validation of the metadata has been passed successfully[0m
[32m2025-10-29 16:22:59.240[0m | [1mINFO    [0m | [36msyngen.ml.processors.processors[0m:[36m_preprocess_data[0m:[36m149[



[32m2025-10-29 16:23:15.351[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-29 16:23:15.352[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-29 16:23:17.226[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-29 16:23:17.271[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing-encrypted/vae/checkpoints[0m
[32m2025-10-29 16:23:17.271[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-29 16:23:17.272[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[0

In [13]:
launcher_for_encrypted_data.execution_artifacts

{'housing_encrypted': {'losses_path': 'model_artifacts/system_store/losses/losses-housing-encrypted-2025-10-29-16-22-59-246000.csv',
  'path_to_input_data': 'model_artifacts/tmp_store/housing-encrypted/input_data_housing-encrypted.dat',
  'generated_reports': {}}}

## Using the Fernet Key in an inference

**Important**: When generating synthetic data with inference, you must use the **same Fernet key** that was used during training. This allows the system to decrypt the stored data subset for report generation.

If the Fernet key is not provided or doesn't match the training key, the inference process will fail when trying to access the encrypted data.

In [14]:
# Example: Inference with Fernet key decryption

from syngen.sdk import Syngen

# The environment variable 'MY_FERNET_KEY' is already set from the training step

# Inference with the same Fernet key
launcher_for_encrypted_data.infer(
    size=200,
    batch_size=32, 
    random_seed=42,
    reports="all",
    fernet_key="MY_FERNET_KEY"  # Must use the same key as in training
)

[32m2025-10-29 16:24:06.358[0m | [1mINFO    [0m | [36msyngen.infer[0m:[36mvalidate_required_parameters[0m:[36m27[0m - [1mThe inference process will be executed according to the information mentioned in 'infer_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-29 16:24:06.361[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing-encrypted/merged_infer_housing-encrypted.csv' was removed[0m
[32m2025-10-29 16:24:06.361[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing-encrypted/infer_message.success' was removed[0m
[32m2025-10-29 16:24:06.363[0m | 



[32m2025-10-29 16:24:06.834[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-encrypted/vae/checkpoints[0m
[32m2025-10-29 16:24:06.834[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 7 batch(es)[0m
[32m2025-10-29 16:24:06.834[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 1 of 7[0m
[32m2025-10-29 16:24:06.835[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:24:06.835[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 8125.63it/s]
[32m2025-



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 8039.27it/s]




[32m2025-10-29 16:24:07.059[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:24:07.060[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 3 of 7[0m
[32m2025-10-29 16:24:07.060[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:24:07.060[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 12100.01it/s]




[32m2025-10-29 16:24:07.108[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:24:07.108[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 4 of 7[0m
[32m2025-10-29 16:24:07.108[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:24:07.109[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 8672.43it/s]
[32m2025-10-29 16:24:07.163[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 12277.10it/s]
[32m2025-10-29 16:24:07.219[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:24:07.219[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 6 of 7[0m
[32m2025-10-29 16:24:07.219[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:24:07.220[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m




Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 13522.08it/s]
[32m2025-10-29 16:24:07.267[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:24:07.268[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 7 of 7[0m
[32m2025-10-29 16:24:07.268[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-29 16:24:07.268[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m




Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 11739.78it/s]
[32m2025-10-29 16:24:07.448[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-29 16:24:07.451[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m243[0m - [1mSynthesis of the table - 'housing_encrypted' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing-encrypted/merged_infer_housing-encrypted.csv'[0m
[32m2025-10-29 16:24:07.451[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of accuracy metrics for the table - 'housing_encrypted' has started[0m
[32m2025-10-29 16:24:07.452[0m | [1mINFO    [0m | [36msyngen.ml.data_loaders.data_loaders[0m:[36mload_data[0m:[36m706[0m - [1mData stored at the path - 'model_artifacts/tmp_s

In [15]:
launcher_for_encrypted_data.execution_artifacts

{'housing_encrypted': {'path_to_input_data': 'model_artifacts/tmp_store/housing-encrypted/input_data_housing-encrypted.dat',
  'path_to_generated_data': 'model_artifacts/tmp_store/housing-encrypted/merged_infer_housing-encrypted.csv',
  'generated_reports': {'accuracy_report': 'model_artifacts/tmp_store/housing-encrypted/reports/accuracy-report-2025_10_29_16_24_30_237918.html'}}}

## Using the Fernet Key with the metadata file

You can also specify the Fernet key in the metadata file for both training and inference:

```yaml
global:
  encryption:
    fernet_key: MY_FERNET_KEY  # Name of the environment variable

TABLE_NAME:
  train_settings:
    source: "./data/table.csv"
  
  infer_settings:
    size: 100
  
  # You can also specify per-table encryption
  encryption:
    fernet_key: MY_FERNET_KEY
```

Then use it in your code:

```python
# Training with the metadata file and the Fernet key
Syngen(metadata_path="path/to/metadata.yaml").train()

# Inference with the metadata file and the Fernet key
Syngen(metadata_path="path/to/metadata.yaml").infer()
```

## Important security notes

⚠️ **Critical security considerations:**

1. **Store the key securely**: Never hardcode the Fernet key directly in your code or commit it to version control systems.
2. **Key recovery is impossible**: If you lose the Fernet key, encrypted data cannot be recovered
3. **Same key required**: Always use the same Fernet key for training and inference
4. **Environment variables**: Use environment variables to store the key securely
5. **Key length**: The Fernet key must be exactly 44 characters (URL-safe base64-encoded)
6. **Production environments**: In production, use secure secret management services (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, etc.)

## What happens without a Fernet key?

If you don't provide a `fernet_key` parameter:
- Data subset is stored **unencrypted** in `.pkl` format
- No decryption is needed during inference
- Suitable for non-sensitive data or development environments

With a `fernet_key`:
- Data subset is stored **encrypted** in `.dat` format
- Decryption is required during inference using the same key
- Recommended for sensitive or production data

# Generate reports separately

Sometimes you may want to generate reports separately after training or/and inference has already been completed. The SDK provides the `Syngen().generate_reports(...)` method that allows you to generate quality reports for a table using existing artifacts without re-running the training or/and inference processes.

This method is useful when:
- You completed training/inference without reports (with `reports="none"`)
- You want to generate additional report types later
- You want to separate the computation-intensive training/inference from report generation


```python
generate_reports(
    table_name: str,                        # required: the name of the table to generate reports for
    reports: Union[str, List[str]],         # required: report types to generate
    fernet_key: Optional[str] = None        # optional: a Fernet key for decrypting encrypted data
)
```

### Parameters description:

- **`table_name`** *(str, required)*: The name of the table to generate reports for.

- **`reports`** *(Union[str, List[str]], required)*: Controls which quality reports to generate. Accepts single string or list of strings:
  - `"accuracy"` - generates an accuracy report comparing original and synthetic data
  - `"metrics_only"` - outputs metrics information to stdout without generating an accuracy report
  - `"sample"` - generates a sample report showing distribution comparisons between original data and the data subset used for a training process
  - `"all"` - generates all available reports (*"accuracy"* and *"sample"*)
  
  List example: `["accuracy", "sample"]` to generate multiple report types.

  *Note*: Report generation may require significant time for large tables (>10,000 rows)

- **`fernet_key`** *(Optional[str], default: None)*: The name of the environment variable containing the Fernet key used to decrypt the original data subset. **Important**: Must be the same key used during training if the data was encrypted.

### Required artifacts

To generate reports, the following artifacts must exist:

**For accuracy reports (`"accuracy"` or `"metrics_only"`):**
- Training must be completed successfully
- Inference must be completed successfully

**For sample reports (`"sample"`):**
- Training must be completed successfully

### Key notes:

- The method uses existing artifacts and does not re-run training or inference
- All required artifacts must be present in the `model_artifacts` directory
- The `table_name` must match exactly what was used in training/inference
- If data was encrypted during training, the same `fernet_key` must be provided

*Note:* For full documentation and additional details, please refer to [README.md](../README.md)

In [16]:
# Example 1: Generate an accuracy report after training and inference completed without reports

from syngen.sdk import Syngen

# Assume training and inference were already completed with reports="none"
# Now generate an accuracy report separately
launcher_for_multiple_tables.generate_reports(
    table_name="housing_conditions",
    reports="accuracy"
)

[32m2025-10-29 16:25:29.242[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of accuracy metrics for the table - 'housing_conditions' has started[0m
[32m2025-10-29 16:25:29.399[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m171[0m - [1mMedian accuracy is 0.8552[0m
Generating bivariate distributions...: 100%|██████████| 3/3 [00:01<00:00,  1.60it/s]
[32m2025-10-29 16:25:32.111[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m199[0m - [1mMedian of differences of correlations is 0.1553[0m
[32m2025-10-29 16:25:32.260[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m207[0m - [1mMean clusters homogeneity is 0.1889[0m
[32m2025-10-29 16:25:32.261[0m | [1mINFO    [0m | [36msyngen.ml.metrics.metrics_classes.metrics[

In [17]:
pprint(launcher_for_multiple_tables.execution_artifacts)

{'housing_conditions': {'generated_reports': {'accuracy_report': 'model_artifacts/tmp_store/housing-conditions/reports/accuracy-report-2025_10_29_16_25_33_032225.html'}}}


In [25]:
# Example 2: Generate a sample report after training completed

from syngen.sdk import Syngen

# Generate a sample report to compare original data with its subset
launcher_for_multiple_tables.generate_reports(
    table_name="housing_properties",
    reports="sample"
)

[32m2025-10-29 16:27:31.220[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of sample metrics for the table - 'housing_properties' has started[0m
[32m2025-10-29 16:27:32.453[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe sample report of the table - 'housing_properties' has been generated[0m


In [26]:
pprint(launcher_for_multiple_tables.execution_artifacts)

{'housing_properties': {'generated_reports': {'sample_report': 'model_artifacts/resources/housing-properties/reports/sample-report-2025_10_29_16_27_32_452396.html'}}}


In [27]:
# Example 3: Generate multiple reports at once

from syngen.sdk import Syngen

# Generate both accuracy and sample reports
launcher_for_multiple_tables.generate_reports(
    table_name="housing_conditions",
    reports=["accuracy", "sample"]
)

# Or use "all" to generate all available reports
# launcher_for_multiple_tables.generate_reports(
#     table_name="housing_conditions",
#     reports="all"
# )

[32m2025-10-29 16:28:10.607[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m300[0m - [1mThe calculation of accuracy metrics for the table - 'housing_conditions' has started[0m
[32m2025-10-29 16:28:10.770[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m171[0m - [1mMedian accuracy is 0.8552[0m
Generating bivariate distributions...: 100%|██████████| 3/3 [00:02<00:00,  1.21it/s]
[32m2025-10-29 16:28:14.149[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m199[0m - [1mMedian of differences of correlations is 0.1553[0m
[32m2025-10-29 16:28:14.301[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m207[0m - [1mMean clusters homogeneity is 0.1889[0m
[32m2025-10-29 16:28:14.302[0m | [1mINFO    [0m | [36msyngen.ml.metrics.metrics_classes.metrics[

In [28]:
launcher_for_multiple_tables.execution_artifacts

{'housing_conditions': {'generated_reports': {'sample_report': 'model_artifacts/resources/housing-conditions/reports/sample-report-2025_10_29_16_28_15_875450.html'}}}

In [None]:
# Example 4: Generate reports for encrypted data

import os
from syngen.sdk import Syngen

# Ensure the Fernet key environment variable is set
# (Should be the same key used during training)
os.environ['MY_FERNET_KEY'] = fernet_key

# Generate reports with decryption
launcher_for_encrypted_data.generate_reports(
    table_name="housing_encrypted",
    reports="accuracy",
    fernet_key="MY_FERNET_KEY"  # Same key used in training
)

In [None]:
# Example 5: Generate metrics only (without a full accuracy report)

from syngen.sdk import Syngen

# Output metrics to stdout without generating HTML/PDF reports
launcher_for_multiple_tables).generate_reports(
    table_name="housing",
    reports="metrics_only"
)

# Data loading and saving: DataIO class

The SDK provides the `DataIO` class for loading and saving data in various supported formats with optional encryption and format settings. This class is useful when you need to:
- Load and save data in different file formats (CSV, Avro, Excel, etc.)
- Load and save data with specific format settings
- Work with encrypted data files

## Class initialization

```python
DataIO(
    path: str,                          # required: path to the data file
    fernet_key: Optional[str] = None,   # optional: Fernet key for encrypted data
    **kwargs                            # optional: format settings for CSV or Excel tables, or schema for AVRO file
)
```

### Parameters description:

- **`path`** *(str, required)*: the path to the data file to load or save. Supported formats include:
  - CSV files: `.csv`, `.psv`, `.tsv`, `.txt`
  - Avro files: `.avro`
  - Excel files: `.xls`, `.xlsx`

- **`fernet_key`** *(Optional[str], default: None)*: the name of the environment variable containing the Fernet key for encrypted data operations.

- **`**kwargs`**: Optional format settings or/and a schema for reading and writing data. Available parameters depend on the file format:

  **For tables in '.csv' format:**
  - `sep` *(str)*: Delimiter to use (e.g., `','`, `';'`, `'\t'`)
  - `quotechar` *(str)*: Character used to denote the start and end of a quoted item (default: `'"'`)
  - `quoting` *(str)*: Quoting behavior - `"all"`, `"minimal"`, `"non-numeric"`, `"none"`
  - `escapechar` *(str)*: Character used to escape other characters
  - `encoding` *(str)*: Encoding to use (e.g., `'utf-8'`, `'latin-1'`)
  - `header` *(Optional[int, List[int], Literal["infer"]])*: Row number(s) containing column labels and marking the start of the data
  - `skiprows` *(Optional[int, List[int]])*: Lines to skip at the start of the file
  - `on_bad_lines` *(Literal["error", "warn", "skip"])*: Action on bad lines - `"error"`, `"warn"`, `"skip"`
  - `engine` *(Optional[Literal["c", "python"]])*: Parser engine - `"c"`, `"python"`
  - `na_values` *(Opional[List[str]])*: Additional strings to recognize as NA/NaN

  **For Excel formats (.xls, .xlsx):**
  - `sheet_name` *(Optional[str, int, List[Union[int, str]])*: Name or index of the sheet to read

## Available methods

`load_data(**kwargs)`

Loads data from the specified file path and returns it as a pandas DataFrame.

`load_schema()`

Returns the original schema of the loaded data, including column names and data types. Available only for data in the '.avro' format.

`save_data(df, **kwargs)`

Saves a pandas DataFrame to the specified file path with(without) the configured format settings or schema.

*Note:* For full documentation and additional details, please refer to [README.md](../README.md)

In [None]:
# The example 1: Load CSV data with default settings

from syngen.sdk import DataIO

data_io = DataIO(path="../examples/example-data/housing.csv")

df = data_io.load_data()

print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
print(f"\nFirst few rows:")
print(df.head())

In [None]:
# The example 2: Load CSV data with custom format settings

from syngen.sdk import DataIO

data_io = DataIO(
    path="../examples/example-data/escaped_quoted_table.csv",
    sep=',',           # delimiter
    quotechar='"',     # quote character
    quoting="minimal", # quoting style
    encoding='utf-8',  # encoding
    header=0           # use first row as header
)

df = data_io.load_data()

print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
print(f"\nFirst few rows:")
print(df.head())

In [None]:
# The example 3: Load data and get schema information

from syngen.sdk import DataIO


data_io = DataIO(path="../examples/example-data/avro_file.avro")

df = data_io.load_data()

schema = data_io.load_schema()

print("Data schema:")
print(schema)

In [None]:
# The example 4: Save data to a file

import pandas as pd
from syngen.sdk import DataIO


sample_data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 45],
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
})


data_io = DataIO(
    path="../examples/example-data/sample_output.csv",
    sep=',',
    encoding='utf-8'
)


data_io.save_data(sample_data)

print("Data saved successfully to '../examples/example-data/sample_output.csv'")