# Quick start

*EPAM Syngen* is an unsupervised tabular data generation tool based on a variational autoencoder (VAE). 
It supports common tabular datatypes (floats, ints, datetime, text, categorical, binary) and can generate linked tables that sharing keys using the simple statistical approach. 
The SDK exposes simple programmatic entry points for training, inference, report generation, loading and saving data in supported formats - *CSV*, *Avro* and *Excel* format. The data should be located locally and be in UTF-8 encoding.

This notebook demonstrates the SDK usage. Install the package and then you can call the main SDK class `Syngen` to run training, inference or generation of reports, and the class`DataIO` to load and save the data in supported formats.

Python *3.10* or *3.11* is required to run the library. The library is tested on Linux and Windows operating systems.

# Installation

Please, install the library *syngen* (from Pypi):

In [2]:
!pip install  --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --use-pep517 --no-cache-dir syngen==0.10.28rc0

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple/


## Launch training

You can start training process using the SDK entrypoint `Syngen.train(...)`. This will train a model and save the model artifacts to a disk. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a complete description of all available parameters:

```python
Syngen.train(
    metadata_path: Optional[str] = None,  # use a metadata file in the '.yaml', '.yml' format to train multiple tables or provide all params in a file
    table_name: Optional[str] = None,     # required for single-table training
    source: Optional[str] = None,         # path to original data
    epochs: int = 10,                     # number of training epochs
    drop_null: bool = False,              # whether to drop rows with at least one missing value
    row_limit: Optional[int] = None,      # a number of rows to train over
    reports: Union[str, List[str]] = "none",  # report types: 'none', 'accuracy', 'sample', 'metrics_only', 'all'
    log_level: str = "INFO",              # logging level
    batch_size: int = 32,                 # training batch size
    fernet_key: Optional[str] = None      # name of the environment variable containing the Fernet key for secure storage of the data subset
)
```

### Parameters description:

- **`metadata_path`** *(Optional[str], default: None)*: Path to a metadata file in the *'.yaml'* or *'.yml'* format for training multiple tables or specifying per-table settings. If provided, individual table parameters (`source`, `table_name`) are ignored in favor of metadata values.

- **`table_name`** *(Optional[str], default: None)*: **Required for single-table training**. An arbitrary name used to identify the table and name the artifact folders. Must be provided when not using `metadata_path`.

- **`source`** *(Optional[str], default: None)*: **Required for single-table training**. Path to the original data file.

- **`epochs`** *(int, default: 10)*: Number of training epochs. Must be ≥ 1. Since the early stopping mechanism is implemented the bigger value of epochs is the better.

- **`drop_null`** *(bool, default: False)*: Whether to drop rows containing at least one missing value before training. When `False`, missing values are handled during the training process.

- **`row_limit`** *(Optional[int], default: None)*: Maximum number of rows to use for training. If specified and less than the total rows, a random subset of the specified size will be selected. Useful for testing or working with large datasets.

- **`reports`** *(Union[str, List[str]], default: "none")*: Controls generation of quality reports. Accepts single string or list of strings:
  - `"none"` - no reports generated (default)
  - `"accuracy"` - generates an accuracy report comparing synthetic data (same size as original) with original dataset to estimate the quality of training process
  - `"sample"` - generates a sample report showing distribution comparisons between the original data and the subset of this data
  - `"metrics_only"` - outputs metrics to stdout
  - `"all"` - generates both accuracy and sample reports
  - List example: `["accuracy", "sample"]` to generate multiple report types
  - *Note*: Report generation may require significant time for large tables (>10,000 rows)

- **`log_level`** *(str, default: "INFO")*: Logging level for the training process. Accepted values: `"TRACE"`, `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`.

- **`batch_size`** *(int, default: 32)*: Training batch size. Must be ≥ 1. Splits training into batches to optimize memory usage. Smaller batches use less RAM but may increase training time.

- **`fernet_key`** *(Optional[str], default: None)*: Name of the environment variable containing a 44-character URL-safe base64-encoded Fernet key. When provided, the data subset is encrypted on disk (stored in the `.dat` format). If not provided, data is stored unencrypted as `.pkl` format. **Important**: The same key must be used during inference and report generation to decrypt the data.

### Key notes:

- **Single table training**: Both `source` and `table_name` are required when `metadata_path` is not provided.
- **Multiple tables training**: Use `metadata_path` to define multiple tables with or without relationships.

*Note:* For full documentation, metadata file format, and additional details, please refer to [README.md](../README.md)

In [3]:
# The example of training the single table

from syngen.sdk import Syngen

Syngen.train(
    source="../examples/example-data/housing.csv", 
    table_name="housing", 
    epochs=5,
    drop_null=False,
    row_limit=1000, 
    batch_size=32, 
    reports="all"
)

[32m2025-10-16 09:25:11.134[0m | [1mINFO    [0m | [36msyngen.train[0m:[36mlaunch_train[0m:[36m69[0m - [1mThe training process will be executed according to the information mentioned in 'train_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-16 09:25:11.149[0m | [1mINFO    [0m | [36msyngen.ml.config.validation[0m:[36m_collect_errors[0m:[36m435[0m - [1mThe validation of the metadata has been passed successfully[0m
[32m2025-10-16 09:25:11.229[0m | [1mINFO    [0m | [36msyngen.ml.processors.processors[0m:[36m_preprocess_data[0m:[36m149[0m - [1mThe subset of rows was set to 1000[0m
[32m2025-10-16 09:25:11.233[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_train_table[0m:[36m433[0m - [1mTraining process of the table - 'housing' has started[0m
[32m2025-10-16 09:25



[32m2025-10-16 09:25:24.446[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-16 09:25:24.447[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-16 09:25:26.053[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-16 09:25:26.092[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing/vae/checkpoints[0m
[32m2025-10-16 09:25:26.092[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-16 09:25:26.092[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[0m - [1mNo



[32m2025-10-16 09:25:26.486[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing/vae/checkpoints[0m
[32m2025-10-16 09:25:26.486[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 1 batch(es)[0m
[32m2025-10-16 09:25:26.486[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 1 of 1[0m
[32m2025-10-16 09:25:26.486[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:25:26.487[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 9076.79it/s]
[32m2025-10-16 09:25:26.728[0m | [1mI

In [4]:
# The example of training of multiple tables with relationships

from syngen.sdk import Syngen

Syngen.train(
    metadata_path="../examples/example-metadata/housing_metadata.yaml",
    log_level="DEBUG"
)

[32m2025-10-16 09:26:48.628[0m | [1mINFO    [0m | [36msyngen.train[0m:[36mlaunch_train[0m:[36m69[0m - [1mThe training process will be executed according to the information mentioned in 'train_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-16 09:26:48.635[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.validation_schema.validation_schema[0m:[36mvalidate_schema[0m:[36m300[0m - [34m[1mThe schema of the metadata is valid[0m
[32m2025-10-16 09:26:48.639[0m | [1mINFO    [0m | [36msyngen.ml.config.validation[0m:[36m_collect_errors[0m:[36m435[0m - [1mThe validation of the metadata has been passed successfully[0m
[32m2025-10-16 09:26:48.646[0m | [1mINFO    [0m | [36msyngen.ml.processors.processors[0m:[36m_preprocess_data[0m:[36m149[0m - [1mThe subset of rows was set to 790[0m
[32m



[32m2025-10-16 09:27:00.871[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-16 09:27:00.872[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-16 09:27:01.632[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-16 09:27:01.664[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing-properties/vae/checkpoints[0m
[32m2025-10-16 09:27:01.664[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-16 09:27:01.664[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[



[32m2025-10-16 09:27:13.877[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-16 09:27:13.878[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-16 09:27:15.241[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-16 09:27:15.266[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing-conditions/vae/checkpoints[0m
[32m2025-10-16 09:27:15.266[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-16 09:27:15.266[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[



[32m2025-10-16 09:27:15.559[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-properties/vae/checkpoints[0m
[32m2025-10-16 09:27:15.560[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m490[0m - [34m[1mInfer model with parameters: size=790, run_parallel=False, batch_size=790, random_seed=1[0m
[32m2025-10-16 09:27:15.560[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 1 batch(es)[0m
[32m2025-10-16 09:27:15.560[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_properties'. Generating the batch 1 of 1[0m
[32m2025-10-16 09:27:15.560[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:15.560[0m | [1mI



Generation of the data...: 100%|██████████| 3/3 [00:00<00:00, 4319.57it/s]




[32m2025-10-16 09:27:16.041[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_conditions'. Generating the batch 2 of 2[0m
[32m2025-10-16 09:27:16.041[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:16.041[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_conditions' started.[0m
Generation of the data...: 100%|██████████| 3/3 [00:00<00:00, 4875.21it/s]
[32m2025-10-16 09:27:16.119[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mgenerate_keys[0m:[36m428[0m - [1mThe 'households' assigned as a foreign_key feature[0m

[32m2025-10-16 09:27:16.119[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mgenerate_keys[0m:[36m428[0m - [1mThe 'households' assigned as a foreign_key feature[0m
[3

# Launch inference

You can start inference process using the SDK entrypoint `Syngen.infer(...)`. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a complete description of all available parameters:

```python
Syngen.infer(
    metadata_path: Optional[str] = None,  # use a metadata file in the '.yaml', '.yml' to infer multiple tables or provide all params in a file
    table_name: Optional[str] = None,     # required for single-table inference
    size: int = 100,                      # the desired number of rows to generate
    run_parallel: bool = False,           # whether to use multiprocessing (feasible for tables > 5000 rows)
    batch_size: Optional[int] = None,     # inference batch size
    random_seed: Optional[int] = None,    # if specified, generates a reproducible result
    reports: Union[str, List[str]] = "none",  # report types: 'none', 'accuracy', 'metrics_only', 'all'
    log_level: str = "INFO",              # logging level
    fernet_key: Optional[str] = None      # name of the environment variable containing the Fernet key for decrypting the data subset
)
```

### Parameters description:

- **`metadata_path`** *(Optional[str], default: None)*: Path to a metadata file in the *'.yaml'* or *'.yml'* format for inferring multiple tables or specifying per-table settings. If provided, the individual table parameter `table_name` is ignored in favor of metadata values.

- **`table_name`** *(Optional[str], default: None)*: **Required for single-table inference**. The name of the table to generate synthetic data for. Must match the name used during training.

- **`size`** *(int, default: 100)*: The desired number of synthetic rows to generate. Must be ≥ 1.

- **`run_parallel`** *(bool, default: False)*: Whether to use multiprocessing for data generation. Set to `True` to enable parallel processing, which is recommended and feasible for generating large tables (>5000 rows).

- **`batch_size`** *(Optional[int], default: None)*: Inference batch size. Must be ≥ 1. If specified, the generation is split into batches to optimize memory usage and save RAM.

- **`random_seed`** *(Optional[int], default: None)*: Random seed for reproducible generation. Must be ≥ 0.

- **`reports`** *(Union[str, List[str]], default: "none")*: Controls generation of quality reports. Accepts single string or list of strings:
  - `"none"` - no reports generated (default)
  - `"accuracy"` - generates an accuracy report comparing original and synthetic data patterns to verify quality of a generated data
  - `"metrics_only"` - outputs metrics to stdout without generating an accuracy report
  - `"all"` - generates an accuracy report (same as `"accuracy"`)
  - List example: `["accuracy", "metrics_only"]` to generate multiple report types
  - *Note*: Report generation may require significant time for large generated tables (>10,000 rows)

- **`log_level`** *(str, default: "INFO")*: Logging level for the inference process. Accepted values: `"TRACE"`, `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`.

- **`fernet_key`** *(Optional[str], default: None)*: Name of the environment variable containing a 44-character URL-safe base64-encoded Fernet key. When provided, the data subset is decrypted for report generation. **Important**: The same key used during training must be used during inference to successfully decrypt the data.

### Key notes:

- **Single table inference**: `table_name` is required when `metadata_path` is not provided, and it must match the name used during the training process.
- **Multiple tables inference**: Use `metadata_path` to generate multiple tables with or without relationships simultaneously.

*Note:* For full documentation, metadata file format, and additional details, please refer to [README.md](../README.md)

In [5]:
# The example of inference for the single table
from syngen.sdk import Syngen

Syngen.infer(
    table_name="housing", 
    size=200,
    run_parallel=False,
    batch_size=32, 
    random_seed=42,
    reports="all"
)

[32m2025-10-16 09:27:37.243[0m | [1mINFO    [0m | [36msyngen.infer[0m:[36mlaunch_infer[0m:[36m43[0m - [1mThe inference process will be executed according to the information mentioned in 'infer_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-16 09:27:37.245[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing/merged_infer_housing.csv' was removed[0m
[32m2025-10-16 09:27:37.245[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing/infer_message.success' was removed[0m
[32m2025-10-16 09:27:37.246[0m | [1mINFO    [0m | [36msyngen.ml.config.valid



[32m2025-10-16 09:27:37.649[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing/vae/checkpoints[0m
[32m2025-10-16 09:27:37.649[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 7 batch(es)[0m
[32m2025-10-16 09:27:37.649[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 1 of 7[0m
[32m2025-10-16 09:27:37.650[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:37.650[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 13458.97it/s]




[32m2025-10-16 09:27:37.799[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:27:37.800[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 2 of 7[0m
[32m2025-10-16 09:27:37.800[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:37.800[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 13601.81it/s]
[32m2025-10-16 09:27:37.842[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 14577.36it/s]




[32m2025-10-16 09:27:37.884[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:27:37.884[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 4 of 7[0m
[32m2025-10-16 09:27:37.884[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:37.885[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 15773.45it/s]




[32m2025-10-16 09:27:37.927[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:27:37.927[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 5 of 7[0m
[32m2025-10-16 09:27:37.927[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:37.928[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 12068.36it/s]
[32m2025-10-16 09:27:37.968[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 15778.85it/s]
[32m2025-10-16 09:27:38.008[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:27:38.009[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing'. Generating the batch 7 of 7[0m
[32m2025-10-16 09:27:38.009[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:27:38.009[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing' started.[0m

[32m2025-10-16 09:27:38.008[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 12952.65it/s]
[32m2025-10-16 09:27:38.160[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:27:38.162[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m242[0m - [1mSynthesis of the table - 'housing' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing/merged_infer_housing.csv'[0m
[32m2025-10-16 09:27:38.163[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing' has started[0m

[32m2025-10-16 09:27:38.160[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 0

In [6]:
# The example of inference of multiple tables with relationships

from syngen.sdk import Syngen

Syngen.infer(
    metadata_path="../examples/example-metadata/housing_metadata.yaml",
    log_level="DEBUG"
)

[32m2025-10-16 09:28:01.939[0m | [1mINFO    [0m | [36msyngen.infer[0m:[36mlaunch_infer[0m:[36m43[0m - [1mThe inference process will be executed according to the information mentioned in 'infer_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-16 09:28:01.945[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.validation_schema.validation_schema[0m:[36mvalidate_schema[0m:[36m300[0m - [34m[1mThe schema of the metadata is valid[0m
[32m2025-10-16 09:28:01.945[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existed_artifact[0m:[36m106[0m - [1mThe artifacts located in the path - 'model_artifacts/tmp_store/housing-properties/merged_infer_housing-properties.csv' was removed[0m
[32m2025-10-16 09:28:01.945[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_remove_existe



[32m2025-10-16 09:28:02.249[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-properties/vae/checkpoints[0m
[32m2025-10-16 09:28:02.249[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m490[0m - [34m[1mInfer model with parameters: size=90, run_parallel=False, batch_size=32, random_seed=10, reports - 'accuracy'[0m
[32m2025-10-16 09:28:02.249[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 3 batch(es)[0m
[32m2025-10-16 09:28:02.249[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_properties'. Generating the batch 1 of 3[0m
[32m2025-10-16 09:28:02.249[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:




[32m2025-10-16 09:28:02.369[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_properties'. Generating the batch 2 of 3[0m
[32m2025-10-16 09:28:02.369[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:28:02.369[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_properties' started.[0m
Generation of the data...: 100%|██████████| 7/7 [00:00<00:00, 12072.42it/s]
[32m2025-10-16 09:28:02.410[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_properties'. Generating the batch 3 of 3[0m
[32m2025-10-16 09:28:02.410[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:2



Generation of the data...: 100%|██████████| 7/7 [00:00<00:00, 11577.34it/s]
[32m2025-10-16 09:28:02.532[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m242[0m - [1mSynthesis of the table - 'housing_properties' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing-properties/merged_infer_housing-properties.csv'[0m
[32m2025-10-16 09:28:02.532[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_infer_table[0m:[36m518[0m - [1mInfer process of the table - 'housing_conditions' has started[0m

[32m2025-10-16 09:28:02.532[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m242[0m - [1mSynthesis of the table - 'housing_properties' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing-properties/merged_infer_housing-properties.csv'[0m
[32m2025-10-16 09:28:02.532[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_infer_table[0m:[36m518[0m - [1mInf



[32m2025-10-16 09:28:02.731[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 3 batch(es)[0m
[32m2025-10-16 09:28:02.731[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_conditions'. Generating the batch 1 of 3[0m
[32m2025-10-16 09:28:02.732[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:28:02.732[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_conditions' started.[0m
Generation of the data...: 100%|██████████| 3/3 [00:00<00:00, 7157.52it/s]




[32m2025-10-16 09:28:02.818[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_conditions'. Generating the batch 2 of 3[0m
[32m2025-10-16 09:28:02.818[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:28:02.818[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_conditions' started.[0m
Generation of the data...: 100%|██████████| 3/3 [00:00<00:00, 7235.72it/s]
[32m2025-10-16 09:28:02.857[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_conditions'. Generating the batch 3 of 3[0m
[32m2025-10-16 09:28:02.858[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:28



Generation of the data...: 100%|██████████| 3/3 [00:00<00:00, 7073.03it/s]
[32m2025-10-16 09:28:02.948[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mgenerate_keys[0m:[36m428[0m - [1mThe 'households' assigned as a foreign_key feature[0m
[32m2025-10-16 09:28:02.952[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m242[0m - [1mSynthesis of the table - 'housing_conditions' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing-conditions/merged_infer_housing-conditions.csv'[0m
[32m2025-10-16 09:28:02.952[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing_conditions' has started[0m

[32m2025-10-16 09:28:02.948[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mgenerate_keys[0m:[36m428[0m - [1mThe 'households' assigned as a foreign_key feature[0m
[32m2025-10-1

# Data security: Using Fernet Key for encryption

In the current implementation, a sample of the original data is stored on a disk during the training process. To ensure data security and protect sensitive information, you can use a **Fernet key** to encrypt this data.

## What is a Fernet Key?

A Fernet key is a 44-character URL-safe base64-encoded string used for symmetric encryption. When provided, the data subset is encrypted on a disk (stored in `.dat` format instead of unencrypted `.pkl` format).

## How to Generate a Fernet Key

You can generate a Fernet key using the following code:

In [8]:
# Generate a Fernet key
from cryptography.fernet import Fernet

fernet_key = Fernet.generate_key().decode("utf-8")

## Setting the Fernet Key as an environment variable

After generating the key, you need to store it as an environment variable. This can be done in your terminal or programmatically in Python.

### Option 1: Set in Terminal (Linux/macOS)

```bash
export MY_FERNET_KEY='your_generated_fernet_key_here'
```

### Option 2: Set in Terminal (Windows)

```cmd
set MY_FERNET_KEY=your_generated_fernet_key_here
```

### Option 3: Set programmatically in Python

```python
import os
os.environ['MY_FERNET_KEY'] = 'your_generated_fernet_key_here'
```

## Using the Fernet Key in a training

When training with encryption, pass the name of the environment variable (not the key itself) to the `fernet_key` parameter:

In [9]:
# The example: the training with the Fernet key encryption

import os
from syngen.sdk import Syngen

# Step 1: Set the Fernet key as an environment variable
os.environ['MY_FERNET_KEY'] = fernet_key  # Using the key generated above

# Step 2: Train with encryption enabled
Syngen.train(
    source="../examples/example-data/housing.csv", 
    table_name="housing_encrypted", 
    epochs=5,
    row_limit=1000, 
    batch_size=32, 
    fernet_key="MY_FERNET_KEY"  # Pass the environment variable name, not the key itself
)

[32m2025-10-16 09:28:52.989[0m | [1mINFO    [0m | [36msyngen.train[0m:[36mlaunch_train[0m:[36m69[0m - [1mThe training process will be executed according to the information mentioned in 'train_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-16 09:28:52.994[0m | [1mINFO    [0m | [36msyngen.ml.config.validation[0m:[36m_collect_errors[0m:[36m435[0m - [1mThe validation of the metadata has been passed successfully[0m
[32m2025-10-16 09:28:53.057[0m | [1mINFO    [0m | [36msyngen.ml.processors.processors[0m:[36m_preprocess_data[0m:[36m149[0m - [1mThe subset of rows was set to 1000[0m
[32m2025-10-16 09:28:53.062[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_train_table[0m:[36m433[0m - [1mTraining process of the table - 'housing_encrypted' has started[0m
[32m2025-1



[32m2025-10-16 09:29:06.169[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m187[0m - [1mCreating BayesianGaussianMixture[0m
[32m2025-10-16 09:29:06.169[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m189[0m - [1mFitting BayesianGaussianMixture[0m
[32m2025-10-16 09:29:07.993[0m | [1mINFO    [0m | [36msyngen.ml.vae.models.model[0m:[36mfit_sampler[0m:[36m191[0m - [1mFinished fitting BayesianGaussianMixture[0m
[32m2025-10-16 09:29:08.029[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36msave_state[0m:[36m545[0m - [1mSaved VAE state in model_artifacts/resources/housing-encrypted/vae/checkpoints[0m
[32m2025-10-16 09:29:08.030[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36m__fit_model[0m:[36m191[0m - [1mFinished VAE training[0m
[32m2025-10-16 09:29:08.030[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m143[0

## Using the Fernet Key in an inference

**Important**: When generating synthetic data with inference, you must use the **same Fernet key** that was used during training. This allows the system to decrypt the stored data subset for report generation.

If the Fernet key is not provided or doesn't match the training key, the inference process will fail when trying to access the encrypted data.

In [11]:
# Example: Inference with Fernet key decryption

from syngen.sdk import Syngen

# The environment variable 'MY_FERNET_KEY' is already set from the training step

# Inference with the same Fernet key
Syngen.infer(
    table_name="housing_encrypted",  # Must match the name used in training
    size=200,
    batch_size=32, 
    random_seed=42,
    reports="all",
    fernet_key="MY_FERNET_KEY"  # Must use the same key as in training
)

[32m2025-10-16 09:29:29.767[0m | [1mINFO    [0m | [36msyngen.infer[0m:[36mlaunch_infer[0m:[36m43[0m - [1mThe inference process will be executed according to the information mentioned in 'infer_settings' in the metadata file. If appropriate information is absent from the metadata file, then the values of parameters sent through CLI will be used. Otherwise, the values of parameters will be defaulted[0m
[32m2025-10-16 09:29:29.771[0m | [1mINFO    [0m | [36msyngen.ml.data_loaders.data_loaders[0m:[36mload_data[0m:[36m706[0m - [1mData stored at the path - 'model_artifacts/tmp_store/housing-encrypted/input_data_housing-encrypted.dat' has been successfully decrypted and loaded.[0m
[32m2025-10-16 09:29:29.771[0m | [1mINFO    [0m | [36msyngen.ml.config.validation[0m:[36m_collect_errors[0m:[36m435[0m - [1mThe validation of the metadata has been passed successfully[0m
[32m2025-10-16 09:29:29.771[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m_in



[32m2025-10-16 09:29:30.164[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m554[0m - [1mLoaded VAE state from model_artifacts/resources/housing-encrypted/vae/checkpoints[0m
[32m2025-10-16 09:29:30.165[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m491[0m - [1mTotal of 7 batch(es)[0m
[32m2025-10-16 09:29:30.165[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 1 of 7[0m
[32m2025-10-16 09:29:30.165[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:29:30.165[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 13349.93it/s]




[32m2025-10-16 09:29:30.313[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:29:30.313[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 2 of 7[0m
[32m2025-10-16 09:29:30.313[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:29:30.313[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 12826.62it/s]
[32m2025-10-16 09:29:30.355[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 15004.01it/s]




[32m2025-10-16 09:29:30.394[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:29:30.394[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 4 of 7[0m
[32m2025-10-16 09:29:30.395[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:29:30.395[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 13427.63it/s]




[32m2025-10-16 09:29:30.434[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:29:30.434[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 5 of 7[0m
[32m2025-10-16 09:29:30.434[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:29:30.434[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 14152.56it/s]




[32m2025-10-16 09:29:30.474[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:29:30.475[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mhandle[0m:[36m503[0m - [1mData synthesis for the table - 'housing_encrypted'. Generating the batch 6 of 7[0m
[32m2025-10-16 09:29:30.475[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun[0m:[36m348[0m - [1mStart data synthesis[0m
[32m2025-10-16 09:29:30.475[0m | [1mINFO    [0m | [36msyngen.ml.handlers.handlers[0m:[36mrun_separate[0m:[36m323[0m - [1mVAE generation for 'housing_encrypted' started.[0m
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 17037.42it/s]
Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 17037.42it/s]
[32m2025-10-16 09:29:30.513[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m



Generation of the data...: 100%|██████████| 11/11 [00:00<00:00, 13036.83it/s]
[32m2025-10-16 09:29:30.668[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36m_restore_nan_values[0m:[36m135[0m - [1mColumn 'total_bedrooms' has 0 (0.0%) empty values generated[0m
[32m2025-10-16 09:29:30.671[0m | [1mINFO    [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m242[0m - [1mSynthesis of the table - 'housing_encrypted' was completed. Synthetic data saved in 'model_artifacts/tmp_store/housing-encrypted/merged_infer_housing-encrypted.csv'[0m
[32m2025-10-16 09:29:30.671[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing_encrypted' has started[0m
[32m2025-10-16 09:29:30.672[0m | [1mINFO    [0m | [36msyngen.ml.data_loaders.data_loaders[0m:[36mload_data[0m:[36m706[0m - [1mData stored at the path - 'model_artifacts/tmp_s

## Using the Fernet Key with the metadata file

You can also specify the Fernet key in the metadata file for both training and inference:

```yaml
global:
  encryption:
    fernet_key: MY_FERNET_KEY  # Name of the environment variable

TABLE_NAME:
  train_settings:
    source: "./data/table.csv"
  
  infer_settings:
    size: 100
  
  # You can also specify per-table encryption
  encryption:
    fernet_key: MY_FERNET_KEY
```

Then use it in your code:

```python
# Training with the metadata file and the Fernet key
Syngen.train(metadata_path="path/to/metadata.yaml")

# Inference with the metadata file and the Fernet key
Syngen.infer(metadata_path="path/to/metadata.yaml")
```

## Important security notes

⚠️ **Critical security considerations:**

1. **Store the key securely**: Never hardcode the Fernet key directly in your code or commit it to version control systems.
2. **Key recovery is impossible**: If you lose the Fernet key, encrypted data cannot be recovered
3. **Same key required**: Always use the same Fernet key for training and inference
4. **Environment variables**: Use environment variables to store the key securely
5. **Key length**: The Fernet key must be exactly 44 characters (URL-safe base64-encoded)
6. **Production environments**: In production, use secure secret management services (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, etc.)

## What happens without a Fernet key?

If you don't provide a `fernet_key` parameter:
- Data subset is stored **unencrypted** in `.pkl` format
- No decryption is needed during inference
- Suitable for non-sensitive data or development environments

With a `fernet_key`:
- Data subset is stored **encrypted** in `.dat` format
- Decryption is required during inference using the same key
- Recommended for sensitive or production data

# Generate reports separately

Sometimes you may want to generate reports separately after training or/and inference has already been completed. The SDK provides the `Syngen().generate_reports(...)` method that allows you to generate quality reports for a table using existing artifacts without re-running the training or/and inference processes.

This method is useful when:
- You completed training/inference without reports (with `reports="none"`)
- You want to generate additional report types later
- You want to separate the computation-intensive training/inference from report generation


```python
Syngen().generate_reports(
    table_name: str,                        # required: table name used in training/inference
    reports: Union[str, List[str]],         # required: report types to generate
    fernet_key: Optional[str] = None        # optional: Fernet key for decrypting encrypted data
)
```

### Parameters description:

- **`table_name`** *(str, required)*: The name of the table to generate reports for. Must match exactly the name used during the training or inference process.

- **`reports`** *(Union[str, List[str]], required)*: Controls which quality reports to generate. Accepts single string or list of strings:
  - `"accuracy"` - generates an accuracy report comparing original and synthetic data
  - `"metrics_only"` - outputs metrics information to stdout without generating an accuracy report
  - `"sample"` - generates a sample report showing distribution comparisons between original data and the data subset used for a training process
  - `"all"` - generates all available reports (*"accuracy"* and *"sample"*)
  
  List example: `["accuracy", "sample"]` to generate multiple report types.

  *Note*: Report generation may require significant time for large tables (>10,000 rows)

- **`fernet_key`** *(Optional[str], default: None)*: The nae of the environment variable containing the Fernet key used to decrypt the original data subset. **Important**: Must be the same key used during training if the data was encrypted.

### Required artifacts

To generate reports, the following artifacts must exist:

**For accuracy reports (`"accuracy"` or `"metrics_only"`):**
- Training must be completed successfully
- Inference must be completed successfully

**For sample reports (`"sample"`):**
- Training must be completed successfully

### Key notes:

- The method uses existing artifacts and does not re-run training or inference
- All required artifacts must be present in the `model_artifacts` directory
- The `table_name` must match exactly what was used in training/inference
- If data was encrypted during training, the same `fernet_key` must be provided

*Note:* For full documentation and additional details, please refer to [README.md](../README.md)

In [12]:
# Example 1: Generate an accuracy report after training and inference completed without reports

from syngen.sdk import Syngen

# Assume training and inference were already completed with reports="none"
# Now generate an accuracy report separately
Syngen().generate_reports(
    table_name="housing",
    reports="accuracy"
)

[32m2025-10-16 09:30:03.764[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing' has started[0m
[32m2025-10-16 09:30:04.116[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
[32m2025-10-16 09:30:04.116[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
Generating bivariate distributions...: 100%|██████████| 45/45 [00:14<00:00,  3.10it/s]

[32m2025-10-16 09:30:20.619[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m197[0m - [1mMedian of differences of correlations is 0.7433[0m
[32m2025-10-16 09:30:20.619[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch

In [13]:
# Example 2: Generate a sample report after training completed

from syngen.sdk import Syngen

# Generate a sample report to compare original data with its subset
Syngen().generate_reports(
    table_name="housing",
    reports="sample"
)

[32m2025-10-16 09:30:29.332[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing' has started[0m
[32m2025-10-16 09:30:29.680[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
[32m2025-10-16 09:30:29.680[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
Generating bivariate distributions...: 100%|██████████| 45/45 [00:14<00:00,  3.10it/s]

[32m2025-10-16 09:30:46.155[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m197[0m - [1mMedian of differences of correlations is 0.7433[0m
[32m2025-10-16 09:30:46.155[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch

In [14]:
# Example 3: Generate multiple reports at once

from syngen.sdk import Syngen

# Generate both accuracy and sample reports
Syngen().generate_reports(
    table_name="housing",
    reports=["accuracy", "sample"]
)

# Or use "all" to generate all available reports
Syngen().generate_reports(
    table_name="housing",
    reports="all"
)

[32m2025-10-16 09:30:52.845[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing' has started[0m
[32m2025-10-16 09:30:53.189[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
[32m2025-10-16 09:30:53.189[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
Generating bivariate distributions...: 100%|██████████| 45/45 [00:14<00:00,  3.10it/s]

[32m2025-10-16 09:31:09.668[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m197[0m - [1mMedian of differences of correlations is 0.7433[0m
[32m2025-10-16 09:31:09.668[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch

In [16]:
# Example 4: Generate reports for encrypted data

import os
from syngen.sdk import Syngen

# Ensure the Fernet key environment variable is set
# (Should be the same key used during training)
os.environ['MY_FERNET_KEY'] = fernet_key

# Generate reports with decryption
Syngen().generate_reports(
    table_name="housing_encrypted",
    reports="accuracy",
    fernet_key="MY_FERNET_KEY"  # Same key used in training
)

[32m2025-10-16 09:34:22.672[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36m_log_and_update_progress[0m:[36m308[0m - [1mThe calculation of accuracy metrics for the table - 'housing' has started[0m
[32m2025-10-16 09:34:23.017[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
[32m2025-10-16 09:34:23.017[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m169[0m - [1mMedian accuracy is 0.8204[0m
Generating bivariate distributions...: 100%|██████████| 45/45 [00:14<00:00,  3.07it/s]

[32m2025-10-16 09:34:39.683[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch_metrics[0m:[36m197[0m - [1mMedian of differences of correlations is 0.7433[0m
[32m2025-10-16 09:34:39.683[0m | [1mINFO    [0m | [36msyngen.ml.metrics.accuracy_test.accuracy_test[0m:[36m_fetch

ValueError: It seems that the Fernet key is absent

In [None]:
# Example 5: Generate metrics only (without a full accuracy report)

from syngen.sdk import Syngen

# Output metrics to stdout without generating HTML/PDF reports
Syngen().generate_reports(
    table_name="housing",
    reports="metrics_only"
)