# Quick start

*EPAM Syngen* is an unsupervised tabular data generation tool based on a variational autoencoder (VAE). 
It supports common tabular datatypes (floats, integers, datetime, text, categorical, binary) and can generate linked tables that sharing keys using the simple statistical approach. 
The SDK exposes simple programmatic entry points for training, inference, report generation, loading and saving data in supported formats - *CSV*, *Avro* and *Excel* format. The data should be located locally and be in UTF-8 encoding.

This notebook demonstrates the SDK usage. Install the package and then you can call the main SDK class `Syngen` to run training, inference or generation of reports, and the class `DataIO` to load and save the data in supported formats.

Python *3.10* or *3.11* is required to run the library. The library is tested on Linux and Windows operating systems.

# Installation

Please, install the library *syngen* (from Pypi):

In [None]:
!pip install  --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --use-pep517 --no-cache-dir syngen==0.10.28rc45

# Class initialization

```python
Syngen(
    metadata_path: Optional[str] = None,  # use a metadata file in the '.yaml', '.yml' format for a training or an inference of one or multiple tables to centralize all parameters in one place
    table_name: Optional[str] = None,     # required for single-table training or inference process; an arbitrary string used to name the directories where artifacts are stored
    source: Optional[str] = None,         # required for single-table training or inference process; a path to the file that you want to use as a reference
)
```
### Attributes description:

- **`metadata_path`** *(Optional[str], default: None)*: a path to a metadata file in *'.yaml'* or *'.yml'* format, used during training or inference to centralize all parameters for one or multiple tables.
- **`table_name`** *(Optional[str], default: None)*: a required parameter for training or inference of a single table; an arbitrary string used to name the directories where artifacts are stored.
- **`source`** *(Optional[str], default: None)*: a required parameter for single-table training or inference process; a path to the file that you want to use as a reference.

***Note***: Both `source` and `table_name` are required when `metadata_path` is not provided. Use `metadata_path` to define multiple tables with or without relationships.

# Launch training

You can start a training process using the SDK entrypoint `Syngen().train(...)`. This will train a model and save the model artifacts to a disk in the directory *'./model_artifacts'*. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a complete description of all available parameters:

```python
train(
    self,
    epochs: int = 10,                     # a number of training epochs
    drop_null: bool = False,              # whether to drop rows with at least one missing value
    row_limit: Optional[int] = None,      # a number of rows to train over
    reports: Union[str, Tuple[str], List[str]] = "none",  # report types: "none", "accuracy", "sample", "metrics_only", "all"
    log_level: Literal["TRACE", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"] = "INFO", # a logging level
    batch_size: int = 32,                 # a training batch size
    fernet_key: Optional[str] = None      # a name of the environment variable containing the Fernet key for secure storage of the data subset
)
```

### Parameters description:

- **`epochs`** *(int, default: 10)*: A number of training epochs. Must be ≥ 1. Since the early stopping mechanism is implemented the bigger value of epochs is the better.

- **`drop_null`** *(bool, default: False)*: Whether to drop rows containing at least one missing value before training. When `False`, missing values are handled during the training process.

- **`row_limit`** *(Optional[int], default: None)*: A maximum number of rows to use for training. If specified and less than the total rows, a random subset of the specified size will be selected. Useful for testing or working with large datasets.

- **`reports`** *(Union[str, Tuple[str], List[str]], default: "none")*: Controls generation of quality reports. Accepts single string or list of strings:
  - `"none"` - no reports generated (default)
  - `"accuracy"` - generates an accuracy report comparing synthetic data (same size as original) with original dataset to estimate the quality of training process
  - `"sample"` - generates a sample report showing distribution comparisons between the original data and the subset of this data
  - `"metrics_only"` - outputs metrics to stdout without generation of an accuracy report
  - `"all"` - generates both accuracy and sample reports

  List example: `["accuracy", "sample"]` to generate multiple report types

  *Note*: Report generation may require significant time for large tables (>10,000 rows)

- **`log_level`** *(str, default: "INFO")*: A logging level for the training process. Accepted values: `"TRACE"`, `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`.

- **`batch_size`** *(int, default: 32)*: A training batch size. Must be ≥ 1. Splits training into batches to optimize memory usage. Smaller batches use less RAM but may increase training time.

- **`fernet_key`** *(Optional[str], default: None)*: A name of the environment variable containing a 44-character URL-safe base64-encoded Fernet key. When provided, the data subset is encrypted on a disk (stored in the `.dat` format). If not provided, data is stored unencrypted in the `.pkl` format. **Important**: The same key must be used during an inference and a report generation to decrypt the data.

*Note:* For full documentation, metadata file format, and additional details, please refer to [README.md](../README.md)

In [None]:
# The example of training the single table

from syngen.sdk import Syngen


launcher_for_single_table = Syngen(
    source="../examples/example-data/housing.csv", 
    table_name="housing"
)

launcher_for_single_table.train(
    epochs=5,
    drop_null=False,
    row_limit=1000, 
    batch_size=32, 
    reports="all"
)

In [None]:
# The example of training of multiple tables with relationships

from syngen.sdk import Syngen


launcher_for_multiple_tables = Syngen(
    metadata_path="../examples/example-metadata/housing_metadata.yaml"
)

launcher_for_multiple_tables.train(log_level="DEBUG")

The *'execution_artifacts'* attribute of the **Syngen** class provides information about the generated artifacts during the training process:

In [None]:
from pprint import pprint

pprint(launcher_for_multiple_tables.execution_artifacts)

# Launch generation of synthetic data

You can start an inference process using the SDK entrypoint `Syngen().infer(...)`. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a complete description of all available parameters:

```python
infer(
    self,
    size: int = 100,                      # the desired number of rows to generate
    run_parallel: bool = False,           # whether to use multiprocessing (feasible for tables > 5000 rows)
    batch_size: Optional[int] = None,     # an inference batch size
    random_seed: Optional[int] = None,    # if specified, generates a reproducible result
    reports: Union[str, List[str]] = "none",  # report types: "none", "accuracy", "metrics_only", "all"
    log_level: str = "INFO",              # a logging level
    fernet_key: Optional[str] = None      # a name of the environment variable containing the Fernet key for decrypting the data subset
)
```
### Parameters description:

- **`size`** *(int, default: 100)*: The desired number of synthetic rows to generate. Must be ≥ 1.

- **`run_parallel`** *(bool, default: False)*: Whether to use multiprocessing for data generation. Set to `True` to enable parallel processing, which is recommended and feasible for generating large tables (>5000 rows).

- **`batch_size`** *(Optional[int], default: None)*: The inference batch size. Must be ≥ 1. If specified, the generation is split into batches to optimize memory usage and save RAM.

- **`random_seed`** *(Optional[int], default: None)*: A random seed for reproducible generation. Must be ≥ 0.

- **`reports`** *(Union[str, Tuple[str], List[str]], default: "none")*: Controls generation of quality reports. Accepts single string or list of strings:
  - `"none"` - no reports generated (default)
  - `"accuracy"` - generates an accuracy report comparing original and synthetic data patterns to verify quality of a generated data
  - `"metrics_only"` - outputs metrics to stdout without generating an accuracy report
  - `"all"` - generates an accuracy report (same as `"accuracy"`)

  List example: `["accuracy", "metrics_only"]` to generate multiple report types

  *Note*: Report generation may require significant time for large generated tables (>10,000 rows)

- **`log_level`** *(str, default: "INFO")*: A logging level for the inference process. Accepted values: `"TRACE"`, `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`.

- **`fernet_key`** *(Optional[str], default: None)*: A name of the environment variable containing a 44-character URL-safe base64-encoded Fernet key. When provided, the data subset is decrypted for a report generation. **Important**: The same key used during a training must be used during a report generation to successfully decrypt the data.

*Note:* For full documentation, metadata file format, and additional details, please refer to [README.md](../README.md)

In [None]:
# The example of inference for the single table

launcher_for_single_table.infer(
    size=200,
    run_parallel=False,
    batch_size=32, 
    random_seed=42,
    reports="all"
)

In [None]:
# The example of inference of multiple tables with relationships


launcher_for_multiple_tables.infer(log_level="DEBUG")

The *'execution_artifacts'* attribute of the **Syngen** class provides information about the generated artifacts during the inference process:

In [None]:
pprint(launcher_for_multiple_tables.execution_artifacts)

# Data security: Using Fernet Key for encryption

In the current implementation, a sample of the original data is stored on a disk during a training process. To ensure data security and protect sensitive information, you can use a **Fernet key** to encrypt this data.

## What is a Fernet Key?

A Fernet key is a 44-character URL-safe base64-encoded string used for symmetric encryption. When provided, the data subset is encrypted on a disk (stored in the `.dat` format instead of unencrypted `.pkl` format).

## How to Generate a Fernet Key

You can generate a Fernet key using the following code:

In [None]:
# Generate a Fernet key
from cryptography.fernet import Fernet

fernet_key = Fernet.generate_key().decode("utf-8")

## Setting the Fernet Key as an environment variable

After generating the key, you need to store it as an environment variable. This can be done in your terminal or programmatically in Python.

### Option 1: Set in Terminal (Linux/macOS)

```bash
export MY_FERNET_KEY='your_generated_fernet_key_here'
```

### Option 2: Set in Terminal (Windows)

```cmd
set MY_FERNET_KEY=your_generated_fernet_key_here
```

### Option 3: Set programmatically in Python

```python
import os
os.environ['MY_FERNET_KEY'] = 'your_generated_fernet_key_here'
```

## Using the Fernet Key in a training

When training with encryption, pass the name of the environment variable (not the Fernet key itself) to the `fernet_key` parameter:

In [None]:
# The example: the training with the Fernet key encryption

import os
from syngen.sdk import Syngen

# Step 1: Set the Fernet key as an environment variable
os.environ["MY_FERNET_KEY"] = fernet_key  # Using the key generated above

# Step 2: Train with encryption enabled

launcher_for_encrypted_data = Syngen(
    source="../examples/example-data/housing.csv", 
    table_name="housing_encrypted"
)

launcher_for_encrypted_data.train(
    epochs=5,
    row_limit=1000, 
    batch_size=32, 
    fernet_key="MY_FERNET_KEY"  # Pass the environment variable name, not the key itself
)

## Using the Fernet Key in an inference

**Important**: When generating synthetic data during an inference process, you must use the **same Fernet key** that was used during a training process. This allows the system to decrypt the stored data subset for a report generation.

If the Fernet key is not provided or doesn't match the Fernet key used in the training process, the inference process will fail when trying to access the encrypted data.

In [None]:
# Example: Inference with Fernet key decryption
# The environment variable 'MY_FERNET_KEY' is already set from the training step

# Inference with the same Fernet key
launcher_for_encrypted_data.infer(
    size=200,
    batch_size=32, 
    random_seed=42,
    reports="all",
    fernet_key="MY_FERNET_KEY"  # Must use the same key as in a training
)

## Using the Fernet Key with the metadata file

You can also specify the Fernet key in the metadata file for both training and inference:

```yaml
global:
  encryption:
    fernet_key: MY_FERNET_KEY  # Name of the environment variable

TABLE_NAME:
  train_settings:
    source: "./data/table.csv"
  
  infer_settings:
    size: 100
  
  # You can also specify per-table encryption
  encryption:
    fernet_key: MY_FERNET_KEY
```

Then use it in your code:

```python
# Training with the metadata file and the Fernet key
Syngen(metadata_path="path/to/metadata.yaml").train()

# Inference with the metadata file and the Fernet key
Syngen(metadata_path="path/to/metadata.yaml").infer()
```

## Important security notes

⚠️ **Critical security considerations:**

1. **Store the key securely**: Never hardcode the Fernet key directly in your code or commit it to version control systems.
2. **Key recovery is impossible**: If you lose the Fernet key, encrypted data cannot be recovered
3. **Same key required**: Always use the same Fernet key for a training, an inference and a report generation
4. **Environment variable**: Use an environment variable to store the Fernet key securely
5. **Key length**: The Fernet key must be exactly 44 characters (URL-safe base64-encoded)
6. **Production environments**: In production, use secure secret management services (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, etc.)

## What happens without a Fernet key?

If you don't provide a `fernet_key` parameter:
- Data subset is stored **unencrypted** in the `.pkl` format
- No decryption is needed during the inference or the report generation
- Suitable for non-sensitive data or development environments

With a `fernet_key`:
- Data subset is stored **encrypted** in the `.dat` format
- Decryption is required during the inference or the report generation using the same Fernet key
- Recommended for sensitive or production data

# Generate quality reports separately

Sometimes you may want to generate quality reports separately after training or/and inference has already been completed to evaluate the quality of the input data or the generated data. The SDK provides the `Syngen().generate_quality_reports(...)` method that allows you to generate quality reports for a table using existing artifacts without re-running the training or/and inference processes.

This method is useful when:
- You completed training/inference without quality reports (with `reports="none"`)
- You want to generate additional report types later
- You want to separate the computation-intensive training/inference from report generation


```python
generate_quality_reports(
    self,
    table_name: str,                        # required: the name of the table to generate quality reports for
    reports: Union[str, Tuple[str], List[str]], # required: report types to generate
    fernet_key: Optional[str] = None        # optional: a Fernet key for decrypting encrypted data
)
```

### Parameters description:

- **`table_name`** *(str, required)*: The name of the table to generate reports for.

- **`reports`** *(Union[str, Tuple[str], List[str]], required)*: Controls which quality reports to generate. Accepts single string or list of strings:
  - `"accuracy"` - generates an accuracy report comparing original and synthetic data
  - `"metrics_only"` - outputs metrics information to stdout without generating an accuracy report
  - `"sample"` - generates a sample report showing distribution comparisons between original data and the data subset used for a training process
  - `"all"` - generates all available reports (*"accuracy"* and *"sample"*)
  
  List example: `["accuracy", "sample"]` to generate multiple report types.

  *Note*: Report generation may require significant time for large tables (>10,000 rows)

- **`fernet_key`** *(Optional[str], default: None)*: The name of the environment variable containing the Fernet key used to decrypt the original data subset. **Important**: Must be the same key used during training if the data was encrypted.

### Required artifacts

To generate quality reports, the following artifacts must exist:

**For accuracy reports (`"accuracy"` or `"metrics_only"`):**
- Training must be completed successfully
- Inference must be completed successfully

**For a `"sample"` report:**
- Training must be completed successfully

### Key notes:

- The method uses existing artifacts and does not re-run a training or an inference process
- All required artifacts must be present in the `model_artifacts` directory
- The `table_name` must match exactly the name of the `table_name` that was used in a training/inference process
- If data was encrypted during a training, the same `fernet_key` must be provided

*Note:* For full documentation and additional details, please refer to [README.md](../README.md)

In [None]:
# Example 1: Generate an accuracy report after a training and inference completed without a generation of quality reports

# Assume a training and inference were already completed with reports="none"
# Now generate an accuracy report separately
launcher_for_multiple_tables.generate_quality_reports(
    table_name="housing_conditions",
    reports="accuracy",
    log_level="DEBUG"
)

In [None]:
# Example 2: Generate a sample report after training completed

# Generate a sample report to compare original data with its subset
launcher_for_multiple_tables.generate_quality_reports(
    table_name="housing_properties",
    reports="sample"
)

In [None]:
# Example 3: Generate multiple reports at once

# Generate both accuracy and sample reports
launcher_for_multiple_tables.generate_quality_reports(
    table_name="housing_conditions",
    reports=["accuracy", "sample"]
)

# Or use "all" to generate all available reports
launcher_for_multiple_tables.generate_quality_reports(
    table_name="housing_conditions",
    reports="all"
)

In [None]:
# Example 4: Generate quality reports for encrypted data
import os

# Ensure the Fernet key environment variable is set
# (Should be the same key used during training)
os.environ['MY_FERNET_KEY'] = fernet_key

# Generate qulaity reports with decryption
launcher_for_encrypted_data.generate_quality_reports(
    table_name="housing_encrypted",
    reports="accuracy",
    fernet_key="MY_FERNET_KEY"  # Same key used in training
)

In [None]:
# Example 5: Generate the "metrics_only" report (without a full accuracy report)
# Output metrics to stdout


launcher_for_multiple_tables.generate_quality_reports(
    table_name="housing",
    reports="metrics_only"
)

# Data loading and saving: DataIO class

The SDK provides the `DataIO` class for loading and saving data in various supported formats with optional encryption and format settings. This class is useful when you need to:
- Load and save data in different file formats (*CSV*, *Avro*, *Excel*, etc.)
- Load and save data with specific format settings
- Work with encrypted data files

## Class initialization

```python
DataIO(
    path: str,                          # required: a path to the data file
    fernet_key: Optional[str] = None,   # optional: a Fernet key for encrypted data
    **kwargs                            # optional: format settings for CSV or Excel tables, or schema for AVRO file
)
```

### Parameters description:

- **`path`** *(str, required)*: the path to the data file to load or save. Supported formats include:
  - CSV files: `.csv`, `.psv`, `.tsv`, `.txt`
  - Avro files: `.avro`
  - Excel files: `.xls`, `.xlsx`

- **`fernet_key`** *(Optional[str], default: None)*: the name of the environment variable containing the Fernet key for encrypted data operations.

- **`**kwargs`**: Optional format settings or/and a schema for reading and writing data. Available parameters depend on the file format:

  **For tables in '.csv', '.psv', '.tsv', '.txt' formats:**
  - `sep` *(str)*: Delimiter to use (e.g., `','`, `';'`, `'\t'`)
  - `quotechar` *(str)*: Character used to denote the start and end of a quoted item (default: `'"'`)
  - `quoting` *(str)*: Quoting behavior - `"all"`, `"minimal"`, `"non-numeric"`, `"none"`
  - `escapechar` *(str)*: Character used to escape other characters
  - `encoding` *(str)*: Encoding to use (e.g., `'utf-8'`, `'latin-1'`)
  - `header` *(Optional[int, List[int], Literal["infer"]])*: Row number(s) containing column labels and marking the start of the data
  - `skiprows` *(Optional[int, List[int]])*: Lines to skip at the start of the file
  - `on_bad_lines` *(Literal["error", "warn", "skip"])*: Action on bad lines - `"error"`, `"warn"`, `"skip"`
  - `engine` *(Optional[Literal["c", "python"]])*: Parser engine - `"c"`, `"python"`
  - `na_values` *(Opional[List[str]])*: Additional strings to recognize as NA/NaN

  **For Excel formats (.xls, .xlsx):**
  - `sheet_name` *(Optional[str, int, List[Union[int, str]])*: Name or index of the sheet to read

### Available methods

`load_data(**kwargs)`

Loads data from the specified file path and returns it as a pandas DataFrame.

`load_schema()`

Returns the original schema of the loaded data, including column names and data types. Available only for data in the *'.avro'* format.

`save_data(df, **kwargs)`

Saves a pandas DataFrame to the specified file path with(without) the configured format settings or schema.

*Note:* For full documentation and additional details, please refer to [README.md](../README.md)

In [None]:
# The example 1: Load CSV data with default settings

from syngen.sdk import DataIO

data_io = DataIO(path="../examples/example-data/housing.csv")

df = data_io.load_data()

print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
print(f"\nFirst few rows:")
print(df.head())

In [None]:
# The example 2: Load CSV data with custom format settings

from syngen.sdk import DataIO

data_io = DataIO(
    path="../examples/example-data/escaped_quoted_table.csv",
    sep=',',           # delimiter
    quotechar='"',     # quote character
    quoting="minimal", # quoting style
    encoding='utf-8',  # encoding
    header=0           # use first row as header
)

df = data_io.load_data()

print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
print(f"\nFirst few rows:")
print(df.head())

In [None]:
# The example 3: Load data and get schema information

from syngen.sdk import DataIO


data_io = DataIO(path="../examples/example-data/avro_file.avro")

df = data_io.load_data()

schema = data_io.load_schema()

print("Data schema:")
print(schema)

In [None]:
# The example 4: Save data to a file

import pandas as pd
from syngen.sdk import DataIO


sample_data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 45],
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
})


data_io = DataIO(
    path="../examples/example-data/sample_output.csv",
    sep=',',
    encoding='utf-8'
)


data_io.save_data(sample_data)

print("Data saved successfully to '../examples/example-data/sample_output.csv'")