# Quick start

*EPAM Syngen* is an unsupervised tabular data generation tool based on a variational autoencoder (VAE). 
It supports common tabular datatypes (floats, ints, datetime, text, categorical, binary) and can generate linked tables that sharing keys using the simple statistical approach. 
The SDK exposes simple programmatic entry points for training, inference, report generation, loading and saving data in supported formats - *CSV*, *Avro* and *Excel* format. The data should be located locally and be in UTF-8 encoding.

This notebook demonstrates the SDK usage. Install the package and then you can call the main SDK class `Syngen` to run training, inference or generation of reports, and `DataIO` to load and save the data in supported formats.

Python *3.10* or *3.11* is required to run the library. The library is tested on Linux and Windows operating systems.

Please
1. Install the package (from PyPI):

2. Use the SDK:

- `Syngen.train(...)` — train a model for a table. Key parameters include `source`, `table_name`, `epochs`, and `batch_size`.
- `Syngen.infer(...)` — generate synthetic rows. Key parameters include `table_name`, `size`, `run_parallel` and `random_seed`.

Notes: the library supports encrypting stored original data via a Fernet key (provide the name of an environment variable that holds the key using the `fernet_key` parameter). You can also generate accuracy and sample reports by passing `reports` either via the CLI or SDK methods.

Please, install the library *syngen* (from Pypi):

In [None]:
!pip install syngen==0.10.28rc0

## Launch training

You can start training process using the SDK entrypoint `Syngen.train(...)`. The SDK mirrors the CLI options so you can pass the same parameters programmatically. Below is a concise description of the most useful parameters:

```python
Syngen.train(
    metadata_path: Optional[str] = None,  # use a metadata YAML to train multiple tables or provide all params in a file
    table_name: Optional[str] = None,     # required for single-table training
    source: Optional[str] = None,         # path to original data
    epochs: int = 10,                     # number of training epochs
    drop_null: bool = False,              # whether to drop rows with at least one missing value
    row_limit: Optional[int] = None,      # a number of rows to train over
    reports: Union[str, List[str]] = "none",  # report types: 'none', 'accuracy', 'sample', 'metrics_only', 'all'
    log_level: str = "INFO",               # logging level
    batch_size: int = 32,                  # training batch size
    fernet_key: Optional[str] = None       # name of the environment variable containing the Fernet key for secure storage
)
```

Key notes:

- `metadata_path`: provide a YAML metadata file to train multiple tables or to specify per-table settings. If `metadata_path` is given, SDK arguments for an individual table such as `source` and `table_name` are ignored in favor of metadata values.
- `source` and `table_name`: when training a single table, both are required (the `source` is the path to the data file and `table_name` is an arbitrary name used for artifact folders).
- `reports`: can be a single string or a list. Accepted values are `none` (default), `accuracy`, `sample`, `metrics_only`, or `all`. You can pass multiple reports (e.g. `['accuracy', 'sample']`) to generate more outputs after training.
- `fernet_key`: pass the *name* of the environment variable that stores a 44-character URL-safe base64 Fernet key; when present, original data samples are encrypted on disk and the same key must be used during inference/reporting to decrypt them.
```
*Note*: For full documentation and additional details, please, follow the project's *README*: [README.md](../README.md)

In [None]:
# Training on the provided example of the data located in the repository - "./example-data/housing.csv"

from syngen.sdk import Syngen

Syngen.train(
    source='./examples/example-data/housing.csv', 
    table_name='Housing', 
    epochs=5,
    drop_null=False,
    row_limit=1000, 
    batch_size=32, 
    reports='all'
)