### Early stopping

The model has an early stopping mechanism with default patiens of 0.005 for 10 epochs in a row. Due to this, you can specify obviously superior number of epochs.

In [1]:
import requests
import pandas as pd

In [2]:
dataset = requests.get("https://raw.githubusercontent.com/tdspora/syngen/main/example-data/housing.csv")

In [3]:
with open("sample-dataset.csv", "w+") as file:
    file.write(dataset.content.decode("utf-8"))

Since syngen is a command-line tool, we will invoke it's commands in a command line mode of jupyter.

First of all, let's install the latest version of Syngen

In [8]:
%%cmd

python.exe -m pip install --upgrade pip
pip3 install syngen

Microsoft Windows [Version 10.0.19044.2251]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Pavel_Bobyrev>
C:\Users\Pavel_Bobyrev>python.exe -m pip install --upgrade pip

C:\Users\Pavel_Bobyrev>pip3 install syngen
Collecting syngen
  Downloading syngen-0.0.48-py3-none-any.whl (63 kB)
     ---------------------------------------- 63.9/63.9 kB 1.1 MB/s eta 0:00:00
Installing collected packages: syngen
Successfully installed syngen-0.0.48

C:\Users\Pavel_Bobyrev>

Then we will invoke the training process. To demonstrate the early stopping feature we subset a small amount of rows from the table (2048) using row_limit argument.

In [15]:
%%cmd

train --source "./sample-dataset.csv" --table_name "housing" --epochs 100 --row_limit 2048

Microsoft Windows [Version 10.0.19044.2251]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Pavel_Bobyrev>
C:\Users\Pavel_Bobyrev>train --source "./sample-dataset.csv" --table_name "housing" --epochs 100 --row_limit 2048


[32m2022-12-01 07:10:42.814[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m__train_table[0m:[36m186[0m - [1mTraining process of the table - housing has started.[0m
[32m2022-12-01 07:10:42.835[0m | [1mINFO    [0m | [36msyngen.ml.interface.interface[0m:[36mrun[0m:[36m147[0m - [1mGenerator: 'vae', mode: 'train'[0m
[32m2022-12-01 07:10:42.843[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.strategies.strategies[0m:[36mrun[0m:[36m27[0m - [34m[1mTrain model with parameters: epochs=100, drop_null=False[0m
[32m2022-12-01 07:10:42.941[0m | [34m[1mDEBUG   [0m | [36msyngen.ml.pipeline.pipeline[0m:[36mdata_pipeline[0m:[36m128[0m - [34m[1mCount of string columns: 0; Count of float columns: 3; Count of int columns: 6; Count of categorical columns: 1; Count of date columns: 0; Count of binary columns: 0[0m
[32m2022-12-01 07:10:48.086[0m | [1mINFO    [0m | [36msyngen.ml.reporters.reporters[0m:[36mreport[0m:[36m158[0m - [1mCorresponding p


C:\Users\Pavel_Bobyrev>

When the Early stopping mechanism is engaged you should see the log message "The loss does not become lower for 10 epochs in a row. Stopping the training". So, we ran only 22 epochs out of 100 specified.

Then you can infer as always - by calling the infer cli method passing the table name as argument.

In [16]:
%%cmd

infer --size 10000 --table_name "housing" --print_report false

Microsoft Windows [Version 10.0.19044.2251]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Pavel_Bobyrev>
C:\Users\Pavel_Bobyrev>infer --size 10000 --table_name "housing" --print_report false


[32m2022-12-01 07:14:57.910[0m | [1mINFO    [0m | [36msyngen.ml.worker.worker[0m:[36m__infer_table[0m:[36m208[0m - [1mInfer process of the table - housing has started.[0m
[32m2022-12-01 07:14:57.910[0m | [1mINFO    [0m | [36msyngen.ml.train_chain.train_chain[0m:[36mhandle[0m:[36m282[0m - [1mTotal of 1 batch(es)[0m
[32m2022-12-01 07:14:57.918[0m | [1mINFO    [0m | [36msyngen.ml.train_chain.train_chain[0m:[36mrun[0m:[36m204[0m - [1mStart data synthesis[0m
[32m2022-12-01 07:15:06.820[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m303[0m - [1mLoaded VAE state from model_artifacts/resources/housing/vae/checkpoints[0m
[32m2022-12-01 07:15:06.868[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrappers[0m:[36mload_state[0m:[36m303[0m - [1mLoaded VAE state from model_artifacts/resources/housing/vae/checkpoints[0m
[32m2022-12-01 07:15:06.884[0m | [1mINFO    [0m | [36msyngen.ml.vae.wrappers.wrap


C:\Users\Pavel_Bobyrev>

When the inference process is completed you will see the log message "Synthesis of the table - housing was complited. Synthetic data saved in model_artifacts/tmp_store/housing/merged_infer_housing.csv".

That's it! You can find the generated table in the path mentioned above. Let's take a quick glance on the data generated using pandas/

In [17]:
import pandas as pd
pd.read_csv("./model_artifacts/tmp_store/housing/merged_infer_housing.csv")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-119.251472,36.466984,29,2681,773,350,712,1.927160,64992,INLAND
1,-118.764351,34.118820,29,2513,602,1292,544,4.084972,275731,<1H OCEAN
2,-117.930855,33.815987,38,2328,322,1276,375,1.947785,118588,<1H OCEAN
3,-119.139587,36.653145,31,2469,775,443,776,2.211595,82714,INLAND
4,-122.550522,37.724789,29,2300,545,1221,741,3.395561,201079,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
9995,-118.788651,33.926743,40,1843,714,1346,343,1.662773,167284,<1H OCEAN
9996,-118.379105,33.689243,12,5303,1119,2632,899,4.352195,264923,<1H OCEAN
9997,-118.783112,34.071365,27,2883,652,1296,634,4.688275,309480,<1H OCEAN
9998,-117.862831,34.892715,40,4202,689,341,276,3.216156,252473,INLAND
