# Using SyGNet

This notebook demonstrates the basic functionality of the **sygnet** package in Python.

To download the package, simply run `pip install sygnet` at the command line.

## Prerequisites

To start, we load the required packages and define a data generating process (DGP).

In [1]:
%%capture
%cd ..

import pandas as pd
import numpy as np

from numpy.random import default_rng
from torch import manual_seed

# NB: once installed via pip, can run `from sygnet import SygnetModel`
from src.sygnet.sygnet_interface import SygnetModel




Next, we will focus on a very simple case of learning a parametric relationship between numeric variables.The DGP comprises two random uniform variables (`x1` and `x2`), which in turn impact the value of two other variables (`x3` and `y`). Once defined, we take 100,000 draws from this DGP to use as our training data:

In [2]:
rng = default_rng(seed = 100)
manual_seed(100)

def gen_sim_data(rng, n=100000):
    
    x1 = rng.uniform(low = 0, high = 1, size = n)
    x2 = rng.uniform(low = 0, high = 1, size = n)
    x3 = rng.normal(loc = x1 + x2, scale = 0.1)
    y = rng.normal(loc=3*x1 + 2*x2 + 1, scale = 1)

    sim_data = pd.DataFrame({
        'x1' : x1,
        'x2' : x2,
        'x3' : x3,
        'y' : y
    })   

    return sim_data

train_data = gen_sim_data(rng)
train_data.head()

Unnamed: 0,x1,x2,x3,y
0,0.834982,0.531162,1.494573,3.945647
1,0.596554,0.292134,0.962634,2.442326
2,0.288863,0.066026,0.146251,0.943672
3,0.042952,0.919107,1.011053,3.074215
4,0.973654,0.510645,1.548556,1.854276


## **sygnet** pipeline

We follow a very similar pipeline to **scikit-learn**: 

1. Instantiate a model:
   * Users must specify what type of GAN architecture to use: we recommend "wgan" for non-conditional synthetic data, and "cgan" when conditional labels will be supplied 
   * Optional arguments allow the user to customise the hidden layer structure, dropout proportions, layer norming, ReLU leakage, and whether to range match the final output
2. Fit the model to the training data
   * Users must supply the training data
   * Optional arguments allow the user to alter the default hyperparameters (epochs, learning rate, batch size etc.)
3. Sample from the trained model
   * Users must specify the number of synthetic observations to draw from the model
   * Optional arguments allow the user to control the format of the returned results, as well as to save the synthetic data to disk



## Basic example

In this first example, we set `mode = "wgan"` to use the Wassterstein GAN architecture. We fit the model to our simulated data, for a single epoch (real uses will require more epochs), and then generate 100 synthetic observations:

In [3]:
model = SygnetModel(mode = "wgan")
model.fit(data = train_data, epochs = 1)
synth_data1 = model.sample(nobs = 1000)

synth_data1.head()

Epoch: 100%|██████████| 1/1 [00:08<00:00,  8.03s/it]


Unnamed: 0,x1,x2,x3,y
0,1.987882,1.251058,1.305152,1.415426
1,1.328185,0.96184,1.358526,0.521271
2,1.748668,0.997697,1.398263,1.574589
3,2.192904,1.09027,1.388196,1.089555
4,1.821552,1.141728,0.738089,1.536895


*Note: In the above example, we "sample" observations from the trained model. In keeping with the **scikit-learn** API, users can instead use `.transform()`, which is an alias  of `.sample()`*.

### GPU support

**sygnet** allows users to train the model using GPU computation, which should improve training times considerably. To run the synthetic generator on the GPU, we simply fit the model with the parameter `device = 'cuda'`. Using the GPU, we see about an eight-fold reduction in the time it takes to run an epoch:

In [4]:
model_gpu = SygnetModel(mode = "wgan")
model_gpu.fit(data = train_data, epochs = 3, device='cuda')
synth_data2 = model_gpu.sample(nobs = 1000)

synth_data2.head()

Epoch: 100%|██████████| 3/3 [00:03<00:00,  1.08s/it]


Unnamed: 0,x1,x2,x3,y
0,0.0,0.349543,0.406956,4.025612
1,0.0,1.165,0.444231,4.278828
2,0.0,0.795632,0.307328,3.525852
3,0.0,0.854147,0.784692,4.134717
4,0.0,0.922386,0.86844,4.546636


## Custom architectures

The above models are trained on the default parameters and for a short number of epochs. They are therefore not well-trained (as can be seen in the resulting synthetic data).

To improve the quality of our model, we can adjust the hyperparameter settings. In this instance, we reduce the batch size and dropout proportion (relative to the default), increase the learning rate, and train for 50 epochs on the GPU:



In [5]:
model_custom = SygnetModel(mode = "wgan", dropout_p=0.1, mixed_activation=False)

model_custom.fit(
    train_data, 
    device = 'cuda', 
    epochs = 50, 
    batch_size=512,
    learning_rate=0.001
)

synth_data3 = model_custom.sample(1000)

synth_data3.head()

Not using mixed activation function -- generated data may not conform to real data if it contains categorical columns.
Epoch: 100%|██████████| 50/50 [02:27<00:00,  2.96s/it]


Unnamed: 0,x1,x2,x3,y
0,0.217527,0.71818,1.001947,-0.600885
1,0.410736,1.004193,1.417894,3.793028
2,0.274179,0.412281,0.687893,2.317762
3,0.750253,0.186607,0.933108,4.225198
4,0.211625,0.682338,0.891227,2.560545


To assess the quality of our model, we can see how well the synthetic data outcome variable `y` is modelled by the synthetic independent variables. To do so, we can regress `y` on `x1` and `x2`. Recall that in our DGP $$y \sim \mathcal{N}(\mu = 3\times X_1 + 2 \times X_2 + 1, \sigma).$$ Therefore, we should expect the coefficient vector to be close to `[3,2,1]`. 

In [6]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(synth_data3.drop(['y','x3'], axis=1), synth_data3['y'])

print(f"Synthetic coefficients = {['%.2f' % val for val in reg.coef_.tolist() + [reg.intercept_]]}")


Synthetic coefficients = ['3.12', '1.63', '0.86']


We find that, while the relationship is not perfectly captured, we nevertheless get data that resembles that relationship.