# Usage Example

In this notebook, We will show the usage of the **GANBLR** models.

Currently, the following ganblr models are available in this package:

- GANBLR
- GANBLR++


## 1. GANBLR

### 1.1. Load the data

The first step is to get the data we will use. For `GANBLR`, the data must be discrete. 

In this case, with the built-in `get_demo_data` method, we can get the discrete `adult` data in the format of `pandas.DataFrame`.

In [1]:
from ganblr.utils import get_demo_data

df = get_demo_data('adult')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,6,7,0,9,2,4,1,1,4,1,13,0,2,39,0
1,7,6,0,9,2,2,4,0,4,1,0,0,0,39,0
2,6,4,0,11,5,0,6,1,4,1,0,0,2,39,0
3,7,4,0,1,0,2,6,0,2,1,0,0,2,39,0
4,4,4,0,9,2,2,10,5,2,0,0,0,2,5,0


### 1.2. Train the GANBLR Model

Next, we will use `sklearn.model_selection.train_test_split` to split the data into training and test sets, then fit the training set into the `GANBLR` model in order to train the model.

Note that the `GANBLR` class has build-in `sklearn.preprocessing.OrdinalEncoder` and `sklearn.preprocessing.LabelEncoder` to convert the data format.

In [2]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

x, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

In [3]:
print("Training shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Training shape: (24421, 14) (24421,)
Test shape: (24421, 14) (24421,)


In [30]:
f'{0.1234567:.6f}'

'0.123457'

In [4]:
from ganblr import GANBLR
model = GANBLR()
model.fit(X_train, y_train, k = 0, epochs = 10, batch_size=64)

warmup run:
Epoch 1/10: G_loss = 4.58913516998291, G_accuracy = 0.8543057441711426, D_loss = 0.4685502052307129, D_accuracy = 0.8620274662971497
Epoch 2/10: G_loss = 2.933922052383423, G_accuracy = 0.8625773191452026, D_loss = 2.430967330932617, D_accuracy = 0.648273766040802
Epoch 3/10: G_loss = 2.145524501800537, G_accuracy = 0.8656893372535706, D_loss = 2.1573619842529297, D_accuracy = 0.5535280108451843
Epoch 4/10: G_loss = 3.6396114826202393, G_accuracy = 0.8678186535835266, D_loss = 3.219583034515381, D_accuracy = 0.5890768766403198
Epoch 5/10: G_loss = 2.9528706073760986, G_accuracy = 0.8692927956581116, D_loss = 1.330764651298523, D_accuracy = 0.6770915985107422
Epoch 6/10: G_loss = 2.9507458209991455, G_accuracy = 0.8715859055519104, D_loss = 0.9282518029212952, D_accuracy = 0.7218539714813232
Epoch 7/10: G_loss = 2.4131100177764893, G_accuracy = 0.8710945248603821, D_loss = 3.151639461517334, D_accuracy = 0.5258618593215942
Epoch 8/10: G_loss = 2.853835344314575, G_accuracy =

<ganblr.models.ganblr.GANBLR at 0x210cdfde790>

### 1.3. Generate the synthetic data

Once the model is ready, we can use `GANBLR.sample` method to sample some synthetic data.

We can use the `size` parameter to specify the number of samples we want to generate. If we do not specify, it will generate the same number as the training data.

In [5]:
size = 1000

syn_data = model.sample(size)

  0%|          | 0/15 [00:00<?, ?it/s]

In [6]:
print(f"{type(syn_data)}, {syn_data.shape}")

<class 'numpy.ndarray'>, (1000, 15)


In [7]:
import pandas as pd
pd.DataFrame(data = syn_data, columns=df.columns).head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,7,5,0,7,2,1,5,3,1,1,1,20,5,9,1
1,10,8,0,15,6,2,5,1,2,0,2,13,1,5,0
2,0,7,0,5,4,1,13,4,3,0,12,12,0,37,1
3,1,1,0,8,2,0,5,1,0,1,8,11,2,33,0
4,0,3,0,3,3,4,13,0,1,0,8,10,0,35,0
5,0,6,0,0,6,0,8,3,3,0,2,11,0,32,0
6,1,2,0,0,1,4,14,4,1,1,5,11,3,35,0
7,6,0,0,0,0,1,11,5,2,0,9,17,1,3,1
8,1,8,0,3,6,1,0,5,2,0,4,19,1,16,0
9,5,1,0,13,2,0,1,4,3,0,0,8,5,27,0


### 1.4. TSTR evaluation

Finally, as we did in our paper, we will perform a simple TSTR(Train on Synthetic, Test on Real) evaluation to demonstrate the performance of our generated data.

We will evaluate on three models from sklearn, `LogisticRegression`, `RandomForest`, and `MLPClassifier`. 

TRTR(Train on Real, Test on Real) will be used as the baseline for comparison.

In [8]:
acc_score_lr = model.evaluate(X_test, y_test, model='lr')
acc_score_mlp = model.evaluate(X_test, y_test, model='mlp')
acc_score_rf = model.evaluate(X_test, y_test, model='rf')

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
lbe = LabelEncoder()
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)
y_train_lbe = lbe.fit_transform(y_train)
y_test_lbe = lbe.transform(y_test)

trtr_score_lr  = LogisticRegression().fit(X_train_ohe, y_train_lbe).score(X_test_ohe, y_test_lbe)
trtr_score_rf  = RandomForestClassifier().fit(X_train, y_train_lbe).score(X_test, y_test_lbe)
trtr_score_mlp = MLPClassifier().fit(X_train_ohe, y_train_lbe).score(X_test_ohe, y_test_lbe)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [10]:
df_evaluate = pd.DataFrame([
    ['TSTR', acc_score_lr, acc_score_rf, acc_score_mlp],
    ['TRTR', trtr_score_lr,trtr_score_rf,trtr_score_mlp]
], columns=['Evaluated Item', 'LR', 'RF', 'MLP'])
df_evaluate

Unnamed: 0,Evaluated Item,LR,RF,MLP
0,TSTR,0.853323,0.829122,0.843905
1,TRTR,0.873183,0.848532,0.859793


## 2. GANBLR++

### 2.1. Load the data

Unlike `GANBLR`, which can only handle discrete data, `GANBLR++` can handle numerical data as well.

In this case, to test `GANBLR++`, we use the built-in `get_demo_data` to get the raw `adult` data in the format of `pandas.DataFrame`.

In [11]:
from ganblr.utils import get_demo_data
df = get_demo_data('adult-raw')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:
from sklearn.model_selection import train_test_split
x, y = df.values[:,:-1], df.values[:,-1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5)


In [13]:
print("Training shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Training shape: (24421, 14) (24421,)
Test shape: (24421, 14) (24421,)


In [14]:
df.dtypes.values

array([dtype('int64'), dtype('O'), dtype('int64'), dtype('O'),
       dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'),
       dtype('O'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('O'), dtype('O')], dtype=object)

In [15]:
import numpy as np
def is_numerical(dtype):
    '''
    if the type is one of ['signed-integer', 'unsigned-integer', 'floating point'], we reconginze it as a numerical one.
    
    Reference: https://numpy.org/doc/stable/reference/generated/numpy.dtype.kind.html#numpy.dtype.kind
    '''
    return dtype.kind in 'iuf'

column_is_numerical = df.dtypes.apply(is_numerical).values
numerical_columns = np.argwhere(column_is_numerical).ravel()
numerical_columns

array([ 0,  2,  4, 10, 11, 12], dtype=int64)

In [16]:
from ganblr import GANBLRPP
ganblrpp = GANBLRPP(numerical_columns)
ganblrpp.fit(X_train, y_train, epochs=10)



warmup run:
Epoch 1/10: G_loss = 4.002487659454346, G_accuracy = 0.853855311870575, D_loss = 0.8450406789779663, D_accuracy = 0.788242518901825
Epoch 2/10: G_loss = 3.051306962966919, G_accuracy = 0.859588086605072, D_loss = 0.6335127949714661, D_accuracy = 0.767793595790863
Epoch 3/10: G_loss = 2.7741646766662598, G_accuracy = 0.8607346415519714, D_loss = 1.0663396120071411, D_accuracy = 0.6937015056610107
Epoch 4/10: G_loss = 3.409590721130371, G_accuracy = 0.8624544739723206, D_loss = 1.5724759101867676, D_accuracy = 0.6476083397865295
Epoch 5/10: G_loss = 3.0426745414733887, G_accuracy = 0.8631505966186523, D_loss = 2.7285985946655273, D_accuracy = 0.5808870792388916
Epoch 6/10: G_loss = 3.376784563064575, G_accuracy = 0.8635600805282593, D_loss = 1.1194065809249878, D_accuracy = 0.739666759967804
Epoch 7/10: G_loss = 3.1782145500183105, G_accuracy = 0.8649522662162781, D_loss = 3.2684273719787598, D_accuracy = 0.5583139061927795
Epoch 8/10: G_loss = 3.9810738563537598, G_accuracy 

<ganblr.models.ganblr.GANBLR at 0x210d2eeac10>

In [17]:
size = 1000
syn_data = ganblrpp.sample(size)

Step 1/2: Sampling discrete data from GANBLR.


  0%|          | 0/15 [00:00<?, ?it/s]

step 2/2: Sampling numerical data.


sampling: 100%|██████████| 6/6 [00:00<00:00, 499.55it/s]


In [18]:
import pandas as pd
pd.DataFrame(syn_data, columns=df.columns).head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,24.015505,Local-gov,124832.703997,Assoc-acdm,4.236034,Divorced,Prof-specialty,Own-child,Black,Female,12782.024782,1305.85982,37.369192,Scotland,<=50K
1,64.05973,Federal-gov,42297.443078,Masters,11.127331,Never-married,Exec-managerial,Own-child,Black,Male,12797.652436,1118.612571,58.679131,Laos,<=50K
2,18.013976,State-gov,737270.73941,12th,14.033178,Never-married,Sales,Not-in-family,Amer-Indian-Eskimo,Female,-38.559027,971.729398,39.89928,Portugal,<=50K
3,25.231355,Self-emp-inc,27948.685155,1st-4th,5.085956,Never-married,Machine-op-inspct,Husband,Amer-Indian-Eskimo,Male,123.163091,1114.484584,18.331086,Trinadad&Tobago,<=50K
4,46.881386,Self-emp-not-inc,169893.245237,Assoc-acdm,9.995286,Married-AF-spouse,Craft-repair,Husband,Other,Male,-8.044544,920.248413,58.718122,Trinadad&Tobago,>50K
5,43.072856,Never-worked,82376.139864,Doctorate,14.020086,Divorced,Exec-managerial,Husband,Other,Male,7895.626146,640.290188,16.597398,Honduras,>50K
6,22.290981,Federal-gov,156881.114418,12th,16.1027,Never-married,Farming-fishing,Unmarried,Amer-Indian-Eskimo,Male,8847.349935,8.160812,71.206713,Taiwan,<=50K
7,45.528674,Self-emp-not-inc,135455.933899,11th,2.049141,Married-civ-spouse,Craft-repair,Not-in-family,Amer-Indian-Eskimo,Female,29747.940528,920.791223,25.569059,Greece,<=50K
8,23.342695,?,45860.773743,Bachelors,12.273781,Divorced,Tech-support,Wife,White,Male,12828.814516,8.421561,14.312241,Hong,<=50K
9,80.155594,Self-emp-not-inc,220467.058339,Preschool,6.845098,Separated,Armed-Forces,Own-child,White,Male,6836.488561,573.535073,58.594679,Taiwan,<=50K


In [19]:
acc_score_lr  = ganblrpp.evaluate(X_test, y_test, model='lr')
acc_score_mlp = ganblrpp.evaluate(X_test, y_test, model='mlp')
acc_score_rf  = ganblrpp.evaluate(X_test, y_test, model='rf')

Step 1/2: Sampling discrete data from GANBLR.


  0%|          | 0/15 [00:00<?, ?it/s]

step 2/2: Sampling numerical data.


sampling: 100%|██████████| 6/6 [00:00<00:00, 230.56it/s]


Step 1/2: Sampling discrete data from GANBLR.


  0%|          | 0/15 [00:00<?, ?it/s]

step 2/2: Sampling numerical data.


sampling: 100%|██████████| 6/6 [00:00<00:00, 239.78it/s]


Step 1/2: Sampling discrete data from GANBLR.




  0%|          | 0/15 [00:00<?, ?it/s]

step 2/2: Sampling numerical data.


sampling: 100%|██████████| 6/6 [00:00<00:00, 230.56it/s]


In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.metrics import accuracy_score

catgorical_columns = list(set(range(X_train.shape[1])) - set(numerical_columns))  

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train_ohe = ohe.fit_transform(X_train[:,catgorical_columns])
X_test_ohe  = ohe.transform(X_test[:,catgorical_columns])
X_train_num = X_train[:,numerical_columns]
X_test_num  = X_test[:,numerical_columns]

scaler = StandardScaler()
X_train_concat = scaler.fit_transform(np.hstack([X_train_num, X_train_ohe]))
X_test_concat  = scaler.transform(np.hstack([X_test_num, X_test_ohe]))

lbe = LabelEncoder()
y_train_lbe = lbe.fit_transform(y_train)
y_test_lbe = lbe.transform(y_test)

trtr_score_lr = LogisticRegression().fit(X_train_concat, y_train).score(X_test_concat, y_test)
trtr_score_rf = RandomForestClassifier().fit(X_train_concat, y_train).score(X_test_concat, y_test)
trtr_score_mlp = MLPClassifier().fit(X_train_concat, y_train).score(X_test_concat, y_test)



In [21]:
df_evaluate = pd.DataFrame([
    ['TSTR', acc_score_lr, acc_score_rf, acc_score_mlp],
    ['TRTR', trtr_score_lr,trtr_score_rf,trtr_score_mlp]
], columns=['Evaluated Item', 'LR', 'RF', 'MLP'])
df_evaluate

Unnamed: 0,Evaluated Item,LR,RF,MLP
0,TSTR,0.768232,0.796896,0.795504
1,TRTR,0.240694,0.799844,0.75865
