# Usage Example

In this notebook, We will show the usage of the **GANBLR** models.

Currently, the following ganblr models are available in this package:

- GANBLR
- GANBLR++


## 1. GANBLR

### 1.1. Load the data

The first step is to get the data we will use. For `GANBLR`, the data must be discrete. 

In this case, with the built-in `get_demo_data` method, we can get the discrete `adult` data in the format of `pandas.DataFrame`.

In [None]:
from ganblr import get_demo_data

df = get_demo_data('adult')
df.head()

### 1.2. Train the GANBLR Model

Next, we will use `sklearn.model_selection.train_test_split` to split the data into training and test sets, then fit the training set into the `GANBLR` model in order to train the model.

Note that the `GANBLR` class has build-in `sklearn.preprocessing.OrdinalEncoder` and `sklearn.preprocessing.LabelEncoder` to convert the data format.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

x, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

In [None]:
print("Training shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

In [None]:
from ganblr.models import GANBLR
model = GANBLR()
model.fit(X_train, y_train, k = 0, epochs = 10, batch_size=64)

### 1.3. Generate the synthetic data

Once the model is ready, we can use `GANBLR.sample` method to sample some synthetic data.

We can use the `size` parameter to specify the number of samples we want to generate. If we do not specify, it will generate the same number as the training data.

In [None]:
size = 1000

syn_data = model.sample(size)

In [None]:
print(f"{type(syn_data)}, {syn_data.shape}")

In [None]:
import pandas as pd
pd.DataFrame(data = syn_data, columns=df.columns).head(10)

### 1.4. TSTR evaluation

Finally, as we did in our paper, we will perform a simple TSTR(Train on Synthetic, Test on Real) evaluation to demonstrate the performance of our generated data.

We will evaluate on three models from sklearn, `LogisticRegression`, `RandomForest`, and `MLPClassifier`. 

TRTR(Train on Real, Test on Real) will be used as the baseline for comparison.

In [None]:
acc_score_lr = model.evaluate(X_test, y_test, model='lr')
acc_score_mlp = model.evaluate(X_test, y_test, model='mlp')
acc_score_rf = model.evaluate(X_test, y_test, model='rf')

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
lbe = LabelEncoder()
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)
y_train_lbe = lbe.fit_transform(y_train)
y_test_lbe = lbe.transform(y_test)

trtr_score_lr  = LogisticRegression().fit(X_train_ohe, y_train_lbe).score(X_test_ohe, y_test_lbe)
trtr_score_rf  = RandomForestClassifier().fit(X_train, y_train_lbe).score(X_test, y_test_lbe)
trtr_score_mlp = MLPClassifier().fit(X_train_ohe, y_train_lbe).score(X_test_ohe, y_test_lbe)

In [None]:
df_evaluate = pd.DataFrame([
    ['TSTR', acc_score_lr, acc_score_rf, acc_score_mlp],
    ['TRTR', trtr_score_lr,trtr_score_rf,trtr_score_mlp]
], columns=['Evaluated Item', 'LR', 'RF', 'MLP'])
df_evaluate

## 2. GANBLR++

### 2.1. Load the data

Unlike `GANBLR`, which can only handle discrete data, `GANBLR++` can handle numerical data as well.

In this case, to test `GANBLR++`, we use the built-in `get_demo_data` to get the raw `adult` data in the format of `pandas.DataFrame`.

In [None]:
from ganblr import get_demo_data
df = get_demo_data('adult-raw')
df.head()

### 2.2. Train the GANBLR++ model

Next, we will use `sklearn.model_selection.train_test_split` to split the data into training and test sets, then fit the training set into the `GANBLRPP` model in order to train the model.

In [None]:
from sklearn.model_selection import train_test_split
x, y = df.values[:,:-1], df.values[:,-1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

In [None]:
print("Training shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

GANBLR++ takes an additional parameter `numerical_columns` to tell the model which columns are numerical data. 

Numerical_columns is a list of integers indicating the indexes of numerical columns. 

In most cases, it can be obtained with the following code, but sometimes we still need to specify it manually.

In [None]:
import numpy as np
def is_numerical(dtype):
    '''
    if the type is one of ['signed-integer', 'unsigned-integer', 'floating point'], we reconginze it as a numerical one.
    
    Reference: https://numpy.org/doc/stable/reference/generated/numpy.dtype.kind.html#numpy.dtype.kind
    '''
    return dtype.kind in 'iuf'

column_is_numerical = df.dtypes.apply(is_numerical).values
numerical_columns = np.argwhere(column_is_numerical).ravel()
numerical_columns

### 2.3. Generate the synthetic data

Once the model is ready, we can use `GANBLRPP.sample` method to sample some synthetic data.

We can use the `size` parameter to specify the number of samples we want to generate. If we do not specify, it will generate the same number as the training data.

In [None]:
from ganblr.models import GANBLRPP
ganblrpp = GANBLRPP(numerical_columns)
ganblrpp.fit(X_train, y_train, epochs=10)

In [None]:
size = 1000
syn_data = ganblrpp.sample(size)

In [None]:
import pandas as pd
pd.DataFrame(syn_data, columns=df.columns).head(10)

### 2.4. TSTR evaluation

Finally, as we did in our paper, we will perform a simple TSTR(Train on Synthetic, Test on Real) evaluation to demonstrate the performance of our generated data.

We will evaluate on three models from sklearn, `LogisticRegression`, `RandomForest`, and `MLPClassifier`. 

TRTR(Train on Real, Test on Real) will be used as the baseline for comparison.

In [None]:
acc_score_lr  = ganblrpp.evaluate(X_test, y_test, model='lr')
acc_score_mlp = ganblrpp.evaluate(X_test, y_test, model='mlp')
acc_score_rf  = ganblrpp.evaluate(X_test, y_test, model='rf')

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.metrics import accuracy_score

catgorical_columns = list(set(range(X_train.shape[1])) - set(numerical_columns))  

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train_ohe = ohe.fit_transform(X_train[:,catgorical_columns])
X_test_ohe  = ohe.transform(X_test[:,catgorical_columns])
X_train_num = X_train[:,numerical_columns]
X_test_num  = X_test[:,numerical_columns]

scaler = StandardScaler()
X_train_concat = scaler.fit_transform(np.hstack([X_train_num, X_train_ohe]))
X_test_concat  = scaler.transform(np.hstack([X_test_num, X_test_ohe]))

lbe = LabelEncoder()
y_train_lbe = lbe.fit_transform(y_train)
y_test_lbe = lbe.transform(y_test)

trtr_score_lr = LogisticRegression().fit(X_train_concat, y_train).score(X_test_concat, y_test)
trtr_score_rf = RandomForestClassifier().fit(X_train_concat, y_train).score(X_test_concat, y_test)
trtr_score_mlp = MLPClassifier().fit(X_train_concat, y_train).score(X_test_concat, y_test)

In [None]:
df_evaluate = pd.DataFrame([
    ['TSTR', acc_score_lr, acc_score_rf, acc_score_mlp],
    ['TRTR', trtr_score_lr,trtr_score_rf,trtr_score_mlp]
], columns=['Evaluated Item', 'LR', 'RF', 'MLP'])
df_evaluate