# Usage Example

In this notebook, We will show the usage of the **GANBLR** models.

Currently, the following ganblr models are available in this package:

- GANBLR
- GANBLR++


## 1. GANBLR

### 1.1. Load the data

The first step is to get the data we will use. For `GANBLR`, the data must be discrete. 

In this case, with the built-in `get_demo_data` method, we can get a discrete `adult` data in the format of `pandas.DataFrame`.

In [None]:
from ganblr.utils import get_demo_data

df = get_demo_data('adult')
df.head()

### 1.2. Train the GANBLR Model

Next, we will use `sklearn.model_selection.train_test_split` to split the data into training and test sets, then fit the training set into the `GANBLR` model in order to train the model.

Note that the `GANBLR` class has build-in `sklearn.preprocessing.OrdinalEncoder` and `sklearn.preprocessing.LabelEncoder` to convert the data format.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
#data = OrdinalEncoder(dtype=int).fit_transform(df)
x, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

In [None]:
print("Training shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

In [None]:
from ganblr import GANBLR
model = GANBLR()
model.fit(X_train, y_train, k = 0, epochs = 10, batch_size=64)

### 1.3. Generate the synthetic data

Once the model is ready, we can use `GANBLR.sample` method to sample some synthetic data.

We can use the `size` parameter to specify the number of samples we want to generate. If we do not specify, it will generate the same number as the training data.

In [None]:
size = 1000

syn_data = model.sample(size)

In [None]:
print(f"{type(syn_data)}, {syn_data.shape}")

In [None]:
import pandas as pd
pd.DataFrame(data = syn_data, columns=df.columns).head(10)

### 1.4. TSTR evaluation

Finally, as we did in our paper, we will perform a simple TSTR(Train on Synthetic, Test on Real) evaluation to demonstrate the performance of our generated data.

We will evaluate on three models from sklearn, `LogisticRegression`, `RandomForest`, and `MLPClassifier`. 

TRTR(Train on Real, Test on Real) will be used as the baseline for comparison.

In [None]:
acc_score_lr = model.evaluate(X_test, y_test, model='lr')
acc_score_mlp = model.evaluate(X_test, y_test, model='mlp')
acc_score_rf = model.evaluate(X_test, y_test, model='rf')

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
lbe = LabelEncoder()
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)
y_train_lbe = lbe.fit_transform(y_train)
y_test_lbe = lbe.transform(y_test)

trtr_score_lr  = LogisticRegression().fit(X_train_ohe, y_train_lbe).score(X_test_ohe, y_test_lbe)
trtr_score_rf  = RandomForestClassifier().fit(X_train, y_train_lbe).score(X_test, y_test_lbe)
trtr_score_mlp = MLPClassifier().fit(X_train_ohe, y_train_lbe).score(X_test_ohe, y_test_lbe)

In [None]:
import pandas as pd
df_evaluate = pd.DataFrame([
    ['TSTR', acc_score_lr, acc_score_rf, acc_score_mlp],
    ['TRTR', trtr_score_lr,trtr_score_rf,trtr_score_mlp]
], columns=['Evaluated Item', 'LR', 'RF', 'MLP'])
df_evaluate
#df_evaluate.set_index('Evaluate Item')

## 2. GANBLR++

In [None]:
from ganblr import GANBLRPP
from pandas import DataFrame, read_csv
df = read_csv('../uci-datasets/raw_csv/adult.csv', index_col=0)

In [None]:
df

In [None]:
import numpy as np
numerical_columns = np.argwhere(df.dtypes.values == float).ravel()
numerical_columns

In [None]:
ganblrpp = GANBLRPP(numerical_columns)

In [None]:
from sklearn.model_selection import train_test_split
x, y = df.values[:,:-1], df.values[:,-1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5,random_state=20)

In [None]:
ganblrpp._GANBLRPP__discritizer._DMMDiscritizer__arr_mu

In [None]:
[len(mu) for mu in ganblrpp._GANBLRPP__discritizer._DMMDiscritizer__arr_mu]

In [None]:
dmmd = ganblrpp._GANBLRPP__discritizer
x = dmmd._DMMDiscritizer__scaler.fit_transform(X_train[:,numerical_columns])
print(x.shape)
arr_modes = []
for i, dmm in enumerate(dmmd._DMMDiscritizer__dmms):
    cur = x[:,i:i+1]
    print(cur.shape)
    modes = dmm.predict(cur)
    modes = LabelEncoder().fit_transform(modes)#.astype(int)
    arr_modes.append(modes)

In [None]:
ganblrpp.fit(X_train, y_train, epochs=10)

In [None]:
acc_score_lr  = ganblrpp.evaluate(X_test, y_test, model='lr')
acc_score_mlp = ganblrpp.evaluate(X_test, y_test, model='mlp')
acc_score_rf  = ganblrpp.evaluate(X_test, y_test, model='rf')

In [None]:
numerical_columns = [0,1,2,3]
catgorical_columns = np.argwhere([col not in numerical_columns for col in range(8)])
list(set(range(8)) - set(numerical_columns))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.metrics import accuracy_score

eval_model = None
if model=='lr':
    eval_model = LogisticRegression() 
elif model == 'rf':
    eval_model = RandomForestClassifier()
elif model == 'mlp':
    eval_model = MLPClassifier() 
elif hasattr(model, 'fit'):
    eval_model = model
else:
    raise Exception('Invalid Arugument')
    
synthetic_data = ganblrpp.sample()
synthetic_x, synthetic_y = synthetic_data[:,:-1], synthetic_data[:,-1]

numerical_columns = ganblrpp._numerical_columns
catgorical_columns = list(set(range(x.shape[1])) - set(numerical_columns))
ode = OrdinalEncoder(categories=ganblrpp._GANBLRPP__ganblr._d.get_categories(catgorical_columns))
ohe = OneHotEncoder(categories =ganblrpp._GANBLRPP__ganblr._d.get_categories(catgorical_columns), sparse=False)
lbe = ganblrpp._GANBLRPP__ganblr._label_encoder
scaler = StandardScaler()
 
real_x_num = scaler.fit_transform(X_test[:,numerical_columns])
syn_x_num  = scaler.fit_transform(synthetic_x[:,numerical_columns])
if model != 'rf':
    real_x_cat = ohe.fit_transform(X_test[:,catgorical_columns])
    syn_x_cat  = ohe.fit_transform(synthetic_x[:,catgorical_columns])
else:
    real_x_cat = x[:,catgorical_columns]
    syn_x_cat = synthetic_x[:,catgorical_columns]
 
real_y = lbe.transform(y)
syn_y  = lbe.transform(synthetic_y)

eval_model.fit(np.hstack([syn_x_num, syn_x_cat]), syn_y)
pred = eval_model.predict(np.hstack([real_x_num, real_x_cat]))
acc = accuracy_score(real_y, pred)