<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_07_5_tabular_synthetic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [None]:
try:
    from google.colab import drive
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab
Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


# Part 7.5: GANs for Tabular Synthetic Data Generation

Typically GANs are used to generate images. However, we can also generate tabular data from a GAN. In this part, we will use the Python tabgan utility to create fake data from tabular data. Specifically, we will use the Auto MPG dataset to train a GAN to generate fake cars.  [Cite:ashrapov2020tabular](https://arxiv.org/pdf/2010.00638.pdf)

## Installing Tabgan

Pytorch is the foundation of the tabgan neural network utility. The following code installs the needed software to run tabgan in Google Colab.

In [None]:
# HIDE OUTPUT
CMD = "wget https://raw.githubusercontent.com/Diyago/"\
  "GAN-for-tabular-data/master/requirements.txt"

!{CMD}
!pip install -r requirements.txt
!pip install tabgan

--2023-02-20 06:32:24--  https://raw.githubusercontent.com/Diyago/GAN-for-tabular-data/master/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 183 [text/plain]
Saving to: ‘requirements.txt’


2023-02-20 06:32:24 (11.7 MB/s) - ‘requirements.txt’ saved [183/183]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scipy==1.4.1
  Downloading scipy-1.4.1-cp38-cp38-manylinux1_x86_64.whl (26.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.0/26.0 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting category_encoders==2.1.0
  Downloading category_encoders-2.1.0-py2.py3-none-any.whl (100 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.3/

Note, after installing; you may see this message:

* You must restart the runtime in order to use newly installed versions.

If so, click the "restart runtime" button just under the message. Then rerun this notebook, and you should not receive further issues.

## Loading the Dygraphia data


In [None]:
 import tensorflow as tf
 import random
 def tf_seed():
    os.environ['PYTHONHASHSEED'] = str(0)
    # if your machine has GPUs use following to off it
    os.environ['CUDA_VISBLE_DEVICE'] = ''
    np.random.seed(0)
    random.seed(0)
    tf.random.set_seed(0)

In [None]:
from sklearn.utils import validation
# HIDE OUTPUT
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics
from keras.initializers import RandomNormal

df = pd.read_excel("/content/reduced_data.xlsx")


# Split into training and test sets
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(
    df.drop("Dysgraphia", axis=1),
    df["Dysgraphia"],
    test_size = 0.20,
    #shuffle=False,
    random_state=42,
    stratify = df["Dysgraphia"]
)

# Create dataframe versions for tabular GAN
df_x_test, df_y_test = df_x_test.reset_index(drop=True), \
df_y_test.reset_index(drop=True)
df_y_train = pd.DataFrame(df_y_train)
df_y_test = pd.DataFrame(df_y_test)

# Pandas to Numpy
x_train = df_x_train.values
x_test = df_x_test.values
y_train = df_y_train.values
y_test = df_y_test.values


tf_seed()
# Build the neural network
model = Sequential()
# Hidden 1
model.add(Dense(20, input_dim = x_train.shape[1], activation = 'relu', kernel_initializer = RandomNormal(mean = 0, stddev = 0.05)))
model.add(Dense(20, activation = 'relu'))
model.add(Dense(25, activation = 'relu')) # Hidden 2
model.add(Dense(12, activation='sigmoid')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['AUC'])

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,
                        patience=5, verbose=1, mode='auto',
                        restore_best_weights = True)
model.fit(x_train, y_train, validation_split = 0.2,
        callbacks = [monitor], verbose = 2, epochs = 1000)

Epoch 1/1000
6/6 - 2s - loss: 0.5526 - auc: 0.5654 - val_loss: 0.5051 - val_auc: 0.7193 - 2s/epoch - 279ms/step
Epoch 2/1000
6/6 - 0s - loss: 0.4800 - auc: 0.6984 - val_loss: 0.5002 - val_auc: 0.7553 - 55ms/epoch - 9ms/step
Epoch 3/1000
6/6 - 0s - loss: 0.4683 - auc: 0.7339 - val_loss: 0.4905 - val_auc: 0.8463 - 65ms/epoch - 11ms/step
Epoch 4/1000
6/6 - 0s - loss: 0.4537 - auc: 0.7672 - val_loss: 0.4939 - val_auc: 0.8570 - 50ms/epoch - 8ms/step
Epoch 5/1000
6/6 - 0s - loss: 0.4500 - auc: 0.7751 - val_loss: 0.4890 - val_auc: 0.8422 - 51ms/epoch - 8ms/step
Epoch 6/1000
6/6 - 0s - loss: 0.4474 - auc: 0.7892 - val_loss: 0.4854 - val_auc: 0.8249 - 64ms/epoch - 11ms/step
Epoch 7/1000
6/6 - 0s - loss: 0.4376 - auc: 0.8047 - val_loss: 0.4889 - val_auc: 0.8275 - 63ms/epoch - 10ms/step
Epoch 8/1000
6/6 - 0s - loss: 0.4405 - auc: 0.7927 - val_loss: 0.4868 - val_auc: 0.8382 - 65ms/epoch - 11ms/step
Epoch 9/1000
6/6 - 0s - loss: 0.4473 - auc: 0.7716 - val_loss: 0.4787 - val_auc: 0.8155 - 50ms/epoch

<keras.callbacks.History at 0x7f4620e08bb0>

We now evaluate the trained neural network to see the RMSE. We will use this trained neural network to compare the accuracy between the original data and the GAN-generated data. We will later see that you can use such comparisons for anomaly detection. We can use this technique can be used for security systems. If a neural network trained on original data does not perform well on new data, then the new data may be suspect or fake.

In [None]:
score = model.evaluate(x_test, y_test)
print("Final score (AUC): {}".format(score[1]))

Final score (AUC): 0.8020833730697632


## Training a GAN for Dysgraphia dataset

Next, we will train the GAN to generate fake data from the original data. There are quite a few options that you can fine-tune for the GAN. The example presented here uses most of the default values. These are the usual hyperparameters that must be tuned for any model and require some experimentation for optimal results. To learn more about tabgab refer to its paper or this [Medium article](https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342), written by the creator of tabgan.

In [None]:
from tabgan.sampler import GANGenerator
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

gen_x, gen_y = GANGenerator(gen_x_times=1.1, cat_cols=None,
           bot_filter_quantile=0.001, top_filter_quantile=0.999, \
              is_post_process=True,
           adversarial_model_params={
               "metrics": "AUC", "max_depth": 2, "max_bin": 100,
               "learning_rate": 0.02, "random_state": \
                42, "n_estimators": 500,
           }, pregeneration_frac=2, only_generated_data=False,\
           gan_params = {"batch_size": 500, "patience": 25, \
          "epochs" : 500,}).generate_data_pipe(df_x_train, df_y_train,\
          df_x_test, deep_copy=True, only_adversarial=False, \
          use_adversarial=True)



Fitting CTGAN transformers for each column:   0%|          | 0/16 [00:00<?, ?it/s]

Training CTGAN, epochs::   0%|          | 0/500 [00:00<?, ?it/s]

Note: if you receive an error running the above code, you likely need to restart the runtime. You should have a "restart runtime" button in the output from the second cell. Once you restart the runtime, rerun all of the cells. This step is necessary as tabgan requires specific versions of some packages.

## Evaluating the GAN Results

If we display the results, we can see that the GAN-generated data looks similar to the original. Some values, typically whole numbers in the original data, have fractional values in the synthetic data.

In [None]:
gen_x

Unnamed: 0,BHK_raw_quality_score,median_Freq_speed,mean_d_P,dist_Freq_tilt_x,dist_Freq_speed,median_Freq_tilt_y,Space_Between_Words,bandwidth_tilt_x,std_d_P,mean_Pressure,in_Air,BHK_raw_speed_score,std_Density,median_Freq_tremolo,Age
0,21.333333,0.001132,0.096496,0.000225,0.000355,0.003374,1418.925895,0.003258,1.335429,588.924280,0.543732,58.333333,252.647261,0.003335,6
1,20.333333,0.001050,0.191805,0.000237,0.000458,0.003336,1524.264209,0.003102,1.758775,460.662390,0.540825,101.333333,181.095889,0.003334,6
2,16.000000,0.001310,0.134668,0.000320,0.000210,0.003348,1125.804989,0.003230,1.548513,309.809359,0.642660,159.000000,280.365147,0.003373,8
3,29.000000,0.001128,0.179636,0.000300,0.000368,0.003401,1678.745264,0.003318,1.802776,489.141456,0.707339,76.000000,285.035458,0.003320,7
4,16.500000,0.001181,0.204028,0.000233,0.000279,0.003370,1317.883933,0.003259,2.334114,380.491356,0.574225,167.500000,219.778932,0.003335,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,9.500000,0.001409,0.129914,0.000142,0.000108,0.003313,930.921346,0.003461,2.038072,440.719243,0.527535,143.000000,157.706077,0.003068,8
153,13.000000,0.001281,0.200997,0.000216,0.000219,0.003311,1252.633855,0.003766,2.313594,543.811315,0.467764,212.500000,185.753591,0.003185,9
154,18.500000,0.001468,0.324858,0.000192,0.000142,0.003331,1474.418533,0.003777,3.302690,433.376566,0.584087,348.500000,178.891976,0.003044,11
155,9.000000,0.001184,0.218088,0.000209,0.000304,0.003320,1215.593437,0.003473,2.653659,507.230910,0.463467,222.000000,94.598949,0.003176,8


In [None]:
gen_y.value_counts()

0    130
1     27
Name: Dysgraphia, dtype: int64

Finally, we present the synthetic data to the previously trained neural network to see how accurately we can predict the synthetic targets.  As we can see, you lose some RMSE accuracy by going to synthetic data.

In [None]:
# Predict
score = model.evaluate(gen_x, gen_y)
print("Final score (AUC): {}".format(score[1]))

Final score (AUC): 0.7840455770492554


In [None]:
new_data = pd.concat([gen_x, gen_y], axis  = 1)
new_data.head()

Unnamed: 0,BHK_raw_quality_score,median_Freq_speed,mean_d_P,dist_Freq_tilt_x,dist_Freq_speed,median_Freq_tilt_y,Space_Between_Words,bandwidth_tilt_x,std_d_P,mean_Pressure,in_Air,BHK_raw_speed_score,std_Density,median_Freq_tremolo,Age,Dysgraphia
0,21.333333,0.001132,0.096496,0.000225,0.000355,0.003374,1418.925895,0.003258,1.335429,588.92428,0.543732,58.333333,252.647261,0.003335,6,0
1,20.333333,0.00105,0.191805,0.000237,0.000458,0.003336,1524.264209,0.003102,1.758775,460.66239,0.540825,101.333333,181.095889,0.003334,6,0
2,16.0,0.00131,0.134668,0.00032,0.00021,0.003348,1125.804989,0.00323,1.548513,309.809359,0.64266,159.0,280.365147,0.003373,8,0
3,29.0,0.001128,0.179636,0.0003,0.000368,0.003401,1678.745264,0.003318,1.802776,489.141456,0.707339,76.0,285.035458,0.00332,7,0
4,16.5,0.001181,0.204028,0.000233,0.000279,0.00337,1317.883933,0.003259,2.334114,380.491356,0.574225,167.5,219.778932,0.003335,10,0


In [None]:
new_data.shape

(157, 16)

In [None]:
new_data.to_excel('synthetic_data.xlsx', index = False)